**PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition**
  [Oindrila Saha](http://oindrilasaha.github.io), [Subhransu Maji](http://people.cs.umass.edu/~smaji/)
  _University of Massachusetts - Amherst_

![Figure [coarsesup_splash]: **Self-supervised fine-tuning using part discovery and contrastive learning (PARTICLE)** : Given a collection of unlabeled images, at each iteration we cluster pixels features from an initial network to obtain part segmentations, and fine-tune the network using a contrastive objective between parts.](./‎partaware.jpg)


We develop techniques for refining representations for fine-grained classification and segmentation tasks in a self-supervised manner. We find that fine-tuning methods based on instance-discriminative contrastive learning are not as effective, and posit that recognizing part-specific variations is crucial for fine-grained categorization. We present an iterative learning approach that incorporates part-centric equivariance and invariance objectives. First, pixel representations are clustered to discover parts. We analyze the representations from convolutional and vision transformer networks that are best suited for this task. Then, a part-centric learning step aggregates and contrasts representations of parts within an image. We show that this improves the performance on image classification and part segmentation tasks across datasets. For example, under a linear-evaluation scheme, the classification accuracy of a ResNet50 trained on ImageNet using DetCon, a self-supervised learning approach, improves from 35.4% to 42.0% on the Caltech-UCSD Birds, from 35.5% to 44.1% on the FGVC Aircraft, and from 29.7% to 37.4% on the Stanford Cars. We also observe significant gains in few-shot part segmentation tasks using the proposed technique, while instance-discriminative learning was not as effective. Smaller, yet consistent, improvements are also observed for stronger networks based on transformers.
 
PUBLICATION
==========================================================================================
**PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition** <br> 
Oindrila Saha, Subhransu Maji <br>
International Conference on Computer Vision, VIPriors Workshop (ICCVW), 2023. <br>
[[arXiv](https://arxiv.org/abs/2309.13822)]

CODE
===============================================================================
The code for reproducing our results alongwith pretrained models is available [here](https://github.com/oindrilasaha/PARTICLE)
 
RESULTS
===============================================================================

**Table 1**: *Comparison of our method with baselines on the CalTech-UCSD Birds dataset. (Please refer to the paper for more details)*

| Architecture | Method | CUB Cls| CUB Seg | 
|---|---|---|---|---|
| ResNet50 | Supervised ImageNet | 66.29 | 47.41 ± 0.88 |
|  | MoCoV2 (ImageNet) | 28.92 | 46.08 ± 0.55 |
|  | MoCoV2 fine-tuned | 31.17 | 46.22 ± 0.70 |
|  | PARTICLE fine-tuned | **36.09** | **47.40 ± 1.06** |
|  | DetCon (ImageNet) | 35.39 | 47.42 ± 0.92 |
|  | DetCon fine-tuned | 37.15 | 47.88 ± 1.18 |
|  | PARTICLE fine-tuned | **41.98** | **50.21 ± 0.85** |
| ViT S/8 | DINO (ImageNet) | 83.36 | 49.57 ± 1.26 |
|  | DINO fine-tuned | 83.36 | 49.66 ± 0.98 |
|  | PARTICLE fine-tuned | **84.15** | **51.40 ± 1.29** |


ACKNOWLEDGEMENTS
===============================================================================
The project was funded in part by NSF grant #1749833 to Subhransu Maji. Our experiments were performed on the University of Massachusetts GPU cluster funded by the Mass. Technology Collaborative.

**Cite us:**
(embed bib.txt height=115px here)

<!-- Markdeep: --><style class="fallback">body{visibility:hidden;white-space:pre;font-family:monospace}</style><script src="markdeep.min.js"></script><script src="https://casual-effects.com/markdeep/latest/markdeep.min.js?"></script><script>window.alreadyProcessedMarkdeep||(document.body.style.visibility="visible")</script>