Improving Image Encoders for General-Purpose Nearest Neighbor Search and Classification
TL;DR
Abstract
Recent advances in computer vision research led to large vision foundation models that generalize to a broad range of image domains and perform exceptionally well in various image based tasks. However, content-based image-to-image retrieval is often overlooked in this context. This paper investigates the effectiveness of different vision foundation models on two challenging nearest neighbor search-based tasks: zero-shot retrieval and k-NN classification. A benchmark for evaluating the performance of various vision encoders and their pre-training methods is established, where significant differences in the performance of these models are observed. Additionally, we propose a fine-tuning regime that improves zero-shot retrieval and k-NN classification through training with a combination of large publicly available datasets without specializing in any data domain. Our results show that the retrained vision encoders have a higher degree of generalization across different search-based tasks and can be used as general-purpose embedding models for image retrieval.
BibTeX
If you use our work in your research, please cite our publication:
@inproceedings{10.1145/3591106.3592266,
author = {Schall, Konstantin and Barthel, Kai Uwe and Hezel, Nico and Jung, Klaus},
title = {Improving Image Encoders for General-Purpose Nearest Neighbor Search and Classification},
year = {2023},
isbn = {9798400701788},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3591106.3592266},
doi = {10.1145/3591106.3592266},
booktitle = {Proceedings of the 2023 ACM International Conference on Multimedia Retrieval},
pages = {57–66},
numpages = {10},
keywords = {Generalization in Nearest Neighbor-Based Tasks, Deep Learning, Content-Based Image Retrieval},
location = {Thessaloniki, Greece},
series = {ICMR '23}
}