Could you provide more detailed information about CLIP fine-tuning for multimodal retrieval?
Specifically, I'm interested in understanding how to handle composite inputs during training, where both queries and database entries may contain combinations of image and text modalities.
Could you provide more detailed information about CLIP fine-tuning for multimodal retrieval?
Specifically, I'm interested in understanding how to handle composite inputs during training, where both queries and database entries may contain combinations of image and text modalities.