Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective - Université Sorbonne Paris Nord Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective

Résumé

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformerbased Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and finetuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of multimodal capabilities while minimizing the potential for models to rely on bias. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. On semantically adversarial examples, we find that those models are able to pinpoint finegrained multimodal differences. Finally, we also notice that fine-tuning a Vision-Language model on multimodal tasks does not necessarily improve its multimodal ability. We make all datasets and code available to replicate experiments.
Fichier principal
Vignette du fichier
11931.SalinE-7.pdf (2.99 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03521715 , version 1 (11-01-2022)
hal-03521715 , version 2 (17-03-2022)

Identifiants

  • HAL Id : hal-03521715 , version 1

Citer

Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre. Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective. AAAI 2022, Feb 2022, Vancouver, Canada. ⟨hal-03521715v1⟩
667 Consultations
656 Téléchargements

Partager

Gmail Facebook X LinkedIn More