VEGA: Grounding Emotion Recognition with Visual Prototypes

This repository provides the official implementation of our ACM MM 2025 paper:

Grounding Emotion Recognition with Visual Prototypes: VEGA--Revisiting CLIP in MERC
Guanyu Hu, Dimitrios Kollias, Xinyu Yang

Overview ✨

VEGA addresses Multimodal Emotion Recognition in Conversations (MERC) with explicit visual semantic grounding. Instead of only feature fusion, VEGA introduces CLIP-based visual emotion anchors and aligns unimodal/fused representations to anchor prototypes.

Core ideas:

Build class-level visual prototypes in CLIP space.
Align text/audio/visual/fused representations to prototypes.
Optimize with classification + distillation in both label and anchor spaces.

Model Overview 🖼️

Method at a Glance 🧠

VEGA is trained with two collaborative branches:

1. Supervision Branch

Classification losses for unimodal and fused outputs.
Distillation from fused prediction to unimodal predictions.

2. VEGA Branch

Projects multimodal utterance representations into a visual semantic space.
Performs visual semantic anchoring by aligning utterance features to visual anchors.
Optimizes anchor-conditioned classification and semantic distillation to enforce anchor-grounded decision boundaries.

Repository Layout 🗂️

VEGA/
├─ run.py                 # Training entrypoint
├─ main.py                # Main training/evaluation flow
├─ train.py               # Optimization and metrics
├─ model.py               # Backbone + VEGA modules
├─ inference.py           # Inference script
├─ dataloader.py          # Runtime dataloader
├─ configs/               # Configuration presets
├─ vega_utils/            # Utilities (anchors/checkpoints/reports/common)
├─ docs/figures/          # Figures used in README
├─ data/                  # Multimodal feature files
├─ anchor/                # Visual anchor assets/cache
├─ checkpoint/            # Pretrained checkpoints
└─ output/                # Training outputs

Environment ⚙️

Python 3.10+
Core packages: torch, numpy, pandas, scikit-learn, tqdm, transformers, pillow, pytz

pip install torch numpy pandas scikit-learn tqdm transformers pillow pytz

Data Preparation ⬇️

Place files exactly as follows:

Resource	Filename	Target Path	Google Drive
Visual anchors (anchor images, optional)	`35_anchor.zip`	unzip to `anchor/35_anchor/`	`https://drive.google.com/file/d/1DOmYn6tISoEPJ4PQDD4F-gB1M58G-NS1/view?usp=sharing`
Visual anchors (ready features, recommended)	`35_anchor.pt`	`anchor/35_anchor.pt`	`https://drive.google.com/file/d/1F-ajsUUHihO0RgREros5AJUiptuVOoIl/view?usp=sharing`
Multimodal features	`IEMOCAP.pkl`	`data/IEMOCAP.pkl`	`https://drive.google.com/file/d/1dx4yikoU8hYZ7FxyrRwzcANzaJBcg90N/view?usp=sharing`
Checkpoint	`IEMOCAP.pth`	`checkpoint/IEMOCAP.pth`	`https://drive.google.com/file/d/1piNYmb1GfRNruKkDHf6_GSUh1ctEYvul/view?usp=sharing`

If you choose 35_anchor.zip, extract/build the anchor cache and save it to anchor/35_anchor.pt. When building from raw images on IEMOCAP, ensure the folder includes images for all labels: happy, sad, neutral, anger, excited, frustration.

Required structure of 35_anchor.pt

{
    "anchor_center": torch.Tensor,  # [num_classes, clip_dim], e.g. [6, 512]
    "anchor_img_dict": {
        "happy": {"feature": torch.Tensor},        # [num_images, clip_dim]
        "sad": {"feature": torch.Tensor},
        "neutral": {"feature": torch.Tensor},
        "anger": {"feature": torch.Tensor},
        "excited": {"feature": torch.Tensor},
        "frustration": {"feature": torch.Tensor},
    },
}

Notes:

feature and anchor_center are CLIP visual features.
CPU/GPU tensors are both acceptable when saving.

Quick Start 🚀

Training 🏋️

python run.py --Dataset IEMOCAP --CLIP_Model openai/clip-vit-base-patch32 --cls_loss --clip_loss --clip_all_clip_kl_loss --cls_all_cls_kl_loss

Inference 🔍

python inference.py --checkpoint "checkpoint/IEMOCAP.pth"

Citation 📚

@inproceedings{hu2025grounding,
  title={Grounding Emotion Recognition with Visual Prototypes: VEGA - Revisiting CLIP in MERC},
  author={Hu, Guanyu and Kollias, Dimitrios and Yang, Xinyu},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={5667--5676},
  year={2025},
  doi={10.1145/3746027.3755340}
}

Acknowledgements

This project builds on excellent open-source foundations.

SDT: https://github.com/butterfliesss/SDT
CLIP: https://github.com/openai/CLIP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEGA: Grounding Emotion Recognition with Visual Prototypes

Overview ✨

Model Overview 🖼️

Method at a Glance 🧠

1. Supervision Branch

2. VEGA Branch

Repository Layout 🗂️

Environment ⚙️

Data Preparation ⬇️

Quick Start 🚀

Training 🏋️

Inference 🔍

Citation 📚

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
anchor		anchor
checkpoint		checkpoint
configs		configs
data		data
docs/figures		docs/figures
output		output
vega_utils		vega_utils
.gitignore		.gitignore
README.md		README.md
dataloader.py		dataloader.py
inference.py		inference.py
main.py		main.py
model.py		model.py
run.py		run.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

VEGA: Grounding Emotion Recognition with Visual Prototypes

Overview ✨

Model Overview 🖼️

Method at a Glance 🧠

1. Supervision Branch

2. VEGA Branch

Repository Layout 🗂️

Environment ⚙️

Data Preparation ⬇️

Quick Start 🚀

Training 🏋️

Inference 🔍

Citation 📚

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages