Multimodal Emotion Recognition in Conversation through Image and Text Fusion
IEEE Envision Project 2025
- Vanshika Mittal
- Rakshith Ashok Kumar
- Pradyun Diwakar
- Shriya Bharadwaj
- Vashisth Patel
- Deepthi Komar
- Karthikeya Gupta
- Kowndinya Vasudev
To develop a multimodal system using the MELD dataset by integrating textual and visual inputs through a late fusion architecture, performing emotion classification in conversational settings.
Emotions are key to effective communication, influencing interactions and decision-making. This project aims to bridge the gap between humans and machines by recognizing emotions in conversations using both text and facial expressions. Leveraging the MELD dataset, we implement two parallel modules:
- a TF-IDF-based NLP model for dialogue processing
- a ResNet-18-based vision model for facial expression analysis.
- By combining these through a late fusion strategy, our system achieves more accurate emotion detection.
- Python
- PyTorch
- Streamlit
The MELD (Multimodal EmotionLines Dataset) is a benchmark dataset for emotion recognition in multi-party conversations, derived from the TV show Friends. It contains over 13,000 utterances across 1,400+ dialogues, each labeled with one of seven emotions: anger, disgust, fear, joy, neutral, sadness, or surprise. Each utterance is paired with text, audio, and video, enabling multimodal analysis. MELD retains conversational context and speaker information, making it ideal for studying emotion dynamics in dialogue.
In our project, we use its text and visual components to build a multimodal emotion classification system.
In our project, we utilize the TF-IDF (Term Frequency-Inverse Document Frequency) representation in combination with Logistic Regression to classify the emotional content of dialogue utterances in the MELD dataset.
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection (or corpus). It balances two components:
- Term Frequency (TF): Measures how frequently a term appears in a document. A higher frequency indicates greater importance.
- Inverse Document Frequency (IDF): Measures how unique or rare a term is across all documents. Rare terms across the corpus receive higher weights.
Logistic Regression is a linear classifier that we used to model the probability that a given input belongs to a particular class using the softmax function.
The presence of images alongside text helps in capturing subtle emotional cues more effectively during emotion analysis. To make use of this, we extracted keyframes from videos present in the MELD dataset — selecting frames where different individuals displayed distinct emotions. These facial images were then mapped to the corresponding utterances and labelled emotions from the textual dataset.
To extract meaningful visual features from each face, we used a Residual Neural Network (ResNet-18). This architecture is composed of stacked 3×3 convolutional layers, each followed by batch normalization and ReLU activation. Towards the end, an adaptive average pooling layer reduces the spatial dimensions of the feature maps to a fixed size, enabling consistent output regardless of input image size. Since our task involved predicting 7 emotion classes (instead of the 1000 classes used in ImageNet), we removed the final fully connected (classification) layer of ResNet-18.
Our emotion recognition model employs late fusion to integrate insights from textual and visual data using the MELD dataset. Text features are extracted using TF-IDF, followed by classification through Logistic Regression, capturing linguistic indicators of emotion. Visual cues, such as facial expressions, are processed using a ResNet architecture, which effectively extracts deep spatial features from images. This modular approach ensures that the strengths of each modality are preserved without interference during feature learning, handles noisy or missing data better, and avoids the complexities of early fusion.
The result is a more accurate and context-aware emotion recognition system leveraging complementary cues from both language and facial expressions.
To evaluate the performance of our emotion recognition system, we conducted experiments across three setups:
The TF-IDF-based model performed reasonably well on shorter utterances and common emotion categories. However, it struggled with context-dependent emotions such as sarcasm.
Using ResNet-18 for facial expression classification offered moderate performance. The model was sensitive to facial visibility, lighting, and resolution—limitations inherent to static frame analysis.
By combining predictions from both modalities using a late fusion strategy, we observed a significant boost in classification performance.
This project provided an introduction to both Computer Vision and Natural Language Processing. Through hands-on implementation, we explored key machine learning concepts such as linear and logistic regression, artificial neural networks (ANNs), and loss functions. In the CV module, we learned to process facial images using Convolutional Neural Networks (CNNs) and advanced architectures like ResNet.
On the NLP side, we explored text vectorization techniques, including TF-IDF. The project also introduced multimodal fusion strategies, specifically late fusion, highlighting how diverse modalities can be combined effectively to enhance predictive performance.




