EmotionScope

Multimodal Emotion Recognition in Conversation through Image and Text Fusion

IEEE Envision Project 2025

Mentors

Vanshika Mittal
Rakshith Ashok Kumar

Mentees

Pradyun Diwakar
Shriya Bharadwaj
Vashisth Patel
Deepthi Komar
Karthikeya Gupta
Kowndinya Vasudev

Aim

To develop a multimodal system using the MELD dataset by integrating textual and visual inputs through a late fusion architecture, performing emotion classification in conversational settings.

Introduction and Overview

Emotions are key to effective communication, influencing interactions and decision-making. This project aims to bridge the gap between humans and machines by recognizing emotions in conversations using both text and facial expressions. Leveraging the MELD dataset, we implement two parallel modules:

a TF-IDF-based NLP model for dialogue processing
a ResNet-18-based vision model for facial expression analysis.
By combining these through a late fusion strategy, our system achieves more accurate emotion detection.

Technologies Used

Python
PyTorch
Streamlit

Dataset

The MELD (Multimodal EmotionLines Dataset) is a benchmark dataset for emotion recognition in multi-party conversations, derived from the TV show Friends. It contains over 13,000 utterances across 1,400+ dialogues, each labeled with one of seven emotions: anger, disgust, fear, joy, neutral, sadness, or surprise. Each utterance is paired with text, audio, and video, enabling multimodal analysis. MELD retains conversational context and speaker information, making it ideal for studying emotion dynamics in dialogue.

In our project, we use its text and visual components to build a multimodal emotion classification system.

Visualisation of emotion distribution in MELD:

Model and Architecture

1. Textual Feature Extraction

In our project, we utilize the TF-IDF (Term Frequency-Inverse Document Frequency) representation in combination with Logistic Regression to classify the emotional content of dialogue utterances in the MELD dataset.

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection (or corpus). It balances two components:

Term Frequency (TF): Measures how frequently a term appears in a document. A higher frequency indicates greater importance.
Inverse Document Frequency (IDF): Measures how unique or rare a term is across all documents. Rare terms across the corpus receive higher weights.

Logistic Regression is a linear classifier that we used to model the probability that a given input belongs to a particular class using the softmax function.

2. Visual Feature Extraction from Images

The presence of images alongside text helps in capturing subtle emotional cues more effectively during emotion analysis. To make use of this, we extracted keyframes from videos present in the MELD dataset — selecting frames where different individuals displayed distinct emotions. These facial images were then mapped to the corresponding utterances and labelled emotions from the textual dataset.

To extract meaningful visual features from each face, we used a Residual Neural Network (ResNet-18). This architecture is composed of stacked 3×3 convolutional layers, each followed by batch normalization and ReLU activation. Towards the end, an adaptive average pooling layer reduces the spatial dimensions of the feature maps to a fixed size, enabling consistent output regardless of input image size. Since our task involved predicting 7 emotion classes (instead of the 1000 classes used in ImageNet), we removed the final fully connected (classification) layer of ResNet-18.

3. Multimodal Fusion

Our emotion recognition model employs late fusion to integrate insights from textual and visual data using the MELD dataset. Text features are extracted using TF-IDF, followed by classification through Logistic Regression, capturing linguistic indicators of emotion. Visual cues, such as facial expressions, are processed using a ResNet architecture, which effectively extracts deep spatial features from images. This modular approach ensures that the strengths of each modality are preserved without interference during feature learning, handles noisy or missing data better, and avoids the complexities of early fusion.

The result is a more accurate and context-aware emotion recognition system leveraging complementary cues from both language and facial expressions.

Results

To evaluate the performance of our emotion recognition system, we conducted experiments across three setups:

1. Text-only classification using TF-IDF + Logistic Regression

The TF-IDF-based model performed reasonably well on shorter utterances and common emotion categories. However, it struggled with context-dependent emotions such as sarcasm.

2. Image-only classification using ResNet-18

Using ResNet-18 for facial expression classification offered moderate performance. The model was sensitive to facial visibility, lighting, and resolution—limitations inherent to static frame analysis.

3. Multimodal classification using a late fusion of both models

By combining predictions from both modalities using a late fusion strategy, we observed a significant boost in classification performance.

Conclusion

This project provided an introduction to both Computer Vision and Natural Language Processing. Through hands-on implementation, we explored key machine learning concepts such as linear and logistic regression, artificial neural networks (ANNs), and loss functions. In the CV module, we learned to process facial images using Convolutional Neural Networks (CNNs) and advanced architectures like ResNet.

On the NLP side, we explored text vectorization techniques, including TF-IDF. The project also introduced multimodal fusion strategies, specifically late fusion, highlighting how diverse modalities can be combined effectively to enhance predictive performance.

References

Dataset Paper Link: https://arxiv.org/pdf/1810.02508.pdf
An Assessment of In-the-Wild Datasets for Multimodal Emotion Recognition

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
individual-app		individual-app
multimodal-app		multimodal-app
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmotionScope

Mentors

Mentees

Aim

Introduction and Overview

Technologies Used

Dataset

Visualisation of emotion distribution in MELD:

Model and Architecture

1. Textual Feature Extraction

2. Visual Feature Extraction from Images

3. Multimodal Fusion

Results

1. Text-only classification using TF-IDF + Logistic Regression

2. Image-only classification using ResNet-18

3. Multimodal classification using a late fusion of both models

Conclusion

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EmotionScope

Mentors

Mentees

Aim

Introduction and Overview

Technologies Used

Dataset

Visualisation of emotion distribution in MELD:

Model and Architecture

1. Textual Feature Extraction

2. Visual Feature Extraction from Images

3. Multimodal Fusion

Results

1. Text-only classification using TF-IDF + Logistic Regression

2. Image-only classification using ResNet-18

3. Multimodal classification using a late fusion of both models

Conclusion

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages