Skip to content

Sriram-B-Srivatsa/Semantic-Language-Translator

Repository files navigation

🌐 PolyGlot NLP Engine

Semantic Cross-Lingual Vector Space Alignment & Visualization

PolyGlot NLP is a modern Natural Language Processing system that performs cross-lingual synonym expansion, Part-of-Speech (POS) tagging, and 300-dimensional vector space visualization. It bridges English with Hindi, Kannada, French, Spanish, and German using Facebook's FastText embeddings and Stanford's Stanza library, backed by a high-performance MySQL database.


🚀 Key Features

  • 🧠 Deep Semantic Search: Uses 300-dimensional FastText vectors to find synonyms based on mathematical meaning, not just string matching.
  • 🌍 Multi-Language Support: Translates contextually between English and 5 target languages (Hi, Kn, Fr, Es, De).
  • 📊 Vector Space Visualization: Projects 300D word embeddings into 2D using PCA (Principal Component Analysis) to visualize semantic relationships and language alignment.
  • ⚡ High-Performance Backend: Uses MySQL with SQLAlchemy for indexed, O(1) lookups of millions of vectors.
  • 🤖 AI-Powered Grammar: Uses Stanza (Stanford NLP) for real-time Part-of-Speech (Noun, Verb, Adj) tagging.
  • 🛡️ Robust Architecture: Includes self-healing database migration scripts, batch processing, and duplicate detection.

🛠️ Tech Stack

Backend

  • Language: Python 3.9+
  • Framework: Flask (REST API)
  • Database: MySQL (Storage for 50k+ vectors per language)
  • ORM: SQLAlchemy

AI & NLP

  • Embeddings: Facebook FastText (Wiki Vectors)
  • Tagging: Stanza (Stanford NLP Group)
  • Synonyms: NLTK WordNet
  • Math/Stats: NumPy, Scikit-Learn (PCA), Matplotlib

Frontend

  • UI: HTML5, JavaScript (ES6)
  • Styling: Tailwind CSS (Modern, Responsive)
  • Icons: FontAwesome

📂 Project Structure

PolyGlotNLP/
│
|-- app.py                   # 🚀 Main Flask Server (Entry Point)
|-- setup_database.py        # 🛠️ Database Installer & ETL Pipeline
|-- requirements.txt         # 📦 Python Dependencies
|-- .gitignore               # Files Git Ignores
|-- README.md                # The Standard Summary File 
│
|-- data/                    # 💾 Raw Asset Storage (Zips, Vectors, Text dictionaries)
│
|-- src/                     # 🧠 Core Logic Module
│   |-- database.py          # MySQL Connection Handler
│   |-- models.py            # SQLAlchemy Table Definitions
│   |-- nlp_service.py       # The "Brain" (Stanza, PCA Logic, Query Processing)
│
|-- static/                  # 🎨 Frontend Logic
│   |-- app.js               # API calls, Graph rendering, UI updates
│
|-- templates/               # 🖥️ Frontend UI
│   |-- index.html           # Main Interface
│
|-- tools/                            # 🔧 Developer Utilities
|    |-- get_english.py               # Download/Slice English Vectors
|    |-- create_real_assets_sliced.py # Download/Slice Target Lang Vectors
|    |-- generate_dictionary.py       # Generate dictionaries via Google Translate
|    |-- check_db.py                  # Data Validation
|    |-- ping_db.py                   # Connection Testing

⚙️ Installation & Setup

1. Prerequisites

  • Python (v3.8 or higher)
  • MySQL Server (Running locally)

2. Configure Database

  1. Open MySQL Workbench or Command Line.
  2. In MySQL, CREATE DATABASE nlp_project CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; And keep MySQL Running in the Background
  3. In Python, SQLALCHEMY_DATABASE_URL = "mysql+pymysql://root:YOUR_PASSWORD@localhost/nlp_project?charset=utf8mb4"

3. Install Dependencies

python -m venv venv->Windows: venv\Scripts\activate | Mac/Linux: source venv/bin/activate->pip install -r requirements.txt

4. Prepare Data Assets

Since raw Vector files are huge (6GB+), we use "Slicer" scripts to extract the top 50,000 words. # Download and slice target languages (Hindi, Kannada, etc.)->python tools/create_real_assets_sliced.py
# Download and slice English vectors (Crucial for Graphing)->python tools/get_english.py
# (Optional) Generate Kannada Dictionary if missing->python tools/generate_dictionary.py

5. Load the Database

Run the ETL pipeline to ingest vectors and dictionaries into MySQL. python setup_database.py Wait for the "🎉 ALL DONE" message in the terminal._

6. Run the Application

python app.py And Visit http://localhost:5000 in your browser.


🔬 Theoretical Concepts

1. Vector Embeddings (FastText)

Unlike simple dictionaries, this project represents words as Vectors (lists of 300 numbers).

  • Concept: Words that appear in similar contexts have similar numbers.
  • Example: Vector(King) - Vector(Man) + Vector(Woman) ≈ Vector(Queen).
  • Why FastText? It handles sub-word information (n-grams), making it better for morphologically rich languages like Hindi and Kannada than Word2Vec.

2. Semantic Alignment (Procrustes Analysis)

We map English vectors and Target vectors into a Shared Vector Space.

  • We use Pre-aligned Vectors provided by Meta Research.
  • These vectors have been rotated using Orthogonal Procrustes Analysis so that the coordinate for Dog (English) is mathematically close to Kutta (Hindi).

3. Dimensionality Reduction (PCA)

We use Principal Component Analysis (PCA) to visualize the data.

  • Problem: We cannot see 300 dimensions.
  • PCA mathematically "squashes" the 300 dimensions down to the 2 most important dimensions (X and Y) while preserving the relative distances between words.

4. Semantic Filtering (Cosine Similarity)

To avoid "noisy" synonyms (e.g., Run -> Bank), we calculate the Cosine Similarity between the input word and its synonyms.

  • Formula: (A . B)/{||A|| ||B||}
  • If Similarity(Input, Synonym) < 0.25, the synonym is discarded.
  • This ensures high-quality, relevant results.

📊 Visualization Guide

When you click "Generate Graph", the system:

  1. Retrieves the 300D vectors for your search results.
  2. Colors English words Blue and Target words Red.
  3. Plots them on a Cartesian plane.

What to look for:

  • Clustering: English words and their translations should appear close together (e.g., "Run" next to "Oodu").
  • Distance: Unrelated concepts (e.g., "King" vs "Computer") should be far apart.
  • Meaning: If synonyms cluster, the AI successfully captures the semantic field.

📜 License

This project uses open-source datasets and libraries:

  • MUSE (Multilingual Unsupervised or Supervised Embeddings): Created by Facebook Research. [License: BSD]
  • WordNet: Created by Princeton University. [License: WordNet 3.0]
  • Stanza: Created by Stanford NLP Group. [License: Apache 2.0]

Intended for Educational and Research Purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors