🌐 PolyGlot NLP Engine

Semantic Cross-Lingual Vector Space Alignment & Visualization

PolyGlot NLP is a modern Natural Language Processing system that performs cross-lingual synonym expansion, Part-of-Speech (POS) tagging, and 300-dimensional vector space visualization. It bridges English with Hindi, Kannada, French, Spanish, and German using Facebook's FastText embeddings and Stanford's Stanza library, backed by a high-performance MySQL database.

🚀 Key Features

🧠 Deep Semantic Search: Uses 300-dimensional FastText vectors to find synonyms based on mathematical meaning, not just string matching.
🌍 Multi-Language Support: Translates contextually between English and 5 target languages (Hi, Kn, Fr, Es, De).
📊 Vector Space Visualization: Projects 300D word embeddings into 2D using PCA (Principal Component Analysis) to visualize semantic relationships and language alignment.
⚡ High-Performance Backend: Uses MySQL with SQLAlchemy for indexed, O(1) lookups of millions of vectors.
🤖 AI-Powered Grammar: Uses Stanza (Stanford NLP) for real-time Part-of-Speech (Noun, Verb, Adj) tagging.
🛡️ Robust Architecture: Includes self-healing database migration scripts, batch processing, and duplicate detection.

🛠️ Tech Stack

Backend

Language: Python 3.9+
Framework: Flask (REST API)
Database: MySQL (Storage for 50k+ vectors per language)
ORM: SQLAlchemy

AI & NLP

Embeddings: Facebook FastText (Wiki Vectors)
Tagging: Stanza (Stanford NLP Group)
Synonyms: NLTK WordNet
Math/Stats: NumPy, Scikit-Learn (PCA), Matplotlib

Frontend

UI: HTML5, JavaScript (ES6)
Styling: Tailwind CSS (Modern, Responsive)
Icons: FontAwesome

📂 Project Structure

PolyGlotNLP/
│
|-- app.py                   # 🚀 Main Flask Server (Entry Point)
|-- setup_database.py        # 🛠️ Database Installer & ETL Pipeline
|-- requirements.txt         # 📦 Python Dependencies
|-- .gitignore               # Files Git Ignores
|-- README.md                # The Standard Summary File 
│
|-- data/                    # 💾 Raw Asset Storage (Zips, Vectors, Text dictionaries)
│
|-- src/                     # 🧠 Core Logic Module
│   |-- database.py          # MySQL Connection Handler
│   |-- models.py            # SQLAlchemy Table Definitions
│   |-- nlp_service.py       # The "Brain" (Stanza, PCA Logic, Query Processing)
│
|-- static/                  # 🎨 Frontend Logic
│   |-- app.js               # API calls, Graph rendering, UI updates
│
|-- templates/               # 🖥️ Frontend UI
│   |-- index.html           # Main Interface
│
|-- tools/                            # 🔧 Developer Utilities
|    |-- get_english.py               # Download/Slice English Vectors
|    |-- create_real_assets_sliced.py # Download/Slice Target Lang Vectors
|    |-- generate_dictionary.py       # Generate dictionaries via Google Translate
|    |-- check_db.py                  # Data Validation
|    |-- ping_db.py                   # Connection Testing

⚙️ Installation & Setup

1. Prerequisites

Python (v3.8 or higher)
MySQL Server (Running locally)

2. Configure Database

Open MySQL Workbench or Command Line.
In MySQL, CREATE DATABASE nlp_project CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; And keep MySQL Running in the Background
In Python, SQLALCHEMY_DATABASE_URL = "mysql+pymysql://root:YOUR_PASSWORD@localhost/nlp_project?charset=utf8mb4"

3. Install Dependencies

python -m venv venv->Windows: venv\Scripts\activate | Mac/Linux: source venv/bin/activate->pip install -r requirements.txt

4. Prepare Data Assets

Since raw Vector files are huge (6GB+), we use "Slicer" scripts to extract the top 50,000 words. # Download and slice target languages (Hindi, Kannada, etc.)->python tools/create_real_assets_sliced.py
# Download and slice English vectors (Crucial for Graphing)->python tools/get_english.py
# (Optional) Generate Kannada Dictionary if missing->python tools/generate_dictionary.py

5. Load the Database

Run the ETL pipeline to ingest vectors and dictionaries into MySQL. python setup_database.py Wait for the "🎉 ALL DONE" message in the terminal._

6. Run the Application

python app.py And Visit http://localhost:5000 in your browser.

🔬 Theoretical Concepts

1. Vector Embeddings (FastText)

Unlike simple dictionaries, this project represents words as Vectors (lists of 300 numbers).

Concept: Words that appear in similar contexts have similar numbers.
Example: Vector(King) - Vector(Man) + Vector(Woman) ≈ Vector(Queen).
Why FastText? It handles sub-word information (n-grams), making it better for morphologically rich languages like Hindi and Kannada than Word2Vec.

2. Semantic Alignment (Procrustes Analysis)

We map English vectors and Target vectors into a Shared Vector Space.

We use Pre-aligned Vectors provided by Meta Research.
These vectors have been rotated using Orthogonal Procrustes Analysis so that the coordinate for Dog (English) is mathematically close to Kutta (Hindi).

3. Dimensionality Reduction (PCA)

We use Principal Component Analysis (PCA) to visualize the data.

Problem: We cannot see 300 dimensions.
PCA mathematically "squashes" the 300 dimensions down to the 2 most important dimensions (X and Y) while preserving the relative distances between words.

4. Semantic Filtering (Cosine Similarity)

To avoid "noisy" synonyms (e.g., Run -> Bank), we calculate the Cosine Similarity between the input word and its synonyms.

Formula: (A . B)/{||A|| ||B||}
If Similarity(Input, Synonym) < 0.25, the synonym is discarded.
This ensures high-quality, relevant results.

📊 Visualization Guide

When you click "Generate Graph", the system:

Retrieves the 300D vectors for your search results.
Colors English words Blue and Target words Red.
Plots them on a Cartesian plane.

What to look for:

Clustering: English words and their translations should appear close together (e.g., "Run" next to "Oodu").
Distance: Unrelated concepts (e.g., "King" vs "Computer") should be far apart.
Meaning: If synonyms cluster, the AI successfully captures the semantic field.

📜 License

This project uses open-source datasets and libraries:

MUSE (Multilingual Unsupervised or Supervised Embeddings): Created by Facebook Research. [License: BSD]
WordNet: Created by Princeton University. [License: WordNet 3.0]
Stanza: Created by Stanford NLP Group. [License: Apache 2.0]

Intended for Educational and Research Purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌐 PolyGlot NLP Engine

Semantic Cross-Lingual Vector Space Alignment & Visualization

🚀 Key Features

🛠️ Tech Stack

Backend

AI & NLP

Frontend

📂 Project Structure

⚙️ Installation & Setup

1. Prerequisites

2. Configure Database

3. Install Dependencies

4. Prepare Data Assets

5. Load the Database

6. Run the Application

🔬 Theoretical Concepts

1. Vector Embeddings (FastText)

2. Semantic Alignment (Procrustes Analysis)

3. Dimensionality Reduction (PCA)

4. Semantic Filtering (Cosine Similarity)

📊 Visualization Guide

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
static		static
templates		templates
tools		tools
unused_reference		unused_reference
.gitignore		.gitignore
ER Diagram.png		ER Diagram.png
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup_database.py		setup_database.py

Folders and files

Latest commit

History

Repository files navigation

🌐 PolyGlot NLP Engine

Semantic Cross-Lingual Vector Space Alignment & Visualization

🚀 Key Features

🛠️ Tech Stack

Backend

AI & NLP

Frontend

📂 Project Structure

⚙️ Installation & Setup

1. Prerequisites

2. Configure Database

3. Install Dependencies

4. Prepare Data Assets

5. Load the Database

6. Run the Application

🔬 Theoretical Concepts

1. Vector Embeddings (FastText)

2. Semantic Alignment (Procrustes Analysis)

3. Dimensionality Reduction (PCA)

4. Semantic Filtering (Cosine Similarity)

📊 Visualization Guide

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages