Skip to content

pgeedh/Gemma-Sense-Contextual_Model

Repository files navigation

🧠 GemmaSense: Hybrid Architecture for Robotics Perception

PaliGemma 2 Gemini 3.0 Robotics

GemmaSense is a Hybrid Vision-Language System that solves the "Semantic Grounding" problem in robotics. It bridges the gap between fast, local edge inference and deep, cloud-based visual reasoning.

🏗 The Hybrid Brain Architecture

GemmaSense Architecture

🔄 System Lifecycle & Pipeline

GemmaSense Detailed Workflow

GemmaSense follows a rigorous three-phase lifecycle designed for robust robotic deployment:

  1. Phase 1: Training - Local dataset collection, automated labeling via PaliGemma 2, and high-speed LoRA fine-tuning for environment-specific adaptation.
  2. Phase 2: Testing - Visual evaluation and confidence scoring. The system intelligently routes queries between the Local Edge (low latency) and the Cloud Brain (Gemini 3.0 - high reasoning) based on uncertainty thresholds.
  3. Phase 3: Deployment - Live hybrid inference stream feeding into an interactive Semantic Grounding Q&A loop, ultimately generating high-precision robotic action maps.

🌓 Operating Modes

GemmaSense operates in two distinct modes to ensure the robot is never "lost" in a new environment:

  • Local Edge Mode (PaliGemma 2): Quantized 3B model running on-device for known environments. Low latency, zero-shot detection.
  • Cloud Brain Mode (Gemini 3.0): High-level reasoning for environment shifts. Triggered when local confidence is low. Deep contextual understanding and recursive grounding.

✨ Features

  • Semantic Grounding: Detects objects and simultaneously queries the user for their "state" (e.g., Is that soldering iron hot? Is that glass beaker fragile?).
  • Context Transfer: Allows a robot trained for a "Kitchen" to understand a "Workshop" by leveraging high-level cloud reasoning.
  • LoRA Support: Includes a built-in training engine to fine-tune local models on your own specialized datasets.
  • Interactive UI: A futuristic HUD for human-in-the-loop verification of robotic world models.

🚀 Getting Started

📦 Installation

git clone https://github.com/pgeedh/GemmaSense
cd GemmaSense

# Setup Python environment
pip install -r requirements.txt

# (Optional) Add your Gemini API Key for Cloud Mode
export GEMINI_API_KEY="your_key_here"

# Start the engine
chmod +x start.sh
./start.sh

🧠 Training locally

To tune the model for your specific kitchen or lab:

  1. Place 10-20 images in dataset/images/.
  2. Run python3 auto_label.py to bootstrap descriptions.
  3. Run python3 train_engine.py to generate your local LoRA adapter.


🦾 Robotic Stack Integration (LeRobot / SO-101)

GemmaSense is designed to be plugged directly into high-level robotic control stacks such as LeRobot. It acts as a "Cognitive Filter" that sits between the low-level camera feed and the motion policy execution.

Architectural Flow:

LeRobot Integration Flow

🛠 Integration Steps:

  1. Initialize the Node: Import the GemmaSenseNode into your LeRobot policy script.
  2. Context Injection: Before the arm executes a pick or place command, pass the current camera frame to the node.
  3. Semantic Verification: The node analyzes the object (e.g., "glass") for contextual safety (is it fragile? is it full?).
  4. Human-in-the-Loop (HITL): If uncertainty exists, the system triggers an inquiry.
  5. Motion Execution: The robot only proceeds once the semantic context is grounded and verified.

Headless Integration Example:

from gemma_sense_node import GemmaSenseNode
from PIL import Image

# Initialize the engine
node = GemmaSenseNode()

# Capture frame from your robot's camera
frame = Image.open("robot_view.jpg")

# Get contextual grounding before executing a move
context = node.ground_objects(frame, target_object="glass")

if context['safety_warning']:
    print(f"ROBOT HALT: {context['recommended_inquiry']}")
    # Trigger voice prompt or wait for user confirmation
else:
    # Execute SO-101 movement policy
    pass

📁 Repository Structure

  • backend/: FastAPI server managing the PaliGemma/Gemini hybrid logic.
  • frontend/: Real-time dashboard for visual grounding.
  • train_engine.py: Native PyTorch loop for high-speed local fine-tuning.
  • auto_label.py: Utility to auto-generate training data using pre-trained models.

“Bridging the gap between seeing and understanding, from the edge to the cloud.”

About

Hybrid Edge-Cloud Vision-Language System (VLM) for interactive semantic grounding and robotics perception. Powered by PaliGemma 2 and Gemini 3.0.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors