GemmaSense is a Hybrid Vision-Language System that solves the "Semantic Grounding" problem in robotics. It bridges the gap between fast, local edge inference and deep, cloud-based visual reasoning.
GemmaSense follows a rigorous three-phase lifecycle designed for robust robotic deployment:
- Phase 1: Training - Local dataset collection, automated labeling via PaliGemma 2, and high-speed LoRA fine-tuning for environment-specific adaptation.
- Phase 2: Testing - Visual evaluation and confidence scoring. The system intelligently routes queries between the Local Edge (low latency) and the Cloud Brain (Gemini 3.0 - high reasoning) based on uncertainty thresholds.
- Phase 3: Deployment - Live hybrid inference stream feeding into an interactive Semantic Grounding Q&A loop, ultimately generating high-precision robotic action maps.
GemmaSense operates in two distinct modes to ensure the robot is never "lost" in a new environment:
- Local Edge Mode (PaliGemma 2): Quantized 3B model running on-device for known environments. Low latency, zero-shot detection.
- Cloud Brain Mode (Gemini 3.0): High-level reasoning for environment shifts. Triggered when local confidence is low. Deep contextual understanding and recursive grounding.
- Semantic Grounding: Detects objects and simultaneously queries the user for their "state" (e.g., Is that soldering iron hot? Is that glass beaker fragile?).
- Context Transfer: Allows a robot trained for a "Kitchen" to understand a "Workshop" by leveraging high-level cloud reasoning.
- LoRA Support: Includes a built-in training engine to fine-tune local models on your own specialized datasets.
- Interactive UI: A futuristic HUD for human-in-the-loop verification of robotic world models.
git clone https://github.com/pgeedh/GemmaSense
cd GemmaSense
# Setup Python environment
pip install -r requirements.txt
# (Optional) Add your Gemini API Key for Cloud Mode
export GEMINI_API_KEY="your_key_here"
# Start the engine
chmod +x start.sh
./start.shTo tune the model for your specific kitchen or lab:
- Place 10-20 images in
dataset/images/. - Run
python3 auto_label.pyto bootstrap descriptions. - Run
python3 train_engine.pyto generate your local LoRA adapter.
GemmaSense is designed to be plugged directly into high-level robotic control stacks such as LeRobot. It acts as a "Cognitive Filter" that sits between the low-level camera feed and the motion policy execution.
- Initialize the Node: Import the
GemmaSenseNodeinto your LeRobot policy script. - Context Injection: Before the arm executes a
pickorplacecommand, pass the current camera frame to the node. - Semantic Verification: The node analyzes the object (e.g., "glass") for contextual safety (is it fragile? is it full?).
- Human-in-the-Loop (HITL): If uncertainty exists, the system triggers an inquiry.
- Motion Execution: The robot only proceeds once the semantic context is grounded and verified.
from gemma_sense_node import GemmaSenseNode
from PIL import Image
# Initialize the engine
node = GemmaSenseNode()
# Capture frame from your robot's camera
frame = Image.open("robot_view.jpg")
# Get contextual grounding before executing a move
context = node.ground_objects(frame, target_object="glass")
if context['safety_warning']:
print(f"ROBOT HALT: {context['recommended_inquiry']}")
# Trigger voice prompt or wait for user confirmation
else:
# Execute SO-101 movement policy
passbackend/: FastAPI server managing the PaliGemma/Gemini hybrid logic.frontend/: Real-time dashboard for visual grounding.train_engine.py: Native PyTorch loop for high-speed local fine-tuning.auto_label.py: Utility to auto-generate training data using pre-trained models.
“Bridging the gap between seeing and understanding, from the edge to the cloud.”


