Skip to content

feat: add multimodal image support to chat input#2305

Open
riyo264 wants to merge 1 commit intoarc53:mainfrom
riyo264:multimodal-clean
Open

feat: add multimodal image support to chat input#2305
riyo264 wants to merge 1 commit intoarc53:mainfrom
riyo264:multimodal-clean

Conversation

@riyo264
Copy link

@riyo264 riyo264 commented Mar 16, 2026

🚀 feat: Multimodal Vision Support (Gemini & OpenAI)

📝 Summary
This PR introduces multimodal capabilities to DocsGPT, allowing users to upload images alongside their text queries. The system can now analyze visual data (diagrams, screenshots, etc.) while leveraging the existing RAG pipeline to provide context-aware answers.

🛠️ Key Changes

  1. Backend (Python/Flask)
    multimodal_service.py: A new centralized service to handle routing between Google Gemini and OpenAI Vision models.

Docling/RAG Integration: The service explicitly uses the docs_together context retrieved from the Docling pipeline, ensuring visual analysis is grounded in the provided documentation.

API Normalization: Added normalize_question_payload to handle both camelCase and snake_case keys and support legacy JSON-string formats for backward compatibility.

Stable API Handshaking: Targeted the Gemini v1 stable endpoint to ensure production reliability and prevent v1beta handshake errors.

  1. Frontend (React/Redux)
    State Persistence: Updated sharedConversationSlice.ts and conversationSlice.ts to store imageBase64 within the Query objects. This ensures that uploaded images persist in the chat history and remain visible when switching between conversations.

API Handlers: Updated conversationHandlers.ts to pass multimodal data through both standard and streaming paths.

Models: Expanded RetrievalPayload and Query interfaces in conversationModels.ts to support image data.

  1. Dependencies
    Updated requirements.txt to include langchain-google-genai>=1.0.0.

⚙️ Environment Configuration
To enable this feature, the following variables should be added to the .env file:

GOOGLE_API_KEY: Required for Gemini support.

LLM_PROVIDER: Set to google or openai.

LLM_NAME: Set to a vision-capable model (e.g., gemini-1.5-flash).

🧪 Testing & Verification
Full-Stack Data Flow: Confirmed that images are correctly encoded to Base64 on the frontend and decoded in the service layer.

Isolation: Verified that the multimodal path only triggers when an image is present, ensuring zero impact on standard text-only RAG or OCR/Docling ingestion paths.

State Integrity: Confirmed that images remain associated with their respective queries when navigating the conversation history.

This change was needed to:

  • Bridge the Visual Gap: Allow users to troubleshoot "real-world" scenarios by uploading screenshots of error messages or configurations.

  • Enhance Docling Context: Complement the Docling ingestion pipeline by allowing the LLM to perform native visual reasoning on diagrams that are difficult to represent through OCR alone.

  • Future-Proof the Architecture: Update the project's state management and routing "plumbing" to handle multimodal data (Base64/MIME), setting the foundation for future media-rich interactions.

Note:

During local testing, some 404 NOT_FOUND errors were observed due to local API key/region versioning quirks (v1beta). This has been mitigated in the code by explicitly targeting the stable v1 endpoint, which is the standard for production environments.

Fixes #1451

@vercel
Copy link

vercel bot commented Mar 16, 2026

@riyo264 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feature Extract and process images from source uploads

1 participant