feat: add multimodal image support to chat input by riyo264 · Pull Request #2305 · arc53/DocsGPT

riyo264 · 2026-03-16T22:54:43Z

🚀 feat: Multimodal Vision Support (Gemini & OpenAI)

📝 Summary
This PR introduces multimodal capabilities to DocsGPT, allowing users to upload images alongside their text queries. The system can now analyze visual data (diagrams, screenshots, etc.) while leveraging the existing RAG pipeline to provide context-aware answers.

🛠️ Key Changes

Backend (Python/Flask)
multimodal_service.py: A new centralized service to handle routing between Google Gemini and OpenAI Vision models.

Docling/RAG Integration: The service explicitly uses the docs_together context retrieved from the Docling pipeline, ensuring visual analysis is grounded in the provided documentation.

API Normalization: Added normalize_question_payload to handle both camelCase and snake_case keys and support legacy JSON-string formats for backward compatibility.

Stable API Handshaking: Targeted the Gemini v1 stable endpoint to ensure production reliability and prevent v1beta handshake errors.

Frontend (React/Redux)
State Persistence: Updated sharedConversationSlice.ts and conversationSlice.ts to store imageBase64 within the Query objects. This ensures that uploaded images persist in the chat history and remain visible when switching between conversations.

API Handlers: Updated conversationHandlers.ts to pass multimodal data through both standard and streaming paths.

Models: Expanded RetrievalPayload and Query interfaces in conversationModels.ts to support image data.

Dependencies
Updated requirements.txt to include langchain-google-genai>=1.0.0.

⚙️ Environment Configuration
To enable this feature, the following variables should be added to the .env file:

GOOGLE_API_KEY: Required for Gemini support.

LLM_PROVIDER: Set to google or openai.

LLM_NAME: Set to a vision-capable model (e.g., gemini-1.5-flash).

🧪 Testing & Verification
Full-Stack Data Flow: Confirmed that images are correctly encoded to Base64 on the frontend and decoded in the service layer.

Isolation: Verified that the multimodal path only triggers when an image is present, ensuring zero impact on standard text-only RAG or OCR/Docling ingestion paths.

State Integrity: Confirmed that images remain associated with their respective queries when navigating the conversation history.

This change was needed to:

Bridge the Visual Gap: Allow users to troubleshoot "real-world" scenarios by uploading screenshots of error messages or configurations.
Enhance Docling Context: Complement the Docling ingestion pipeline by allowing the LLM to perform native visual reasoning on diagrams that are difficult to represent through OCR alone.
Future-Proof the Architecture: Update the project's state management and routing "plumbing" to handle multimodal data (Base64/MIME), setting the foundation for future media-rich interactions.

Note:

During local testing, some 404 NOT_FOUND errors were observed due to local API key/region versioning quirks (v1beta). This has been mitigated in the code by explicitly targeting the stable v1 endpoint, which is the standard for production environments.

Fixes #1451

vercel · 2026-03-16T22:54:49Z

@riyo264 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

feat: add multimodal image support to chat input

221f437

github-actions bot added frontend application Application labels Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add multimodal image support to chat input#2305

feat: add multimodal image support to chat input#2305
riyo264 wants to merge 1 commit intoarc53:mainfrom
riyo264:multimodal-clean

riyo264 commented Mar 16, 2026

Uh oh!

vercel bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

riyo264 commented Mar 16, 2026

🚀 feat: Multimodal Vision Support (Gemini & OpenAI)

This change was needed to:

Note:

Uh oh!

vercel bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant