feat: add multimodal image support to chat input#2305
Open
riyo264 wants to merge 1 commit intoarc53:mainfrom
Open
feat: add multimodal image support to chat input#2305riyo264 wants to merge 1 commit intoarc53:mainfrom
riyo264 wants to merge 1 commit intoarc53:mainfrom
Conversation
|
@riyo264 is attempting to deploy a commit to the Arc53 Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 feat: Multimodal Vision Support (Gemini & OpenAI)
📝 Summary
This PR introduces multimodal capabilities to DocsGPT, allowing users to upload images alongside their text queries. The system can now analyze visual data (diagrams, screenshots, etc.) while leveraging the existing RAG pipeline to provide context-aware answers.
🛠️ Key Changes
multimodal_service.py: A new centralized service to handle routing between Google Gemini and OpenAI Vision models.
Docling/RAG Integration: The service explicitly uses the docs_together context retrieved from the Docling pipeline, ensuring visual analysis is grounded in the provided documentation.
API Normalization: Added normalize_question_payload to handle both camelCase and snake_case keys and support legacy JSON-string formats for backward compatibility.
Stable API Handshaking: Targeted the Gemini v1 stable endpoint to ensure production reliability and prevent v1beta handshake errors.
State Persistence: Updated sharedConversationSlice.ts and conversationSlice.ts to store imageBase64 within the Query objects. This ensures that uploaded images persist in the chat history and remain visible when switching between conversations.
API Handlers: Updated conversationHandlers.ts to pass multimodal data through both standard and streaming paths.
Models: Expanded RetrievalPayload and Query interfaces in conversationModels.ts to support image data.
Updated requirements.txt to include langchain-google-genai>=1.0.0.
⚙️ Environment Configuration
To enable this feature, the following variables should be added to the .env file:
GOOGLE_API_KEY: Required for Gemini support.
LLM_PROVIDER: Set to google or openai.
LLM_NAME: Set to a vision-capable model (e.g., gemini-1.5-flash).
🧪 Testing & Verification
Full-Stack Data Flow: Confirmed that images are correctly encoded to Base64 on the frontend and decoded in the service layer.
Isolation: Verified that the multimodal path only triggers when an image is present, ensuring zero impact on standard text-only RAG or OCR/Docling ingestion paths.
State Integrity: Confirmed that images remain associated with their respective queries when navigating the conversation history.
This change was needed to:
Bridge the Visual Gap: Allow users to troubleshoot "real-world" scenarios by uploading screenshots of error messages or configurations.
Enhance Docling Context: Complement the Docling ingestion pipeline by allowing the LLM to perform native visual reasoning on diagrams that are difficult to represent through OCR alone.
Future-Proof the Architecture: Update the project's state management and routing "plumbing" to handle multimodal data (Base64/MIME), setting the foundation for future media-rich interactions.
Note:
During local testing, some 404 NOT_FOUND errors were observed due to local API key/region versioning quirks (v1beta). This has been mitigated in the code by explicitly targeting the stable v1 endpoint, which is the standard for production environments.
Fixes #1451