Project 1:-- Build a GUI Agent with local LLM/VLM and OpenVINO #34948
Replies: 3 comments
-
|
Hey @openvino-dev-samples and @zhuo-yoyowz, A very thanks for your time and i am very exited as THIS IS the project that i have been working on from past year so it is a Dream come true For me so kindly if you please guide me I will be really grateful for that. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Rishabh-Sahni-0809 Thanks for your engagement.
3. I would recommend you to focus some specific scenarios in this project, and optimize the user experience of them with local model. Do not expect a local model can do everything |
Beta Was this translation helpful? Give feedback.
-
|
Hey @openvino-dev-samples and @zhuo-yoyowz, Now kindly tell me which more parts need fine tuning |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @openvino-dev-samples and @zhuo-yoyowz ,
I've been following this thread and wanted to share both my understanding of the project and what I've already built toward it.
My understanding of the project:
At its core, this is about building a desktop agent that perceives the screen as a human does — not through accessibility trees or hardcoded selectors, but by visually understanding the UI through a VLM. The LLM handles reasoning and goal decomposition while the VLM provides grounded visual perception. Together they enable an iterative Plan → Execute → Observe → Update loop that can handle dynamic, real-world interfaces. The OpenVINO requirement means all inference must run locally on CPU — no cloud APIs, reproducible and low-latency.
What I have already built (Digital Dave):
I have a working Python desktop agent I've been developing for the past year, evolving from a basic voice assistant into a full agentic system. It currently includes:
I've attached an architecture diagram showing the current pipeline and exactly where the OpenVINO VLM layer and new modules slot in.
Core proposal (GSoC deliverables):
The OCR-based perception layer is the key bottleneck I want to replace. My plan is to swap Tesseract for an OpenVINO-accelerated VLM (SmolVLM or nanoLLaVA, quantized INT8, CPU-only), outputting structured JSON with element labels, roles, and grounded (x, y) coordinates. On top of this I will build a proper agentic planner that decomposes a natural language goal into subtasks, executes them, verifies success by re-capturing the screen, and re-plans if the expected state was not reached.
Deliverables:
Stretch goals (if time permits):
I believe starting from a working codebase rather than from scratch puts me in a position to deliver meaningfully more during the coding period. I am happy to share the full repository and walk through any part of the design.
Two questions for the mentors:
Below is the ScreenShots of current UI and app which i am trying to improve and a youtube video link from the same project BUT a previous phase so you can also view the features> As Before it was a normal voice Assistant But now i have made it run locally and made many improvement as shown in below Github repo
Please do correct me if I've misunderstood anything or if you'd prefer a different direction.
Github:-[GitHub repository link]
Youtube:-https://youtu.be/9NFLYU2uIxU?si=lm5ZJdIOi6vDRVVs
Beta Was this translation helpful? Give feedback.
All reactions