Project 1:-- Build a GUI Agent with local LLM/VLM and OpenVINO #34948

Rishabh-Sahni-0809 · 2026-03-26T07:45:31Z

Rishabh-Sahni-0809
Mar 26, 2026

Hi @openvino-dev-samples and @zhuo-yoyowz ,

I've been following this thread and wanted to share both my understanding of the project and what I've already built toward it.

My understanding of the project:

At its core, this is about building a desktop agent that perceives the screen as a human does — not through accessibility trees or hardcoded selectors, but by visually understanding the UI through a VLM. The LLM handles reasoning and goal decomposition while the VLM provides grounded visual perception. Together they enable an iterative Plan → Execute → Observe → Update loop that can handle dynamic, real-world interfaces. The OpenVINO requirement means all inference must run locally on CPU — no cloud APIs, reproducible and low-latency.

What I have already built (Digital Dave):

I have a working Python desktop agent I've been developing for the past year, evolving from a basic voice assistant into a full agentic system. It currently includes:

Local LLM inference via Ollama phi for conversational responses, sentiment analysis, and system stress advice
Screen reading via ScreenPipe OCR + Tesseract with a local phi summarization pipeline
Desktop automation using desktop_use client, PyAutoGUI, and pynput (click, type, scroll, hotkeys)
A Flask REST API with /chat, /listen, /diagnose, and /metrics endpoints
Deep research mode: cross-references WolframAlpha + Wikipedia + Ollama for complex queries
Do Not Disturb mode, alarm scheduling, sentiment-aware responses, and a built-in notes/todo/drawing app suite
System monitoring with an AI copilot that suggests recovery steps when CPU/RAM are overloaded

I've attached an architecture diagram showing the current pipeline and exactly where the OpenVINO VLM layer and new modules slot in.

Core proposal (GSoC deliverables):

The OCR-based perception layer is the key bottleneck I want to replace. My plan is to swap Tesseract for an OpenVINO-accelerated VLM (SmolVLM or nanoLLaVA, quantized INT8, CPU-only), outputting structured JSON with element labels, roles, and grounded (x, y) coordinates. On top of this I will build a proper agentic planner that decomposes a natural language goal into subtasks, executes them, verifies success by re-capturing the screen, and re-plans if the expected state was not reached.

Deliverables:

OpenVINO VLM integration — convert and quantize a vision-language model for CPU inference, benchmark latency on real screen captures
Structured UI grounding — element label + role + DPI-aware (x, y) coordinates from each frame
Agentic task planner — goal decomposition via Ollama phi LLM with iterative verify-and-replan loop
GUI shell upgrade — live action log panel and annotated screenshot preview with highlighted click targets
Documentation and reproducible setup — install guide, demo video, benchmarks

Stretch goals (if time permits):

EAGLE-3 speculative decoding: use a small draft model to accelerate the LLM planning step via multi-token prediction, reducing per-step latency without changing the model's output quality
Mamba-3 SSM backend: explore replacing the transformer-based LLM planner with a Mamba-3 state space model, which offers lower memory footprint and more efficient CPU recurrence — relevant for resource-constrained local inference
Face ID module: I have a separate working face recognition application that I plan to integrate as both an authentication layer (agent only activates for the recognised user) and a personalisation layer (greet by name, load per-user preferences and history)

I believe starting from a working codebase rather than from scratch puts me in a position to deliver meaningfully more during the coding period. I am happy to share the full repository and walk through any part of the design.

Two questions for the mentors:

For the OpenVINO VLM integration — do you have a preference between SmolVLM and nanoLLaVA as the starting model, or is model selection part of what the project should evaluate?
For the agentic loop — should the verify step rely purely on VLM re-observation, or would you like to see a hybrid approach that also queries the OS accessibility API (AT-SPI on Linux, MSAA on Windows) as a fallback?

Below is the ScreenShots of current UI and app which i am trying to improve and a youtube video link from the same project BUT a previous phase so you can also view the features> As Before it was a normal voice Assistant But now i have made it run locally and made many improvement as shown in below Github repo

Please do correct me if I've misunderstood anything or if you'd prefer a different direction.

Github:-[GitHub repository link]
Youtube:-https://youtu.be/9NFLYU2uIxU?si=lm5ZJdIOi6vDRVVs

Rishabh-Sahni-0809 · 2026-03-26T14:04:15Z

Rishabh-Sahni-0809
Mar 26, 2026
Author

Hey @openvino-dev-samples and @zhuo-yoyowz,
Kindly View My Approach and can you please tell me if it is Fine or If there are area of improvements then what are those so i can make those improvements and make the proposal as the deadline is Coming close and can you please also provide me some prerequisite tasks as there are none available currently so i have been HELPING and reading the repo and understanding the code.

A very thanks for your time and i am very exited as THIS IS the project that i have been working on from past year so it is a Dream come true For me so kindly if you please guide me I will be really grateful for that.

0 replies

openvino-dev-samples · 2026-03-27T07:31:23Z

openvino-dev-samples
Mar 27, 2026

Hi @Rishabh-Sahni-0809 Thanks for your engagement.

For the OpenVINO VLM integration — do you have a preference between SmolVLM and nanoLLaVA as the starting model, or is model selection part of what the project should evaluate?Regarding model selection, it all depends on you. You should consider the trade-off between model size and user experience in this project, and ensure the final results are deliverable
For the agentic loop — should the verify step rely purely on VLM re-observation, or would you like to see a hybrid approach that also queries the OS accessibility API (AT-SPI on Linux, MSAA on Windows) as a fallback?Please take care of the limited resource in edge and design your workflow. (don't make your PC explode ：）)

3. I would recommend you to focus some specific scenarios in this project, and optimize the user experience of them with local model. Do not expect a local model can do everything

0 replies

Rishabh-Sahni-0809 · 2026-03-28T09:04:03Z

Rishabh-Sahni-0809
Mar 28, 2026
Author

Hey @openvino-dev-samples and @zhuo-yoyowz,
could you please provide me with some prerequisite task to complete as none are available from past some days.
And @openvino-dev-samples as per your suggestion:-
I will go with the hybrid approach and i was also thinking that we can't do everything locally so what i was thinking is that we could use eagle 3 and mamba 3 for better optimization of cpu and if it will still be more large then what a pc can handle then i was thinking to then ask for those questions to online servers like groq this is what i have been using up until now.
Kindly view my youtube video and github repo also. I have attached those but the youtube video is from past phase but it has screen reading and other features so kindly view that too and kindly assign me some prerequisite tasks also.

Now kindly tell me which more parts need fine tuning

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 1:-- Build a GUI Agent with local LLM/VLM and OpenVINO #34948

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Project 1:-- Build a GUI Agent with local LLM/VLM and OpenVINO #34948

Uh oh!

Rishabh-Sahni-0809 Mar 26, 2026

Replies: 3 comments

Uh oh!

Rishabh-Sahni-0809 Mar 26, 2026 Author

Uh oh!

openvino-dev-samples Mar 27, 2026

Uh oh!

Rishabh-Sahni-0809 Mar 28, 2026 Author

Rishabh-Sahni-0809
Mar 26, 2026

Rishabh-Sahni-0809
Mar 26, 2026
Author

openvino-dev-samples
Mar 27, 2026

Rishabh-Sahni-0809
Mar 28, 2026
Author