English · 简体中文
Automate from how you actually work.
Turn screen recordings into skills you can run, edit, and reuse.
Quick Start · Features · Architecture · License
Video Driven Skill is an open-source automation studio that transforms screen recordings into runnable, editable skill packages. Upload a video, extract key frames, annotate intent, let a multimodal AI model draft the skill — then refine, run, version, archive, and export it.
The project is designed for teams and individuals who want automation to start from how work is actually performed, not from a blank script editor.
Workflow: Record the process → Pick the frames that matter → Annotate intent → Generate a skill → Review & run → Export & deploy
- Video-to-Skill Pipeline — Upload an operation recording and automatically convert it into a structured skill package with
SKILL.md,package.json, scripts, and variables. - Smart Frame Extraction — Auto-extract key frames via FFmpeg, or manually capture the moments that matter.
- Visual Annotation — Mark up frames with arrows, notes, and corrections to tell the AI exactly what to do.
- Multimodal AI Generation — Leverages any OpenAI-compatible vision model to generate browser, Android, iOS, or desktop automation code.
- In-Browser Code Editor — Review, edit, and refine generated code with syntax highlighting and variable management.
- Incremental Regeneration — Regenerate the full skill or just a selected code range, with diff review between versions.
- Local Skill Runner — Run skills directly with streamed logs and optional screenshots.
- Skill Repository — Browse, search, import, export (ZIP), and drag-to-reorder your skill collection.
- Knowledge Base — Attach reference images, documents, and notes to each skill for richer context.
- Archive System — Preserve videos, frames, and requirements for building future skills from past material.
Install Docker first, then choose the path that matches your goal.
Use this if you just want to run the app. The install script downloads the release Compose file, creates .env, pulls the pre-built images, and starts the stack.
curl -fsSL https://raw.githubusercontent.com/thought2code/video-driven-skill/main/scripts/install.sh | bashirm https://raw.githubusercontent.com/thought2code/video-driven-skill/main/scripts/install.ps1 | iexDefault install location:
- macOS / Linux:
~/video-driven-skill - Windows:
%USERPROFILE%\video-driven-skill
Open http://localhost after the script finishes (Docker uses standard ports 80 / 443).
To use AI generation, set your API key in the generated .env file:
AI_API_KEY=your-key-here
AI_BASE_URL=your-base-url
AI_MODEL=your-modelCommon install options: --tag v1.0.0, --dir <path>, --no-open. Local dev with npm run dev uses port 3000.
Use this for development, unreleased main, or local builds. It requires Docker and Git.
git clone https://github.com/thought2code/video-driven-skill.git
cd video-driven-skillchmod +x scripts/run-in-docker.sh
./scripts/run-in-docker.sh.\scripts\run-in-docker.cmdOn first run, .env is created from .env.example; set AI_API_KEY before using AI features:
AI_API_KEY=your-key-here
AI_BASE_URL=your-base-url
AI_MODEL=your-modelFor faster base-image pulls in China, add --cn. To skip opening the browser, add --no-open.
The frontend runs Caddy as a reverse proxy. Set a public hostname in .env and Caddy will obtain and renew Let's Encrypt certificates automatically. With no domain configured, the stack serves HTTP only at http://localhost.
Prerequisites
- A server with a public IP and Docker installed.
- An A record for your hostname (e.g.
vds.example.com) pointing to that IP. - Firewall / security group allowing 80 and 443 (TCP; optional 443/UDP for HTTP/3).
Configuration (see .env.example):
VDS_DOMAIN=vds.example.com
ACME_EMAIL=you@example.comVDS_DOMAIN: hostname only (nohttps://or path).ACME_EMAIL: optional, for Let's Encrypt expiry notices.
Start
docker compose up -d --buildOn first start with VDS_DOMAIN set, allow time for ACME validation (often 30s–few minutes), then open https://vds.example.com. HTTP redirects to HTTPS.
Certificates persist in Docker volumes caddy-data and caddy-config.
Troubleshooting
- Certificate not issued: verify DNS (
dig vds.example.com) and that ports 80/443 are reachable from the internet. - Logs:
docker compose logs -f frontend
- Upload — Upload an operation recording (e.g., a screen capture of a workflow).
- Extract Frames — Auto-extract key frames or manually capture the moments that matter.
- Annotate — Mark up frames with arrows, notes, and corrections.
- Describe Intent — Tell the AI what you want, e.g., "Collect item names from this page and export them."
- Generate — Let the multimodal model produce a complete skill package.
- Review & Edit — Inspect generated code, adjust variables, and refine the output.
- Run — Execute the skill locally with streamed log output.
- Iterate — Regenerate the full skill or just a selected section, with diff comparison.
- Export & Deploy — Package as a ZIP or deploy to your local skill directory.
video-driven-skill/
├── backend/ # Spring Boot — API, video processing, AI, skill runner
├── frontend/ # React + Vite — studio UI
├── docker-compose.yml # Docker deployment (build from source)
├── docker-compose.release.yml # GHCR images (no clone)
├── docker-compose.cn.yml # Optional mirror overlay (local build)
├── ARCHITECTURE.md # Architecture (English)
├── ARCHITECTURE.zh-CN.md # Architecture (Chinese)
├── scripts/
│ ├── install.sh / install.ps1 # Install from GHCR (no clone)
│ ├── run-in-docker.cmd / .sh # Build & run from source
│ └── kill-midscene.sh # Optional cleanup helper
| Module | Responsibility |
|---|---|
controller/ |
REST API & WebSocket entry points |
service/VideoService |
Video upload, FFmpeg frame extraction, streaming |
service/AIService |
Prompt construction & multimodal API calls |
service/SkillService |
Skill CRUD, import/export, versioning |
service/SkillRunnerService |
Workspace setup, dependency injection, execution, log collection |
service/KnowledgeService |
Per-skill reference files & manifest |
model/ & repository/ |
SQLite-backed domain entities |
Runtime data lives under ~/video-driven-skill/ by default (override with VIDEO_DRIVEN_SKILL_HOME; on Windows, the same folder name under your user profile):
uploads/— uploaded videos & extracted framesskills/— generated skill source filesarchives/— reusable video/frame/requirement resourcesvideo-driven-skill.db— SQLite database
With Docker Compose, the same layout is stored at /data inside the backend container (Compose volume app-data), not under ~/video-driven-skill/. Inspect the host path with docker volume inspect video-driven-skill_app-data.
| Component | Responsibility |
|---|---|
HomePage |
Upload, import, and recent resources |
PlaygroundPage |
Frame annotation & skill workspace |
FrameTimeline / FrameAnnotator / FrameList |
Visual evidence collection |
AIProcessor |
Generation control & streamed status |
SkillList |
Skill repository with drag-to-reorder |
SkillEditor / SkillExport / SkillRunner |
Review, export & execution |
RegeneratePanel / CodeComparisonView |
Iteration workflow |
KnowledgeBasePanel |
Extra context per skill |
SKILL.md # Skill intent, instructions, and variable docs
package.json # Metadata
variables.json # User-editable runtime inputs
scripts/main.js # Executable entrypoint
knowledge/ # Optional reference files
For a deeper walkthrough, see ARCHITECTURE.md.
| Method | Path | Purpose |
|---|---|---|
POST |
/api/videos/upload |
Upload a video |
POST |
/api/videos/{id}/frames/auto |
Auto-extract frames |
POST |
/api/videos/{id}/frames/manual |
Manual frame capture |
GET |
/api/videos/{id}/stream |
Stream uploaded video |
GET |
/api/skills |
List all skills |
PUT |
/api/skills/order |
Persist skill ordering |
POST |
/api/skills/generate |
Generate a skill |
GET |
/api/skills/{id} |
Read a skill |
PUT |
/api/skills/{id}/files |
Update skill files |
GET |
/api/skills/{id}/export |
Export skill as ZIP |
POST |
/api/skills/{id}/regenerate |
Generate candidate revision |
POST |
/api/skills/{id}/partial-regenerate |
Regenerate selected code range |
POST |
/api/skills/{id}/accept |
Accept candidate revision |
GET |
/api/skills/{id}/versions |
List skill versions |
POST |
/api/skills/{id}/deploy |
Deploy skill locally |
This repository is prepared for open-source use:
- No API keys or credentials are committed.
- Local databases, uploads, archives, generated skills, logs, and build outputs are git-ignored.
- Runtime configuration comes from environment variables or local
.envfiles. - Do not upload private recordings, credentials, customer data, or production screenshots to any public instance.
If you discover a security issue, please report it responsibly. See SECURITY.md.
This project is licensed under the MIT License. See LICENSE for details.
Built with care by the Video Driven Skill team.