CodeGrok MCP

Semantic Code Search for AI Assistants

Give your AI assistant the power to truly understand your codebase

Features • Quick Start • Capabilities • Limitations • Integrations • Use Cases

What is CodeGrok MCP?

CodeGrok MCP is a Model Context Protocol (MCP) server that enables AI assistants to intelligently search and understand codebases using semantic embeddings and Tree-sitter parsing.

Unlike simple text search, CodeGrok understands code structure - it knows what functions, classes, and methods are, and can find relevant code even when you describe it in natural language.

You: "Where is authentication handled?"
CodeGrok: Returns auth middleware, login handlers, JWT validation code...

Why Use CodeGrok?

The Problem: AI assistants have limited context windows. Sending your entire codebase is expensive and often impossible.

The Solution: CodeGrok indexes your code once, then AI can query semantically and receive only the 5-10 most relevant code snippets—10-100x token reduction vs naive "read all files" approaches.

Features

Semantic Code Search - Find code by meaning, not just keywords
9 Languages Supported - Python, JavaScript, TypeScript, C, C++, Go, Java, Kotlin, Bash
28 File Extensions - Comprehensive coverage including .jsx, .tsx, .mjs, .hpp, etc.
Fast Parallel Indexing - 3-5x faster with multi-threaded parsing
Incremental Updates - Only re-index changed files (auto mode)
Local & Private - All data stays on your machine in .codegrok/ folder
Zero LLM Dependencies - Lightweight, focused tool (no API keys required)
GPU Acceleration - Auto-detects CUDA for faster embeddings
Works with Any MCP Client - Claude, Cursor, Cline, and more

✅ What CodeGrok CAN Do

For Live Coding (AI-Assisted Development)

Capability	Description
Semantic Code Search	Natural language queries → vector similarity search against indexed code
Find Code by Purpose	Query "How does auth work?" → Returns relevant auth files with line numbers
Symbol Extraction	Extracts functions, classes, methods with signatures, docstrings, calls, imports
Incremental Updates	`learn` with auto mode only re-indexes modified files (uses file modification time)
Persistent Storage	Index survives restarts in `.codegrok/` folder
Load Existing Index	`learn` with `mode='load_only'` instantly loads previously indexed codebase

For Learning a New Codebase

Capability	Description
Entry Point Discovery	Query "main entry point" to find where execution starts
Architecture Understanding	Query "database connection" to find DB layer
Domain Concepts	Query "user authentication flow" to find auth logic
Index Statistics	See files parsed, symbols extracted, timing info

❌ What CodeGrok CANNOT Do

Important: Understanding limitations helps you use the tool effectively.

Not Designed For

Limitation	Explanation
Code Execution	Pure indexing/search - no interpreter, no running tests
Code Modification	Read-only search - doesn't write or edit files
Real-time File Watching	No daemon mode - manually call `learn` again to update index
Cross-repository Search	Single codebase per index - can't search multiple projects simultaneously
Find All Usages	Finds definitions, not references (no "who calls this function?")
Type Inference / LSP	No language server - no jump-to-definition, no autocomplete
Git History Analysis	Indexes current state only - no commit history or blame
Regex/Exact Search	Semantic only - use `grep` or `ripgrep` for exact string matching
Code Metrics	No complexity scoring, no linting, no coverage data

Technical Constraints

Constraint	Impact
First index is slow	~50 chunks/second (~3-4 min for 10K symbols)
Memory usage	Embedding models use 500MB-2GB RAM
Model download	First run downloads ~500MB model from HuggingFace
Query latency	~50-100ms per search

Quick Start

Installation

# Clone the repository
git clone https://github.com/rdondeti/CodeGrok_mcp.git
cd CodeGrok_mcp

# Option 1: Use setup script (recommended)
./setup.sh              # Linux/macOS
# or
.\setup.ps1             # Windows PowerShell

# Option 2: Manual install
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
pip install -e .

# Verify installation
codegrok-mcp --help

Setup script options:

Flag	Description
`--clean`	Remove existing venv before creating new
`--prod`	Install production dependencies only
`--no-verify`	Skip verification step

First Index

Once integrated with your AI tool (see below), ask your assistant:

"Learn my codebase at /path/to/my/project"

Then search:

"Find how API endpoints are defined"
"Where is error handling implemented?"
"Show me the database models"

🎯 Use Cases

Use Case 1: Live Coding with AI

How CodeGrok Saves Tokens:

Without CodeGrok:
  AI tries to read entire codebase → exceeds context window → fails or costs $$

With CodeGrok:
  AI: "I need to add a new route"
    ↓ calls get_sources("Express route definition")
  CodeGrok: Returns routes/api.js:15, routes/auth.js:8
    ↓ AI reads only those 2 files
  Result: 10-100x fewer tokens, faster responses

Use Case 2: Learning a New Codebase

Step 1: "Learn my codebase at ~/projects/big-app"
Step 2: "Where is the main entry point?"
Step 3: "How is authentication implemented?"
Step 4: "Find the database connection logic"
Step 5: "Show me how API errors are handled"

Use Case 3: Code Review Assistance

"Find all functions that handle user input"
"Where is validation performed?"
"Show me error handling patterns"

🔌 AI Tool Integrations

Claude Code (CLI)

The easiest way to add CodeGrok to Claude Code:

# Add the MCP server
claude mcp add codegrok-mcp -- codegrok-mcp

Or manually add to your settings (~/.claude/settings.json):

{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}

Usage in Claude Code:

> learn my codebase at ./my-project
> find authentication logic
> where is the main entry point?

Claude Desktop

Add to your Claude Desktop configuration:

Platform	Config File Location
macOS	`~/Library/Application Support/Claude/claude_desktop_config.json`
Windows	`%APPDATA%\Claude\claude_desktop_config.json`
Linux	`~/.config/Claude/claude_desktop_config.json`

{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp",
      "args": []
    }
  }
}

Restart Claude Desktop after saving.

Cursor

Cursor supports MCP servers through its extension system:

Open Settings → Extensions → MCP
Add Server Configuration:

{
  "codegrok": {
    "command": "codegrok-mcp",
    "transport": "stdio"
  }
}

Or add to .cursor/mcp.json in your project:

{
  "servers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}

Windsurf (Codeium)

Windsurf supports MCP through Cascade:

Open Cascade Settings
Navigate to MCP Servers
Add configuration:

{
  "codegrok": {
    "command": "codegrok-mcp",
    "transport": "stdio"
  }
}

Cline (VS Code)

Add to Cline's MCP settings in VS Code:

Open Command Palette (Ctrl+Shift+P / Cmd+Shift+P)
Search "Cline: Open MCP Settings"
Add:

{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}

Zed Editor

Zed supports MCP through its assistant panel. Add to settings:

{
  "assistant": {
    "mcp_servers": {
      "codegrok": {
        "command": "codegrok-mcp"
      }
    }
  }
}

Continue (VS Code/JetBrains)

Add to your Continue configuration (~/.continue/config.json):

{
  "mcpServers": [
    {
      "name": "codegrok",
      "command": "codegrok-mcp"
    }
  ]
}

Generic MCP Client

For any MCP-compatible client, use stdio transport:

# Command to run
codegrok-mcp

# Transport
stdio (stdin/stdout)

# Protocol
Model Context Protocol (MCP)

MCP Tools Reference

CodeGrok provides 4 tools for AI assistants:

Tool	Description	Key Parameters
`learn`	Index a codebase (smart modes)	`path` (required), `mode` (auto/full/load_only), `file_extensions`, `embedding_model`
`get_sources`	Semantic code search	`question` (required), `n_results` (1-50, default: 10), `language`, `symbol_type`
`get_stats`	Get index statistics	None
`list_supported_languages`	List supported languages	None

Learn modes:

auto (default): Smart detection - incremental reindex if exists, full index if new
full: Force complete re-index (destroys existing index)
load_only: Just load existing index without any indexing

Tool Examples

Learn a Codebase

{
  "tool": "learn",
  "arguments": {
    "path": "/home/user/my-project",
    "mode": "auto"
  }
}

Response:

{
  "success": true,
  "message": "Indexed 150 files with 1,247 symbols",
  "stats": {
    "total_files": 150,
    "total_symbols": 1247,
    "total_chunks": 2834,
    "indexing_time": 12.5
  }
}

Search for Code

{
  "tool": "get_sources",
  "arguments": {
    "question": "How is user authentication implemented?",
    "n_results": 5
  }
}

Response:

{
  "sources": [
    {
      "file": "src/auth/middleware.py",
      "symbol": "authenticate_request",
      "type": "function",
      "line": 45,
      "content": "def authenticate_request(request):\n    ...",
      "score": 0.89
    }
  ]
}

Incremental Update (using learn with auto mode)

{
  "tool": "learn",
  "arguments": {
    "path": "/home/user/my-project",
    "mode": "auto"
  }
}

Response (when index exists):

{
  "success": true,
  "mode_used": "incremental",
  "files_added": 2,
  "files_modified": 5,
  "files_deleted": 1
}

Supported Languages

Language	Extensions	Parser
Python	`.py`, `.pyi`, `.pyw`	tree-sitter-python
JavaScript	`.js`, `.jsx`, `.mjs`, `.cjs`	tree-sitter-javascript
TypeScript	`.ts`, `.tsx`, `.mts`, `.cts`	tree-sitter-typescript
C	`.c`, `.h`	tree-sitter-c
C++	`.cpp`, `.cc`, `.cxx`, `.hpp`, `.hh`, `.hxx`	tree-sitter-cpp
Go	`.go`	tree-sitter-go
Java	`.java`	tree-sitter-java
Kotlin	`.kt`, `.kts`	tree-sitter-kotlin
Bash	`.sh`, `.bash`, `.zsh`	tree-sitter-bash

Total: 9 languages, 28 file extensions

How It Works

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     MCP Client                               │
│        (Claude, Cursor, Cline, etc.)                        │
└─────────────────────────┬───────────────────────────────────┘
                          │ MCP Protocol (stdio)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                   CodeGrok MCP Server                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Parsers   │  │  Embeddings │  │   Vector Storage    │  │
│  │ (Tree-sitter)│  │ (Sentence   │  │    (ChromaDB)       │  │
│  │             │  │ Transformers)│  │                     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Indexing Pipeline

Source Files → Tree-sitter Parser → Symbol Extraction →
Code Chunks → Embeddings → ChromaDB Storage

Parse: Tree-sitter extracts functions, classes, methods with signatures
Chunk: Code is split into semantic chunks with context (docstrings, imports, calls)
Embed: Sentence-transformers create vector embeddings
Store: ChromaDB persists vectors locally in .codegrok/

Search Pipeline

Query → Embedding → Vector Similarity → Ranked Results

Embed Query: Convert natural language to vector
Search: Find similar vectors in ChromaDB
Return: Top-k results with file paths, line numbers, and code snippets

Storage

All data is stored locally in your project:

your-project/
└── .codegrok/
    ├── chroma/           # Vector database
    └── metadata.json     # Index metadata (stats, file mtimes)

Configuration

Environment Variables

Variable	Description	Default
`CODEGROK_EMBEDDING_MODEL`	Embedding model to use	`nomic-embed-code`
`CODEGROK_DEVICE`	Compute device (cpu/cuda/mps)	Auto-detect

Embedding Models

Model	Size	Best For
`coderankembed`	768d / 137M	Code (default, recommended) - uses `nomic-ai/CodeRankEmbed`

The default model (nomic-ai/CodeRankEmbed) is optimized for code retrieval with:

768-dimensional embeddings
8192 max sequence length
State-of-the-art performance on CodeSearchNet benchmarks

Security Note: `trust_remote_code`

The default embedding model (nomic-ai/CodeRankEmbed) requires trust_remote_code=True when loading via SentenceTransformers. This flag allows execution of custom Python code bundled with the model.

Why it's required:

The model uses a custom Nomic BERT architecture that isn't part of the standard HuggingFace model library
Custom files: modeling_hf_nomic_bert.py (model architecture), configuration_hf_nomic_bert.py (config)

Security audit: The custom code has been reviewed and contains:

Standard PyTorch neural network definitions
No exec(), eval(), or dynamic code execution
No subprocess or shell commands
No network requests beyond HuggingFace's standard model download APIs
Only imports from trusted libraries (torch, transformers, einops, safetensors)

For maximum security:

Review the model code yourself: nomic-ai/CodeRankEmbed on HuggingFace
Pin to a specific model revision in production deployments
Consider using Microsoft CodeBERT (microsoft/codebert-base) as an alternative that doesn't require trust_remote_code (with potential quality trade-offs)

Development

Setup

# Clone
git clone https://github.com/rdondeti/CodeGrok_mcp.git
cd CodeGrok_mcp

# Run setup script
./setup.sh              # Linux/macOS (includes dev dependencies)
.\setup.ps1             # Windows PowerShell

# For clean reinstall:
./setup.sh --clean

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src/codegrok_mcp --cov-report=term-missing

# Run specific test categories
pytest tests/unit/ -v          # Fast unit tests
pytest tests/integration/ -v   # Integration tests (uses real embeddings)
pytest tests/mcp/ -v           # MCP protocol simulation tests

Code Quality

# Format code
black src/

# Type checking
mypy src/

# Linting
flake8 src/

FAQ & Troubleshooting

Server won't start

# Check installation
pip show codegrok-mcp

# Check Python version (need 3.10+)
python --version

# Reinstall
pip install -e .

Indexing is slow

Large codebases (>10k files) take longer on first index
Use learn again after first index for incremental updates (auto mode)
Close other heavy applications
Consider indexing a subdirectory first

Search returns irrelevant results

Be more specific in queries (e.g., "JWT token validation" instead of "auth")
Re-index if codebase changed significantly
Check that the code type you're searching exists

Out of memory

Index smaller portions of the codebase
The default coderankembed model uses ~500MB-2GB RAM
Close other applications

"No index loaded" error

Use learn tool first:

"Learn my codebase at /path/to/project"

Comparison with Other Tools

Feature	CodeGrok MCP	grep/ripgrep	GitHub Search	Sourcegraph
Semantic Search	✅	❌	Partial	✅
Local/Private	✅	✅	❌	❌
MCP Support	✅	❌	❌	❌
No API Keys	✅	✅	❌	❌
Multi-language	✅	✅	✅	✅
Code Structure Aware	✅	❌	Partial	✅
Offline	✅	✅	❌	❌

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing)
Make your changes
Run tests (pytest)
Format code (black src/)
Submit a Pull Request

Development Guidelines

Follow Black formatting (line length 100)
Add type hints to all functions
Write tests for new features
Update documentation

License

MIT License - see LICENSE for details.

Related Projects

Model Context Protocol - The protocol that powers this integration
Tree-sitter - Fast, accurate code parsing
ChromaDB - Vector database for embeddings
Sentence Transformers - State-of-the-art embeddings

Support

Issues: GitHub Issues
Discussions: GitHub Discussions

Made with ❤️ for developers who want AI that truly understands their code

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
self_test		self_test
src/codegrok_mcp		src/codegrok_mcp
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.ps1		setup.ps1
setup.sh		setup.sh

License

dondetir/CodeGrok_mcp

Folders and files

Latest commit

History

Repository files navigation

CodeGrok MCP

What is CodeGrok MCP?

Why Use CodeGrok?

Features

✅ What CodeGrok CAN Do

For Live Coding (AI-Assisted Development)

For Learning a New Codebase

❌ What CodeGrok CANNOT Do

Not Designed For

Technical Constraints

Quick Start

Installation

First Index

🎯 Use Cases

Use Case 1: Live Coding with AI

Use Case 2: Learning a New Codebase

Use Case 3: Code Review Assistance

🔌 AI Tool Integrations

Claude Code (CLI)

Claude Desktop

Cursor

Windsurf (Codeium)

Cline (VS Code)

Zed Editor

Continue (VS Code/JetBrains)

Generic MCP Client

MCP Tools Reference

Tool Examples

Learn a Codebase

Search for Code

Incremental Update (using learn with auto mode)

Supported Languages

How It Works

Architecture

Indexing Pipeline

Search Pipeline

Storage

Configuration

Environment Variables

Embedding Models

Security Note: trust_remote_code

Development

Setup

Testing

Code Quality

FAQ & Troubleshooting

Comparison with Other Tools

Contributing

Development Guidelines

License

Related Projects

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Security Note: `trust_remote_code`

Packages