Skip to content

Latest commit

 

History

History
393 lines (306 loc) · 11.7 KB

File metadata and controls

393 lines (306 loc) · 11.7 KB

Guide to Customizing Embabel Guide for Codebase Ingestion

This guide shows you how to customize Embabel Guide to work with your codebase instead of documentation URLs.

Overview

Guide already has everything you need:

  • ✅ Full RAG infrastructure with ingestDirectory(String dir) method
  • ✅ Chat functionality with ChatActions
  • ✅ Frontend integration (Embabel Hub)
  • ✅ Agent framework and LLM integration

You just need to configure it to use your codebase instead of URLs.

Step-by-Step Customization

Step 1: Get Guide

git clone https://github.com/embabel/guide.git
cd guide

Step 2: Add Codebase Path Configuration

Modify src/main/java/com/embabel/guide/GuideProperties.java:

Before:

public record GuideProperties(
        boolean reloadContentOnStartup,
        String defaultPersona,
        LlmOptions chatLlm,
        String projectsPath,
        ContentChunker.Config chunkerConfig,
        String referencesFile,
        List<String> urls,  // Only URLs
        String toolPrefix,
        Set<String> toolGroups
) {}

After:

public record GuideProperties(
        boolean reloadContentOnStartup,
        String defaultPersona,
        LlmOptions chatLlm,
        String projectsPath,
        ContentChunker.Config chunkerConfig,
        String referencesFile,
        List<String> urls,  // Keep for docs
        String codebasePath,  // ADD: Your codebase path
        String toolPrefix,
        Set<String> toolGroups
) {}

Step 3: Modify Data Ingestion

Modify src/main/java/com/embabel/guide/rag/DataManager.java:

In the loadReferences() method, change:

Before:

public void loadReferences() {
    int successCount = 0;
    int failureCount = 0;

    for (var url : guideProperties.urls()) {
        try {
            logger.info("⏳Loading URL: {}...", url);
            ingestPage(url);
            logger.info("✅ Loaded URL: {}", url);
            successCount++;
        } catch (Throwable t) {
            logger.error("❌ Failure loading URL {}: {}", url, t.getMessage(), t);
            failureCount++;
        }
    }
    logger.info("Loaded {}/{} URLs successfully ({} failed)",
            successCount, guideProperties.urls().size(), failureCount);
}

After:

public void loadReferences() {
    // If codebase path is configured, use directory ingestion
    if (guideProperties.codebasePath() != null && !guideProperties.codebasePath().isEmpty()) {
        try {
            logger.info("⏳Loading codebase from: {}...", guideProperties.codebasePath());
            ingestDirectory(guideProperties.codebasePath());
            logger.info("✅ Loaded codebase from: {}", guideProperties.codebasePath());
        } catch (Throwable t) {
            logger.error("❌ Failure loading codebase from {}: {}", 
                    guideProperties.codebasePath(), t.getMessage(), t);
            throw new RuntimeException("Failed to load codebase", t);
        }
        return;  // Skip URL ingestion if using codebase
    }

    // Otherwise, use existing URL-based ingestion
    int successCount = 0;
    int failureCount = 0;

    for (var url : guideProperties.urls()) {
        try {
            logger.info("⏳Loading URL: {}...", url);
            ingestPage(url);
            logger.info("✅ Loaded URL: {}", url);
            successCount++;
        } catch (Throwable t) {
            logger.error("❌ Failure loading URL {}: {}", url, t.getMessage(), t);
            failureCount++;
        }
    }
    logger.info("Loaded {}/{} URLs successfully ({} failed)",
            successCount, guideProperties.urls().size(), failureCount);
}

Step 4: Configure Application

Modify src/main/resources/application.yml:

guide:
  reload-content-on-startup: true
  
  # Your codebase path (absolute or relative)
  codebase-path: /path/to/your/codebase
  
  # URLs can be empty if using codebase only, or keep for docs
  urls: []
  
  # Or use both codebase and docs:
  # urls:
  #   - https://docs.example.com
  
  default-persona: adaptive
  chat-llm:
    model: gpt-4.1-mini
  
  # ... rest of configuration

Step 5: Build and Run

# Build Guide
mvn clean install

# Run Guide
mvn spring-boot:run

That's it! Guide will now:

  1. Index your codebase on startup (if reload-content-on-startup: true)
  2. Allow users to chat about your codebase
  3. Use RAG to find relevant code sections
  4. Provide references to code locations

Alternative: Support Both Codebase and URLs

If you want to support both codebase and URLs:

public void loadReferences() {
    // Load codebase if configured
    if (guideProperties.codebasePath() != null && !guideProperties.codebasePath().isEmpty()) {
        try {
            logger.info("⏳Loading codebase from: {}...", guideProperties.codebasePath());
            ingestDirectory(guideProperties.codebasePath());
            logger.info("✅ Loaded codebase from: {}", guideProperties.codebasePath());
        } catch (Throwable t) {
            logger.error("❌ Failure loading codebase: {}", t.getMessage(), t);
        }
    }

    // Load URLs if configured
    if (guideProperties.urls() != null && !guideProperties.urls().isEmpty()) {
        int successCount = 0;
        int failureCount = 0;
        for (var url : guideProperties.urls()) {
            try {
                logger.info("⏳Loading URL: {}...", url);
                ingestPage(url);
                logger.info("✅ Loaded URL: {}", url);
                successCount++;
            } catch (Throwable t) {
                logger.error("❌ Failure loading URL {}: {}", url, t.getMessage(), t);
                failureCount++;
            }
        }
        logger.info("Loaded {}/{} URLs successfully ({} failed)",
                successCount, guideProperties.urls().size(), failureCount);
    }
}

What Guide Provides Out of the Box

After this customization, you get:

  1. Codebase Indexing: Your codebase is indexed using Tika for parsing and DrivineStore for vector storage
  2. Chat Interface: Users can ask questions about your codebase
  3. RAG Search: Vector search finds relevant code sections
  4. Frontend: Embabel Hub frontend (or custom frontend) connects automatically
  5. Agent Framework: Full Embabel agent framework with @Action triggers
  6. References: Code references with file paths and locations

Example: Chatting About Your Codebase

Once Guide is running with your codebase:

User: "How does authentication work in this codebase?"

Guide (using RAG):

  • Searches indexed code for authentication-related code
  • Finds relevant classes/functions
  • Provides answer with references to specific files

User: "Show me the API endpoints"

Guide:

  • Finds REST controllers or API definitions
  • Lists endpoints with their implementations
  • Provides file paths and line numbers

Troubleshooting

Codebase Not Loading

Local:

  • Check path: Ensure codebase-path is absolute or correct relative path
  • Check permissions: Ensure Guide can read the directory
  • Check logs: Look for parsing errors in logs

Docker:

  • Verify volume mount: Check that volume mount path matches GUIDE_CODEBASE_PATH
    docker exec embabel-guide ls -la <codebase-path>
  • Check environment variable: Verify GUIDE_CODEBASE_PATH is set correctly
    docker exec embabel-guide env | grep GUIDE_CODEBASE_PATH
  • Check container logs: Look for Java exceptions in Guide logs
    docker logs embabel-guide

Environment Variable Not Recognized

  • Spring Boot binding: Spring Boot uses GUIDE_CODEBASE_PATH for guide.codebase-path
  • Check property name: Verify @ConfigurationProperties(prefix = "guide") in GuideProperties
  • YAML vs Environment: Environment variables override application.yml values

Ingestion Fails Silently

  • Check Guide logs: Look for Java exceptions
    # Local
    tail -f logs/guide.log
    
    # Docker
    docker logs embabel-guide
  • Verify Neo4j connection: Ensure Neo4j is accessible from Guide
    # Docker
    docker exec embabel-guide ping neo4j

No Results in Chat

  • Wait for indexing: Large codebases take time to index
  • Check Neo4j: Ensure Neo4j is running and Guide can connect
  • Check embeddings: Verify embedding model is configured correctly
  • Verify ingestion: Check logs to confirm codebase was indexed

Frontend Not Connecting

  • Check port: Default is 1337, ensure it's accessible
  • Check CORS: If using external frontend, configure CORS in Guide
  • Check WebSocket: Ensure WebSocket/SSE endpoints are accessible
  • Verify API endpoint: Check that frontend points to correct Guide URL

Docker-Specific Issues

Container can't find codebase:

  • Verify volume mount syntax in docker-compose.override.yml
  • Check that host path exists before mounting
  • Use absolute paths for reliability

Environment variables not applied:

  • Check .env file is in same directory as compose.yaml
  • Or pass environment variables inline:
    GUIDE_CODEBASE_PATH=/codebase docker compose up -d

Container restarts and loses data:

  • Neo4j data persists in Docker volumes (default behavior)
  • Codebase ingestion runs on startup if GUIDE_RELOAD_CONTENT_ON_STARTUP=true
  • To skip re-ingestion, set to false

For more detailed troubleshooting, see TESTING.md.

Docker Setup

The customized Guide works seamlessly with Docker. You can build a custom Docker image and use Docker Compose with volume mounts.

Building Custom Guide Image

After making the customization changes:

cd guide
docker build -t guide-custom:latest -f Dockerfile .

Using Docker Compose with Codebase Mount

  1. Create a docker-compose override file (copy from talk-to-your-repo/docker-compose.override.yml.example):
services:
  guide:
    image: guide-custom:latest  # Or use build: context: .
    environment:
      - GUIDE_CODEBASE_PATH=/codebase  # Path inside container
      - GUIDE_RELOAD_CONTENT_ON_STARTUP=true
    volumes:
      - ./test-codebase:/codebase:ro  # Mount your codebase (read-only)
      - /var/run/docker.sock:/var/run/docker.sock
  1. Set environment variables (create .env file or pass inline):
OPENAI_API_KEY=sk-your-key-here
GUIDE_CODEBASE_PATH=/codebase
NEO4J_PASSWORD=your-password
  1. Start services:
cd guide
docker compose --profile java up --build -d
  1. Verify:
# Check container logs
docker logs embabel-guide

# Verify codebase is mounted
docker exec embabel-guide ls -la /codebase

# Check environment variable
docker exec embabel-guide env | grep GUIDE_CODEBASE_PATH

How It Works

  • Configuration Binding: Spring Boot automatically binds GUIDE_CODEBASE_PATH environment variable to guide.codebase-path property
  • Volume Mounts: Docker Compose mounts your codebase directory into the container
  • No Deep Changes: The customization doesn't require any Docker-specific code changes

See TESTING.md for detailed Docker testing instructions.

Next Steps

  1. Customize references: Add codebase-specific references to references.yml
  2. Adjust chunking: Configure content-chunker settings for code
  3. Add filters: Filter out unwanted files (node_modules, target/, etc.)
  4. Multiple codebases: Extend to support multiple codebase paths

Summary

Customizing Guide for codebase ingestion requires minimal changes:

  • ✅ Add codebasePath to GuideProperties (~1 line)
  • ✅ Modify loadReferences() to use ingestDirectory() (~10 lines)
  • ✅ Configure in application.yml (~1 line)

Total changes: ~12 lines of code vs thousands of lines for reimplementation!

Everything else (ChatActions, RAG, frontend, agent framework) works as-is. 🎉