A Python-based framework for evaluating Large Language Models (LLMs) based on Anthropic's research paper and using the DRACO AI dataset.
- Quick Start
- Features
- Installation
- Configuration
- Usage
- Project Structure
- Example Plots
- Contributing
- Troubleshooting
- License
- Clone the repo
- Set up environment variables
- Install dependencies
- Run
python main.py
- 🤖 Multiple Model Support: OpenAI, Anthropic, Together AI, Groq, OpenRouter, Gemini, HuggingFace
- 📊 Evaluation Metrics: Completeness, relevance, conciseness, confidence, factuality, judgement, and custom
- 🔍 RAG Implementation: FAISS vectorstore with BGE embeddings and reranking
- 🛠️ Tool Usage: Code execution, simulation running, SmolAgents integration
- ⚖️ Multiple Judges: Support for secondary judge models
- 📈 Statistical Analysis: Comprehensive statistics and visualization
- 🌐 Cross-Platform: Windows, macOS, and Linux support
-
Clone the repository:
git clone https://github.com/nsourlos/LLM_evaluation_framework.git cd LLM_evaluation_framework -
Create and activate a virtual environment:
python -m venv DRACO source DRACO/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the package in editable mode:
pip install -r requirements.txt #Optionally ipywidgets==8.1.7 for Running in Jupyter notebook #Optionally flash-attn==2.6.3 for GPU support
-
(Optional) Use environment within Jupyter Notebook
pip install ipykernel python -m ipykernel install --user --name=DRACO --display-name "Python (DRACO)" -
(Optional) Set up code execution environment:
# When using code execution features, a separate environment is needed # to safely run generated code without conflicts conda create -n test_LLM python==3.10 -y conda activate test_LLM pip install -r data/requirements_code_execution.txt
- Note: If using venv instead of conda, paths in src/llm_eval/utils/paths.py must be modified to point to the correct venv location
This creates an isolated environment for running generated code, preventing potential conflicts with the main evaluation environment.
- Rename
env_exampletoenvand add your API keys:
OPENAI_API_KEY="your_openai_api_key"
GEMINI_API_KEY="your_gemini_api_key"
TOGETHER_API_KEY="your_together_api_key"
GROQ_API_KEY="your_groq_api_key"
ANTHROPIC_API_KEYO="your_anthropic_api_key"
HF_TOKEN="your_huggingface_token"
OPEN_ROUTER_API_KEY="your_openrouter_api_key"
Edit src/llm_eval/utils/paths.py to set your system-specific paths:
- For the corresponding OS: Set
base_pathandvenv_path
Edit src/llm_eval/config.py to configure:
excel_file_name: Your dataset Excel file - This need to be added from the userembedding_model: Model for RAG embeddingsreranker_model_name: Model for rerankingmodels: List of models to evaluate (e.g. OpenAI, Together, Gemini models)judge_model: Models used to judge the resultscommercial_api_providers: Use to distinguish commercial and HuggingFace modelsmax_output_tokens: Maximum tokens in judge LLM outputgenerate_max_tokens: Token limit for regular model responsesgeneration_max_tokens_thinking: Token limit for reasoning model responsesdomain: Domain of evaluation (e.g. "Water" Engineering)n_resamples: Number of times to resample the datasetcontinue_from_resample: Which resample iteration to continue fromtool_usage: Enable/disable tool usage for answering questionsuse_RAG: Enable/disable RAG (Retrieval Augmented Generation)use_smolagents: Enable/disable SmolAgents for code execution
The input Excel file must contain at least two columns:
input: The questions or prompts to evaluateoutput: The expected answers or ground truth
Additional columns may be added:
id: Column to uniquely identify questionsorigin_file: The json file from which the question-answer pair was extractedtopic/subtopic: The topic/subtopic of the questionReference: Information from where the question-answer pair was obtained
-
Configure parameters:
- Set up your environment variables in
env_exampleand rename it toenv - Configure paths in
src/llm_eval/utils/paths.py - Modify prompts and list of metrics in
src/llm_eval/evaluation/prompts.py - Adjust parameters in
src/llm_eval/config.py
- Set up your environment variables in
-
Run the evaluation:
python main.py # Optionally `python main.py | tee data/log.txt` to save terminal output to txt file
The script will:
- Load and process your Excel dataset
- Run evaluations on specified models
- Generate Excel results files
- Create JSON files for statistics
- Produce visualization plots
llm_evaluation_framework/
├── src/
│ └── llm_eval/
│ ├── config.py # All configuration parameters
│ ├── core/
│ │ ├── data_loader.py # Functions for loading data and models
│ │ ├── model_utils.py # Model initialization and utilities
│ ├── evaluation/
│ │ ├── evaluator.py # Evaluation functions
│ │ └── prompts.py # All evaluation prompt strings
│ ├── providers/
│ │ └── api_handlers.py # Helper functions for LLM APIs
│ ├── tools/
│ │ ├── code_execution.py # Logic for tool handling
│ │ └── tool_usage.py # Tool usage definition and decision logic
│ └── utils/
│ ├── paths.py # OS-specific path configurations
│ ├── plotting.py # Visualization functions
│ ├── processing.py # Processing and Excel file creation
│ ├── rag.py # RAG implementation
│ ├── scoring.py # Scoring utilities
│ └── statistics.py # Statistical calculations
├── notebooks/
│ └── convert_DRACO_to_excel.ipynb # Create Excel file from json files with question-answer pairs
├── data/
│ ├── requirements_code_execution.txt # Dependencies for code execution environment
│ ├── network_0.inp # Input file for network comparison
│ ├── network_test.inp # Input file for network testing scenarios
│ ├── compare_networks_test.py # Test script for network comparison functionality
│ └── compare_networks.py # Main network comparison implementation
├── runpod/
│ ├── README_runpod.md # RunPod instructions
│ └── runpod_initialize.ipynb # Notebook that automatically initialize runpod and copies files to it
├── example_imgs/
│ ├── metric_comparison_grid.png #Example image of a comparison grid of models for different metrics
│ ├── model_performance_summary.png #Example image of metric comparisons between models for different metrics
│ ├── model_statistical_comparisons.png #Example image of statistical comparisons between models
│ ├── spider_chart_judge_deepseek-ai_DeepSeek-V3.png #Example image of spider graph comparisons between metrics for different models
├── main.py # Main script
├── env_example # Environment variables (to be renamed to env)
├── requirements.txt # Dependencies
└── README.md # This file
The framework generates various visualization plots to help analyze the evaluation results. Here are some examples of a comparison of two models:
Overall performance summary of evaluated models
Spider chart showing metric distribution
Comparison of different metrics across models
Statistical comparison between models with p-values
When making changes:
- Maintain backward compatibility
- Preserve original function signatures
- Keep all comments and logging
- Remove Langsmith
- Replace txt saves with logging
All operations are logged in txt files to track errors. To modify list of metrics to be evaluated, change the list_of_metrics in prompts.py
To be decided ....