$OneMillion-Bench

5 professional domains, bilingual (Chinese + English), 400 questions:

Domain	CN	EN	Total	Description
Healthcare	40	40	80	Clinical medicine, oncology, pharma, gene & cell therapy
Finance	40	40	80	Investment, equities, financial analysis
Industry	40	40	80	Systems engineering, embedded systems, robotics
Law	40	40	80	Corporate/commercial law, M&A
Natural Sciences	40	40	80	Chemistry, materials science

Each test case contains a prompt, optional system prompt, domain tags, and 5-23 weighted rubrics labeled as: Factual Information, Analytical Reasoning, Instructions Following, or Structure and Formatting.

Installation

Requires Python 3.10+. conda recommended.

conda create -n evals python=3.10 -y && conda activate evals
git clone https://github.com/humanlaya/OneMillion-Bench.git
cd OneMillion-Bench
pip install -e .
omb --version  # verify installation

Quick start

1. Download the dataset

mkdir datasets
hf download Humanlaya-Research-Insititution/OneMillion-Bench --repo-type=dataset --local-dir datasets/OneMillion-Bench

2. Set API keys

export OPENROUTER_API_KEY="your-key"                  # Required (recommended for 50+ models)
export DASHSCOPE_API_KEY="your-key"                   # Optional: Qwen / DashScope
export VOLC_API_KEY="your-key"                        # Optional: VolcEngine (Doubao)
export HUNYUAN_API_KEY="your-key"                     # Optional: Tencent Hunyuan
export LING_API_KEY="your-key"                        # Optional: Ling-1T
export LITELLM_API_KEY="your-key"                     # Optional: LiteLLM

p.s. add these to ~/.bashrc or ~/.zshrc to persist across sessions.

3. Configure evals

cp src/omb/config/default.yaml my_config.yaml

Edit my_config.yaml to select models (API keys are read from environment variables, not the config file):

JUDGE_MODELS:
  - "google/gemini-3-pro-preview"

GENERATOR_MODELS:
  - "openai/gpt-5.4"
  - "anthropic/claude-opus-4.6"
  - "google/gemini-3.1-pro-preview"

4. Run evals

omb eval --dataset datasets/OneMillion-Bench/Healthcare --config my_config.yaml            # single domain
omb eval --dataset datasets/OneMillion-Bench --config my_config.yaml --recursive           # all domains
omb eval --dataset datasets/OneMillion-Bench/Law --config my_config.yaml --grade-only      # grade only

5. Python API

python examples/auto_grading.py --dataset datasets/OneMillion-Bench/Healthcare --config my_config.yaml

Results are written to outputs/result_YYYYMMDD_HHMMSS/.

CLI reference

omb eval [OPTIONS]

Flag	Description
`--dataset PATH`	Path to test file or directory (required)
`--config` / `-c FILE`	YAML configuration file
`--recursive`	Scan subdirectories recursively
`--grade-only`	Grade existing responses only, skip generation
`--overwrite`	Force regeneration and regrading
`--enable-search`	Enable web search augmentation
`--limit N`	Limit number of test questions
`--repeat-sample N`	Independent responses per generator (default: 1)
`--repeat-judge N`	Repeated judge runs per response (default: 1)
`--detect-metadata`	Auto-detect model context length and max tokens
`--verbose` / `-v`	Verbose output
`--quiet` / `-q`	Suppress non-essential output
`--debug`	Show questions and responses
`--no-color`	Disable colored output

Exit codes: 0 = success, 2 = config error, 3 = input error, 4 = partial failure, 130 = interrupted.

Configuration

Managed via YAML files. Copy src/omb/config/default.yaml and modify as needed. API keys are read from environment variables, not config files.

API keys

Environment variable	Provider
`OPENROUTER_API_KEY`	OpenRouter (50+ models)
`DASHSCOPE_API_KEY`	Qwen / DashScope
`VOLC_API_KEY`	VolcEngine (Doubao)
`HUNYUAN_API_KEY`	Tencent Hunyuan
`LING_API_KEY`	Ling-1T
`LITELLM_API_KEY`	LiteLLM

Model selection

Parameter	Description
`JUDGE_MODELS`	Models used for grading
`GENERATOR_MODELS`	Models to evaluate
`REFERENCE_MODELS`	Optional baseline models with existing responses
`REASONING_EFFORT`	Judge reasoning: `"low"`, `"medium"`, `"high"`, `"xhigh"`, or `null`

Parameters

Parameter	Default	Description
`MAX_TOKENS`	128,000	Max output tokens
`TIMEOUT`	600	API timeout (seconds)
`RETRY_TIMES`	8	Retry attempts
`SAMPLE_K`	1	Response samples per generator
`REPEATED_JUDGE`	1	Judge runs per response
`MAX_GENERATION_CONCURRENCY`	128	Parallel generation requests
`MAX_GRADING_CONCURRENCY`	128	Parallel grading requests
`TEMPERATURE`	1.0	Sampling temperature
`TOP_P`	0.95	Nucleus sampling threshold
`TOP_K`	20	Top-k sampling

Task format

Each test case is a JSON file:

Field	Type	Description
`prompt`	string	Question or instruction
`system_prompt`	string	Optional system prompt
`tags`	list[string]	Domain hierarchy, e.g., `["Healthcare", "Pharma", "Gene Therapy"]`
`case_id`	int	Unique test case ID
`rubrics`	list[object]	Evaluation rubrics (see below)

Each rubric:

Field	Type	Description
`rubric_number`	int	Index within the case
`rubric_detail`	string	Evaluation criterion
`rubric_weight`	int	Score weight (positive to award, negative to penalize)
`rubricLabel`	string	Category label

Example:

{
  "prompt": "Explain the concept of viral titer in gene therapy ...",
  "system_prompt": "",
  "tags": ["Healthcare", "Pharma", "Gene Therapy"],
  "case_id": 2860,
  "rubrics": [
    {"rubric_number": 1, "rubric_detail": "Clearly defines viral titer and distinguishes infectious vs. physical titer.", "rubric_weight": 10, "rubricLabel": "Analytical Reasoning"},
    {"rubric_number": 2, "rubric_detail": "Lists at least 3 common titer measurement methods.", "rubric_weight": 8, "rubricLabel": "Factual Information"}
  ]
}

Output format

Results are saved to outputs/result_YYYYMMDD_HHMMSS/:

Updated JSON files — input files augmented with model_response, rubric_auto_score, judge_cot, and rubric_auto_vs_human fields.
grading_results.xlsx — Gruvbox-themed workbook with per-judge scores, aggregate sheets, and cost breakdowns.
grading_results.json — Machine-readable summary with per-model metrics and costs.

Architecture

src/omb/
├── cli/               # CLI entry point (omb eval ...)
├── orchestrator.py    # Main workflow: load -> generate -> grade -> retry -> report
├── clients/           # LLM API clients (OpenRouter, Qwen, VolcEngine, Hunyuan, Ling-1T)
├── config/            # YAML configuration management with pricing data
├── grading/           # Grading prompt construction and response parsing
├── processing/        # Async generation and grading with semaphore concurrency
├── reporting/         # Excel (Gruvbox-themed) and JSON report generation
├── tracking/          # Token usage and cost tracking per model
└── utils/             # Rich console UI and Gruvbox color palette

See docs/arch.md for detailed architecture documentation.

Development

make install        # Install dev dependencies
make test           # Run tests (pytest)
make lint           # Static analysis (flake8, mypy, pylint)
make format         # Format code (black, isort)
make format-check   # Check formatting without modifying
make check          # Run all checks
make clean          # Remove build artifacts

Frequently asked questions

What models are supported? 50+ models through 6 providers. Specified as provider/model (e.g., openai/gpt-5.4). See src/omb/config/default.yaml for the full list.

Can I add a new LLM provider? Yes. Extend BaseLLMClient in src/omb/clients/ and implement the call() method.

Can I use custom rubrics? Yes. Create JSON files following the task format schema.

How does grading work? The judge evaluates each rubric with a binary yes/no judgment and chain-of-thought justification. Scores are weighted by rubric weights. Multiple judges and repeated runs are supported.

How is cost tracked? Token usage is tracked per model per call. Costs are computed from the pricing table in the config and reported in both Excel and JSON outputs.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
src/omb		src/omb
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$OneMillion-Bench

Overview

Dataset