WIP watch diff with upstream main branch by Jeronymous · Pull Request #6 · OpenLLM-France/lighteval

Jeronymous · 2026-04-22T12:26:19Z

No description provided.

…ion of the dataset)

…q len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine"

… caching

…and new version of the dataset is different)

…t it has eos_token_id)

…o avoid failures or NaN

Add Red Teaming benchmark based on AvgBench

Upstream refactor splits src/lighteval/tasks into per-task files under src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/, drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and removes the suite field from LightevalTaskConfig. Port our edits to the new structure: - tasks/gsm_plus.py: generation_size 16384 - tasks/gsm8k.py: generation_size 2048 - tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric, language-specific stop sequences for all 11 subsets - tasks/piqa.py: switch to lighteval/piqa mirror - tasks/siqa.py: pin hf_revision - tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt uses dynamic letters based on the number of options; add a parallel mmlu_pro_raw task exposing the handmade prompt (no inspect_ai) - tasks/ruler.py: new home for the ruler prompt helper - tasks/advbench.py: move here from community_tasks/ - multilingual/tasks/mathalea.py: move here from community_tasks/ - multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct Other conflict resolutions: - pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0, new inspect-ai and openai deps - vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token guard, prefix-cache None-skip in logprob loop, and skip_reading_prefix_cache via guarded attribute assignment; adopt upstream's build_vllm_token_prompts helper - llm_as_judge.py: keep max_model_len=65536, adopt upstream's api_key/base_url litellm pass-through - lighteval_task.py: preserve name/data_dir fallback in load_dataset while picking up upstream's data_files support; keep partial args detail in __str__ for deterministic cache hashing - cache_management.py: adopt name-only task_to_configs lookup; keep regex that strips function memory addresses for hash determinism

litellm.completion expects an int, not a (N,) tuple.

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

…ge LLM (to avoid some memory errors)

…ntually called)

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)

The generator had been narrowed to MCFFormulation + the ALL label only, which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore the full formulation list and sensitivity labels.

…kken + max_images to skip vision profiling)

…dict=False to get token ids, not a BatchEncoding)

…huggingface#1067 regression)

…text' setting

Oligou and others added 30 commits October 14, 2025 11:51

Merge branch

fd58a24

skip task if no documents

80fb9cd

Change default use_chat_template when loading the tokenizer fails

acd19f1

Take HF_HOME env variable into account (if set)

3cc6315

Fix MGSM evals

f0f7162

fix reshape bug

df19f29

Remove padding from response

646d657

add ruler metric and prompt

8c07847

Add RULER in metrics

ed1718b

make FLORES translation benchmark work with datasets v2 (parquet vers…

58d0ccf

…ion of the dataset)

Fix possible failure around stop_sequences

1deed74

Fix failure reported in huggingface#1005 (from Pull Request huggingfa…

769a575

…ce#1006)

Do not use GPT as a judge

2d001dd

Fix IFBench subset

e7069e2

Fix IFEval-fr dataset repo

628d2b0

limit the model length to avoid error "ValueError: The model's max se…

2d1f146

…q len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine"

make cache string independant of function random address

b7cf5ff

Do not take version of transformers that is bug w.r.t OFFLINE behaviour

9436e15

Fix use of sets in eval code

4c9e90c

Fix corner case

bc164c1

Misc fixes in RULER evaluation

cb2da29

Change the code to make it work with more recent versions of vllm

82805ab

Fix vllm call in LLM as a judge

41dec9a

Fix error in logprob computation with vllm >= 0.12, because of prefix…

2e968b2

… caching

Fix GPQA-French benchmark (original dataset cannot be found anymore, …

d9af025

…and new version of the dataset is different)

Fix for Mistral tokenizer, that does not have eos_token attribute (bu…

a7e4591

…t it has eos_token_id)

Fix corner cases

45ba41e

Fix corner case on IFBench

9ba96b0

override max_position_embedding with max_length passed by the user, t…

e74e9c0

…o avoid failures or NaN

add COMET and MetricX metrics to lighteval

ddce778

Jeronymous and others added 25 commits April 9, 2026 15:28

Make results deterministic. Add the judgement in the details

280f450

Also add another judgement where the judge does not see the question

8d5c991

Add possibility to avoid running evaluation

da058f2

Merge pull request #4 from OpenLLM-France/advbench

481d9bd

Add Red Teaming benchmark based on AvgBench

Fix ruff style and lint after merge

180975c

Solve version incompatibility in project install

2466d64

less differences with the upstream branch

68494ca

Add copyright

9ca1f4b

less differences with the upstream branch

6ee2a9e

do not build doc on fork

d9fe736

Add safety / red-teaming benchmarks

379ed71

fix max_tokens tuple bug in JudgeLM litellm call

a7febad

litellm.completion expects an int, not a (N,) tuple.

support per-doc system role via Doc.specific["instruction_as_system"]

b68623f

Current RAG-style tasks need the row-specific retrieved context to live in the system role, not prepended to the user query. Opt-in flag keeps all existing tasks unchanged.

Add environment variable to possibly tune the memory usage of the jud…

4ecdb69

…ge LLM (to avoid some memory errors)

make sure the memory of the LLM is freed (before the judge LLM is eve…

9f90fba

…ntually called)

Add generative task variant for MathAlea

ca639a2

keep unanswerable rows in squad_v2

5946dea

squad_v2 was filtering out questions with no answer, which is exactly the half of the dataset that tests refusal behavior. Replace the filter with an explicit "unanswerable" choice.

Fix MixEval: For FreeForm, the judge was onloy seeing the first good …

84f1717

…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)

add luciole_rag citation-aware grounded QA benchmark

7dabf27

Add Exo7 benchmark

cfb4c2d

Remove unsupported 'suite' argument from safety task configs

e03dd8a

Remove unsupported 'suite' argument from registry docstring example

eb76c0c

Restore CF/Hybrid formulations and sensitivity labels in global_mmlu

e0b6b4e

The generator had been narrowed to MCFFormulation + the ALL label only, which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore the full formulation list and sensitivity labels.

Add comet and metricx metrics to flores200

f122b15

Jeronymous force-pushed the merge_hf_main branch from 53a2e84 to f122b15 Compare June 17, 2026 13:38

Jeronymous added 4 commits June 17, 2026 17:45

vllm: fix Ministral on transformers v5 (mistral tokenizer_mode for te…

c36d670

…kken + max_images to skip vision profiling)

judge: fix vLLM judge on transformers v5 (apply_chat_template return_…

32707db

…dict=False to get token ids, not a BatchEncoding)

metrics: fix apply_metric for batched metrics returning list-of-dicts (…

e50ecd3

…huggingface#1067 regression)

Safety benchmarks: Use Llama Guard 4 judge. And don't compute 'no_con…

5535c34

…text' setting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP watch diff with upstream main branch#6

WIP watch diff with upstream main branch#6
Jeronymous wants to merge 85 commits into
upstream-mainfrom
merge_hf_main

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Jeronymous commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants