Skip to content

WIP watch diff with upstream main branch#6

Open
Jeronymous wants to merge 85 commits into
upstream-mainfrom
merge_hf_main
Open

WIP watch diff with upstream main branch#6
Jeronymous wants to merge 85 commits into
upstream-mainfrom
merge_hf_main

Conversation

@Jeronymous

Copy link
Copy Markdown
Member

No description provided.

Oligou and others added 30 commits October 14, 2025 11:51
…q len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine"
…and new version of the dataset is different)
Jeronymous and others added 25 commits April 9, 2026 15:28
Add Red Teaming benchmark based on AvgBench
Upstream refactor splits src/lighteval/tasks into per-task files under
src/lighteval/tasks/tasks/ and src/lighteval/tasks/multilingual/tasks/,
drops default_tasks.py / default_prompts.py / multilingual/tasks.py, and
removes the suite field from LightevalTaskConfig.

Port our edits to the new structure:
- tasks/gsm_plus.py: generation_size 16384
- tasks/gsm8k.py: generation_size 2048
- tasks/mgsm.py: hf_revision, suffix exact_match + expr_gold_metric,
  language-specific stop sequences for all 11 subsets
- tasks/piqa.py: switch to lighteval/piqa mirror
- tasks/siqa.py: pin hf_revision
- tasks/mmlu_pro.py: fix upstream's hardcoded ABCD letters so the prompt
  uses dynamic letters based on the number of options; add a parallel
  mmlu_pro_raw task exposing the handmade prompt (no inspect_ai)
- tasks/ruler.py: new home for the ruler prompt helper
- tasks/advbench.py: move here from community_tasks/
- multilingual/tasks/mathalea.py: move here from community_tasks/
- multilingual/tasks/french.py: keep jzhang86/fr_ifeval fallback and the
  generative GPQA-fr-diamond variant with prompt_gpqa_fr_instruct

Other conflict resolutions:
- pyproject.toml: take upstream unpinned transformers, vllm>=0.11.0,
  new inspect-ai and openai deps
- vllm_model.py: keep max_seq_len_to_capture fallback, Mistral eos_token
  guard, prefix-cache None-skip in logprob loop, and
  skip_reading_prefix_cache via guarded attribute assignment; adopt
  upstream's build_vllm_token_prompts helper
- llm_as_judge.py: keep max_model_len=65536, adopt upstream's
  api_key/base_url litellm pass-through
- lighteval_task.py: preserve name/data_dir fallback in load_dataset
  while picking up upstream's data_files support; keep partial args
  detail in __str__ for deterministic cache hashing
- cache_management.py: adopt name-only task_to_configs lookup; keep
  regex that strips function memory addresses for hash determinism
litellm.completion expects an int, not a (N,) tuple.
Current RAG-style tasks need the row-specific retrieved context to
live in the system role, not prepended to the user query. Opt-in
flag keeps all existing tasks unchanged.
squad_v2 was filtering out questions with no answer, which is
exactly the half of the dataset that tests refusal behavior.
Replace the filter with an explicit "unanswerable" choice.
…options, not all the possible ones. Also increase generation_size from 100 to 1024 (for thinking models)
The generator had been narrowed to MCFFormulation + the ALL label only,
which dropped the _cf/_hybrid variants and the CA/CS/UNK labels. Restore
the full formulation list and sensitivity labels.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants