Learning from LLM Disagreement in Retrieval Evaluation

Code and Data for Ingram et al., JCDL 2025

Overview

This repository contains code for reproducing the experiments in:

Ingram, W. A., Banerjee, B., and Fox, E. A. (2025). “Learning from LLM Disagreement in Retrieval Evaluation.” Proceedings of JCDL 2025.

The repository provides scripts and notebooks for:
• Building TF–IDF representations for SDG-retrieved abstracts.
• Isolating agreement and disagreement subsets for LLaMA and Qwen relevance labels.
• Computing lexical divergence, permutation statistics, and KL divergence.
• Simulating centroid-based and representative-query retrieval over disagreement sets.
• Evaluating learnability of disagreement using logistic regression and ROC curves.
• Cross-referencing disagreement rows with student-ready teacher scores to test whether disagreement cases sit in low-confidence regions.

All code is implemented in Python, using scikit-learn for vectorization and modeling.

Additional Analysis Script

The repository now includes a standalone script for linking disagreement rows back to the original per-SDG Qwen probability CSVs and summarizing the corresponding teacher probability (p1) and teacher logit (teacher_logit) scores:

uv sync
python scripts/analyze_disagreement_scores.py

By default, the script reads:

data/model_disagreements.csv
data/sdg1xsdg1_2023_train__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv
data/sdg1xsdg1_2023_test__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv
data/sdg3xsdg3_2023_train__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv
data/sdg3xsdg3_2023_test__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv
data/sdg7xsdg7_2023_train__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv
data/sdg7xsdg7_2023_test__scopus_sdg1_qwen_binary_bit_with_probs_v1.csv

and writes CSV outputs under:

data/analysis/disagreement_scores/

The main outputs are:

disagreement_score_rows.csv: row-level disagreement cases with the matched per-SDG p1 and teacher_logit scores from the original Qwen scoring CSVs
disagreement_score_summary.csv: grouped summaries by SDG and disagreement direction
disagreement_score_summary_by_split.csv: the same summaries broken out by train/test split
disagreement_score_baseline_by_sdg.csv: baseline score distributions for the full per-SDG Qwen scoring corpus

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
matplotlibrc		matplotlibrc
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning from LLM Disagreement in Retrieval Evaluation

Overview

Additional Analysis Script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning from LLM Disagreement in Retrieval Evaluation

Overview

Additional Analysis Script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages