Skip to content

Commit 4ce62b2

Browse files
Merge pull request #315 from cvs-health/release/v0.5.0
Minor release: `v0.5.0`
2 parents 9d30468 + 54858ea commit 4ce62b2

File tree

143 files changed

+19827
-763
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

143 files changed

+19827
-763
lines changed

.gitignore

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,8 @@ instance/
7474

7575
# Sphinx documentation
7676
docs/_build/
77-
docs/
77+
docs/build/
78+
docs/source/_autosummary/
7879
docs_srcs/
7980

8081
# PyBuilder
@@ -145,4 +146,8 @@ dmypy.json
145146

146147

147148
.vscode/
148-
.settings/
149+
.settings/
150+
.specstory/
151+
.cursor/
152+
.cursorindexingignore
153+
.cursorignore

README.md

Lines changed: 80 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ pip install uqlm
2525
```
2626

2727
## Hallucination Detection
28-
UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into four main types:
28+
UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into different types:
2929

3030

3131

@@ -35,6 +35,8 @@ UQLM provides a suite of response-level scorers for quantifying the uncertainty
3535
| [White-Box Scorers](#white-box-scorers-token-probability-based) | ⚡ Minimal\* (token probabilities already returned) | ✔️ None\* (no extra LLM calls) | 🔒 Limited (requires access to token probabilities) | ✅ Off-the-shelf |
3636
| [LLM-as-a-Judge Scorers](#llm-as-a-judge-scorers) | ⏳ Low-Medium (additional judge calls add latency) | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge) |✅ Off-the-shelf |
3737
| [Ensemble Scorers](#ensemble-scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | ✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users) |
38+
| [Long-Text Scorers](#long-text-scorers-claim-level) | ⏱️ High-Very high (multiple generations & claim-level comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal | ✅ Off-the-shelf |
39+
3840

3941
<sup><sup> \*Does not apply to multi-generation white-box scorers, which have higher cost and latency. </sup></sup>
4042

@@ -69,7 +71,7 @@ results.to_df()
6971
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/main/assets/images/black_box_output4.png" />
7072
</p>
7173

72-
Above, `use_best=True` implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use `ChatOpenAI` in this example, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. For a more detailed demo, refer to our [Black-Box UQ Demo](./examples/black_box_demo.ipynb).
74+
Above, `use_best=True` implements mitigation so that the uncertainty-minimized responses is selected. Note that although we use `ChatOpenAI` in this example, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. For a more detailed demo, refer to our [Black-Box UQ Demo](./examples/black_box_demo.ipynb).
7375

7476

7577
**Available Scorers:**
@@ -227,22 +229,85 @@ As with the other examples, any [LangChain Chat Model](https://js.langchain.com/
227229
* BS Detector ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))
228230
* Generalized UQ Ensemble ([Bouchard & Chauhan, 2025](https://arxiv.org/abs/2504.19254))
229231

232+
233+
### Long-Text Scorers (Claim-Level)
234+
235+
These scorers take a fine-grained approach and score confidence/uncertainty at the claim or sentence level. An extension of [black-box scorers](#black-box-scorers-consistency-based), long-text scorers sample multiple responses to the same prompt, decompose the original response into claims or sentences, and evaluate consistency of each original claim/sentence with the sampled responses.
236+
237+
<p align="center">
238+
<picture>
239+
<source media="(prefers-color-scheme: dark)" srcset="assets/images/luq_example_dark.png">
240+
<source media="(prefers-color-scheme: light)" srcset="assets/images/luq_example.png">
241+
<img src="assets/images/luq_example.png" alt="LUQ Graphic" />
242+
</picture>
243+
</p>
244+
245+
After scoring claims in the response, the response can be refined by removing claims with confidence scores less than a specified threshold and reconstructing the response from the retained claims. This approach allows for improved factual precision of long-text generations.
246+
247+
<p align="center">
248+
<picture>
249+
<source media="(prefers-color-scheme: dark)" srcset="assets/images/uad_graphic_dark.png">
250+
<source media="(prefers-color-scheme: light)" srcset="assets/images/uad_graphic.png">
251+
<img src="assets/images/uad_graphic.png" alt="UAD Graphic" />
252+
</picture>
253+
</p>
254+
255+
**Example Usage:**
256+
Below is a sample of code illustrating how to use the `LongTextUQ` class to conduct claim-level hallucination detection and uncertainty-aware response refinement.
257+
258+
```python
259+
from langchain_openai import ChatOpenAI
260+
llm = ChatOpenAI(model="gpt-4o")
261+
262+
from uqlm import LongTextUQ
263+
luq = LongTextUQ(llm=llm, scorers=["entailment"], response_refinement=True)
264+
265+
results = await luq.generate_and_score(prompts=prompts, num_responses=5)
266+
results_df = results.to_df()
267+
results_df
268+
269+
# Preview the data for a specific claim in the first response
270+
# results_df["claims_data"][0][0]
271+
# Output:
272+
# {
273+
# 'claim': 'Suthida Bajrasudhabimalalakshana was born on June 3, 1978.',
274+
# 'removed': False,
275+
# 'entailment': 0.9548099517822266
276+
# }
277+
```
278+
<p align="center">
279+
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/long_text_output.png" />
280+
</p>
281+
282+
Above `response` and `entailment` reflect the original response and response-level confidence score, while `refined_response` and `refined_entailment` are the corresponding values after response refinement. The `claims_data` column includes granular data for each response, including claims, claim-level confidence scores, and whether each claim is retained in the response refinement process. We use `ChatOpenAI` in this example, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. For a more detailed demo, refer to our [Long-Text UQ Demo](./examples/long_text_uq_demo.ipynb).
283+
284+
285+
**Available Scorers:**
286+
287+
* LUQ scorers ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279); [Zhang et al., 2025](https://arxiv.org/abs/2410.13246))
288+
* Graph-based scorers ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))
289+
* Generalized long-form semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))
290+
230291
## Documentation
231292
Check out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more.
232293

233-
## Example notebooks
234-
UQLM offers a broad collection of tutorial notebooks to demonstrate usage of the various scorers. These notebooks aim to have versatile coverage of various LLMs and datasets, but you can easily replace them with your LLM and dataset of choice. Below is a list of these tutorials:
235-
236-
- [Black-Box Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb): A notebook demonstrating hallucination detection with black-box (consistency) scorers.
237-
- [White-Box Uncertainty Quantification (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring only a single generation per response (fastest and cheapest).
238-
- [White-Box Uncertainty Quantification (Multi-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_multi_generation_demo.ipynb): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring multiple generations per response (slower and more expensive, but higher performance).
239-
- [LLM-as-a-Judge](https://github.com/cvs-health/uqlm/blob/main/examples/judges_demo.ipynb): A notebook demonstrating hallucination detection with LLM-as-a-Judge.
240-
- [Tunable UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_tuning_demo.ipynb): A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers ([Bouchard & Chauhan, 2025](https://arxiv.org/abs/2504.19254)).
241-
- [Off-the-Shelf UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_off_the_shelf_demo.ipynb): A notebook demonstrating hallucination detection using BS Detector ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175)) off-the-shelf ensemble.
242-
- [Semantic Entropy](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_entropy_demo.ipynb): A notebook demonstrating token-probability-based semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/abs/2302.09664)), a state-of-the-art multi-generation white-box scorer.
243-
- [Semantic Density](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_density_demo.ipynb): A notebook demonstrating semantic density Semantic Density ([Qiu et al., 2024](https://arxiv.org/abs/2405.13845))), a state-of-the-art multi-generation white-box scorer.
244-
- [Multimodal Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/main/examples/multimodal_demo.ipynb): A notebook demonstrating UQLM's scoring approach with multimodal inputs (compatible with black-box UQ and white-box UQ).
245-
- [Score Calibration](https://github.com/cvs-health/uqlm/blob/main/examples/score_calibration_demo.ipynb): A notebook illustrating transformation of confidence scores into calibrated probabilities that better reflect the true likelihood of correctness.
294+
## Example notebooks and tutorials
295+
296+
UQLM comes with a comprehensive set of example notebooks to help you get started with different uncertainty quantification approaches. These examples demonstrate how to use UQLM for various tasks, from basic hallucination detection to advanced ensemble methods.
297+
298+
**[Browse all example notebooks →](https://github.com/cvs-health/uqlm/blob/main/examples/)**
299+
300+
The examples directory contains tutorials for:
301+
- Black-box and white-box uncertainty quantification
302+
- Single and multi-generation approaches
303+
- LLM-as-a-judge techniques
304+
- Ensemble methods
305+
- State-of-the-art techniques like Semantic Entropy and Semantic Density
306+
- Multimodal uncertainty quantification
307+
- Score calibration
308+
309+
Each notebook includes detailed explanations and code samples that you can adapt to your specific use case.
310+
246311

247312
## Citation
248313
A technical description of the `uqlm` scorers and extensive experimental results are presented in **[this paper](https://arxiv.org/abs/2504.19254)**. If you use our framework or toolkit, please cite:

assets/README_PYPI.md

Lines changed: 70 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ pip install uqlm
2121
```
2222

2323
## Hallucination Detection
24-
UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into four main types:
24+
UQLM provides a suite of response-level scorers for quantifying the uncertainty of Large Language Model (LLM) outputs. Each scorer returns a confidence score between 0 and 1, where higher scores indicate a lower likelihood of errors or hallucinations. We categorize these scorers into different types:
2525

2626

2727

@@ -31,6 +31,8 @@ UQLM provides a suite of response-level scorers for quantifying the uncertainty
3131
| [White-Box Scorers](#white-box-scorers-token-probability-based) | ⚡ Minimal\* (token probabilities already returned) | ✔️ None\* (no extra LLM calls) | 🔒 Limited (requires access to token probabilities) | ✅ Off-the-shelf |
3232
| [LLM-as-a-Judge Scorers](#llm-as-a-judge-scorers) | ⏳ Low-Medium (additional judge calls add latency) | 💵 Low-High (depends on number of judges)| 🌍 Universal (any LLM can serve as judge) |✅ Off-the-shelf |
3333
| [Ensemble Scorers](#ensemble-scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | 🔀 Flexible (combines various scorers) | ✅ Off-the-shelf (beginner-friendly); 🛠️ Can be tuned (best for advanced users) |
34+
| [Long-Text Scorers](#long-text-scorers-claim-level) | ⏱️ High-Very high (multiple generations & claim-level comparisons) | 💸 High (multiple LLM calls) | 🌍 Universal | ✅ Off-the-shelf |
35+
3436

3537
<sup><sup> \*Does not apply to multi-generation white-box scorers, which have higher cost and latency. </sup></sup>
3638

@@ -207,22 +209,76 @@ As with the other examples, any [LangChain Chat Model](https://js.langchain.com/
207209
* BS Detector ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175))
208210
* Generalized UQ Ensemble ([Bouchard & Chauhan, 2025](https://arxiv.org/abs/2504.19254))
209211

212+
### Long-Text Scorers (Claim-Level)
213+
214+
These scorers take a fine-grained approach and score confidence/uncertainty at the claim or sentence level. An extension of [black-box scorers](#black-box-scorers-consistency-based), long-text scorers sample multiple responses to the same prompt, decompose the original response into claims or sentences, and evaluate consistency of each original claim/sentence with the sampled responses.
215+
216+
<p align="center">
217+
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/luq_example.png" />
218+
</p>
219+
220+
221+
After scoring claims in the response, the response can be refined by removing claims with confidence scores less than a specified threshold and reconstructing the response from the retained claims. This approach allows for improved factual precision of long-text generations.
222+
223+
<p align="center">
224+
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/uad_graphic.png" />
225+
</p>
226+
227+
**Example Usage:**
228+
Below is a sample of code illustrating how to use the `LongTextUQ` class to conduct claim-level hallucination detection and uncertainty-aware response refinement.
229+
230+
```python
231+
from langchain_openai import ChatOpenAI
232+
llm = ChatOpenAI(model="gpt-4o-mini")
233+
234+
from uqlm import LongTextUQ
235+
luq = LongTextUQ(llm=llm, scorers=["entailment"], response_refinement=True)
236+
237+
results = await luq.generate_and_score(prompts=prompts, num_responses=5)
238+
results_df = results.to_df()
239+
results_df
240+
241+
# Preview the data for a specific claim in the first response
242+
# results_df["claims_data"][0][0]
243+
# Output:
244+
# {
245+
# 'claim': 'Suthida Bajrasudhabimalalakshana was born on June 3, 1978.',
246+
# 'removed': False,
247+
# 'entailment': 0.9548099517822266
248+
# }
249+
```
250+
<p align="center">
251+
<img src="https://raw.githubusercontent.com/cvs-health/uqlm/develop/assets/images/long_text_output.png" />
252+
</p>
253+
254+
Above `response` and `entailment` reflect the original response and response-level confidence score, while `refined_response` and `refined_entailment` are the corresponding values after response refinement. The `claims_data` column includes granular data for each response, including claims, claim-level confidence scores, and whether each claim is retained in the response refinement process. We use `ChatOpenAI` in this example, any [LangChain Chat Model](https://js.langchain.com/docs/integrations/chat/) may be used. For a more detailed demo, refer to our [Long-Text UQ Demo](./examples/long_text_uq_demo.ipynb).
255+
256+
257+
**Available Scorers:**
258+
259+
* LUQ scorers ([Zhang et al., 2024](https://arxiv.org/abs/2403.20279); [Zhang et al., 2025](https://arxiv.org/abs/2410.13246))
260+
* Graph-based scorers ([Jiang et al., 2024](https://arxiv.org/abs/2410.20783))
261+
* Generalized long-form semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0))
262+
210263
## Documentation
211264
Check out our [documentation site](https://cvs-health.github.io/uqlm/latest/index.html) for detailed instructions on using this package, including API reference and more.
212265

213-
## Example notebooks
214-
UQLM offers a broad collection of tutorial notebooks to demonstrate usage of the various scorers. These notebooks aim to have versatile coverage of various LLMs and datasets, but you can easily replace them with your LLM and dataset of choice. Below is a list of these tutorials:
215-
216-
- [Black-Box Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/main/examples/black_box_demo.ipynb): A notebook demonstrating hallucination detection with black-box (consistency) scorers.
217-
- [White-Box Uncertainty Quantification (Single-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_single_generation_demo.ipynb): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring only a single generation per response (fastest and cheapest).
218-
- [White-Box Uncertainty Quantification (Multi-Generation)](https://github.com/cvs-health/uqlm/blob/main/examples/white_box_multi_generation_demo.ipynb): A notebook demonstrating hallucination detection with white-box (token probability-based) scorers requiring multiple generations per response (slower and more expensive, but higher performance).
219-
- [LLM-as-a-Judge](https://github.com/cvs-health/uqlm/blob/main/examples/judges_demo.ipynb): A notebook demonstrating hallucination detection with LLM-as-a-Judge.
220-
- [Tunable UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_tuning_demo.ipynb): A notebook demonstrating hallucination detection with a tunable ensemble of UQ scorers ([Bouchard & Chauhan, 2025](https://arxiv.org/abs/2504.19254)).
221-
- [Off-the-Shelf UQ Ensemble](https://github.com/cvs-health/uqlm/blob/main/examples/ensemble_off_the_shelf_demo.ipynb): A notebook demonstrating hallucination detection using BS Detector ([Chen & Mueller, 2023](https://arxiv.org/abs/2308.16175)) off-the-shelf ensemble.
222-
- [Semantic Entropy](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_entropy_demo.ipynb): A notebook demonstrating token-probability-based semantic entropy ([Farquhar et al., 2024](https://www.nature.com/articles/s41586-024-07421-0); [Kuhn et al., 2023](https://arxiv.org/abs/2302.09664)), a state-of-the-art multi-generation white-box scorer.
223-
- [Semantic Density](https://github.com/cvs-health/uqlm/blob/main/examples/semantic_density_demo.ipynb): A notebook demonstrating semantic density Semantic Density ([Qiu et al., 2024](https://arxiv.org/abs/2405.13845))), a state-of-the-art multi-generation white-box scorer.
224-
- [Multimodal Uncertainty Quantification](https://github.com/cvs-health/uqlm/blob/main/examples/multimodal_demo.ipynb): A notebook demonstrating UQLM's scoring approach with multimodal inputs (compatible with black-box UQ and white-box UQ).
225-
- [Score Calibration](https://github.com/cvs-health/uqlm/blob/main/examples/score_calibration_demo.ipynb): A notebook illustrating transformation of confidence scores into calibrated probabilities that better reflect the true likelihood of correctness.
266+
## Example notebooks and tutorials
267+
268+
UQLM comes with a comprehensive set of example notebooks to help you get started with different uncertainty quantification approaches. These examples demonstrate how to use UQLM for various tasks, from basic hallucination detection to advanced ensemble methods.
269+
270+
**[Browse all example notebooks →](https://github.com/cvs-health/uqlm/blob/main/examples/)**
271+
272+
The examples directory contains tutorials for:
273+
- Black-box and white-box uncertainty quantification
274+
- Single and multi-generation approaches
275+
- LLM-as-a-judge techniques
276+
- Ensemble methods
277+
- State-of-the-art techniques like Semantic Entropy and Semantic Density
278+
- Multimodal uncertainty quantification
279+
- Score calibration
280+
281+
Each notebook includes detailed explanations and code samples that you can adapt to your specific use case.
226282

227283
## Citation
228284
A technical description of the `uqlm` scorers and extensive experimental results are presented in **[this paper](https://arxiv.org/abs/2504.19254)**. If you use our framework or toolkit, please cite:

assets/images/claim_qa_graphic.png

185 KB
Loading

assets/images/graph-uq3.png

100 KB
Loading

assets/images/long_text_output.png

85.8 KB
Loading

assets/images/luq_example.png

326 KB
Loading

assets/images/luq_example_dark.png

327 KB
Loading
205 KB
Loading

assets/images/uad_graphic.png

101 KB
Loading

0 commit comments

Comments
 (0)