Skip to content

Fix helm_normalizer mis-scoring numbers (homogeneize before remove_punc)#1273

Open
iamsharduld wants to merge 1 commit into
huggingface:mainfrom
iamsharduld:fix/helm-normalizer-numbers
Open

Fix helm_normalizer mis-scoring numbers (homogeneize before remove_punc)#1273
iamsharduld wants to merge 1 commit into
huggingface:mainfrom
iamsharduld:fix/helm-normalizer-numbers

Conversation

@iamsharduld

Copy link
Copy Markdown

What

In helm_normalizer (the quasi-exact-match / HELM normalizer), remove_punc runs before
homogeneize_numbers, so it strips the decimal point before the float() cast. The result is wrong
on numeric answers:

from lighteval.metrics.normalizations import helm_normalizer
from lighteval.metrics.metrics_sample import ExactMatches
em = ExactMatches(normalize_gold=helm_normalizer, normalize_pred=helm_normalizer)

em.compute_one_item("10", "1.0")   # 1  -> distinct numbers scored as EXACT MATCH (false positive)
em.compute_one_item("3.14", "314") # 1  -> false positive
em.compute_one_item("1.0", "1")    # 0  -> false NEGATIVE, contradicting homogeneize_numbers' docstring

Root cause: '1.0'remove_punc'10'float'10.0', and '3.14''314''314.0',
so different numbers normalize to the same string while 1.0 and 1 normalize differently.

Fix

Run homogeneize_numbers before remove_punc, so float() sees the original token. Distinct
numbers then stay distinct and equal-but-differently-formatted numbers match, per the docstring intent.
Non-numeric tokens are unaffected (homogeneize_numbers returns them unchanged either way).

Tests

tests/test_unit_base_metrics.py::test_quasi_exact_match_numbers1.0 == 1 matches; 10 != 1.0
and 3.14 != 314 don't. Fails before, passes after; the existing test_quasi_exact_match (sentence
text) still passes.

In helm_normalizer, remove_punc ran before homogeneize_numbers, so it stripped
the decimal point before the float() cast: '1.0' -> '10' -> '10.0'. This made
distinct numbers collide ('10' and '1.0' both -> '10.0'; '3.14' and '314' both
-> '314.0') and broke the function's own documented goal ('1.0' != '1'), causing
false exact-match scores on numeric answers (QuAC/DROP-style quasi_exact_match).

Run homogeneize_numbers before remove_punc so float() sees the original token.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant