Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 45 additions & 13 deletions dingo/model/llm/text_quality/llm_text_quality_v5.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,39 +30,67 @@ class LLMTextQualityV5(BaseTextQuality):
**Impact**: Broken structures prevent models from learning correct formatting patterns.

**Check for**:
- **Error_Formula**: Mathematical expressions with **unmatched delimiters** or **unclosed environments**
- **Error_Formula**: Mathematical content with **broken syntax** OR **systematically stripped symbols/formulas**

Two failure modes:

**(A) Broken LaTeX syntax** — delimiters or environments are present but malformed:
- Delimiters unmatched: $ without closing $ (LaTeX context, not dollar signs)
- Environments unclosed: \\begin{{align}} without \\end{{align}}
- Syntax broken: \\frac{{a}}{{b missing closing }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example provided for "Syntax broken" (\\frac{{a}}{{b missing closing }}) results in a syntactically valid LaTeX expression (\frac{a}{b missing closing }) when the template is formatted. To effectively demonstrate broken syntax to the LLM, the closing brace should be omitted so that the expression remains unclosed.

Suggested change
- Syntax broken: \\frac{{a}}{{b missing closing }}
- Syntax broken: \\frac{{a}}{{b missing closing

- HTML tags unclosed: <sub>text without </sub>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for "HTML tags unclosed" (<sub>text without </sub>) actually shows a correctly closed tag. To properly demonstrate an unclosed tag to the model, the closing </sub> should be removed.

Suggested change
- HTML tags unclosed: <sub>text without </sub>
- HTML tags unclosed: <sub>text without

- Impact: Prevents >50% of mainstream parsers from rendering

**(B) Stripped mathematical content** — symbols/formulas systematically removed during extraction:
- Orphan hyphens from stripped Greek letters: "κ-solutions" → "-solutions", "ε-net" → "-net"
- Empty positions after connective words: "thus ;" or "the interval ;" where a formula was removed
- Sentences referencing variables/expressions that are absent: "a small number" (number missing), "we have ." (equation missing)
- Systematic loss: multiple occurrences throughout the text, not just one or two typos
- Impact: Mathematical text becomes incoherent; models learn broken academic writing patterns

Example (BAD — stripped symbols):
"Let be a -solution to the Ricci flow which is -noncollapsed. Ancient, in the sense that t ranges on the interval ; Bounded curvature, thus ;"
(Greek letters κ stripped from "κ-solution" and "κ-noncollapsed"; interval expression and inequality after "thus" removed entirely)

⚠️ **Normal patterns (DO NOT flag)**:
- Mixing inline ($...$) and display ($$...$$) formulas
- Using \\begin{{align}}...\\end{{align}} within $$...$$
- Line breaks with \\\\ in alignment environments
- HTML tags: <sub>x</sub>, <sup>2</sup> for subscripts/superscripts
- Mixing LaTeX and HTML in web-extracted content

✅ **Only flag when**:
- Delimiters unmatched: $ without closing $ (LaTeX context, not dollar signs)
- Environments unclosed: \\begin{{align}} without \\end{{align}}
- Syntax broken: \\frac{{a}}{{b missing closing }}
- HTML tags unclosed: <sub>text without </sub>
- Plain-text math without any LaTeX (e.g., "a^2 + b^2 = c^2" without $ delimiters) — this is fine as long as the expressions are actually present

⚠️ **Important**: Distinguish LaTeX $ from dollar signs ($100)
- Dollar sign: "$100", "$5.99" (followed by numbers) → NOT LaTeX
- LaTeX delimiter: "$x$", "$\\alpha$" (contains math symbols) → IS LaTeX
- Example: "The price is $100 and equation $x=y$ costs $50" has 4 dollar symbols but only 2 are LaTeX delimiters (and they match)

- Example (BAD): "$x^2 + y^2 is broken here $$a = b$$$"
- Example (BAD — broken delimiters): "$x^2 + y^2 is broken here $$a = b$$$"
(First LaTeX $ never closes, extra $ at end)
- Example (GOOD): "The item costs $100 and satisfies $x^2 + y^2 = z^2$ where price is $50"
(Dollar signs for money + proper LaTeX pair)
- Impact: Only flag errors that prevent >50% of mainstream parsers (pdflatex, MathJax, KaTeX, Pandoc, Jupyter) from rendering

- **Error_Table**: Table structures that are malformed or unreadable
- Example (BAD): Misaligned columns, missing headers, or garbled HTML tags
- Impact: Models cannot learn proper table representation

- **Error_Code**: Code blocks with formatting corruption
- Example (BAD): Line numbers mixed with code, broken syntax highlighting markers
- Impact: Teaches incorrect code structure
**Common corruption patterns**:
- Missing code fence (` ``` `): code appears as plain text without language block
- Lost indentation: Python/YAML code with all indentation stripped (flat lines)
- Broken identifiers: spaces injected into tokens, e.g. `sys .argv`, `pts .append`, `i[ 0]`
- Line numbers mixed with code, broken syntax highlighting markers
- Keywords wrapped in inline backticks instead of a fenced block, e.g. `` `import` sys ``

Example (BAD — indentation and identifiers destroyed):
```
`import` sys
pts = []
for i in range( 1,len(sys .argv), 2):
pts .append([int(sys .argv[i]), int(sys .argv[i +1])])
```
Correct version would have a code fence, proper indentation, and no spaces inside `sys.argv`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explanation states that a "Correct version would have a code fence", but the "BAD" example provided in the preceding lines (85-90) already includes a code fence. This inconsistency might confuse the LLM. The explanation should be updated to focus on the issues actually demonstrated in the example, such as the lack of indentation and the corrupted identifiers.

Suggested change
Correct version would have a code fence, proper indentation, and no spaces inside `sys.argv`.
Correct version would have proper indentation and no spaces inside identifiers.


- Impact: Teaches incorrect code syntax, broken tokenization patterns, and wrong indentation conventions

**Key Question**: "Can the model learn proper formatting from this structure?"

Expand Down Expand Up @@ -160,10 +188,14 @@ class LLMTextQualityV5(BaseTextQuality):
Input: "The eigenstate $\\psi_n$ where <sub>n</sub> is quantum number and energy E<sup>2</sup> = m<sup>2</sup>c<sup>4</sup>"
Output: {{"score": 1, "type": "Good", "name": "None", "reason": "Normal mix of LaTeX and HTML tags from web content"}}

**Example 2 (Bad - Completeness)**:
**Example 2 (Bad - Completeness, broken delimiters)**:
Input: "The formula $x^2 + y^2 is broken here $$a = b$$$"
Output: {"score": 0, "type": "Completeness", "name": "Error_Formula", "reason": "Unmatched delimiters: first $ never closes, extra $ at end"}

**Example 2.5 (Bad - Completeness, stripped math)**:
Input: "Definition 1.(-solutions) A -solution is a Ricci flow which is -noncollapsed at every scale. Ancient, in the sense that t ranges on the interval ; Bounded curvature, thus ;"
Output: {{"score": 0, "type": "Completeness", "name": "Error_Formula", "reason": "Mathematical symbols systematically stripped: Greek letters removed ('-solutions' instead of 'κ-solutions'), formulas missing after 'the interval' and 'thus'"}}

**Example 3 (Bad - Effectiveness)**:
Input: "Theappleisredandtasty�withsomegarbledtext□□"
Output: {"score": 0, "type": "Effectiveness", "name": "Error_Garbled_Characters", "reason": "Contains encoding corruption (�, □) and missing spaces (>1% of text)"}
Expand Down
4 changes: 3 additions & 1 deletion docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@ This document provides comprehensive information about all quality metrics used
| `LLMMathCompare` | LLMMathCompare | Compares the effectiveness of two tools in extracting mathematical formulas from HTML to Markdown format by evaluatin... | Internal Implementation | N/A | N/A |
| `LLMSecurityPolitics` | LLMSecurityPolitics | Evaluates whether the text contains politics-related content | Internal Implementation | N/A | N/A |
| `LLMTableCompare` | LLMTableCompare | Compares the effectiveness of two tools in extracting tables from HTML to Markdown format by evaluating recognition r... | Internal Implementation | N/A | N/A |
| `LLMTextEquation` | LLMTextEquation | Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit... | [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/abs/2501.14506) (Yu et al., 2025) | [📊 See Results](eval/prompt/redpajama_data_evaluated_by_prompt.md) | [📝 View Example](../examples/llm_and_rule/llm_local.py) |
| `LLMTextQualityV4` | LLMTextQualityV4 | Enhanced text quality evaluation covering completeness (formulas, tables, code), effectiveness (garbled text, spacing... | [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/abs/2501.14506) (Yu et al., 2025) | [📊 See Results](eval/prompt/redpajama_data_evaluated_by_prompt.md) | N/A |
| `LLMTextQualityV5` | LLMTextQualityV5 | Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit... | [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/abs/2501.14506) (Yu et al., 2025) | [📊 See Results](eval/prompt/redpajama_data_evaluated_by_prompt.md) | [📝 View Example](../examples/llm_and_rule/llm_local.py) |
| `LLMTextTable` | LLMTextTable | Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversit... | [WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages](https://arxiv.org/abs/2501.14506) (Yu et al., 2025) | [📊 See Results](eval/prompt/redpajama_data_evaluated_by_prompt.md) | [📝 View Example](../examples/llm_and_rule/llm_local.py) |

### SFT Data Assessment Metrics

Expand Down Expand Up @@ -58,7 +60,7 @@ This document provides comprehensive information about all quality metrics used
| Type | Metric | Description | Paper Source | Evaluation Results | Examples |
|------|--------|-------------|--------------|-------------------|----------|
| `QUALITY_BAD_COMPLETENESS` | RuleLineEndWithEllipsis, RuleLineEndWithTerminal, RuleSentenceNumber, RuleWordNumber | Checks whether the ratio of lines ending with ellipsis is below threshold; Checks whether the ratio of lines ending w... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
| `QUALITY_BAD_EFFECTIVENESS` | RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl, RuleDoi, RuleIsbn | Detects garbled text and anti-crawling characters by combining special character and invisible character detection; D... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
| `QUALITY_BAD_EFFECTIVENESS` | RuleDoi, RuleIsbn, RuleAbnormalChar, RuleAbnormalHtml, RuleAlphaWords, RuleAudioDataFormat, RuleCharNumber, RuleColonEnd, RuleContentNull, RuleContentShort, RuleContentShortMultiLan, RuleEnterAndSpace, RuleEnterMore, RuleEnterRatioMore, RuleHtmlEntity, RuleHtmlTag, RuleInvisibleChar, RuleImageDataFormat, RuleLatexSpecialChar, RuleLineJavascriptCount, RuleLoremIpsum, RuleMeanWordLength, RuleNlpDataFormat, RuleSftDataFormat, RuleSpaceMore, RuleSpecialCharacter, RuleStopWord, RuleSymbolWordRatio, RuleVedioDataFormat, RuleOnlyUrl | Check whether the string is in the correct format of the doi; Check whether the string is in the correct format of th... | Internal Implementation | N/A | N/A |
| `QUALITY_BAD_FLUENCY` | RuleAbnormalNumber, RuleCharSplit, RuleNoPunc, RuleWordSplit, RuleWordStuck | Checks PDF content for abnormal book page or index numbers that disrupt text flow; Checks PDF content for abnormal ch... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
| `QUALITY_BAD_RELEVANCE` | RuleHeadWordAr, RuleHeadWordCs, RuleHeadWordHu, RuleHeadWordKo, RuleHeadWordRu, RuleHeadWordSr, RuleHeadWordTh, RuleHeadWordVi, RulePatternSearch, RuleWatermark | Checks whether Arabic content contains irrelevant tail source information; Checks whether Czech content contains irre... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
| `QUALITY_BAD_SECURITY` | RuleIDCard, RuleUnsafeWords, RulePIIDetection | Checks whether content contains ID card information; Checks whether content contains unsafe words; Detects Personal I... | [RedPajama: an Open Dataset for Training Large Language Models](https://github.com/togethercomputer/RedPajama-Data) (Together Computer, 2023) | [📊 See Results](eval/rule/slimpajama_data_evaluated_by_rule.md) | N/A |
Expand Down