Skip to content

feat: add LLMHtmlExtractCompareV3 metric#384

Merged
e06084 merged 4 commits intoMigoXLab:devfrom
e06084:dev
Apr 3, 2026
Merged

feat: add LLMHtmlExtractCompareV3 metric#384
e06084 merged 4 commits intoMigoXLab:devfrom
e06084:dev

Conversation

@e06084
Copy link
Copy Markdown
Collaborator

@e06084 e06084 commented Apr 3, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the LLMHtmlExtractCompareV3 evaluator for comparing HTML extraction quality and adds a --count flag to the dingo info CLI. Feedback focuses on improving the robustness of LLM response parsing—specifically for thinking blocks and JSON extraction—and refining the logic for status reporting and output formatting in example scripts.

Comment on lines +181 to +188
if response.startswith("<think>"):
think_content = re.search(
r"<think>(.*?)</think>", response, flags=re.DOTALL
)
if think_content:
response_think = think_content.group(1).strip()
response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL)
response = response.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for extracting the <think> block is fragile because it relies on response.startswith("<think>"). If the LLM output contains leading whitespace or a newline (which is common), this check will fail, and the thinking content will not be extracted. Furthermore, the <think> block will remain in the response, likely causing the subsequent JSON parsing to fail. It is better to strip the response first and use a regex to find and remove the thinking block regardless of its position.

Suggested change
if response.startswith("<think>"):
think_content = re.search(
r"<think>(.*?)</think>", response, flags=re.DOTALL
)
if think_content:
response_think = think_content.group(1).strip()
response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL)
response = response.strip()
response = response.strip()
response_think = ""
think_match = re.search(r"<think>(.*?)</think>", response, flags=re.DOTALL)
if think_match:
response_think = think_match.group(1).strip()
response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL).strip()

Comment on lines +190 to +196
if response.startswith("```json"):
response = response[7:]
if response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sequential if statements for stripping markdown code blocks are slightly redundant. If the response starts with ```json, it will be stripped, and then the code will check if it starts with ``` again. Using elif is cleaner and more intentional. Also, ensure the stripping logic is robust against variations in LLM output.

Suggested change
if response.startswith("```json"):
response = response[7:]
if response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()
if response.startswith("```json"):
response = response[7:]
elif response.startswith("```"):
response = response[3:]
if response.endswith("```"):
response = response[:-3]
response = response.strip()

try:
response_json = json.loads(response)
if response_think:
response_json["reason"] = response_json.get("reason", "") + "\n" + response_think
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line adds a newline to the reason field even if response_think is empty. It's better to only append the thinking content if it exists and handle the formatting more cleanly.

Suggested change
response_json["reason"] = response_json.get("reason", "") + "\n" + response_think
if response_think:
reason = response_json.get("reason", "")
response_json["reason"] = f"{reason}\n{response_think}".strip()

except json.JSONDecodeError:
raise ConvertJsonError(f"Convert to JSON format failed: {response}")

response_model = ResponseScoreTypeNameReason(**response_json)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instantiating the Pydantic model ResponseScoreTypeNameReason directly from response_json can raise a ValidationError if the LLM output is malformed or missing required fields. This should be wrapped in a try-except block to provide a more descriptive error or handle the failure gracefully.

Suggested change
response_model = ResponseScoreTypeNameReason(**response_json)
try:
response_model = ResponseScoreTypeNameReason(**response_json)
except Exception as e:
raise ConvertJsonError(f"Invalid response structure: {e}")

else:
tmp_type = "EXTRACTION_EQUAL"

result.status = response_model.score != 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic result.status = response_model.score != 1 treats score=0 (Equal) as status=True. In the context of the provided examples (e.g., html_extract_compare_v3_example_dataset.py), status=True is used to identify samples where Tool B is better. If the tools are equal, it is generally not considered a 'finding' or 'bad' sample. Please verify if score=0 should indeed trigger status=True.

Suggested change
result.status = response_model.score != 1
result.status = response_model.score == 2

print(f"\n推理过程:\n{result.eval_details.reason[0]}")
print(f"是否存在问题: {result.status}")
print(f"评估结果类型: {result.label}")
print(f"\n推理过程:\n{result.reason}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In EvalDetail, the reason attribute is a list of strings. Printing result.reason directly will output the list representation (e.g., ['...']). For a human-readable example, it is better to print the first element of the list.

Suggested change
print(f"\n推理过程:\n{result.reason}")
print(f"\n推理过程:\n{result.reason[0]}")

# print(f"判断名称: {result.name}")
print(f"是否存在问题: {result.status}")
print(f"评估结果类型: {result.label}")
print(f"\n推理过程:\n{result.reason}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the V2 example, result.reason is a list. Printing it directly results in a list representation which is less readable for users. Accessing the first element is preferred.

Suggested change
print(f"\n推理过程:\n{result.reason}")
print(f"\n推理过程:\n{result.reason[0]}")

@e06084 e06084 merged commit 65fc0fb into MigoXLab:dev Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant