feat: add LLMHtmlExtractCompareV3 metric#384
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the LLMHtmlExtractCompareV3 evaluator for comparing HTML extraction quality and adds a --count flag to the dingo info CLI. Feedback focuses on improving the robustness of LLM response parsing—specifically for thinking blocks and JSON extraction—and refining the logic for status reporting and output formatting in example scripts.
| if response.startswith("<think>"): | ||
| think_content = re.search( | ||
| r"<think>(.*?)</think>", response, flags=re.DOTALL | ||
| ) | ||
| if think_content: | ||
| response_think = think_content.group(1).strip() | ||
| response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL) | ||
| response = response.strip() |
There was a problem hiding this comment.
The logic for extracting the <think> block is fragile because it relies on response.startswith("<think>"). If the LLM output contains leading whitespace or a newline (which is common), this check will fail, and the thinking content will not be extracted. Furthermore, the <think> block will remain in the response, likely causing the subsequent JSON parsing to fail. It is better to strip the response first and use a regex to find and remove the thinking block regardless of its position.
| if response.startswith("<think>"): | |
| think_content = re.search( | |
| r"<think>(.*?)</think>", response, flags=re.DOTALL | |
| ) | |
| if think_content: | |
| response_think = think_content.group(1).strip() | |
| response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL) | |
| response = response.strip() | |
| response = response.strip() | |
| response_think = "" | |
| think_match = re.search(r"<think>(.*?)</think>", response, flags=re.DOTALL) | |
| if think_match: | |
| response_think = think_match.group(1).strip() | |
| response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL).strip() |
| if response.startswith("```json"): | ||
| response = response[7:] | ||
| if response.startswith("```"): | ||
| response = response[3:] | ||
| if response.endswith("```"): | ||
| response = response[:-3] | ||
| response = response.strip() |
There was a problem hiding this comment.
The sequential if statements for stripping markdown code blocks are slightly redundant. If the response starts with ```json, it will be stripped, and then the code will check if it starts with ``` again. Using elif is cleaner and more intentional. Also, ensure the stripping logic is robust against variations in LLM output.
| if response.startswith("```json"): | |
| response = response[7:] | |
| if response.startswith("```"): | |
| response = response[3:] | |
| if response.endswith("```"): | |
| response = response[:-3] | |
| response = response.strip() | |
| if response.startswith("```json"): | |
| response = response[7:] | |
| elif response.startswith("```"): | |
| response = response[3:] | |
| if response.endswith("```"): | |
| response = response[:-3] | |
| response = response.strip() |
| try: | ||
| response_json = json.loads(response) | ||
| if response_think: | ||
| response_json["reason"] = response_json.get("reason", "") + "\n" + response_think |
There was a problem hiding this comment.
This line adds a newline to the reason field even if response_think is empty. It's better to only append the thinking content if it exists and handle the formatting more cleanly.
| response_json["reason"] = response_json.get("reason", "") + "\n" + response_think | |
| if response_think: | |
| reason = response_json.get("reason", "") | |
| response_json["reason"] = f"{reason}\n{response_think}".strip() |
| except json.JSONDecodeError: | ||
| raise ConvertJsonError(f"Convert to JSON format failed: {response}") | ||
|
|
||
| response_model = ResponseScoreTypeNameReason(**response_json) |
There was a problem hiding this comment.
Instantiating the Pydantic model ResponseScoreTypeNameReason directly from response_json can raise a ValidationError if the LLM output is malformed or missing required fields. This should be wrapped in a try-except block to provide a more descriptive error or handle the failure gracefully.
| response_model = ResponseScoreTypeNameReason(**response_json) | |
| try: | |
| response_model = ResponseScoreTypeNameReason(**response_json) | |
| except Exception as e: | |
| raise ConvertJsonError(f"Invalid response structure: {e}") |
| else: | ||
| tmp_type = "EXTRACTION_EQUAL" | ||
|
|
||
| result.status = response_model.score != 1 |
There was a problem hiding this comment.
The logic result.status = response_model.score != 1 treats score=0 (Equal) as status=True. In the context of the provided examples (e.g., html_extract_compare_v3_example_dataset.py), status=True is used to identify samples where Tool B is better. If the tools are equal, it is generally not considered a 'finding' or 'bad' sample. Please verify if score=0 should indeed trigger status=True.
| result.status = response_model.score != 1 | |
| result.status = response_model.score == 2 |
| print(f"\n推理过程:\n{result.eval_details.reason[0]}") | ||
| print(f"是否存在问题: {result.status}") | ||
| print(f"评估结果类型: {result.label}") | ||
| print(f"\n推理过程:\n{result.reason}") |
There was a problem hiding this comment.
In EvalDetail, the reason attribute is a list of strings. Printing result.reason directly will output the list representation (e.g., ['...']). For a human-readable example, it is better to print the first element of the list.
| print(f"\n推理过程:\n{result.reason}") | |
| print(f"\n推理过程:\n{result.reason[0]}") |
| # print(f"判断名称: {result.name}") | ||
| print(f"是否存在问题: {result.status}") | ||
| print(f"评估结果类型: {result.label}") | ||
| print(f"\n推理过程:\n{result.reason}") |
There was a problem hiding this comment.
No description provided.