feat: add LLMHtmlExtractCompareV3 metric#384

Merged

e06084 merged 4 commits intoMigoXLab:devfrom

Apr 3, 2026

Collaborator

e06084 commented Apr 3, 2026

No description provided.

e06084 added 3 commits

April 2, 2026 19:17


          fix: updat LLMHtmlExtractCompare metric

33432b5


          feat: add LLMHtmlExtractCompareV3 metric

a1d7d68

a2ba5bb

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces the LLMHtmlExtractCompareV3 evaluator for comparing HTML extraction quality and adds a --count flag to the dingo info CLI. Feedback focuses on improving the robustness of LLM response parsing—specifically for thinking blocks and JSON extraction—and refining the logic for status reporting and output formatting in example scripts.

dingo/model/llm/compare/llm_html_extract_compare_v3.py

Comment on lines +181 to +188

+                      if response.startswith("<think>"):
+                          think_content = re.search(
+                              r"<think>(.*?)</think>", response, flags=re.DOTALL
+                          )
+                          if think_content:
+                              response_think = think_content.group(1).strip()
+                          response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL)
+                          response = response.strip()

Contributor

gemini-code-assist bot Apr 3, 2026

The logic for extracting the <think> block is fragile because it relies on response.startswith("<think>"). If the LLM output contains leading whitespace or a newline (which is common), this check will fail, and the thinking content will not be extracted. Furthermore, the <think> block will remain in the response, likely causing the subsequent JSON parsing to fail. It is better to strip the response first and use a regex to find and remove the thinking block regardless of its position.

Suggested change

      
                    if response.startswith("<think>"):
          
                        think_content = re.search(
          
                            r"<think>(.*?)</think>", response, flags=re.DOTALL
          
                        )
          
                        if think_content:
          
                            response_think = think_content.group(1).strip()
          
                        response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL)
          
                        response = response.strip()
          
                    response = response.strip()
          
                    response_think = ""
          
                    think_match = re.search(r"<think>(.*?)</think>", response, flags=re.DOTALL)
          
                    if think_match:
          
                        response_think = think_match.group(1).strip()
          
                        response = re.sub(r"<think>.*?</think>", "", response, flags=re.DOTALL).strip()

dingo/model/llm/compare/llm_html_extract_compare_v3.py

Comment on lines +190 to +196

+                      if response.startswith("```json"):
+                          response = response[7:]
+                      if response.startswith("```"):
+                          response = response[3:]
+                      if response.endswith("```"):
+                          response = response[:-3]
+                      response = response.strip()

Contributor

gemini-code-assist bot Apr 3, 2026

The sequential if statements for stripping markdown code blocks are slightly redundant. If the response starts with ```json, it will be stripped, and then the code will check if it starts with ``` again. Using elif is cleaner and more intentional. Also, ensure the stripping logic is robust against variations in LLM output.

Suggested change

      
                    if response.startswith("```json"):
          
                        response = response[7:]
          
                    if response.startswith("```"):
          
                        response = response[3:]
          
                    if response.endswith("```"):
          
                        response = response[:-3]
          
                    response = response.strip()
          
                    if response.startswith("```json"):
          
                        response = response[7:]
          
                    elif response.startswith("```"):
          
                        response = response[3:]
          
                    if response.endswith("```"):
          
                        response = response[:-3]
          
                    response = response.strip()

dingo/model/llm/compare/llm_html_extract_compare_v3.py

+                      try:
+                          response_json = json.loads(response)
+                          if response_think:
+                              response_json["reason"] = response_json.get("reason", "") + "\n" + response_think

Contributor

gemini-code-assist bot Apr 3, 2026

This line adds a newline to the reason field even if response_think is empty. It's better to only append the thinking content if it exists and handle the formatting more cleanly.

Suggested change

      
                            response_json["reason"] = response_json.get("reason", "") + "\n" + response_think
          
                        if response_think:
          
                            reason = response_json.get("reason", "")
          
                            response_json["reason"] = f"{reason}\n{response_think}".strip()

dingo/model/llm/compare/llm_html_extract_compare_v3.py

+                      except json.JSONDecodeError:
+                          raise ConvertJsonError(f"Convert to JSON format failed: {response}")
+                      response_model = ResponseScoreTypeNameReason(**response_json)

Contributor

gemini-code-assist bot Apr 3, 2026

Instantiating the Pydantic model ResponseScoreTypeNameReason directly from response_json can raise a ValidationError if the LLM output is malformed or missing required fields. This should be wrapped in a try-except block to provide a more descriptive error or handle the failure gracefully.

Suggested change

      
                    response_model = ResponseScoreTypeNameReason(**response_json)
          
                    try:
          
                        response_model = ResponseScoreTypeNameReason(**response_json)
          
                    except Exception as e:
          
                        raise ConvertJsonError(f"Invalid response structure: {e}")

dingo/model/llm/compare/llm_html_extract_compare_v3.py

+                      else:
+                          tmp_type = "EXTRACTION_EQUAL"
+                      result.status = response_model.score != 1

Contributor

gemini-code-assist bot Apr 3, 2026

The logic result.status = response_model.score != 1 treats score=0 (Equal) as status=True. In the context of the provided examples (e.g., html_extract_compare_v3_example_dataset.py), status=True is used to identify samples where Tool B is better. If the tools are equal, it is generally not considered a 'finding' or 'bad' sample. Please verify if score=0 should indeed trigger status=True.

Suggested change

      
                    result.status = response_model.score != 1
          
                    result.status = response_model.score == 2

examples/compare/html_extract_compare_v2_example.py

    
                  print(f"\n推理过程:\n{result.eval_details.reason[0]}")

                  print(f"是否存在问题: {result.status}")

                  print(f"评估结果类型: {result.label}")

                  print(f"\n推理过程:\n{result.reason}")

Contributor

gemini-code-assist bot Apr 3, 2026

In EvalDetail, the reason attribute is a list of strings. Printing result.reason directly will output the list representation (e.g., ['...']). For a human-readable example, it is better to print the first element of the list.

Suggested change

      
                print(f"\n推理过程:\n{result.reason}")
          
                print(f"\n推理过程:\n{result.reason[0]}")

examples/compare/html_extract_compare_v3_example.py

+                  # print(f"判断名称: {result.name}")
+                  print(f"是否存在问题: {result.status}")
+                  print(f"评估结果类型: {result.label}")
+                  print(f"\n推理过程:\n{result.reason}")

Contributor

gemini-code-assist bot Apr 3, 2026

Similar to the V2 example, result.reason is a list. Printing it directly results in a list representation which is less readable for users. Accessing the first element is preferred.

Suggested change

      
                print(f"\n推理过程:\n{result.reason}")
          
                print(f"\n推理过程:\n{result.reason[0]}")


          add ut

9e663c8

e06084 force-pushed the dev branch from cc5905e to 9e663c8 Compare

April 3, 2026 06:31

e06084 merged commit 65fc0fb into MigoXLab:dev

2 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet