Add SDK-backed OpenHands SFT converter#241
Conversation
There was a problem hiding this comment.
🟡 Acceptable — the converter is well-structured and the tests are solid. Three issues worth cleaning up before merge.
[PR Description] The custom-codereview-guide and AGENTS.md require a design-decision catalog for non-obvious implementation choices. Several decisions in this PR warrant entries: why <finish> is extracted via regex from MessageAction rather than expecting raw ApiAction in the standardized format; how browser_enabled: true interacts with custom tool registration for mind2web/webshop (custom tools get registered then skipped in sdk_tool_specs); and the single-dataset-per-process constraint enforced by the module-level dataset global. Please add a design-decision section to the PR description covering at least these three.
See inline comments for the technical issues.
[RISK ASSESSMENT]
VERDICT: ✅ Worth merging after addressing the dead CLI args and the PR description gap. The module-level global and the cross-dataset tool collision are latent issues that can be tracked separately.
KEY INSIGHT: The module-level dataset = os.getenv("MY_DATASET") bakes the dataset identity into the module at import time, creating a hidden coupling that the CLI design makes safe today but that future in-process callers will hit unexpectedly.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26555004793
This review was generated by an AI agent (OpenHands) on behalf of the user.
…ft-events # Conflicts: # requirements.txt
|
Addressed and resolved all review threads. Changes pushed in
Local validation:
Review threads resolved via GraphQL:
GitHub checks on latest head are passing. |
There was a problem hiding this comment.
🟡 Acceptable — the SDK-backed converter is well-structured and the five-dataset rollout is a solid first cut. Two issues below need attention before merge.
[IMPORTANT ISSUES]
-
agenttuning_webshoplabel→index corruption —coerce_browser_indexextracts the first digit from the element label string.click(element="Buy Now")→browser_click(index=0),click(element="B07T3G7L37")→browser_click(index=7). These indices are meaningless: webshop uses named labels, not DOM indices. The SFT output silently contains incorrect click targets for every non-numeric label. Per AGENTS.md the design-decision catalog must include the specific tradeoff here — the label→index coercion is not documented as a known limitation, only the general statement that "browser-like actions are mapped to the SDK browser tool." -
Test state pollution —
test_openhands_sdk_converter_rejects_conflicting_custom_tool_schemascallsregister_metadata_tools(first_metadata)in-process, which writes to both_REGISTERED_METADATA_TOOL_SPECSand the SDK-global tool registry viaregister_tool. Neither is torn down. Current tests are unaffected by ordering, but this is a latent trap for future in-process tests.
[RISK ASSESSMENT]
- [Overall PR] Risk Assessment: 🟡 MEDIUM — Broad schema version bump across 40+ datasets and new SDK dependency. CI passes, but the webshop label→index silent corruption is a data-quality risk embedded in committed sample artifacts.
VERDICT: Worth merging after (1) documenting the webshop label→index limitation as an explicit design-decision entry in the PR description, and (2) confirming the corrupted indices are acceptable for the training goal (or fixing coerce_browser_index to preserve the original string label instead of extracting a digit).
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26558168612
This review was generated by an AI agent (OpenHands) on behalf of the user.
Summary
agents/openhands_sdk/std_to_sft.py, a replacement OpenHands SDK converter that builds SFT records through SDK primitives instead of hand-rolled OpenAI chat formatting.metadata.jsoncustom_toolsas SDK custom tools, create a real SDKConversation, append standardized actions/observations as SDK events, and serialize via SDK event/message formatting.agenttuning_alfworld,agenttuning_kg,agenttuning_mind2web,agenttuning_os, andagenttuning_webshop.Design Decisions
<finish>in legacyMessageActioncontent is converted with a narrow regex because some standardized samples already encode final answers as assistant messages rather than rawApiAction(function="finish")events. Converting those messages into SDKfinishtool calls preserves the agent trajectory shape expected by SDK LLM formatting without requiring an upstream data migration in this PR.custom_tools, butsdk_tool_specsmaps index-based browser actions such as Mind2Webclick(bid=...)/fill(bid=...)to the SDK browser tool whenbrowser_enabledis true. Label-based custom tools such as WebShopclick(element="Buy Now")stay as dataset custom tools so the original target label is preserved instead of being coerced into an arbitrary DOM index. Same-name custom tools with different schemas now raise if processed in one Python process.process_row, defaulting to the currentMY_DATASETvalue at call time. This keeps the CLI one-dataset-per-invocation behavior while avoiding stale module-import state for tests or future in-process callers.--is_weband--api_envare rejected for this converter because SDK tool configuration comes fromdatasets/$MY_DATASET/metadata.json; silently accepting those legacy OpenHands v0 flags would imply behavior they do not control.Validation
OPENHANDS_SUPPRESS_BANNER=1 uv run --python 3.12 --with-requirements requirements.txt pytest -q tests/test_*.py uv run --python 3.12 --with-requirements requirements.txt pre-commit run --all-filesResults:
535 passed, 16 skippedtest (3.12).Replaces closed PR #233 with the SDK-event generation approach.
This PR was created by an AI agent on behalf of the user.