Add OpenHands SDK SFT converter#233
Conversation
3c4ba33 to
f1e6ecd
Compare
Co-authored-by: openhands <openhands@all-hands.dev>
f1e6ecd to
e80b092
Compare
|
PR Artifacts Notice This PR contains a
|
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
🟡 Acceptable — The core converter (agents/openhands_sdk/std_to_sft.py) is solid: clean OpenAI chat-completions output, proper tool-call/result pairing with explicit IDs, good browser-action heuristics, and metadata-driven tool selection. Tests exercise real code paths and cover important edge cases (legacy XML finish, parallel tool calls, explicit IDs). Two documentation issues need attention.
Also note: .agents/skills/custom-codereview-guide.md (not modified in this PR, so no inline comment) still says api.py is required for ApiAction datasets (line 16) and validates against available_apis (lines 24–25). After this PR merges those review rules will fire incorrectly on every future dataset PR — api.py no longer exists and the field is now available_custom_tools. This should be updated in the same commit that updates AGENTS.md.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM. Large fan-out (~370 files, 50 datasets updated) increases the chance of a silent regression. The stale AGENTS.md and reviewer-skill guide could mislead future contributors and automated reviewers.
VERDICT: ✅ Worth merging after fixing the stale documentation.
KEY INSIGHT: The api.py → metadata.json migration is complete in code but incomplete in documentation — AGENTS.md and the reviewer skill still enforce the old contract.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26491199686
| @@ -89,6 +89,7 @@ Example correct format: | |||
| - Every `ApiAction.kwargs` object must validate by calling the function in `api.py`; include required parameters such as the `message` argument for `finish`. | |||
| - If a trajectory has `available_apis`, it must be a top-level list of function names that is a subset of that dataset's `api.py` functions, and every `ApiAction.function` used in that trajectory must appear in the list. | |||
| - Only populate `available_apis` for datasets that have `api.py` and whose source data explicitly specifies per-instance tool/API availability. Do not fill it with all functions from `api.py`, and do not infer it merely from the APIs used in the trajectory. | |||
| - If a trajectory has `CodeAction` entries, it must have top-level `available_code_languages` containing exactly the `CodeAction.language` values used in that trajectory. | |||
There was a problem hiding this comment.
🟠 Important: This new line was added alongside lines 88–91 and line 15 which still reference the old api.py contract and the renamed field available_apis. This PR removes every dataset-local api.py and renames the trajectory field to available_custom_tools, but the surrounding rules were not updated. Future contributors will add api.py files and use available_apis in JSON — both now wrong.
Required updates in the same file:
- Line 15 (directory tree):
api.py (required if ApiAction is used)→metadata.json (required when sample_std.json exists) - Lines 88–89: replace
api.pywithmetadata.json custom_tools - Lines 90–91:
available_apis→available_custom_tools;api.py functions→metadata.json custom_tools - Line 49: remove
api.pyfrom the fix-and-regenerate list - Line 172:
api.py function signature→metadata.json custom_tools schema
| @@ -0,0 +1,28 @@ | |||
| # OpenHands SDK Validation Runs | |||
There was a problem hiding this comment.
🟡 Suggestion: This directory adds ~150 files (~26,000 lines). Per summary.json, 46 of 50 datasets are pending_rerun — only agenttuning_db and screenagent actually completed a run. Every example.py is a 523-line script that is byte-for-byte identical across all 50 datasets except three top-level constants (DATASET_NAME, RECORD_INDEX, RECORD_ID).
Consider:
- Replacing the 50 copies with a single parameterised script, or
- Not committing the 46
pending_rerunplaceholder stubs — they add no validation signal.
The 50× duplication will make the harness expensive to maintain when the SDK API changes.
|
Closing this in favor of a smaller replacement PR that generates OpenHands SDK SFT data through SDK event/message construction instead of the hand-rolled OpenAI-chat conversion and large validation artifact set. |
Summary
agents/openhands_sdk, a V1 OpenHands Software Agent SDK SFT converter that emits OpenAI chat-completions style records with nativeassistant.tool_calls,toolmessages, SDK-style text content blocks, and atoolsarray.api.pystubs with canonicalmetadata.jsonfiles for all 50 standardized datasets.1.4.0and rename per-trajectoryavailable_apistoavailable_custom_tools.custom_tools: OpenAI-standard function tool specs for custom dataset tools.code_enabled: exact dataset-level set ofCodeAction.languagevalues.browser_enabled: dataset-level browser capability flag.metadata.jsonand reject action/metadata inconsistencies.OpenHands SDK Behavior
code_enabled: ["bash"]enables the SDKTerminalToolrepresentation.code_enabledintentionally fails OpenHands SDK conversion for now.browser_enabled: trueenables the SDK browser tool family.available_custom_toolsnarrows the custom tool set for a trajectory; when absent, all metadatacustom_toolsare available.str_replace_editor,submit,finish, and browser-shaped actions are mapped to SDK-native tool names instead of duplicated as custom tools.goto(line_number)and Androidclick(x, y).Current Dataset Status
50/50datasets withsample_std.json.api.pyfiles removed: all replaced bymetadata.json.agenttuning_db:mysqlcode_feedback:cpp,go,javascript,pythoncodeactinstruct:pythonjupyter-agent-dataset:pythonomniact:pythonopenhands:bash,pythonValidation
Results:
461 passed, 14 skipped, 4 warnings1.3.0 -> 1.4.0)Notes
The non-bash SDK gap is intentional in this PR. Future work can define harness-specific mappings for Python and other programming languages without weakening the harness-independent metadata schema.
This PR description was updated by an AI agent on behalf of the user.