Skip to content

Add OpenHands SDK SFT converter#233

Closed
neubig wants to merge 19 commits into
mainfrom
codex/openhands-sdk-sft-converter
Closed

Add OpenHands SDK SFT converter#233
neubig wants to merge 19 commits into
mainfrom
codex/openhands-sdk-sft-converter

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 25, 2026

Summary

  • Add agents/openhands_sdk, a V1 OpenHands Software Agent SDK SFT converter that emits OpenAI chat-completions style records with native assistant.tool_calls, tool messages, SDK-style text content blocks, and a tools array.
  • Replace dataset-local api.py stubs with canonical metadata.json files for all 50 standardized datasets.
  • Bump standardized schema to 1.4.0 and rename per-trajectory available_apis to available_custom_tools.
  • Add dataset metadata fields:
    • custom_tools: OpenAI-standard function tool specs for custom dataset tools.
    • code_enabled: exact dataset-level set of CodeAction.language values.
    • browser_enabled: dataset-level browser capability flag.
  • Update OpenHands v0, OpenHands v0 MCP, SWE-agent, and OpenHands SDK SFT converters to read metadata.json and reject action/metadata inconsistencies.
  • Regenerate OpenHands SDK SFT samples from metadata-driven tool availability.

OpenHands SDK Behavior

  • code_enabled: ["bash"] enables the SDK TerminalTool representation.
  • Any non-bash language in code_enabled intentionally fails OpenHands SDK conversion for now.
  • browser_enabled: true enables the SDK browser tool family.
  • available_custom_tools narrows the custom tool set for a trajectory; when absent, all metadata custom_tools are available.
  • SDK-native aliases such as str_replace_editor, submit, finish, and browser-shaped actions are mapped to SDK-native tool names instead of duplicated as custom tools.
  • Browser detection is context-aware so non-browser homonyms remain custom tools, e.g. SWE-agent goto(line_number) and Android click(x, y).

Current Dataset Status

  • Metadata files added: 50/50 datasets with sample_std.json.
  • Dataset-local api.py files removed: all replaced by metadata.json.
  • OpenHands SDK SFT samples regenerated for datasets compatible with current SDK code handling.
  • OpenHands SDK SFT samples intentionally removed for datasets with non-bash code actions:
    • agenttuning_db: mysql
    • code_feedback: cpp, go, javascript, python
    • codeactinstruct: python
    • jupyter-agent-dataset: python
    • omniact: python
    • openhands: bash, python

Validation

uv run --with ruff ruff check .
uv run --with pytest --with pydantic --with Pillow --with beautifulsoup4 --with tqdm --with datasets --with matplotlib --with pre-commit pytest -q tests/test_*.py
python scripts/check_schema_version_bump.py --base-ref origin/main --head-ref HEAD

Results:

  • Ruff: passed
  • Full test suite: 461 passed, 14 skipped, 4 warnings
  • Schema version bump check: passed (1.3.0 -> 1.4.0)

Notes

The non-bash SDK gap is intentional in this PR. Future work can define harness-specific mappings for Python and other programming languages without weakening the harness-independent metadata schema.

This PR description was updated by an AI agent on behalf of the user.

@neubig neubig force-pushed the codex/openhands-sdk-sft-converter branch from 3c4ba33 to f1e6ecd Compare May 25, 2026 03:51
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the codex/openhands-sdk-sft-converter branch from f1e6ecd to e80b092 Compare May 25, 2026 12:02
@github-actions
Copy link
Copy Markdown

PR Artifacts Notice

This PR contains a .pr/ directory with PR-specific documents. This directory will be automatically removed when the PR is approved.

For fork PRs: Manual removal is required before merging.

@neubig neubig marked this pull request as ready for review May 27, 2026 04:41
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — The core converter (agents/openhands_sdk/std_to_sft.py) is solid: clean OpenAI chat-completions output, proper tool-call/result pairing with explicit IDs, good browser-action heuristics, and metadata-driven tool selection. Tests exercise real code paths and cover important edge cases (legacy XML finish, parallel tool calls, explicit IDs). Two documentation issues need attention.

Also note: .agents/skills/custom-codereview-guide.md (not modified in this PR, so no inline comment) still says api.py is required for ApiAction datasets (line 16) and validates against available_apis (lines 24–25). After this PR merges those review rules will fire incorrectly on every future dataset PR — api.py no longer exists and the field is now available_custom_tools. This should be updated in the same commit that updates AGENTS.md.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM. Large fan-out (~370 files, 50 datasets updated) increases the chance of a silent regression. The stale AGENTS.md and reviewer-skill guide could mislead future contributors and automated reviewers.

VERDICT: ✅ Worth merging after fixing the stale documentation.

KEY INSIGHT: The api.py → metadata.json migration is complete in code but incomplete in documentation — AGENTS.md and the reviewer skill still enforce the old contract.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26491199686

Comment thread AGENTS.md
@@ -89,6 +89,7 @@ Example correct format:
- Every `ApiAction.kwargs` object must validate by calling the function in `api.py`; include required parameters such as the `message` argument for `finish`.
- If a trajectory has `available_apis`, it must be a top-level list of function names that is a subset of that dataset's `api.py` functions, and every `ApiAction.function` used in that trajectory must appear in the list.
- Only populate `available_apis` for datasets that have `api.py` and whose source data explicitly specifies per-instance tool/API availability. Do not fill it with all functions from `api.py`, and do not infer it merely from the APIs used in the trajectory.
- If a trajectory has `CodeAction` entries, it must have top-level `available_code_languages` containing exactly the `CodeAction.language` values used in that trajectory.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: This new line was added alongside lines 88–91 and line 15 which still reference the old api.py contract and the renamed field available_apis. This PR removes every dataset-local api.py and renames the trajectory field to available_custom_tools, but the surrounding rules were not updated. Future contributors will add api.py files and use available_apis in JSON — both now wrong.

Required updates in the same file:

  • Line 15 (directory tree): api.py (required if ApiAction is used)metadata.json (required when sample_std.json exists)
  • Lines 88–89: replace api.py with metadata.json custom_tools
  • Lines 90–91: available_apisavailable_custom_tools; api.py functionsmetadata.json custom_tools
  • Line 49: remove api.py from the fix-and-regenerate list
  • Line 172: api.py function signaturemetadata.json custom_tools schema

Comment thread .pr/validation/README.md
@@ -0,0 +1,28 @@
# OpenHands SDK Validation Runs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: This directory adds ~150 files (~26,000 lines). Per summary.json, 46 of 50 datasets are pending_rerun — only agenttuning_db and screenagent actually completed a run. Every example.py is a 523-line script that is byte-for-byte identical across all 50 datasets except three top-level constants (DATASET_NAME, RECORD_INDEX, RECORD_ID).

Consider:

  1. Replacing the 50 copies with a single parameterised script, or
  2. Not committing the 46 pending_rerun placeholder stubs — they add no validation signal.

The 50× duplication will make the harness expensive to maintain when the SDK API changes.

@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented May 28, 2026

Closing this in favor of a smaller replacement PR that generates OpenHands SDK SFT data through SDK event/message construction instead of the hand-rolled OpenAI-chat conversion and large validation artifact set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants