Skip to content

Add SDK-backed OpenHands SFT converter#241

Merged
neubig merged 8 commits into
mainfrom
codex/openhands-sdk-sft-events
May 28, 2026
Merged

Add SDK-backed OpenHands SFT converter#241
neubig merged 8 commits into
mainfrom
codex/openhands-sdk-sft-events

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented May 28, 2026

Summary

  • Add agents/openhands_sdk/std_to_sft.py, a replacement OpenHands SDK converter that builds SFT records through SDK primitives instead of hand-rolled OpenAI chat formatting.
  • Register dataset metadata.json custom_tools as SDK custom tools, create a real SDK Conversation, append standardized actions/observations as SDK events, and serialize via SDK event/message formatting.
  • Add metadata and generated OpenHands SDK SFT samples for five test datasets: agenttuning_alfworld, agenttuning_kg, agenttuning_mind2web, agenttuning_os, and agenttuning_webshop.
  • Keep unsupported non-bash code actions explicitly rejected for now.
  • Run repo CI on Python 3.12 so the OpenHands SDK dependency is installed and exercised by tests.

Design Decisions

  • <finish> in legacy MessageAction content is converted with a narrow regex because some standardized samples already encode final answers as assistant messages rather than raw ApiAction(function="finish") events. Converting those messages into SDK finish tool calls preserves the agent trajectory shape expected by SDK LLM formatting without requiring an upstream data migration in this PR.
  • Browser datasets still register their metadata custom_tools, but sdk_tool_specs maps index-based browser actions such as Mind2Web click(bid=...) / fill(bid=...) to the SDK browser tool when browser_enabled is true. Label-based custom tools such as WebShop click(element="Buy Now") stay as dataset custom tools so the original target label is preserved instead of being coerced into an arbitrary DOM index. Same-name custom tools with different schemas now raise if processed in one Python process.
  • Dataset identity is resolved inside process_row, defaulting to the current MY_DATASET value at call time. This keeps the CLI one-dataset-per-invocation behavior while avoiding stale module-import state for tests or future in-process callers.
  • --is_web and --api_env are rejected for this converter because SDK tool configuration comes from datasets/$MY_DATASET/metadata.json; silently accepting those legacy OpenHands v0 flags would imply behavior they do not control.

Validation

OPENHANDS_SUPPRESS_BANNER=1 uv run --python 3.12 --with-requirements requirements.txt pytest -q tests/test_*.py
uv run --python 3.12 --with-requirements requirements.txt pre-commit run --all-files

Results:

  • Full pytest suite: 535 passed, 16 skipped
  • Pre-commit: passed
  • GitHub checks on latest head are passing, including test (3.12).

Replaces closed PR #233 with the SDK-event generation approach.

This PR was created by an AI agent on behalf of the user.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — the converter is well-structured and the tests are solid. Three issues worth cleaning up before merge.

[PR Description] The custom-codereview-guide and AGENTS.md require a design-decision catalog for non-obvious implementation choices. Several decisions in this PR warrant entries: why <finish> is extracted via regex from MessageAction rather than expecting raw ApiAction in the standardized format; how browser_enabled: true interacts with custom tool registration for mind2web/webshop (custom tools get registered then skipped in sdk_tool_specs); and the single-dataset-per-process constraint enforced by the module-level dataset global. Please add a design-decision section to the PR description covering at least these three.

See inline comments for the technical issues.

[RISK ASSESSMENT]
⚠️ Risk: 🟡 MEDIUM — new code path that generates training data; bugs in conversion could silently produce wrong tool schemas or role assignments in SFT records. The existing tests cover the happy path well but rely on pre-committed fixtures for browser datasets (mind2web, webshop), so regressions there won't be caught by the regeneration test.

VERDICT: ✅ Worth merging after addressing the dead CLI args and the PR description gap. The module-level global and the cross-dataset tool collision are latent issues that can be tracked separately.

KEY INSIGHT: The module-level dataset = os.getenv("MY_DATASET") bakes the dataset identity into the module at import time, creating a hidden coupling that the CLI design makes safe today but that future in-process callers will hit unexpectedly.


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26555004793

This review was generated by an AI agent (OpenHands) on behalf of the user.

Comment thread agents/openhands_sdk/std_to_sft.py Outdated
Comment thread agents/openhands_sdk/std_to_sft.py Outdated
Comment thread agents/openhands_sdk/std_to_sft.py Outdated
@neubig
Copy link
Copy Markdown
Contributor Author

neubig commented May 28, 2026

Addressed and resolved all review threads.

Changes pushed in 741cc65:

  • Moved dataset lookup into process_row, with optional dataset_name override for in-process callers.
  • Replaced the registered-tool name set with stored serialized tool specs and now raise on same-name/different-schema custom tools.
  • Reject legacy --is_web / --api_env flags with a clear NotImplementedError instead of accepting no-op flags.
  • Added regression tests for all three cases.
  • Updated the PR description with the requested design-decision section.

Local validation:

  • OPENHANDS_SUPPRESS_BANNER=1 uv run --python 3.12 --with-requirements requirements.txt pytest -q tests/test_*.py -> 538 passed, 16 skipped
  • uv run --python 3.12 --with-requirements requirements.txt pre-commit run --all-files -> passed

Review threads resolved via GraphQL:

  • PRRT_kwDOMKFAs86FS2iX
  • PRRT_kwDOMKFAs86FS2iZ
  • PRRT_kwDOMKFAs86FS2ia

GitHub checks on latest head are passing.

@neubig neubig requested a review from openhands-agent May 28, 2026 06:15
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — the SDK-backed converter is well-structured and the five-dataset rollout is a solid first cut. Two issues below need attention before merge.

[IMPORTANT ISSUES]

  1. agenttuning_webshop label→index corruptioncoerce_browser_index extracts the first digit from the element label string. click(element="Buy Now")browser_click(index=0), click(element="B07T3G7L37")browser_click(index=7). These indices are meaningless: webshop uses named labels, not DOM indices. The SFT output silently contains incorrect click targets for every non-numeric label. Per AGENTS.md the design-decision catalog must include the specific tradeoff here — the label→index coercion is not documented as a known limitation, only the general statement that "browser-like actions are mapped to the SDK browser tool."

  2. Test state pollutiontest_openhands_sdk_converter_rejects_conflicting_custom_tool_schemas calls register_metadata_tools(first_metadata) in-process, which writes to both _REGISTERED_METADATA_TOOL_SPECS and the SDK-global tool registry via register_tool. Neither is torn down. Current tests are unaffected by ordering, but this is a latent trap for future in-process tests.

[RISK ASSESSMENT]

  • [Overall PR] Risk Assessment: 🟡 MEDIUM — Broad schema version bump across 40+ datasets and new SDK dependency. CI passes, but the webshop label→index silent corruption is a data-quality risk embedded in committed sample artifacts.

VERDICT: Worth merging after (1) documenting the webshop label→index limitation as an explicit design-decision entry in the PR description, and (2) confirming the corrupted indices are acceptable for the training goal (or fixing coerce_browser_index to preserve the original string label instead of extracting a digit).


Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26558168612

This review was generated by an AI agent (OpenHands) on behalf of the user.

Comment thread agents/openhands_sdk/std_to_sft.py
Comment thread tests/test_openhands_sdk_sft_conversion.py
@neubig neubig merged commit 274f935 into main May 28, 2026
5 checks passed
@neubig neubig deleted the codex/openhands-sdk-sft-events branch May 28, 2026 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant