Add SDK-backed OpenHands SFT converter by neubig · Pull Request #241 · neulab/agent-data-protocol

neubig · 2026-05-28T04:38:25Z

Summary

Add agents/openhands_sdk/std_to_sft.py, a replacement OpenHands SDK converter that builds SFT records through SDK primitives instead of hand-rolled OpenAI chat formatting.
Register dataset metadata.json custom_tools as SDK custom tools, create a real SDK Conversation, append standardized actions/observations as SDK events, and serialize via SDK event/message formatting.
Add metadata and generated OpenHands SDK SFT samples for five test datasets: agenttuning_alfworld, agenttuning_kg, agenttuning_mind2web, agenttuning_os, and agenttuning_webshop.
Keep unsupported non-bash code actions explicitly rejected for now.
Run repo CI on Python 3.12 so the OpenHands SDK dependency is installed and exercised by tests.

Design Decisions

<finish> in legacy MessageAction content is converted with a narrow regex because some standardized samples already encode final answers as assistant messages rather than raw ApiAction(function="finish") events. Converting those messages into SDK finish tool calls preserves the agent trajectory shape expected by SDK LLM formatting without requiring an upstream data migration in this PR.
Browser datasets still register their metadata custom_tools, but sdk_tool_specs maps index-based browser actions such as Mind2Web click(bid=...) / fill(bid=...) to the SDK browser tool when browser_enabled is true. Label-based custom tools such as WebShop click(element="Buy Now") stay as dataset custom tools so the original target label is preserved instead of being coerced into an arbitrary DOM index. Same-name custom tools with different schemas now raise if processed in one Python process.
Dataset identity is resolved inside process_row, defaulting to the current MY_DATASET value at call time. This keeps the CLI one-dataset-per-invocation behavior while avoiding stale module-import state for tests or future in-process callers.
--is_web and --api_env are rejected for this converter because SDK tool configuration comes from datasets/$MY_DATASET/metadata.json; silently accepting those legacy OpenHands v0 flags would imply behavior they do not control.

Validation

OPENHANDS_SUPPRESS_BANNER=1 uv run --python 3.12 --with-requirements requirements.txt pytest -q tests/test_*.py
uv run --python 3.12 --with-requirements requirements.txt pre-commit run --all-files

Results:

Full pytest suite: 535 passed, 16 skipped
Pre-commit: passed
GitHub checks on latest head are passing, including test (3.12).

Replaces closed PR #233 with the SDK-event generation approach.

This PR was created by an AI agent on behalf of the user.

github-actions

🟡 Acceptable — the converter is well-structured and the tests are solid. Three issues worth cleaning up before merge.

[PR Description] The custom-codereview-guide and AGENTS.md require a design-decision catalog for non-obvious implementation choices. Several decisions in this PR warrant entries: why <finish> is extracted via regex from MessageAction rather than expecting raw ApiAction in the standardized format; how browser_enabled: true interacts with custom tool registration for mind2web/webshop (custom tools get registered then skipped in sdk_tool_specs); and the single-dataset-per-process constraint enforced by the module-level dataset global. Please add a design-decision section to the PR description covering at least these three.

See inline comments for the technical issues.

[RISK ASSESSMENT]
⚠️ Risk: 🟡 MEDIUM — new code path that generates training data; bugs in conversion could silently produce wrong tool schemas or role assignments in SFT records. The existing tests cover the happy path well but rely on pre-committed fixtures for browser datasets (mind2web, webshop), so regressions there won't be caught by the regeneration test.

VERDICT: ✅ Worth merging after addressing the dead CLI args and the PR description gap. The module-level global and the cross-dataset tool collision are latent issues that can be tracked separately.

KEY INSIGHT: The module-level dataset = os.getenv("MY_DATASET") bakes the dataset identity into the module at import time, creating a hidden coupling that the CLI design makes safe today but that future in-process callers will hit unexpectedly.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26555004793

This review was generated by an AI agent (OpenHands) on behalf of the user.

…ft-events # Conflicts: # requirements.txt

neubig · 2026-05-28T05:24:03Z

Addressed and resolved all review threads.

Changes pushed in 741cc65:

Moved dataset lookup into process_row, with optional dataset_name override for in-process callers.
Replaced the registered-tool name set with stored serialized tool specs and now raise on same-name/different-schema custom tools.
Reject legacy --is_web / --api_env flags with a clear NotImplementedError instead of accepting no-op flags.
Added regression tests for all three cases.
Updated the PR description with the requested design-decision section.

Local validation:

OPENHANDS_SUPPRESS_BANNER=1 uv run --python 3.12 --with-requirements requirements.txt pytest -q tests/test_*.py -> 538 passed, 16 skipped
uv run --python 3.12 --with-requirements requirements.txt pre-commit run --all-files -> passed

Review threads resolved via GraphQL:

PRRT_kwDOMKFAs86FS2iX
PRRT_kwDOMKFAs86FS2iZ
PRRT_kwDOMKFAs86FS2ia

GitHub checks on latest head are passing.

github-actions

🟡 Acceptable — the SDK-backed converter is well-structured and the five-dataset rollout is a solid first cut. Two issues below need attention before merge.

[IMPORTANT ISSUES]

agenttuning_webshop label→index corruption — coerce_browser_index extracts the first digit from the element label string. click(element="Buy Now") → browser_click(index=0), click(element="B07T3G7L37") → browser_click(index=7). These indices are meaningless: webshop uses named labels, not DOM indices. The SFT output silently contains incorrect click targets for every non-numeric label. Per AGENTS.md the design-decision catalog must include the specific tradeoff here — the label→index coercion is not documented as a known limitation, only the general statement that "browser-like actions are mapped to the SDK browser tool."
Test state pollution — test_openhands_sdk_converter_rejects_conflicting_custom_tool_schemas calls register_metadata_tools(first_metadata) in-process, which writes to both _REGISTERED_METADATA_TOOL_SPECS and the SDK-global tool registry via register_tool. Neither is torn down. Current tests are unaffected by ordering, but this is a latent trap for future in-process tests.

[RISK ASSESSMENT]

[Overall PR] Risk Assessment: 🟡 MEDIUM — Broad schema version bump across 40+ datasets and new SDK dependency. CI passes, but the webshop label→index silent corruption is a data-quality risk embedded in committed sample artifacts.

VERDICT: Worth merging after (1) documenting the webshop label→index limitation as an explicit design-decision entry in the PR description, and (2) confirming the corrupted indices are acceptable for the training goal (or fixing coerce_browser_index to preserve the original string label instead of extracting a digit).

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26558168612

This review was generated by an AI agent (OpenHands) on behalf of the user.

Add SDK-backed OpenHands SFT converter

88ac3d7

github-actions Bot reviewed May 28, 2026

View reviewed changes

Comment thread agents/openhands_sdk/std_to_sft.py Outdated

Comment thread agents/openhands_sdk/std_to_sft.py Outdated

Comment thread agents/openhands_sdk/std_to_sft.py Outdated

Graham Neubig added 2 commits May 28, 2026 01:20

Merge remote-tracking branch 'origin/main' into codex/openhands-sdk-s…

2ca994f

…ft-events # Conflicts: # requirements.txt

Normalize SDK SFT records for HF chat templates

087001d

Graham Neubig added 4 commits May 28, 2026 01:26

Keep SDK dependencies optional for CI

e642c0a

Bump schema version for dataset metadata

0bcc716

Run CI on Python 3.12 for SDK support

e53e431

Address SDK converter review feedback

741cc65

neubig requested a review from openhands-agent May 28, 2026 06:15

github-actions Bot reviewed May 28, 2026

View reviewed changes

Comment thread agents/openhands_sdk/std_to_sft.py

Comment thread tests/test_openhands_sdk_sft_conversion.py

Preserve webshop custom click labels

31110f4

neubig merged commit 274f935 into main May 28, 2026
5 checks passed

neubig deleted the codex/openhands-sdk-sft-events branch May 28, 2026 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SDK-backed OpenHands SFT converter#241

Add SDK-backed OpenHands SFT converter#241
neubig merged 8 commits into
mainfrom
codex/openhands-sdk-sft-events

neubig commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neubig commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Decisions

Validation

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neubig commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neubig commented May 28, 2026 •

edited

Loading

neubig commented May 28, 2026 •

edited

Loading