Fix unhandled exception noise from background safetensors conversion thread#45752
Fix unhandled exception noise from background safetensors conversion thread#45752dhruv7477 wants to merge 1 commit intohuggingface:mainfrom
Conversation
…thread Signed-off-by: Dhruv Sharma <dhruv7477@gmail.com>
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45752&sha=ac7fda |
|
Re: Failing CI checks The two failing tests (test_tp_generation for exaone4 and test_ep_forward for gpt_oss) are pre-existing failures unrelated to this PR. GptOssModelTest::test_ep_forward — tracked by open issue #45161 ("Only TP not working with GPT-OSS MoE model", filed April 1, 2026). The ProcessRaisedException in EP forward is a known infrastructure issue. Exaone4ModelTest::test_tp_generation — SIGABRT (process crash) in the distributed subprocess. This is a CUDA/NCCL-level crash that cannot be caused by changing the ignore_errors_during_conversion flag on a background Python thread in modeling_utils.py. This PR touches a single line in the background safetensors conversion thread. It has no interaction with tensor-parallel or expert-parallel computation paths. Could a maintainer re-run the required run_tests check? |
The background
Thread-auto_conversioninmodeling_utils.pywas spawned withignore_errors_during_conversion=False. Whenget_repo_discussions()raisesHfHubHTTPError403 (discussions disabled on the repo), the exception propagated uncaught inside the thread, and Python printed the full traceback to stderr — the noise reported in #44403.Since this thread is explicitly fire-and-forget (the comment on line 720 reads "try to launch safetensors conversion for next time"), errors from it should never surface to the user. Changing the flag to
Truecausesauto_conversionto catch and suppress the exception cleanly.A prior attempt (#44440) was closed because it was opened two days after the issue while discussion was still ongoing, and made additional changes to
safetensors_conversion.py. This PR is a single-line fix to the actual root cause.Fixes #44403
Tests run:
python -m pytest tests/utils/test_modeling_utils.py -x -vResult: 94 passed, 38 skipped, 1 xfailed
I used an AI assistant to help trace the root cause and identify the relevant code path, but I reviewed the change, ran the tests, and verified the fix against the stack trace in the issue.