Skip to content

Comments

feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765

Open
robert-cronin wants to merge 14 commits intokaito-project:mainfrom
robert-cronin:feat/1384
Open

feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765
robert-cronin wants to merge 14 commits intokaito-project:mainfrom
robert-cronin:feat/1384

Conversation

@robert-cronin
Copy link
Contributor

Integrate transformers OpenAI-compatible serve engine

Reason for Change:
Add a new inference server (inference.py) that wraps HuggingFace's built-in transformers serve OpenAI-compatible engine, providing /v1/chat/completions and /v1/models endpoints. This enables day 1 support for new HuggingFace models without needing custom serving code.

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:
Fixes #1384

Notes for Reviewers:
This adds inference.py wrapping transformers serve and switches DefaultTransformersMainFile to point to it, making it the default HF runtime. The old inference_api.py is preserved in the repo but no longer launched by default.

…oject#1384)

Signed-off-by: Robert Cronin <robert@robertcronin.com>
Signed-off-by: Robert Cronin <robert@robertcronin.com>
@Fei-Guo
Copy link
Collaborator

Fei-Guo commented Feb 11, 2026

Super cool! Have you done manual tests and do you mind adding the test results in the change description. I also think it is worth adding an e2e to test hf server for a small model.

@robert-cronin
Copy link
Contributor Author

Super cool! Have you done manual tests and do you mind adding the test results in the change description. I also think it is worth adding an e2e to test hf server for a small model.

Thank you, yes I did some manual testing on SmolLM2-135M and added some unit testing which have passed in the unit-tests gh workflow. but it seems pytest has been hanging for 2 hours now. could you help me to cancel that job? I will fix the hang and add e2e tests in a new commit

Signed-off-by: Robert Cronin <robert@robertcronin.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates HuggingFace's transformers serve OpenAI-compatible engine as a new inference server (inference.py), providing day 1 support for new models. The implementation wraps the transformers library's built-in serving functionality while preserving KAITO's adapter loading and configuration logic. The new server becomes the default for all transformers-based models by changing the DefaultTransformersMainFile constant.

Changes:

  • Added new inference.py wrapping transformers serve with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models)
  • Updated DefaultTransformersMainFile to point to inference.py instead of inference_api.py
  • Added comprehensive unit tests for the new inference server and updated e2e tests to validate new endpoints
  • Added transformers[serving] dependency to enable the serving functionality

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
presets/workspace/inference/text-generation/inference.py New OpenAI-compatible inference server wrapping transformers serve engine with adapter support
presets/workspace/inference/text-generation/tests/test_inference.py Comprehensive unit tests for the new inference server endpoints and metrics
presets/workspace/dependencies/requirements.txt Added transformers[serving] extra to enable serving functionality
pkg/workspace/inference/preset_inference_types.go Changed DefaultTransformersMainFile constant to use inference.py
pkg/workspace/inference/preset_inferences_test.go Updated tests to expect inference.py instead of inference_api.py
pkg/utils/test/test_model.go Updated all test model definitions to use inference.py
test/e2e/preset_test.go Added validateChatCompletionsEndpoint function and calls in relevant test cases
docker/presets/models/tfs/Dockerfile Added inference.py to Docker image alongside existing inference_api.py

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Robert Cronin <robert@robertcronin.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Collaborator

@zhuangqh zhuangqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

@zhuangqh
Copy link
Collaborator

failed in private model case

  Loading llama-3.1-8b-instruct@main
  INFO:     10.224.0.9:50389 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
  ERROR:    Exception in ASGI application
  Traceback (most recent call last):
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 403, in hf_raise_for_status
      response.raise_for_status()
    File "/usr/local/lib/python3.12/site-packages/requests/models.py", line 1026, in raise_for_status
      raise HTTPError(http_error_msg, response=self)
  requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/llama-3.1-8b-instruct/resolve/main/processor_config.json
  The above exception was the direct cause of the following exception:
  Traceback (most recent call last):
    File "/usr/local/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
      hf_hub_download(
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1014, in hf_hub_download
      return _hf_hub_download_to_cache_dir(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1121, in _hf_hub_download_to_cache_dir
      _raise_on_head_call_error(head_call_error, force_download, local_files_only)
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1662, in _raise_on_head_call_error
      raise head_call_error
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1550, in _get_metadata_or_catch_error
      metadata = get_hf_file_metadata(
                 ^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1467, in get_hf_file_metadata
      r = _request_wrapper(
          ^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 283, in _request_wrapper
      response = _request_wrapper(
                 ^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 307, in _request_wrapper
      hf_raise_for_status(response)
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 453, in hf_raise_for_status
      raise _format(RepositoryNotFoundError, message, response) from e
  huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-6997d8d9-4279887f656a9868337b07fc;415534fa-dd25-4769-a30e-50b32498010f)
  Repository Not Found for url: https://huggingface.co/llama-3.1-8b-instruct/resolve/main/processor_config.json.
  Please make sure you specified the correct `repo_id` and `repo_type`.

command line

accelerate launch --gpu_ids=all --num_processes=1 --num_machines=1 --machine_rank=0 /workspace/tfs/inference_api.py --pretrained_model_name_or_path=meta-llama/Llama-3.1-8B-Instruct --revision=0e9e39f249a16976918f6564b8830bc894c89659 --torch_dtype=bfloat16 --pipeline=text-generation --chat_template=/workspace/chat_templates/llama-3-instruct.jinja --allow_remote_files --served_model_name=llama-3.1-8b-instruct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Integrate with transformers openai compatible server

4 participants