feat: integrate transformers OpenAI-compatible serve engine (#1384) by robert-cronin · Pull Request #1765 · kaito-project/kaito

robert-cronin · 2026-02-11T03:08:14Z

Integrate transformers OpenAI-compatible serve engine

Reason for Change:
Add a new inference server (inference.py) that wraps HuggingFace's built-in transformers serve OpenAI-compatible engine, providing /v1/chat/completions and /v1/models endpoints. This enables day 1 support for new HuggingFace models without needing custom serving code.

Requirements

added unit tests and e2e tests (if applicable).

Issue Fixed:
Fixes #1384

Notes for Reviewers:
This adds inference.py wrapping transformers serve and switches DefaultTransformersMainFile to point to it, making it the default HF runtime. The old inference_api.py is preserved in the repo but no longer launched by default.

…oject#1384) Signed-off-by: Robert Cronin <robert@robertcronin.com>

Signed-off-by: Robert Cronin <robert@robertcronin.com>

Fei-Guo · 2026-02-11T03:47:49Z

Super cool! Have you done manual tests and do you mind adding the test results in the change description. I also think it is worth adding an e2e to test hf server for a small model.

robert-cronin · 2026-02-11T05:14:46Z

Super cool! Have you done manual tests and do you mind adding the test results in the change description. I also think it is worth adding an e2e to test hf server for a small model.

Thank you, yes I did some manual testing on SmolLM2-135M and added some unit testing which have passed in the unit-tests gh workflow. but it seems pytest has been hanging for 2 hours now. could you help me to cancel that job? I will fix the hang and add e2e tests in a new commit

Signed-off-by: Robert Cronin <robert@robertcronin.com>

pkg/workspace/inference/preset_inference_types.go

Copilot

Pull request overview

This PR integrates HuggingFace's transformers serve OpenAI-compatible engine as a new inference server (inference.py), providing day 1 support for new models. The implementation wraps the transformers library's built-in serving functionality while preserving KAITO's adapter loading and configuration logic. The new server becomes the default for all transformers-based models by changing the DefaultTransformersMainFile constant.

Changes:

Added new inference.py wrapping transformers serve with OpenAI-compatible endpoints (/v1/chat/completions, /v1/models)
Updated DefaultTransformersMainFile to point to inference.py instead of inference_api.py
Added comprehensive unit tests for the new inference server and updated e2e tests to validate new endpoints
Added transformers[serving] dependency to enable the serving functionality

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
presets/workspace/inference/text-generation/inference.py	New OpenAI-compatible inference server wrapping transformers serve engine with adapter support
presets/workspace/inference/text-generation/tests/test_inference.py	Comprehensive unit tests for the new inference server endpoints and metrics
presets/workspace/dependencies/requirements.txt	Added transformers[serving] extra to enable serving functionality
pkg/workspace/inference/preset_inference_types.go	Changed DefaultTransformersMainFile constant to use inference.py
pkg/workspace/inference/preset_inferences_test.go	Updated tests to expect inference.py instead of inference_api.py
pkg/utils/test/test_model.go	Updated all test model definitions to use inference.py
test/e2e/preset_test.go	Added validateChatCompletionsEndpoint function and calls in relevant test cases
docker/presets/models/tfs/Dockerfile	Added inference.py to Docker image alongside existing inference_api.py

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

presets/workspace/inference/text-generation/inference.py

pkg/workspace/inference/preset_inference_types.go

Signed-off-by: Robert Cronin <robert@robertcronin.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

zhuangqh

Good job.

zhuangqh · 2026-02-20T07:50:38Z

failed in private model case

  Loading llama-3.1-8b-instruct@main
  INFO:     10.224.0.9:50389 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
  ERROR:    Exception in ASGI application
  Traceback (most recent call last):
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 403, in hf_raise_for_status
      response.raise_for_status()
    File "/usr/local/lib/python3.12/site-packages/requests/models.py", line 1026, in raise_for_status
      raise HTTPError(http_error_msg, response=self)
  requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/llama-3.1-8b-instruct/resolve/main/processor_config.json
  The above exception was the direct cause of the following exception:
  Traceback (most recent call last):
    File "/usr/local/lib/python3.12/site-packages/transformers/utils/hub.py", line 479, in cached_files
      hf_hub_download(
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1014, in hf_hub_download
      return _hf_hub_download_to_cache_dir(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1121, in _hf_hub_download_to_cache_dir
      _raise_on_head_call_error(head_call_error, force_download, local_files_only)
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1662, in _raise_on_head_call_error
      raise head_call_error
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1550, in _get_metadata_or_catch_error
      metadata = get_hf_file_metadata(
                 ^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 1467, in get_hf_file_metadata
      r = _request_wrapper(
          ^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 283, in _request_wrapper
      response = _request_wrapper(
                 ^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/file_download.py", line 307, in _request_wrapper
      hf_raise_for_status(response)
    File "/usr/local/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 453, in hf_raise_for_status
      raise _format(RepositoryNotFoundError, message, response) from e
  huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-6997d8d9-4279887f656a9868337b07fc;415534fa-dd25-4769-a30e-50b32498010f)
  Repository Not Found for url: https://huggingface.co/llama-3.1-8b-instruct/resolve/main/processor_config.json.
  Please make sure you specified the correct `repo_id` and `repo_type`.

command line

accelerate launch --gpu_ids=all --num_processes=1 --num_machines=1 --machine_rank=0 /workspace/tfs/inference_api.py --pretrained_model_name_or_path=meta-llama/Llama-3.1-8B-Instruct --revision=0e9e39f249a16976918f6564b8830bc894c89659 --torch_dtype=bfloat16 --pipeline=text-generation --chat_template=/workspace/chat_templates/llama-3-instruct.jinja --allow_remote_files --served_model_name=llama-3.1-8b-instruct

robert-cronin added 2 commits February 11, 2026 02:50

feat: integrate transformers OpenAI-compatible serve engine (kaito-pr…

2ddc70f

…oject#1384) Signed-off-by: Robert Cronin <robert@robertcronin.com>

switch default HF runtime to transformers serve engine

e713112

Signed-off-by: Robert Cronin <robert@robertcronin.com>

robert-cronin requested review from Fei-Guo, chewong and zhuangqh as code owners February 11, 2026 03:08

github-project-automation bot added this to KAITO Roadmap Feb 11, 2026

robert-cronin temporarily deployed to unit-tests February 11, 2026 03:08 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 11, 2026 03:08 — with GitHub Actions Waiting

robert-cronin had a problem deploying to unit-tests February 11, 2026 03:08 — with GitHub Actions Error

robert-cronin had a problem deploying to e2e-test February 11, 2026 03:08 — with GitHub Actions Error

feat: integrate transformers OpenAI-compatible serve engine

c5507ca

Signed-off-by: Robert Cronin <robert@robertcronin.com>

robert-cronin temporarily deployed to unit-tests February 11, 2026 05:36 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 11, 2026 05:36 — with GitHub Actions Waiting

robert-cronin had a problem deploying to e2e-test February 11, 2026 05:36 — with GitHub Actions Error

robert-cronin temporarily deployed to unit-tests February 11, 2026 05:36 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 11, 2026 05:36 — with GitHub Actions Waiting

robert-cronin had a problem deploying to e2e-test February 11, 2026 05:36 — with GitHub Actions Error

andyzhangx reviewed Feb 11, 2026

View reviewed changes

pkg/workspace/inference/preset_inference_types.go Outdated Show resolved Hide resolved

andyzhangx requested a review from Copilot February 11, 2026 13:56

Copilot started reviewing on behalf of andyzhangx February 11, 2026 13:57 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

presets/workspace/inference/text-generation/inference.py Outdated Show resolved Hide resolved

pkg/workspace/inference/preset_inference_types.go Outdated Show resolved Hide resolved

feat: replace custom HF inference server with transformers serve engine

71516c1

Signed-off-by: Robert Cronin <robert@robertcronin.com>

robert-cronin requested a deployment to e2e-test February 11, 2026 23:11 — with GitHub Actions Waiting

robert-cronin temporarily deployed to unit-tests February 11, 2026 23:11 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 11, 2026 23:11 — with GitHub Actions Waiting

robert-cronin temporarily deployed to unit-tests February 11, 2026 23:11 — with GitHub Actions Inactive

robert-cronin had a problem deploying to e2e-test February 11, 2026 23:11 — with GitHub Actions Error

fix: sync configmap base image tag to 0.2.0

4a73d18

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

robert-cronin temporarily deployed to unit-tests February 18, 2026 23:24 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 18, 2026 23:24 — with GitHub Actions Waiting

robert-cronin temporarily deployed to unit-tests February 18, 2026 23:24 — with GitHub Actions Inactive

robert-cronin had a problem deploying to e2e-test February 18, 2026 23:24 — with GitHub Actions Failure

robert-cronin had a problem deploying to e2e-test February 18, 2026 23:24 — with GitHub Actions Error

feat: add served-model-name flag for transformers runtime

d1d26b9

robert-cronin temporarily deployed to unit-tests February 20, 2026 00:05 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 20, 2026 00:05 — with GitHub Actions Waiting

robert-cronin temporarily deployed to unit-tests February 20, 2026 00:05 — with GitHub Actions Inactive

robert-cronin had a problem deploying to e2e-test February 20, 2026 00:05 — with GitHub Actions Failure

robert-cronin had a problem deploying to e2e-test February 20, 2026 00:05 — with GitHub Actions Error

zhuangqh approved these changes Feb 20, 2026

View reviewed changes

test: add served_model_name unit tests and fix SSE curl timeout in e2e

91a6798

robert-cronin requested a deployment to e2e-test February 20, 2026 02:42 — with GitHub Actions Waiting

robert-cronin temporarily deployed to unit-tests February 20, 2026 02:42 — with GitHub Actions Inactive

robert-cronin had a problem deploying to e2e-test February 20, 2026 02:42 — with GitHub Actions Error

fix: ruff format test_inference_api.py

a588461

robert-cronin temporarily deployed to unit-tests February 20, 2026 02:49 — with GitHub Actions Inactive

robert-cronin requested a deployment to e2e-test February 20, 2026 02:49 — with GitHub Actions Waiting

robert-cronin had a problem deploying to e2e-test February 20, 2026 02:49 — with GitHub Actions Failure

robert-cronin requested a deployment to e2e-test February 20, 2026 02:49 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765

feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765
robert-cronin wants to merge 14 commits intokaito-project:mainfrom
robert-cronin:feat/1384

robert-cronin commented Feb 11, 2026

Uh oh!

Fei-Guo commented Feb 11, 2026

Uh oh!

robert-cronin commented Feb 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

zhuangqh left a comment

Uh oh!

zhuangqh commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

robert-cronin commented Feb 11, 2026

Integrate transformers OpenAI-compatible serve engine

Uh oh!

Fei-Guo commented Feb 11, 2026

Uh oh!

robert-cronin commented Feb 11, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

zhuangqh left a comment

Choose a reason for hiding this comment

Uh oh!

zhuangqh commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants