feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765
feat: integrate transformers OpenAI-compatible serve engine (#1384)#1765robert-cronin wants to merge 14 commits intokaito-project:mainfrom
Conversation
…oject#1384) Signed-off-by: Robert Cronin <robert@robertcronin.com>
Signed-off-by: Robert Cronin <robert@robertcronin.com>
|
Super cool! Have you done manual tests and do you mind adding the test results in the change description. I also think it is worth adding an e2e to test hf server for a small model. |
Thank you, yes I did some manual testing on SmolLM2-135M and added some unit testing which have passed in the unit-tests gh workflow. but it seems pytest has been hanging for 2 hours now. could you help me to cancel that job? I will fix the hang and add e2e tests in a new commit |
Signed-off-by: Robert Cronin <robert@robertcronin.com>
There was a problem hiding this comment.
Pull request overview
This PR integrates HuggingFace's transformers serve OpenAI-compatible engine as a new inference server (inference.py), providing day 1 support for new models. The implementation wraps the transformers library's built-in serving functionality while preserving KAITO's adapter loading and configuration logic. The new server becomes the default for all transformers-based models by changing the DefaultTransformersMainFile constant.
Changes:
- Added new
inference.pywrappingtransformers servewith OpenAI-compatible endpoints (/v1/chat/completions,/v1/models) - Updated
DefaultTransformersMainFileto point toinference.pyinstead ofinference_api.py - Added comprehensive unit tests for the new inference server and updated e2e tests to validate new endpoints
- Added
transformers[serving]dependency to enable the serving functionality
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| presets/workspace/inference/text-generation/inference.py | New OpenAI-compatible inference server wrapping transformers serve engine with adapter support |
| presets/workspace/inference/text-generation/tests/test_inference.py | Comprehensive unit tests for the new inference server endpoints and metrics |
| presets/workspace/dependencies/requirements.txt | Added transformers[serving] extra to enable serving functionality |
| pkg/workspace/inference/preset_inference_types.go | Changed DefaultTransformersMainFile constant to use inference.py |
| pkg/workspace/inference/preset_inferences_test.go | Updated tests to expect inference.py instead of inference_api.py |
| pkg/utils/test/test_model.go | Updated all test model definitions to use inference.py |
| test/e2e/preset_test.go | Added validateChatCompletionsEndpoint function and calls in relevant test cases |
| docker/presets/models/tfs/Dockerfile | Added inference.py to Docker image alongside existing inference_api.py |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Robert Cronin <robert@robertcronin.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
failed in private model case command line |
Integrate transformers OpenAI-compatible serve engine
Reason for Change:
Add a new inference server (
inference.py) that wraps HuggingFace's built-intransformers serveOpenAI-compatible engine, providing/v1/chat/completionsand/v1/modelsendpoints. This enables day 1 support for new HuggingFace models without needing custom serving code.Requirements
Issue Fixed:
Fixes #1384
Notes for Reviewers:
This adds
inference.pywrappingtransformers serveand switchesDefaultTransformersMainFileto point to it, making it the default HF runtime. The oldinference_api.pyis preserved in the repo but no longer launched by default.