Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve#44558
Conversation
|
Added documentation for the new API endpoint |
|
Thanks for the PR, I'll take a look! |
|
Thanks for those improvements @stevhliu I've comitted them all |
|
Hey @rain-1, we're doing significant changes to the structure of serve at this time with @SunMarc, sorry your PR is getting delayed. Btw, what prohibits you from updating to a newer, maintained API? I'm not sure if it's transformers serve's place to introduce deprecated API endpoints instead of supporting the existing/future ones |
|
@LysandreJik No worries. I hold off on other contributions til this is done.
Do any of these support text continuation? I'm working with pretrained LLM models. They do not have chat template. So I need an API endpoint that does input "Once upon a time, " -> output "Once upon a time, there was a princess who"
OpenAI deprecated this endpoint because they cannot safely serve base models to customers as part of their product. It's hard to do safety tuning on those. Regardless of that.. LLM development still has the pretraining phase. Most tools are using a chat template which is ontly trained into the model in the post-training phase. |
|
@LysandreJik @stevhliu What are you thinking about this PR? Any response to my justification for the value of this? |
|
Hey @rain-1, with the refactor done, we are open to have more endpoints as it should be way simpler to maintain. Feel free to update your PR to the latest version of serve. |
|
@SunMarc awesome! Here's a PR. |
SunMarc
left a comment
There was a problem hiding this comment.
Thanks, this is already looking quite good as it is simple to look at the code to see what is supported @rain-1 . Can you maybe make sure that we have parity in terms of feature for non streaming and streaming. Left also a bunch of comments but they are small fixes and we can merge this soon
| # Copyright 2026 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
| Handler for the /v1/completions endpoint (OpenAI legacy Completions API). |
There was a problem hiding this comment.
let's just call the file completion.py. I prefer not to add legacy in it.
There was a problem hiding this comment.
| from fastapi.responses import JSONResponse, StreamingResponse | ||
|
|
||
| from .chat_completion import ChatCompletionHandler | ||
| from .legacy_completion import LegacyCompletionHandler |
There was a problem hiding this comment.
Remove legacy. Just all it CompletionHandler. do the same for the others
There was a problem hiding this comment.
renamed it to CompletionHandler.
| prompt = body.get("prompt", "") | ||
| if isinstance(prompt, list): | ||
| if len(prompt) > 0 and isinstance(prompt[0], str): | ||
| prompt = "".join(prompt) | ||
| else: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail="Token array prompts are not supported. Please pass a string prompt.", | ||
| ) |
There was a problem hiding this comment.
do we really need this ? Like the prompt can be a list ? If really needed, instead of having this here, can you have a new method called get_processor_inputs maybe ? Also does completion api only expect text ?
There was a problem hiding this comment.
It should be text only. I think we should be strict about taking only plain string. No need to take a list of strings and append them if the client can just do that.
Clearest approach is to validate its a string and return a 400 if not. We wont need a get_processor_inputs method because we don't do any complex parsing or validation.
Yeah, this is text only.
|
|
||
| suffix = body.get("suffix") | ||
|
|
||
| if body.get("stream"): |
There was a problem hiding this comment.
not a big deal but let's try to keep the same code as the others: streaming = body.get("stream")
| chunk = { | ||
| "id": request_id, | ||
| "object": "text_completion", | ||
| "created": int(time.time()), | ||
| "model": model_id, | ||
| "choices": [{"text": text, "index": 0, "logprobs": None}], | ||
| } |
There was a problem hiding this comment.
you mean we can have somehting like _build_chunk_sse where finish_reason is None ?
There was a problem hiding this comment.
Using _build_chunk_sse instead of constructing the object directly now.
| usage = CompletionUsage( | ||
| prompt_tokens=input_len, | ||
| completion_tokens=completion_tokens, | ||
| total_tokens=input_len + completion_tokens, | ||
| ) |
There was a problem hiding this comment.
Added this to streaming as well!
| class TestLegacyCompletion(unittest.TestCase): | ||
| """Integration tests for /v1/completions with a real model.""" |
|
|
||
| @slow | ||
| @require_serve | ||
| class TestLegacyCompletion(unittest.TestCase): |
There was a problem hiding this comment.
Try to add a bit more integration tests that matters I think. There are plenty of tests you can take inspiration from in response or chat completion. Maybe you can add some cb tests as it is quite an important use case. For this class, since we will probably not update that much, let's just have one class.
There was a problem hiding this comment.
New tests added!
Non-streaming
- test_non_streaming — basic response shape: text_completion object, non-empty text, valid finish_reason
- test_non_streaming_usage — prompt/completion/total token counts are correct
- test_finish_reason_length — max_tokens=1 gives finish_reason="length"
- test_finish_reason_stop — model generates to EOS, gives finish_reason="stop"
- test_stop_strings — generation halts at the stop word
- test_suffix — suffix is appended to generated text
Streaming
- test_streaming — chunks contain text, last chunk has a valid finish_reason
- test_streaming_usage — last chunk carries correct token usage
- test_streaming_finish_reason_length — max_tokens=1 stream ends with "length"
- test_streaming_suffix — suffix appears as a final text chunk
- test_request_cancellation — closing the stream early doesn't crash the server
Continuous batching
- test_cb_streaming — CB path produces text with a valid finish_reason
- test_cb_non_streaming — CB path returns a full non-empty response
84cb51a to
4e64376
Compare
…sformers serve` Adds support for the legacy text completions endpoint, which accepts a freeform text prompt (no chat template) and returns generated text in choices[].text. Supports both streaming and non-streaming modes, suffix for fill-in-the-middle insertion, and proper finish_reason detection. The implementation follows the refactored serving architecture: a new CompletionHandler in serving/completion.py extends BaseHandler, and the route is registered in serving/server.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
run-slow: cli |
SunMarc
left a comment
There was a problem hiding this comment.
Awesome, thanks for iterating on this PR !
|
This comment contains models: ["cli"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @LysandreJik for viz |
What does this PR do?
Hello, my motivation for this:
I work with base models. I often need to continue text documents. So I need the OpenAI /v1/completions that uses an LLM to continune text. This is applicable for base models/pretrained foundational models, which have not been post-trained as instruct models to follow a chat template.
The 'transformers' tool is amazingly useful for bringing up a quick API endpoint, and I would love this to support the ability to continue a prompt! That's why I worked with Claude Code/Opus 4.6 to produce this PR.
Note: While this is a legacy feature of the OpenAI API (they have moved away from providing base model support) vllm still supports. It's very useful for research or any work with pretrained models.
Here's an example script that tests the functionality
So I run
transformers servethen run that script and I get the result:So this implementation is working well for my purposes.
Fixes # (issue)
Before submitting
Pull Request section? yes
to it if that's the case. no, this PR is the first contact
documentation guidelines, and
here are tips on formatting docstrings. Yes, API endpoint documented
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
based on git blame I tag @LysandreJik