Skip to content

Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve#44558

Merged
SunMarc merged 1 commit into
huggingface:mainfrom
rain-1:main
Apr 22, 2026
Merged

Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve#44558
SunMarc merged 1 commit into
huggingface:mainfrom
rain-1:main

Conversation

@rain-1
Copy link
Copy Markdown
Contributor

@rain-1 rain-1 commented Mar 10, 2026

Adds support for the legacy text completions endpoint, which accepts a freeform text prompt (no chat template) and returns generated text in choices[].text. Supports both streaming and non-streaming modes, suffix for fill-in-the-middle insertion, and proper finish_reason detection.

What does this PR do?

Hello, my motivation for this:

I work with base models. I often need to continue text documents. So I need the OpenAI /v1/completions that uses an LLM to continune text. This is applicable for base models/pretrained foundational models, which have not been post-trained as instruct models to follow a chat template.

The 'transformers' tool is amazingly useful for bringing up a quick API endpoint, and I would love this to support the ability to continue a prompt! That's why I worked with Claude Code/Opus 4.6 to produce this PR.

Note: While this is a legacy feature of the OpenAI API (they have moved away from providing base model support) vllm still supports. It's very useful for research or any work with pretrained models.

Here's an example script that tests the functionality

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Non-streaming
resp = client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="The capital of France is",
    max_tokens=20
)
for i, choice in enumerate(resp.choices):
    print(i)
    print("-")
    print(choice.text)
    print("\n\n")

# Streaming
for chunk in client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="Once upon a time",
    max_tokens=50,
    stream=True,
):
    print(chunk.choices[0].text, end="", flush=True)

So I run transformers serve then run that script and I get the result:

0
-
 Paris, and the capital of the United States is Washington, D.C. The capital of the United



, in a magical land of science, there was a very special kind of animal called a jellyfish. This jellyfish had a really interesting way of living.

So this implementation is working well for my purposes.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No this provides a new feature to the transformers command line tool
  • Did you read the contributor guideline,
    Pull Request section? yes
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. no, this PR is the first contact
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings. Yes, API endpoint documented
  • Did you write any new necessary tests? *Yes, we added a test for this feature. and also tested the functionality with an independent pythoon script

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

based on git blame I tag @LysandreJik

@rain-1
Copy link
Copy Markdown
Contributor Author

rain-1 commented Mar 10, 2026

Added documentation for the new API endpoint

@LysandreJik
Copy link
Copy Markdown
Member

Thanks for the PR, I'll take a look!

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs side lgtm, thanks!

Comment thread docs/source/en/serve-cli/serving.md Outdated
Comment thread docs/source/en/serve-cli/serving.md Outdated
Comment thread docs/source/en/serve-cli/serving.md Outdated
@rain-1
Copy link
Copy Markdown
Contributor Author

rain-1 commented Mar 11, 2026

Thanks for those improvements @stevhliu I've comitted them all

@LysandreJik
Copy link
Copy Markdown
Member

Hey @rain-1, we're doing significant changes to the structure of serve at this time with @SunMarc, sorry your PR is getting delayed.

Btw, what prohibits you from updating to a newer, maintained API? I'm not sure if it's transformers serve's place to introduce deprecated API endpoints instead of supporting the existing/future ones

@rain-1
Copy link
Copy Markdown
Contributor Author

rain-1 commented Mar 22, 2026

@LysandreJik No worries. I hold off on other contributions til this is done.

newer, maintained API

Do any of these support text continuation?

I'm working with pretrained LLM models. They do not have chat template. So I need an API endpoint that does input "Once upon a time, " -> output "Once upon a time, there was a princess who"

I'm not sure if it's transformers serve's place to introduce deprecated API endpoints

OpenAI deprecated this endpoint because they cannot safely serve base models to customers as part of their product. It's hard to do safety tuning on those.

Regardless of that.. LLM development still has the pretraining phase.

Most tools are using a chat template which is ontly trained into the model in the post-training phase.

@rain-1
Copy link
Copy Markdown
Contributor Author

rain-1 commented Apr 6, 2026

@LysandreJik @stevhliu What are you thinking about this PR? Any response to my justification for the value of this?

@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Apr 15, 2026

Hey @rain-1, with the refactor done, we are open to have more endpoints as it should be way simpler to maintain. Feel free to update your PR to the latest version of serve.

@rain-1
Copy link
Copy Markdown
Contributor Author

rain-1 commented Apr 20, 2026

@SunMarc awesome! Here's a PR.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is already looking quite good as it is simple to look at the code to see what is supported @rain-1 . Can you maybe make sure that we have parity in terms of feature for non streaming and streaming. Left also a bunch of comments but they are small fixes and we can merge this soon

Comment on lines +1 to +15
# Copyright 2026 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Handler for the /v1/completions endpoint (OpenAI legacy Completions API).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just call the file completion.py. I prefer not to add legacy in it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/transformers/cli/serving/server.py Outdated
from fastapi.responses import JSONResponse, StreamingResponse

from .chat_completion import ChatCompletionHandler
from .legacy_completion import LegacyCompletionHandler
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove legacy. Just all it CompletionHandler. do the same for the others

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it to CompletionHandler.

Comment on lines +90 to +98
prompt = body.get("prompt", "")
if isinstance(prompt, list):
if len(prompt) > 0 and isinstance(prompt[0], str):
prompt = "".join(prompt)
else:
raise HTTPException(
status_code=400,
detail="Token array prompts are not supported. Please pass a string prompt.",
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this ? Like the prompt can be a list ? If really needed, instead of having this here, can you have a new method called get_processor_inputs maybe ? Also does completion api only expect text ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be text only. I think we should be strict about taking only plain string. No need to take a list of strings and append them if the client can just do that.

Clearest approach is to validate its a string and return a 400 if not. We wont need a get_processor_inputs method because we don't do any complex parsing or validation.

Yeah, this is text only.


suffix = body.get("suffix")

if body.get("stream"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but let's try to keep the same code as the others: streaming = body.get("stream")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Done.

Comment on lines +163 to +169
chunk = {
"id": request_id,
"object": "text_completion",
"created": int(time.time()),
"model": model_id,
"choices": [{"text": text, "index": 0, "logprobs": None}],
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean we can have somehting like _build_chunk_sse where finish_reason is None ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using _build_chunk_sse instead of constructing the object directly now.

Comment thread src/transformers/cli/serving/completion.py
Comment on lines +224 to +228
usage = CompletionUsage(
prompt_tokens=input_len,
completion_tokens=completion_tokens,
total_tokens=input_len + completion_tokens,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this for streaming also

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to streaming as well!

Comment thread tests/cli/test_serve.py Outdated
Comment on lines +984 to +985
class TestLegacyCompletion(unittest.TestCase):
"""Integration tests for /v1/completions with a real model."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update name to TestCompletion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread tests/cli/test_serve.py Outdated

@slow
@require_serve
class TestLegacyCompletion(unittest.TestCase):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to add a bit more integration tests that matters I think. There are plenty of tests you can take inspiration from in response or chat completion. Maybe you can add some cb tests as it is quite an important use case. For this class, since we will probably not update that much, let's just have one class.

Copy link
Copy Markdown
Contributor Author

@rain-1 rain-1 Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests added!

Non-streaming

  • test_non_streaming — basic response shape: text_completion object, non-empty text, valid finish_reason
  • test_non_streaming_usage — prompt/completion/total token counts are correct
  • test_finish_reason_length — max_tokens=1 gives finish_reason="length"
  • test_finish_reason_stop — model generates to EOS, gives finish_reason="stop"
  • test_stop_strings — generation halts at the stop word
  • test_suffix — suffix is appended to generated text

Streaming

  • test_streaming — chunks contain text, last chunk has a valid finish_reason
  • test_streaming_usage — last chunk carries correct token usage
  • test_streaming_finish_reason_length — max_tokens=1 stream ends with "length"
  • test_streaming_suffix — suffix appears as a final text chunk
  • test_request_cancellation — closing the stream early doesn't crash the server

Continuous batching

  • test_cb_streaming — CB path produces text with a valid finish_reason
  • test_cb_non_streaming — CB path returns a full non-empty response

@rain-1 rain-1 force-pushed the main branch 2 times, most recently from 84cb51a to 4e64376 Compare April 22, 2026 07:03
…sformers serve`

Adds support for the legacy text completions endpoint, which accepts a
freeform text prompt (no chat template) and returns generated text in
choices[].text. Supports both streaming and non-streaming modes, suffix
for fill-in-the-middle insertion, and proper finish_reason detection.

The implementation follows the refactored serving architecture: a new
CompletionHandler in serving/completion.py extends BaseHandler, and the
route is registered in serving/server.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Apr 22, 2026

run-slow: cli

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for iterating on this PR !

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["cli"]
quantizations: []

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN c10dde9d workflow commit (merge commit)
PR f3b1c5fd branch commit (from PR)
main 2da596b0 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@SunMarc SunMarc added this pull request to the merge queue Apr 22, 2026
@SunMarc
Copy link
Copy Markdown
Member

SunMarc commented Apr 22, 2026

cc @LysandreJik for viz

Merged via the queue into huggingface:main with commit 0ea540e Apr 22, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants