Add /v1/completions endpoint (OpenAI legacy completions API) to `transformers serve` by rain-1 · Pull Request #44558 · huggingface/transformers

rain-1 · 2026-03-10T07:09:07Z

Adds support for the legacy text completions endpoint, which accepts a freeform text prompt (no chat template) and returns generated text in choices[].text. Supports both streaming and non-streaming modes, suffix for fill-in-the-middle insertion, and proper finish_reason detection.

What does this PR do?

Hello, my motivation for this:

I work with base models. I often need to continue text documents. So I need the OpenAI /v1/completions that uses an LLM to continune text. This is applicable for base models/pretrained foundational models, which have not been post-trained as instruct models to follow a chat template.

The 'transformers' tool is amazingly useful for bringing up a quick API endpoint, and I would love this to support the ability to continue a prompt! That's why I worked with Claude Code/Opus 4.6 to produce this PR.

Note: While this is a legacy feature of the OpenAI API (they have moved away from providing base model support) vllm still supports. It's very useful for research or any work with pretrained models.

Here's an example script that tests the functionality

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Non-streaming
resp = client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="The capital of France is",
    max_tokens=20
)
for i, choice in enumerate(resp.choices):
    print(i)
    print("-")
    print(choice.text)
    print("\n\n")

# Streaming
for chunk in client.completions.create(
    model="Qwen/Qwen3.5-0.8B-Base",
    prompt="Once upon a time",
    max_tokens=50,
    stream=True,
):
    print(chunk.choices[0].text, end="", flush=True)

So I run transformers serve then run that script and I get the result:

0
-
 Paris, and the capital of the United States is Washington, D.C. The capital of the United



, in a magical land of science, there was a very special kind of animal called a jellyfish. This jellyfish had a really interesting way of living.

So this implementation is working well for my purposes.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No this provides a new feature to the transformers command line tool
Did you read the contributor guideline,
Pull Request section? yes
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. no, this PR is the first contact
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. Yes, API endpoint documented
Did you write any new necessary tests? *Yes, we added a test for this feature. and also tested the functionality with an independent pythoon script

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

based on git blame I tag @LysandreJik

rain-1 · 2026-03-10T07:12:30Z

Added documentation for the new API endpoint

LysandreJik · 2026-03-10T08:51:19Z

Thanks for the PR, I'll take a look!

stevhliu

the docs side lgtm, thanks!

rain-1 · 2026-03-11T09:29:27Z

Thanks for those improvements @stevhliu I've comitted them all

LysandreJik · 2026-03-19T12:57:38Z

Hey @rain-1, we're doing significant changes to the structure of serve at this time with @SunMarc, sorry your PR is getting delayed.

Btw, what prohibits you from updating to a newer, maintained API? I'm not sure if it's transformers serve's place to introduce deprecated API endpoints instead of supporting the existing/future ones

rain-1 · 2026-03-22T07:19:13Z

@LysandreJik No worries. I hold off on other contributions til this is done.

newer, maintained API

Do any of these support text continuation?

I'm working with pretrained LLM models. They do not have chat template. So I need an API endpoint that does input "Once upon a time, " -> output "Once upon a time, there was a princess who"

I'm not sure if it's transformers serve's place to introduce deprecated API endpoints

OpenAI deprecated this endpoint because they cannot safely serve base models to customers as part of their product. It's hard to do safety tuning on those.

Regardless of that.. LLM development still has the pretraining phase.

Most tools are using a chat template which is ontly trained into the model in the post-training phase.

rain-1 · 2026-04-06T09:23:17Z

@LysandreJik @stevhliu What are you thinking about this PR? Any response to my justification for the value of this?

SunMarc · 2026-04-15T15:16:40Z

Hey @rain-1, with the refactor done, we are open to have more endpoints as it should be way simpler to maintain. Feel free to update your PR to the latest version of serve.

rain-1 · 2026-04-20T10:27:14Z

@SunMarc awesome! Here's a PR.

SunMarc

Thanks, this is already looking quite good as it is simple to look at the code to see what is supported @rain-1 . Can you maybe make sure that we have parity in terms of feature for non streaming and streaming. Left also a bunch of comments but they are small fixes and we can merge this soon

SunMarc · 2026-04-20T14:43:25Z

+# Copyright 2026 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Handler for the /v1/completions endpoint (OpenAI legacy Completions API).


let's just call the file completion.py. I prefer not to add legacy in it.

Done

4e64376#diff-3d7153f3a8dd64565d5b489b3a6364a59c15c4ef9ba8cf71c3d3675246b1e37a

SunMarc · 2026-04-20T14:45:22Z

    from fastapi.responses import JSONResponse, StreamingResponse

 from .chat_completion import ChatCompletionHandler
+from .legacy_completion import LegacyCompletionHandler


Remove legacy. Just all it CompletionHandler. do the same for the others

renamed it to CompletionHandler.

SunMarc · 2026-04-20T14:56:13Z

+        prompt = body.get("prompt", "")
+        if isinstance(prompt, list):
+            if len(prompt) > 0 and isinstance(prompt[0], str):
+                prompt = "".join(prompt)
+            else:
+                raise HTTPException(
+                    status_code=400,
+                    detail="Token array prompts are not supported. Please pass a string prompt.",
+                )


do we really need this ? Like the prompt can be a list ? If really needed, instead of having this here, can you have a new method called get_processor_inputs maybe ? Also does completion api only expect text ?

It should be text only. I think we should be strict about taking only plain string. No need to take a list of strings and append them if the client can just do that.

Clearest approach is to validate its a string and return a 400 if not. We wont need a get_processor_inputs method because we don't do any complex parsing or validation.

Yeah, this is text only.

SunMarc · 2026-04-20T14:57:02Z

+
+        suffix = body.get("suffix")
+
+        if body.get("stream"):


not a big deal but let's try to keep the same code as the others: streaming = body.get("stream")

Thank you! Done.

SunMarc · 2026-04-20T15:03:49Z

+                        chunk = {
+                            "id": request_id,
+                            "object": "text_completion",
+                            "created": int(time.time()),
+                            "model": model_id,
+                            "choices": [{"text": text, "index": 0, "logprobs": None}],
+                        }


you mean we can have somehting like _build_chunk_sse where finish_reason is None ?

Using _build_chunk_sse instead of constructing the object directly now.

SunMarc · 2026-04-20T15:05:19Z

+        usage = CompletionUsage(
+            prompt_tokens=input_len,
+            completion_tokens=completion_tokens,
+            total_tokens=input_len + completion_tokens,
+        )


add this for streaming also

Added this to streaming as well!

SunMarc · 2026-04-20T15:06:51Z

+class TestLegacyCompletion(unittest.TestCase):
+    """Integration tests for /v1/completions with a real model."""


update name to TestCompletion

SunMarc · 2026-04-20T15:10:14Z


+@slow
+@require_serve
+class TestLegacyCompletion(unittest.TestCase):


Try to add a bit more integration tests that matters I think. There are plenty of tests you can take inspiration from in response or chat completion. Maybe you can add some cb tests as it is quite an important use case. For this class, since we will probably not update that much, let's just have one class.

New tests added!

Non-streaming

test_non_streaming — basic response shape: text_completion object, non-empty text, valid finish_reason

test_non_streaming_usage — prompt/completion/total token counts are correct

test_finish_reason_length — max_tokens=1 gives finish_reason="length"

test_finish_reason_stop — model generates to EOS, gives finish_reason="stop"

test_stop_strings — generation halts at the stop word

test_suffix — suffix is appended to generated text

Streaming

test_streaming — chunks contain text, last chunk has a valid finish_reason

test_streaming_usage — last chunk carries correct token usage

test_streaming_finish_reason_length — max_tokens=1 stream ends with "length"

test_streaming_suffix — suffix appears as a final text chunk

test_request_cancellation — closing the stream early doesn't crash the server

Continuous batching

test_cb_streaming — CB path produces text with a valid finish_reason

test_cb_non_streaming — CB path returns a full non-empty response

…sformers serve` Adds support for the legacy text completions endpoint, which accepts a freeform text prompt (no chat template) and returns generated text in choices[].text. Supports both streaming and non-streaming modes, suffix for fill-in-the-middle insertion, and proper finish_reason detection. The implementation follows the refactored serving architecture: a new CompletionHandler in serving/completion.py extends BaseHandler, and the route is registered in serving/server.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SunMarc · 2026-04-22T12:12:29Z

run-slow: cli

SunMarc

Awesome, thanks for iterating on this PR !

github-actions · 2026-04-22T12:13:57Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["cli"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-04-22T12:24:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-04-22T12:27:00Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	c10dde9d	workflow commit (merge commit)
PR	f3b1c5fd	branch commit (from PR)
main	2da596b0	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

SunMarc · 2026-04-22T12:31:46Z

cc @LysandreJik for viz

stevhliu approved these changes Mar 10, 2026

View reviewed changes

Comment thread docs/source/en/serve-cli/serving.md Outdated

Comment thread docs/source/en/serve-cli/serving.md Outdated

Comment thread docs/source/en/serve-cli/serving.md Outdated

rain-1 force-pushed the main branch from 1f26a58 to fb284f6 Compare April 20, 2026 10:26

SunMarc reviewed Apr 20, 2026

View reviewed changes

rain-1 force-pushed the main branch 2 times, most recently from 84cb51a to 4e64376 Compare April 22, 2026 07:03

rain-1 force-pushed the main branch from 4e64376 to f3b1c5f Compare April 22, 2026 07:07

SunMarc approved these changes Apr 22, 2026

View reviewed changes

SunMarc added this pull request to the merge queue Apr 22, 2026

Merged via the queue into huggingface:main with commit 0ea540e Apr 22, 2026
18 checks passed

		class TestLegacyCompletion(unittest.TestCase):
		"""Integration tests for /v1/completions with a real model."""

Conversation

rain-1 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

rain-1 commented Mar 10, 2026

Uh oh!

LysandreJik commented Mar 10, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rain-1 commented Mar 11, 2026

Uh oh!

LysandreJik commented Mar 19, 2026

Uh oh!

rain-1 commented Mar 22, 2026

Uh oh!

rain-1 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc commented Apr 15, 2026

Uh oh!

rain-1 commented Apr 20, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rain-1 Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Apr 22, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

CI Results

Commit Info

rain-1 commented Mar 10, 2026 •

edited

Loading

rain-1 commented Apr 6, 2026 •

edited

Loading

rain-1 Apr 22, 2026 •

edited

Loading