Skip to main content

Streaming chat completions

Stream tokens as the model generates them instead of waiting for the whole response. Pass stream=True to chat.completions.create(...) and you get an iterator of ChatCompletionChunk objects, each carrying one incremental piece of text on chunk.choices[0].delta.content. Use this for chat UIs, agent loops, long generations, and anywhere a first-token-fast experience matters.

Inference on Pareta is OpenAI-compatible, so the streaming shape here is the same vLLM-style data-only SSE the openai SDK consumes. The model id you pass is an endpoint id from deploying an endpoint, and streamed inference is metered against your org balance exactly like a non-streaming call.

Quickstart

Python

from pareta import Pareta

pa = Pareta.from_env() # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

stream = pa.chat.completions.create(
model="ep_contract_kie", # an endpoint id from endpoints.deploy(...)
messages=[{"role": "user", "content": "Write a haiku about throughput."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()

TypeScript

import { Pareta } from "pareta";

const pa = Pareta.fromEnv(); // reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

const stream = pa.chat.completions.create({
model: "ep_contract_kie", // an endpoint id from endpoints.deploy(...)
messages: [{ role: "user", content: "Write a haiku about throughput." }],
stream: true,
});

for await (const chunk of stream) {
const delta = chunk.choices[0].delta.content;
if (delta) {
process.stdout.write(delta);
}
}
console.log();

stream=True changes the return type: instead of a single ChatCompletion, create(...) returns an Iterator[ChatCompletionChunk]. Nothing is sent until you start iterating, and the connection stays open for the life of the loop.

Reading a chunk

A streaming chunk has the same schema as a ChatCompletion, but each choice carries a delta (the incremental token) instead of a full message:

Python

chunk.choices[0].delta.content # str | None — the new text in this chunk
chunk.choices[0].delta.role # str | None — usually only set on the first chunk
chunk.choices[0].finish_reason # str | None — "stop" / "length" on the last chunk
chunk.id # str | None
chunk.model # str | None

TypeScript

chunk.choices[0].delta.content // string | null — the new text in this chunk
chunk.choices[0].delta.role // string | null — usually only set on the first chunk
chunk.choices[0].finishReason // string | null — "stop" / "length" on the last chunk
chunk.id // string | null
chunk.model // string | null

delta.content is None on chunks that carry no text (for example the opening role chunk, or a final chunk that only sets finish_reason), so always guard the if delta: check before printing or appending. The stream ends when the server sends [DONE]; the SDK consumes that sentinel and stops the iterator for you, so a plain for loop terminates cleanly.

Need the raw server JSON for a field the typed layer does not surface? Every response object keeps it: chunk.to_dict() returns the untouched payload.

Accumulating the full text

Collect the deltas into a buffer to reconstruct the complete message:

Python

from pareta import Pareta

pa = Pareta.from_env()

chunks = pa.chat.completions.create(
model="ep_contract_kie",
messages=[
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Summarize what an invoice number is."},
],
stream=True,
temperature=0.2, # extra OpenAI params pass straight through
max_tokens=256,
)

parts = []
finish_reason = None
for chunk in chunks:
choice = chunk.choices[0]
if choice.delta.content:
parts.append(choice.delta.content)
if choice.finish_reason:
finish_reason = choice.finish_reason

full_text = "".join(parts)
print(full_text)
print("finish_reason:", finish_reason) # e.g. "stop" or "length"

TypeScript

import { Pareta } from "pareta";

const pa = Pareta.fromEnv();

const chunks = pa.chat.completions.create({
model: "ep_contract_kie",
messages: [
{ role: "system", content: "You are concise." },
{ role: "user", content: "Summarize what an invoice number is." },
],
stream: true,
temperature: 0.2, // extra OpenAI params pass straight through
max_tokens: 256,
});

const parts: string[] = [];
let finishReason: string | null = null;
for await (const chunk of chunks) {
const choice = chunk.choices[0];
if (choice.delta.content) {
parts.push(choice.delta.content);
}
if (choice.finishReason) {
finishReason = choice.finishReason;
}
}

const fullText = parts.join("");
console.log(fullText);
console.log("finishReason:", finishReason); // e.g. "stop" or "length"

A finish_reason of "length" means the model hit max_tokens before it was done; raise max_tokens if you need the full answer.

Note: token usage is not reliably populated on streamed chunks. If you need the usage counts (prompt_tokens / completion_tokens / total_tokens), make the same call with stream=False and read completion.usage.

Extra parameters

Any OpenAI chat parameter you pass as a keyword argument is forwarded verbatim in the request body: temperature, max_tokens, top_p, stop, frequency_penalty, and so on. There is no hardware knob — GPUs, quantization, and tensor-parallelism are resolved by Pareta when you deploy the endpoint, so the only model selector here is the endpoint id you pass to model.

Python

stream = pa.chat.completions.create(
model="ep_contract_kie",
messages=[{"role": "user", "content": "List three GPU-free wins."}],
stream=True,
top_p=0.9,
stop=["\n\n"],
)

TypeScript

const stream = pa.chat.completions.create({
model: "ep_contract_kie",
messages: [{ role: "user", content: "List three GPU-free wins." }],
stream: true,
top_p: 0.9,
stop: ["\n\n"],
});

Async streaming

AsyncPareta mirrors the sync client. create(...) is a coroutine, so await it once to get the async iterator, then drive it with async for:

Python

import asyncio
from pareta import AsyncPareta


async def main():
async with AsyncPareta.from_env() as pa:
stream = await pa.chat.completions.create(
model="ep_contract_kie",
messages=[{"role": "user", "content": "Stream me a limerick."}],
stream=True,
)
async for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()


asyncio.run(main())

TypeScript

// There is no AsyncPareta in TypeScript: the single `Pareta` client is already
// async. `create({ stream: true })` returns an AsyncIterable<ChatCompletionChunk>
// directly — drive it with `for await`, no separate await for the stream handle.
import { Pareta } from "pareta";

const pa = Pareta.fromEnv();

const stream = pa.chat.completions.create({
model: "ep_contract_kie",
messages: [{ role: "user", content: "Stream me a limerick." }],
stream: true,
});

for await (const chunk of stream) {
const delta = chunk.choices[0].delta.content;
if (delta) {
process.stdout.write(delta);
}
}
console.log();

The async with block calls aclose() for you when the block exits, releasing the HTTP client. The chunk shape is identical to the sync path: chunk.choices[0].delta.content is the incremental text.

Metering and errors

Streamed inference debits your org balance on success, the same as a non-streaming completion. Top-ups are browser-only; the SDK does not expose balance or payment methods. If the balance is empty, the call raises InsufficientCreditsError (HTTP 402) before any tokens flow:

Python

from pareta import Pareta
from pareta import InsufficientCreditsError, EndpointNotReadyError

pa = Pareta.from_env()

try:
stream = pa.chat.completions.create(
model="ep_contract_kie",
messages=[{"role": "user", "content": "Hello"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
except InsufficientCreditsError:
print("Out of credit — top up in the dashboard.")
except EndpointNotReadyError:
print("Endpoint is cold or stopped — start it and retry.")

TypeScript

import { Pareta, InsufficientCreditsError, EndpointNotReadyError } from "pareta";

const pa = Pareta.fromEnv();

try {
const stream = pa.chat.completions.create({
model: "ep_contract_kie",
messages: [{ role: "user", content: "Hello" }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0].delta.content;
if (delta) {
process.stdout.write(delta);
}
}
console.log();
} catch (e) {
if (e instanceof InsufficientCreditsError) {
console.log("Out of credit — top up in the dashboard.");
} else if (e instanceof EndpointNotReadyError) {
console.log("Endpoint is cold or stopped — start it and retry.");
} else {
throw e;
}
}

A few things to know about how the stream behaves under failure:

  • model / messages validation is local. Passing an empty model or empty messages raises ValueError immediately, before any network call.
  • Errors surface before the first byte. Non-2xx responses (402, 401, 404, 503, and so on) are raised as the matching ParetaError subclass when the stream starts, not mid-loop. A stopped or cold endpoint raises EndpointNotReadyError (503).
  • Mid-stream drops are not retried. Retries cover only the initial connect/handshake. Once SSE bytes are flowing, a dropped connection raises (APIConnectionError / APITimeoutError) rather than silently resuming, because a partial generation cannot be safely continued. Wrap the loop and re-issue the request if you need at-least-once delivery.

See error handling for the full exception hierarchy.