Quickstart
Deploy the recommended open-weights model for a task and run inference against it, end to end, in about a dozen lines. Pareta picks the GPU and serving config for you, so there is no hardware to choose. Inference is OpenAI-compatible and metered against your org's balance.
Install
pip install pareta # or: uv add pareta / poetry add pareta
Authenticate
Mint a pareta_sk_ key in the dashboard (key management is browser-only) and
export it. Pareta.from_env() reads PARETA_API_KEY (and an optional
PARETA_BASE_URL).
export PARETA_API_KEY="pareta_sk_..."
The SDK only ever consumes a key. It never creates, lists, or revokes them, and it never exposes your balance or payment methods. Topping up credit is browser-only.
Deploy and run inference
This is the whole loop: name a task, let Pareta pick the recommended model,
deploy it, and send a request. The wait=True flag blocks through the deploy
SSE stream and hands you back a live Endpoint.
Python
from pareta import Pareta
pa = Pareta.from_env() # reads PARETA_API_KEY
task = "contract-key-fields" # a subtask id from the catalog
# Inspect the model deploy(model="recommended") will pick (a per-task alias).
print("recommended:", pa.tasks.recommended(task)) # e.g. "qwen-vl-2"
# Deploy it. No GPU, quantization, or parallelism knob — Pareta resolves all of it.
ep = pa.endpoints.deploy(task=task, model="recommended", wait=True)
print("live endpoint:", ep.id, ep.status) # e.g. "ep_a1b2c3" "live"
# Run OpenAI-compatible inference against the endpoint id.
resp = pa.chat.completions.create(
model=ep.id, # the endpoint id, not the alias
messages=[{"role": "user", "content": "Say hello in one short sentence."}],
)
print(resp.choices[0].message.content)
print("tokens:", resp.usage.total_tokens)
TypeScript
import { Pareta } from "pareta";
const pa = Pareta.fromEnv(); // reads PARETA_API_KEY
const task = "contract-key-fields"; // a subtask id from the catalog
// Inspect the model deploy({ model: "recommended" }) will pick (a per-task alias).
console.log("recommended:", await pa.tasks.recommended(task)); // e.g. "qwen-vl-2"
// Deploy it. No GPU, quantization, or parallelism knob — Pareta resolves all of it.
const ep = await pa.endpoints.deploy({ task, model: "recommended", wait: true });
console.log("live endpoint:", ep.id, ep.status); // e.g. "ep_a1b2c3" "live"
// Run OpenAI-compatible inference against the endpoint id.
const resp = await pa.chat.completions.create({
model: ep.id, // the endpoint id, not the alias
messages: [{ role: "user", content: "Say hello in one short sentence." }],
});
console.log(resp.choices[0].message.content);
console.log("tokens:", resp.usage.totalTokens);
Output:
recommended: qwen-vl-2
live endpoint: ep_a1b2c3 live
Hello, it is good to meet you.
tokens: 27
A few things worth pinning down:
taskis a subtask id (for example"contract-key-fields"). Discover ids withpa.tasks.list(), or turn a sentence into a task withpa.tasks.match("extract fields from contracts"). See Tasks.model="recommended"(the default) resolves server-side to the task's curated pick, falling back to the top open model on the leaderboard. You can also pass a specific per-task alias. Real model ids never reach the client.ep.idis what you pass tochat.completions.create(model=...). That is the deployed endpoint id, distinct fromep.model, which is the per-task public alias the endpoint serves.- No hardware knob.
deploy()takes onlytask,model, and an optionalname. Pareta selects the GPU and serving class from its registry.
Stream the response
Pass stream=True to get an iterator of ChatCompletionChunk. The incremental
text lives on chunk.choices[0].delta.content (it can be None on the first
and last chunks, so guard it).
Python
for chunk in pa.chat.completions.create(
model=ep.id,
messages=[{"role": "user", "content": "Write a haiku about invoices."}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="", flush=True)
print()
TypeScript
for await (const chunk of pa.chat.completions.create({
model: ep.id,
messages: [{ role: "user", content: "Write a haiku about invoices." }],
stream: true,
})) {
process.stdout.write(chunk.choices[0].delta.content || "");
}
console.log();
Extra OpenAI parameters (temperature, max_tokens, top_p, and so on) pass
straight through as keyword arguments.
Cost and credit
Every successful completion debits your org's balance. If the balance is empty,
the call raises InsufficientCreditsError (HTTP 402). Top-up is browser-only.
Python
from pareta import InsufficientCreditsError
try:
resp = pa.chat.completions.create(model=ep.id, messages=[
{"role": "user", "content": "ping"},
])
except InsufficientCreditsError:
print("Out of credit — top up in the dashboard.")
TypeScript
import { InsufficientCreditsError } from "pareta";
try {
const resp = await pa.chat.completions.create({
model: ep.id,
messages: [{ role: "user", content: "ping" }],
});
} catch (e) {
if (e instanceof InsufficientCreditsError) {
console.log("Out of credit — top up in the dashboard.");
} else {
throw e;
}
}
Evaluation runs are metered the same way (open plus frontier compute). An
EvalRun reports its billed total on run.cost, a Decimal in dollars floored
to whole cents (so a sub-cent run reads Decimal("0.00")); the raw value is on
run.cost_micro_usd. See Evals.
Clean up
Stop the endpoint when you are done so it stops accruing cost, and close the client (or use it as a context manager).
Python
pa.endpoints.stop(ep.id) # later: pa.endpoints.start(ep.id) / pa.endpoints.delete(ep.id)
pa.close()
TypeScript
await pa.endpoints.stop(ep.id); // later: pa.endpoints.start(ep.id) / pa.endpoints.delete(ep.id)
// No close() in TS: the client owns no connection (it uses the native fetch),
// so there is nothing to release and no context-manager form to wrap it in.
Python
# Context-manager form closes the HTTP client for you.
with Pareta.from_env() as pa:
resp = pa.chat.completions.create(model=ep.id, messages=[
{"role": "user", "content": "hi"},
])
TypeScript
// No context manager in TS — just construct and use it; nothing to close.
const pa = Pareta.fromEnv();
const resp = await pa.chat.completions.create({
model: ep.id,
messages: [{ role: "user", content: "hi" }],
});
List what you can call
models.list() returns the OpenAI-compatible subset: deployed endpoints with a
live URL. Each id is usable directly in chat.completions.create(model=...).
Python
for m in pa.models.list():
print(m.id, m.owned_by)
TypeScript
for (const m of await pa.models.list()) {
console.log(m.id, m.ownedBy);
}
Async
AsyncPareta mirrors the sync client; resource methods are async def and
streams are async iterators.
Python
import asyncio
from pareta import AsyncPareta
async def main():
async with AsyncPareta.from_env() as pa:
ep = await pa.endpoints.deploy(
task="contract-key-fields", model="recommended", wait=True,
)
resp = await pa.chat.completions.create(
model=ep.id,
messages=[{"role": "user", "content": "Say hello."}],
)
print(resp.choices[0].message.content)
asyncio.run(main())
TypeScript
// There is no AsyncPareta in TypeScript — the single `Pareta` client is already
// async. Every I/O method returns a Promise (await it), and streams are async
// iterables (`for await`). `tasks.recommended()` / `tasks.leaderboard()` are
// present here too (no sync-only carve-out).
import { Pareta } from "pareta";
const pa = Pareta.fromEnv();
const ep = await pa.endpoints.deploy({
task: "contract-key-fields",
model: "recommended",
wait: true,
});
const resp = await pa.chat.completions.create({
model: ep.id,
messages: [{ role: "user", content: "Say hello." }],
});
console.log(resp.choices[0].message.content);
One async difference to note: tasks.recommended() and tasks.leaderboard()
are sync-only for now.
Already using the OpenAI SDK?
You do not need this SDK just to call a deployed endpoint. Point the openai
client at your base_url plus your pareta_sk_ key:
Python
from openai import OpenAI
client = OpenAI(api_key="pareta_sk_...", base_url="https://api.pareta.ai/v1")
resp = client.chat.completions.create(
model="ep_a1b2c3",
messages=[{"role": "user", "content": "hi"}],
)
TypeScript
import OpenAI from "openai";
const client = new OpenAI({ apiKey: "pareta_sk_...", baseURL: "https://api.pareta.ai/v1" });
const resp = await client.chat.completions.create({
model: "ep_a1b2c3",
messages: [{ role: "user", content: "hi" }],
});
This SDK's unique value is the control plane: deploy, operate, and eval models from code.