From a sentence to a deployed winner
You have a job to do ("pull the key fields out of these contracts") and a pile of your own examples. You want the cheapest open-weights model that actually does the job well, serving live inference, without ever touching a GPU console.
This page walks the whole funnel, end to end:
tasks.matchturns your plain-English intent into a benchmark task id.tasks.leaderboardshows you which models lead that task and what the recommended pick is.evals.runs.createscores a shortlist on your data, with frontier baselines for context.- You pick the best open model from the results (
kind == "open"). endpoints.deploystands it up. Pareta resolves the hardware.chat.completions.createruns OpenAI-compatible inference against it.
A few platform truths that shape the code below:
- GPUs are hidden.
endpoints.deploy(task=, model=)takes a task and a model, never a GPU, tensor-parallel degree, or quantization knob. Pareta resolves the serving class. - Models are per-task aliases. Every open-model id you see (
leaderboardrows,run.results[].model_id,endpoint.model) is a public per-task alias. The real weights id never crosses into the SDK. Pass the alias straight back todeploy(model=...). - Evals and inference are metered against your org balance. An eval run debits
for the open and frontier compute it used;
run.costis the billed total in dollars. A successful completion debits too. An empty balance raisesInsufficientCreditsError(402). Top-up is browser-only; the SDK never exposes balance or payment. - Inference is OpenAI-compatible. Once deployed, the endpoint behaves like any OpenAI chat endpoint.
Setup
Python
from pareta import Pareta
pa = Pareta.from_env() # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
TypeScript
import { Pareta } from "pareta";
const pa = Pareta.fromEnv(); // reads PARETA_API_KEY (+ optional PARETA_BASE_URL)
from_env() is the preferred constructor. To pass the key explicitly:
Pareta(api_key="pareta_sk_...").
Step 1: Intent to task with tasks.match
match takes free text and returns a ranked TaskMatch. Use .matched to gate
on a confident hit and .chosen for the best candidate.
Python
m = pa.tasks.match("extract the key fields from a contract", top_k=5)
if not m.matched:
# No confident hit. Inspect the alternates and pick one, or rephrase.
for c in m.candidates:
print(f"{c.task_id:30} score={c.score:.2f} ({c.confidence})")
raise SystemExit("No confident task match. Pick a candidate above.")
task_id = m.chosen.task_id
print(f"matched {task_id} (score={m.chosen.score:.2f}, via {m.matcher})")
if m.ambiguous:
print("Heads up: the top two candidates scored close together.")
TypeScript
const m = await pa.tasks.match("extract the key fields from a contract", { topK: 5 });
if (!m.matched) {
// No confident hit. Inspect the alternates and pick one, or rephrase.
for (const c of m.candidates) {
console.log(`${c.taskId} score=${c.score?.toFixed(2)} (${c.confidence})`);
}
throw new Error("No confident task match. Pick a candidate above.");
}
const taskId = m.chosen.taskId;
console.log(`matched ${taskId} (score=${m.chosen.score?.toFixed(2)}, via ${m.matcher})`);
if (m.ambiguous) {
console.log("Heads up: the top two candidates scored close together.");
}
TaskMatch fields: .query, .matched (bool), .chosen
(TaskMatchCandidate | None), .candidates (ranked list), .ambiguous,
.matcher ("keyword" or "semantic"). Each TaskMatchCandidate has
.task_id, .score (0 to 1), and .confidence ("high" / "medium" /
"low"). match raises ValueError if the query is empty.
If you already know the task id, skip straight to step 2. Browse the full catalog
with pa.tasks.list() (each Task has .id, .default_scorer, and
.has_blob_input, where the last tells you whether the task takes documents or
images).
Step 2: See who leads with tasks.leaderboard
The leaderboard ranks models for a task by quality and cost, names the
recommended deployable pick, and includes a frontier baseline so you know what
you are saving against.
Python
lb = pa.tasks.leaderboard(task_id)
print(f"recommended: {lb.recommended}")
print(f"ranked by {lb.metric}, cost per {lb.cost_unit}\n")
for e in lb.models:
cost_usd = (e.cost_per_request_micro_usd or 0) / 1_000_000
print(f"{e.name:24} {e.kind:8} q={e.quality:.3f} ${cost_usd:.6f}/req {e.context_k}k ctx")
if lb.frontier:
f = lb.frontier
print(f"\nfrontier baseline: {f.name} q={f.quality:.3f}")
TypeScript
const lb = await pa.tasks.leaderboard(taskId);
console.log(`recommended: ${lb.recommended}`);
console.log(`ranked by ${lb.metric}, cost per ${lb.costUnit}\n`);
for (const e of lb.models) {
const costUsd = (e.costPerRequestMicroUsd ?? 0) / 1_000_000;
console.log(`${e.name} ${e.kind} q=${e.quality?.toFixed(3)} $${costUsd.toFixed(6)}/req ${e.contextK}k ctx`);
}
if (lb.frontier) {
const f = lb.frontier;
console.log(`\nfrontier baseline: ${f.name} q=${f.quality?.toFixed(3)}`);
}
Leaderboard fields: .task_id, .metric, .cost_unit, .recommended
(deployable model alias, or None), .models (ranked LeaderboardEntry list),
and .frontier (a single baseline entry, or None). Each LeaderboardEntry has
.name, .kind ("open" or "frontier"), .quality (0 to 1),
.cost_per_request_micro_usd (raw micro-USD, not floored), .context_k
(context window in thousands), and .run_mode.
cost_per_request_micro_usd is raw micro-USD: 1,000,000 micro-USD = $1.00. The
SDK keeps sub-cent unit rates in micro-USD on purpose. Flooring them to whole
cents would erase the open-vs-frontier gap that makes the comparison worth doing.
Want just the deployable pick without the full board:
Python
pick = pa.tasks.recommended(task_id) # -> str | None, the recommended alias
TypeScript
const pick = await pa.tasks.recommended(taskId); // -> string | null, the recommended alias
This is exactly what endpoints.deploy(model="recommended") resolves to under the
hood. Inspect it here before you commit.
Step 3: Prove it on your data with evals.runs.create
The leaderboard is the catalog's published view. Your contracts are not the catalog's contracts. Run a real eval on your rows before you deploy anything.
Build a shortlist from the leaderboard's open entries, then score them against
your data. You can create the eval set inline in the same call by passing
task= + items=.
Python
# Shortlist: top open models off the leaderboard (these are deployable aliases).
candidates = [e.name for e in lb.models if e.kind == "open"][:3]
# Your data: one dict per row. Shape depends on the task's scorer; for an
# extraction task each row carries the input plus the expected fields.
items = [
{"input": "MASTER SERVICES AGREEMENT ... Term: 24 months ... Fee: $48,000",
"expected": {"term_months": 24, "annual_fee_usd": 48000}},
{"input": "STATEMENT OF WORK ... Term: 12 months ... Fee: $9,500",
"expected": {"term_months": 12, "annual_fee_usd": 9500}},
# ... more rows. More rows means tighter confidence intervals.
]
run = pa.evals.runs.create(
task=task_id,
items=items,
models=candidates, # open candidates to score
frontier="benchmarked", # baselines on this task's leaderboard, for context
name="contracts shortlist v1",
wait=True, # block until terminal, then return the final run
)
TypeScript
// Shortlist: top open models off the leaderboard (these are deployable aliases).
const candidates = lb.models.filter((e) => e.kind === "open").slice(0, 3).map((e) => e.name);
// Your data: one object per row. Shape depends on the task's scorer; for an
// extraction task each row carries the input plus the expected fields.
const items = [
{ input: "MASTER SERVICES AGREEMENT ... Term: 24 months ... Fee: $48,000",
expected: { term_months: 24, annual_fee_usd: 48000 } },
{ input: "STATEMENT OF WORK ... Term: 12 months ... Fee: $9,500",
expected: { term_months: 12, annual_fee_usd: 9500 } },
// ... more rows. More rows means tighter confidence intervals.
];
const run = await pa.evals.runs.create({
task: taskId,
items,
models: candidates, // open candidates to score
frontier: "benchmarked", // baselines on this task's leaderboard, for context
name: "contracts shortlist v1",
wait: true, // block until terminal, then return the final run
});
evals.runs.create parameters:
- Provide either
eval_set=<id>(an existing set) ortask=+items=to create one inline.models=is required and is the list of open candidate aliases to score. frontier=controls the vendor baselines, resolved SDK-side:Noneor"none"-> no baselines."all"-> every frontier model for the task."benchmarked"-> only the frontier models on the task's leaderboard (vision-filtered for document tasks).- an explicit list of frontier ids -> passed through as-is.
wait=Truepolls until the run is terminal ("completed"or"failed"), everypoll_intervalseconds (default 3.0), up totimeoutseconds (default 900.0), then returns the finalEvalRun. It raisesParetaErrorif the timeout is hit.wait=Falsereturns immediately with a"running"/queued run; poll it yourself withpa.evals.runs.wait(run.id)orpa.evals.runs.retrieve(run.id).
create raises ValueError if neither eval_set nor task+items is given,
and ValueError if items is empty.
This call is metered. The org balance is debited for the open and frontier compute
the run used. If the balance is empty it raises InsufficientCreditsError.
Python
from pareta import InsufficientCreditsError
try:
run = pa.evals.runs.create(task=task_id, items=items, models=candidates,
frontier="benchmarked", wait=True)
except InsufficientCreditsError:
raise SystemExit("Org balance is empty. Top up in the dashboard (browser-only).")
TypeScript
import { InsufficientCreditsError } from "pareta";
try {
const run = await pa.evals.runs.create({
task: taskId, items, models: candidates, frontier: "benchmarked", wait: true,
});
} catch (e) {
if (e instanceof InsufficientCreditsError) {
throw new Error("Org balance is empty. Top up in the dashboard (browser-only).");
}
throw e;
}
Document and image tasks
If task.has_blob_input is true, the rows reference binary documents (PDFs,
scans). Create the set first, attach each file to its row, then start the run
against the set id:
Python
es = pa.evals.sets.create(task=task_id, items=items, name="contracts with PDFs")
# Attach a PDF to row 0's "document" blob field. Files under 5 MiB go inline;
# larger ones use a signed-URL upload. The SDK picks the path for you.
pa.evals.sets.upload_document(es.id, "contracts/0001.pdf", idx=0, field_name="document")
run = pa.evals.runs.create(eval_set=es.id, models=candidates,
frontier="benchmarked", wait=True)
TypeScript
const es = await pa.evals.sets.create({ task: taskId, items, name: "contracts with PDFs" });
// Attach a PDF to row 0's "document" blob field. Files under 5 MiB go inline;
// larger ones use a signed-URL upload. The SDK picks the path for you.
await pa.evals.sets.uploadDocument(es.id, "contracts/0001.pdf", { idx: 0, fieldName: "document" });
const run = await pa.evals.runs.create({
evalSet: es.id, models: candidates, frontier: "benchmarked", wait: true,
});
upload_document accepts a path (str/Path), raw bytes, or a binary
file-like object; the MIME type is guessed from the filename unless you pass
mime=. EvalSet exposes .id, .task_id, .name, .item_count, and
.scoring_strategy.
Step 4: Read the results, pick the best open model
A terminal EvalRun carries per-model aggregates in .results. Each EvalResult
has .model_id (the per-task alias), .kind ("open" or "frontier"),
.quality_mean, .quality_ci_low / .quality_ci_high (95% CI),
.mean_cost_micro_usd (raw average cost per item), .n_succeeded, and
.error_count.
Python
print(f"run {run.id}: {run.status}")
print(f"billed: ${run.cost} ({run.cost_micro_usd} micro-USD)\n")
for r in sorted(run.results, key=lambda r: (r.quality_mean or 0), reverse=True):
cost_usd = (r.mean_cost_micro_usd or 0) / 1_000_000
print(f"{r.model_id:24} {r.kind:8} "
f"q={r.quality_mean:.3f} [{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}] "
f"${cost_usd:.6f}/item ({r.n_succeeded} ok, {r.error_count} err)")
# The winner: the highest-quality OPEN model. Frontier rows are baselines only:
# they are vendor APIs, not something you deploy here.
open_results = [r for r in run.results if r.kind == "open"]
if not open_results:
raise SystemExit("No open candidates succeeded. Widen the shortlist.")
winner = max(open_results, key=lambda r: (r.quality_mean or 0))
print(f"\nwinner: {winner.model_id} (quality {winner.quality_mean:.3f})")
TypeScript
console.log(`run ${run.id}: ${run.status}`);
console.log(`billed: $${run.cost} (${run.costMicroUsd} micro-USD)\n`);
for (const r of [...run.results].sort((a, b) => (b.qualityMean ?? 0) - (a.qualityMean ?? 0))) {
const costUsd = (r.meanCostMicroUsd ?? 0) / 1_000_000;
console.log(
`${r.modelId} ${r.kind} ` +
`q=${r.qualityMean?.toFixed(3)} [${r.qualityCiLow?.toFixed(3)}, ${r.qualityCiHigh?.toFixed(3)}] ` +
`$${costUsd.toFixed(6)}/item (${r.nSucceeded} ok, ${r.errorCount} err)`,
);
}
// The winner: the highest-quality OPEN model. Frontier rows are baselines only:
// they are vendor APIs, not something you deploy here.
const openResults = run.results.filter((r) => r.kind === "open");
if (openResults.length === 0) {
throw new Error("No open candidates succeeded. Widen the shortlist.");
}
const winner = openResults.reduce((a, b) => ((b.qualityMean ?? 0) > (a.qualityMean ?? 0) ? b : a));
console.log(`\nwinner: ${winner.modelId} (quality ${winner.qualityMean?.toFixed(3)})`);
Two money fields, two purposes:
run.costis aDecimal, the billed total in dollars, floored to whole cents (the SDK never rounds a charge up). A run that cost 5 micro-USD readsDecimal("0.00").run.cost_micro_usdis the raw integer total in micro-USD when you need exact precision.- Per-model
mean_cost_micro_usdstays in raw micro-USD for the same reason the leaderboard rates do: flooring sub-cent unit costs would collapse the open-vs-frontier comparison.
The frontier rows are there to answer "how much quality am I giving up, and how much am I saving?" You deploy the open winner, not the frontier baseline.
Step 5: Deploy the winner with endpoints.deploy
Hand the winning alias straight to deploy. No hardware knob: Pareta resolves the
serving class for the task and model. With wait=True, the call blocks through the
deploy and returns the live Endpoint.
Python
ep = pa.endpoints.deploy(
task=task_id,
model=winner.model_id, # the open alias from the eval, deployed as-is
name="contracts-prod", # optional; auto-generated if omitted
wait=True,
)
print(f"endpoint {ep.id} status={ep.status} live={ep.is_live} url={ep.url}")
TypeScript
const ep = await pa.endpoints.deploy({
task: taskId,
model: winner.modelId, // the open alias from the eval, deployed as-is
name: "contracts-prod", // optional; auto-generated if omitted
wait: true,
});
console.log(`endpoint ${ep.id} status=${ep.status} live=${ep.isLive} url=${ep.url}`);
Endpoint fields: .id (the name you pass to chat.completions.create(model=...)),
.name, .model (the per-task alias), .status ("live", "starting",
"stopped", ...), .task, .url, and .is_live (status == "live").
To pass the leaderboard's recommended pick instead of an eval winner, use
model="recommended" (the default) and skip the model argument entirely.
Watching deploy progress
With wait=False, deploy returns an iterator of progress events. Each event is
a {"event": str, "data": dict} dict. The stream ends with a "complete" event
(its data carries the Endpoint) or an "error" event.
Python
ep = None
for event in pa.endpoints.deploy(task=task_id, model=winner.model_id):
if event["event"] == "progress":
print("...", event["data"])
elif event["event"] == "complete":
ep = pa.endpoints.retrieve(event["data"]["endpoint"]["id"])
elif event["event"] == "error":
raise SystemExit(f"deploy failed: {event['data']}")
TypeScript
let ep = null;
for await (const event of pa.endpoints.deploy({ task: taskId, model: winner.modelId })) {
if (event.event === "progress") {
console.log("...", event.data);
} else if (event.event === "complete") {
ep = await pa.endpoints.retrieve(event.data.endpoint.id);
} else if (event.event === "error") {
throw new Error(`deploy failed: ${JSON.stringify(event.data)}`);
}
}
With wait=True the SDK consumes this stream internally and raises ParetaError
on an "error" event. deploy raises ValueError if task is missing.
Step 6: Inference with chat.completions.create
The deployed endpoint is OpenAI-compatible. Pass ep.id as the model:
Python
resp = pa.chat.completions.create(
model=ep.id,
messages=[
{"role": "system", "content": "Extract term_months and annual_fee_usd as JSON."},
{"role": "user", "content": "MASTER SERVICES AGREEMENT ... Term: 36 months ... Fee: $72,000"},
],
temperature=0, # any OpenAI chat param passes straight through
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens, "tokens")
TypeScript
const resp = await pa.chat.completions.create({
model: ep.id,
messages: [
{ role: "system", content: "Extract term_months and annual_fee_usd as JSON." },
{ role: "user", content: "MASTER SERVICES AGREEMENT ... Term: 36 months ... Fee: $72,000" },
],
temperature: 0, // any OpenAI chat param passes straight through
});
console.log(resp.choices[0].message.content);
console.log(resp.usage.totalTokens, "tokens");
create returns a ChatCompletion with .id, .model, .created, .choices
(each Choice has .index, .finish_reason, .message), and .usage
(.prompt_tokens, .completion_tokens, .total_tokens). It raises ValueError
if model or messages is empty, and (like the eval run) debits the org balance
on success, raising InsufficientCreditsError if the balance is empty.
Streaming
Pass stream=True for an iterator of ChatCompletionChunk. Each chunk's
incremental text is at .choices[0].delta.content:
Python
for chunk in pa.chat.completions.create(model=ep.id, messages=[...], stream=True):
print(chunk.choices[0].delta.content or "", end="", flush=True)
TypeScript
for await (const chunk of pa.chat.completions.create({ model: ep.id, messages: [...], stream: true })) {
process.stdout.write(chunk.choices[0].delta.content ?? "");
}
You never need this SDK to call the endpoint. Point the openai client at the
same base_url + your pareta_sk_ key. The SDK's value is the control plane you
just walked: match, leaderboard, eval, deploy.
The whole funnel
Python
from pareta import Pareta, InsufficientCreditsError
pa = Pareta.from_env()
# 1. intent -> task
m = pa.tasks.match("extract the key fields from a contract")
assert m.matched, "no confident task match"
task_id = m.chosen.task_id
# 2. who leads this task
lb = pa.tasks.leaderboard(task_id)
candidates = [e.name for e in lb.models if e.kind == "open"][:3]
# 3. prove it on your data (open candidates + benchmarked frontier baselines)
items = [{"input": "...", "expected": {...}}] # your rows
try:
run = pa.evals.runs.create(task=task_id, items=items, models=candidates,
frontier="benchmarked", wait=True)
except InsufficientCreditsError:
raise SystemExit("Top up the org balance in the dashboard (browser-only).")
# 4. pick the best OPEN model
winner = max((r for r in run.results if r.kind == "open"),
key=lambda r: (r.quality_mean or 0))
# 5. deploy it (Pareta resolves the hardware)
ep = pa.endpoints.deploy(task=task_id, model=winner.model_id, wait=True)
# 6. infer (OpenAI-compatible)
resp = pa.chat.completions.create(
model=ep.id,
messages=[{"role": "user", "content": "Extract fields from: ..."}],
)
print(resp.choices[0].message.content)
TypeScript
import { Pareta, InsufficientCreditsError } from "pareta";
const pa = Pareta.fromEnv();
// 1. intent -> task
const m = await pa.tasks.match("extract the key fields from a contract");
if (!m.matched) throw new Error("no confident task match");
const taskId = m.chosen.taskId;
// 2. who leads this task
const lb = await pa.tasks.leaderboard(taskId);
const candidates = lb.models.filter((e) => e.kind === "open").slice(0, 3).map((e) => e.name);
// 3. prove it on your data (open candidates + benchmarked frontier baselines)
const items = [{ input: "...", expected: {} }]; // your rows
let run;
try {
run = await pa.evals.runs.create({
task: taskId, items, models: candidates, frontier: "benchmarked", wait: true,
});
} catch (e) {
if (e instanceof InsufficientCreditsError) {
throw new Error("Top up the org balance in the dashboard (browser-only).");
}
throw e;
}
// 4. pick the best OPEN model
const winner = run.results
.filter((r) => r.kind === "open")
.reduce((a, b) => ((b.qualityMean ?? 0) > (a.qualityMean ?? 0) ? b : a));
// 5. deploy it (Pareta resolves the hardware)
const ep = await pa.endpoints.deploy({ task: taskId, model: winner.modelId, wait: true });
// 6. infer (OpenAI-compatible)
const resp = await pa.chat.completions.create({
model: ep.id,
messages: [{ role: "user", content: "Extract fields from: ..." }],
});
console.log(resp.choices[0].message.content);
Operating and measuring the live endpoint
Once it is serving, operate it from code: pa.endpoints.list(),
pa.endpoints.retrieve(ep.id), pa.endpoints.stop(ep.id),
pa.endpoints.start(ep.id), pa.endpoints.delete(ep.id). Read its dimensions via
pa.endpoints.metrics(ep.id).performance() (and .uptime(), .cost(),
.quality(), .activity()). The .cost() dimension reports per-endpoint spend
and savings versus the frontier baseline.
Async
Every step has an async twin on AsyncPareta, with the same names and arguments,
all await-ed. wait=True and deploy(wait=False) return awaitables and async
iterators rather than their sync equivalents.
Python
import asyncio
from pareta import AsyncPareta
async def main():
async with AsyncPareta.from_env() as pa:
m = await pa.tasks.match("extract the key fields from a contract")
task_id = m.chosen.task_id
run = await pa.evals.runs.create(
task=task_id, items=[...], models=[...], frontier="benchmarked", wait=True)
winner = max((r for r in run.results if r.kind == "open"),
key=lambda r: (r.quality_mean or 0))
ep = await pa.endpoints.deploy(task=task_id, model=winner.model_id, wait=True)
resp = await pa.chat.completions.create(
model=ep.id, messages=[{"role": "user", "content": "..."}])
print(resp.choices[0].message.content)
asyncio.run(main())
TypeScript
There is no AsyncPareta in TypeScript — the one Pareta client is already
async. Every I/O method returns a Promise you await, and leaderboard() /
recommended() are present on it (no sync-only gap to work around). So the whole
funnel above is the async path; just call it inside an async function.
import { Pareta } from "pareta";
async function main() {
const pa = Pareta.fromEnv();
const m = await pa.tasks.match("extract the key fields from a contract");
const taskId = m.chosen.taskId;
const run = await pa.evals.runs.create({
task: taskId, items: [], models: [], frontier: "benchmarked", wait: true,
});
const winner = run.results
.filter((r) => r.kind === "open")
.reduce((a, b) => ((b.qualityMean ?? 0) > (a.qualityMean ?? 0) ? b : a));
const ep = await pa.endpoints.deploy({ task: taskId, model: winner.modelId, wait: true });
const resp = await pa.chat.completions.create({
model: ep.id, messages: [{ role: "user", content: "..." }],
});
console.log(resp.choices[0].message.content);
}
main();
Related
- Run an eval on your own data: the eval set and run surface in depth, including document uploads and confidence intervals.
- Deploy and operate an endpoint: start/stop, metrics, and the deploy progress stream.
- OpenAI-compatible inference: streaming, usage,
and pointing the
openaiclient at Pareta. - Money and metering: how
run.cost, micro-USD rates, andInsufficientCreditsErrorwork.