Benchmark models on your own data

A public leaderboard tells you which model wins on someone else's data. It does not tell you which model wins on yours. This page shows how to take your own labeled rows, score a slate of open-weights candidates against a frontier baseline, and read back a ranked, cost-annotated result, all in one evals.runs.create(...) call.

The shape is always the same:

Pick a task (it carries the scorer and the input schema).
Build an eval set from your rows.
Run open candidates against frontier="benchmarked".
Read the ranked results and the dollar cost of the run.

Evals are metered: the org balance is debited for the compute you ran (open candidates and frontier baselines). run.cost is the billed total in dollars; an empty balance raises InsufficientCreditsError. Top-up is browser-only.

Setup

Python

from pareta import Pareta

pa = Pareta.from_env()  # reads PARETA_API_KEY (and optional PARETA_BASE_URL)

TypeScript

import { Pareta } from "pareta";

const pa = Pareta.fromEnv(); // reads PARETA_API_KEY (and optional PARETA_BASE_URL)

from_env() is the path you want; it keeps the key out of your source. See Authentication for the constructor form and key formats.

1. Pick a task

A task defines what gets scored and how. Every eval set, run, and result is anchored to one task id. The task also owns the default_scorer (the metric your candidates are judged on) and tells you, via has_blob_input, whether rows carry documents or images.

If you already know the id, skip ahead. Otherwise, match free text against the catalog:

Python

match = pa.tasks.match("extract key fields from a contract", top_k=5)

if match.matched:
    task_id = match.chosen.task_id          # best candidate
    print(task_id, match.chosen.confidence)  # e.g. "contract-key-fields" "high"
else:
    # nothing landed with confidence; inspect the ranked alternates
    for c in match.candidates:
        print(c.task_id, round(c.score, 3), c.confidence)
    raise SystemExit("refine the query")

TypeScript

const match = await pa.tasks.match("extract key fields from a contract", { topK: 5 });

let taskId: string;
if (match.matched) {
  taskId = match.chosen!.taskId!; // best candidate
  console.log(taskId, match.chosen!.confidence); // e.g. "contract-key-fields" "high"
} else {
  // nothing landed with confidence; inspect the ranked alternates
  for (const c of match.candidates) {
    console.log(c.taskId, c.score?.toFixed(3), c.confidence);
  }
  throw new Error("refine the query");
}

match.ambiguous is True when the top two scores are close, worth surfacing to a human before committing. Confirm the scorer and input schema before you build a set:

Python

task = pa.tasks.retrieve(task_id)
print(task.default_scorer)   # the metric your run will report (e.g. "macro_joint_f1")
print(task.has_blob_input)   # True → rows attach PDFs/images (see step 2b)

TypeScript

const task = await pa.tasks.retrieve(taskId);
console.log(task.defaultScorer); // the metric your run will report (e.g. "macro_joint_f1")
console.log(task.hasBlobInput);  // true → rows attach PDFs/images (see step 2b)

See Discover tasks for the full matching and catalog walkthrough.

2. Build an eval set from your rows

An eval set is your labeled data, stored server-side and reusable across runs. Each row is a dict whose fields match the task schema. The exact keys are task-specific, but the universal shape is inputs the model sees plus a target (the gold label the scorer compares against).

Python

items = [
    {
        "text": "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
        "target": {"effective_date": "2026-03-03", "parties": ["Acme Corp", "Globex LLC"]},
    },
    {
        "text": "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
        "target": {"effective_date": "2026-01-12", "parties": ["Initech", "Hooli"]},
    },
    # ... more rows. A few dozen labeled rows already give you a usable signal.
]

eval_set = pa.evals.sets.create(task=task_id, items=items)

print(eval_set.id)                # use this in runs.create(eval_set=...)
print(eval_set.item_count)        # 2
print(eval_set.scoring_strategy)  # e.g. "extraction"

TypeScript

const items = [
  {
    text: "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
    target: { effective_date: "2026-03-03", parties: ["Acme Corp", "Globex LLC"] },
  },
  {
    text: "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
    target: { effective_date: "2026-01-12", parties: ["Initech", "Hooli"] },
  },
  // ... more rows. A few dozen labeled rows already give you a usable signal.
];

const evalSet = await pa.evals.sets.create({ task: taskId, items });

console.log(evalSet.id);              // use this in runs.create({ evalSet: ... })
console.log(evalSet.itemCount);       // 2
console.log(evalSet.scoringStrategy); // e.g. "extraction"

items must be non-empty (an empty list raises ValueError before any request goes out). If you omit name, the set is labeled "sdk eval set (N items)".

Reuse a set across many runs, or list and prune as you iterate:

Python

for s in pa.evals.sets.list():
    print(s.id, s.task_id, s.item_count, s.name)

# pa.evals.sets.delete(eval_set.id)   # when you are done with it

TypeScript

for (const s of await pa.evals.sets.list()) {
  console.log(s.id, s.taskId, s.itemCount, s.name);
}

// await pa.evals.sets.delete(evalSet.id);   // when you are done with it

2b. Document tasks: attach the file to each row

When task.has_blob_input is True, the row carries a binary document. Create the set with the row's text/label fields and a placeholder for the blob, then attach the file to that row by index:

Python

doc_task = "invoice-extraction"   # a has_blob_input task

eval_set = pa.evals.sets.create(
    task=doc_task,
    items=[
        {"target": {"invoice_number": "INV-7781", "total": "1240.00"}},
        {"target": {"invoice_number": "INV-7782", "total": "98.50"}},
    ],
)

# Attach one PDF per row. idx is the 0-based row; field_name is the blob input
# field from the task schema. MIME is auto-detected from the filename.
pa.evals.sets.upload_document(eval_set.id, "invoices/7781.pdf", idx=0, field_name="document")
pa.evals.sets.upload_document(eval_set.id, "invoices/7782.pdf", idx=1, field_name="document")

TypeScript

const docTask = "invoice-extraction"; // a hasBlobInput task

const evalSet = await pa.evals.sets.create({
  task: docTask,
  items: [
    { target: { invoice_number: "INV-7781", total: "1240.00" } },
    { target: { invoice_number: "INV-7782", total: "98.50" } },
  ],
});

// Attach one PDF per row. idx is the 0-based row; fieldName is the blob input
// field from the task schema. MIME is auto-detected from the filename.
await pa.evals.sets.uploadDocument(evalSet.id, "invoices/7781.pdf", { idx: 0, fieldName: "document" });
await pa.evals.sets.uploadDocument(evalSet.id, "invoices/7782.pdf", { idx: 1, fieldName: "document" });

upload_document accepts a path (str/Path), raw bytes, or any binary file-like object; anything else raises TypeError. Files under 5 MiB upload inline; larger ones go through a signed-URL direct-to-storage flow. Either way the call returns the completion response dict. Pass mime="application/pdf" to override detection.

3. Run open candidates against a frontier baseline

This is the core call. You name the open-weights candidates (per-task public aliases) and let frontier="benchmarked" pull the vendor baselines that sit on this task's leaderboard. The run scores everything on the same rows with the same scorer, so the numbers are directly comparable.

Python

run = pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["contract-kie-1", "contract-kie-2"],  # open candidates (aliases)
    frontier="benchmarked",                        # vendor baselines on this leaderboard
    wait=True,                                      # block until the run is terminal
)

print(run.status)  # "completed"

TypeScript

const run = await pa.evals.runs.create({
  evalSet: evalSet.id,
  models: ["contract-kie-1", "contract-kie-2"], // open candidates (aliases)
  frontier: "benchmarked",                       // vendor baselines on this leaderboard
  wait: true,                                    // block until the run is terminal
});

console.log(run.status); // "completed"

The models list is the open candidates you want to rank; it is required. frontier controls the baselines:

`frontier=`	Evaluates against
`None` or `"none"`	nothing (open candidates only)
`"benchmarked"`	frontier models on this task's leaderboard (vision-filtered for document tasks)
`"all"`	every frontier model in the eval pool for the task
`["gpt-4o", "claude-..."]`	exactly these frontier ids

The "benchmarked" and "all" keywords need to know the task. With eval_set=... the SDK looks it up from the set; if you pass an explicit list of ids it skips the lookup entirely.

GPUs and serving hardware never enter this call. There is no GPU, quantization, or run-mode knob. You name a task and models; Pareta resolves the rest. Open-weights model ids are per-task aliases, and frontier ids are the vendor names in the clear.

Inline create (skip step 2)

If you do not need a reusable set, hand the rows straight to the run. Pass task= and items= instead of eval_set=, and the SDK creates the set for you:

Python

run = pa.evals.runs.create(
    task=task_id,
    items=items,
    models=["contract-kie-1", "contract-kie-2"],
    frontier="benchmarked",
    wait=True,
)

TypeScript

const run = await pa.evals.runs.create({
  task: taskId,
  items,
  models: ["contract-kie-1", "contract-kie-2"],
  frontier: "benchmarked",
  wait: true,
});

You must pass either eval_set=<id> or both task= and items=; anything else raises ValueError.

Picking candidates from the leaderboard

If you want the curated pick rather than hand-naming aliases, read the leaderboard and feed its recommended id into the run:

Python

lb = pa.tasks.leaderboard(task_id)
print(lb.recommended)                # the deployable alias Pareta curates for this task
print(lb.frontier.name)              # the savings baseline

candidates = [lb.recommended] + [m.name for m in lb.models[:2] if m.kind == "open"]
run = pa.evals.runs.create(eval_set=eval_set.id, models=candidates,
                           frontier="benchmarked", wait=True)

TypeScript

const lb = await pa.tasks.leaderboard(taskId);
console.log(lb.recommended);   // the deployable alias Pareta curates for this task
console.log(lb.frontier!.name); // the savings baseline

const candidates = [lb.recommended!, ...lb.models.slice(0, 2).filter((m) => m.kind === "open").map((m) => m.name!)];
const run = await pa.evals.runs.create({
  evalSet: evalSet.id,
  models: candidates,
  frontier: "benchmarked",
  wait: true,
});

To enumerate the frontier roster directly (for example, to build an explicit frontier=[...] list), use pa.evals.frontier_models(task=task_id); each entry exposes .id, .vendor, .vision, and .benchmarked.

4. Read the ranked results

A terminal run carries one EvalResult per model. Sort by quality_mean to get the ranking, and read run.cost to see what the run cost you:

Python

ranked = sorted(run.results, key=lambda r: r.quality_mean or 0, reverse=True)

for r in ranked:
    cost_per_item = (r.mean_cost_micro_usd or 0) / 1_000_000  # micro-USD → dollars
    print(
        f"{r.model_id:24}  "
        f"quality={r.quality_mean:.3f}  "
        f"[{r.quality_ci_low:.3f}, {r.quality_ci_high:.3f}]  "
        f"${cost_per_item:.6f}/item  "
        f"ok={r.n_succeeded} err={r.error_count}"
    )

print(f"\nrun cost: ${run.cost}")          # Decimal dollars, floored to cents
print(f"raw micro-USD: {run.cost_micro_usd}")

TypeScript

const ranked = [...run.results].sort((a, b) => (b.qualityMean ?? 0) - (a.qualityMean ?? 0));

for (const r of ranked) {
  const costPerItem = (r.meanCostMicroUsd ?? 0) / 1_000_000; // micro-USD → dollars
  console.log(
    `${(r.modelId ?? "").padEnd(24)}  ` +
      `quality=${r.qualityMean!.toFixed(3)}  ` +
      `[${r.qualityCiLow!.toFixed(3)}, ${r.qualityCiHigh!.toFixed(3)}]  ` +
      `$${costPerItem.toFixed(6)}/item  ` +
      `ok=${r.nSucceeded} err=${r.errorCount}`,
  );
}

console.log(`\nrun cost: $${run.cost}`);     // dollar string, floored to cents
console.log(`raw micro-USD: ${run.costMicroUsd}`);

What the fields mean:

quality_mean: the model's mean score on the task's scorer, in [0, 1]. This is your ranking key.
quality_ci_low / quality_ci_high: the 95% confidence interval. If two models' intervals overlap heavily, your eval set is too small to separate them, so add rows.
mean_cost_micro_usd: average cost per item, kept in micro-USD (not floored). This is where the open-vs-frontier comparison lives, so sub-cent precision is preserved: a cheaper open model that matches frontier quality is the whole point.
n_succeeded / error_count: how many rows scored cleanly. A high error_count on one model usually means malformed output, not a bad model, so inspect before trusting its quality number.
model_id: the per-task alias (open) or vendor id (frontier). kind distinguishes "open" from "frontier" where the backend populates it.

A note on money

run.cost is a Decimal of dollars, floored to whole cents, so the SDK never overstates a charge and a sub-cent run reads Decimal("0.00"). For the exact figure use run.cost_micro_usd (an integer, where 1_000_000 micro-USD is $1.00). The same convention is why per-item rates like mean_cost_micro_usd stay in micro-USD: flooring them to cents would erase the open-vs-frontier difference you ran the eval to find.

Not blocking on the run

wait=True polls until the run reaches "completed" or "failed", then returns. For long sets, tune the cadence and ceiling:

Python

run = pa.evals.runs.create(
    eval_set=eval_set.id,
    models=["contract-kie-1", "contract-kie-2"],
    frontier="benchmarked",
    wait=True,
    poll_interval=5.0,   # seconds between polls (default 3.0)
    timeout=1800.0,      # give up after 30 min (default 900.0); raises ParetaError on timeout
)

TypeScript

const run = await pa.evals.runs.create({
  evalSet: evalSet.id,
  models: ["contract-kie-1", "contract-kie-2"],
  frontier: "benchmarked",
  wait: true,
  pollInterval: 5,   // seconds between polls (default 3)
  timeout: 1800,     // give up after 30 min (default 900); throws ParetaError on timeout
});

Or fire and poll yourself. wait=False returns immediately with a run you can retrieve later:

Python

run = pa.evals.runs.create(eval_set=eval_set.id,
                           models=["contract-kie-1"], frontier="benchmarked")
run_id = run.id
# ... later, from anywhere ...
run = pa.evals.runs.retrieve(run_id)
if run.is_terminal:
    print(run.status, run.results)

# equivalently, block on an already-started run:
run = pa.evals.runs.wait(run_id, timeout=1800.0)

TypeScript

let run = await pa.evals.runs.create({
  evalSet: evalSet.id,
  models: ["contract-kie-1"],
  frontier: "benchmarked",
});
const runId = run.id!;
// ... later, from anywhere ...
run = await pa.evals.runs.retrieve(runId);
if (run.isTerminal) {
  console.log(run.status, run.results);
}

// equivalently, block on an already-started run:
run = await pa.evals.runs.wait(runId, { timeout: 1800 });

Handling an empty balance

Both the open and frontier compute are metered. If the org balance cannot cover the run, create raises before any work is billed:

Python

from pareta import InsufficientCreditsError

try:
    run = pa.evals.runs.create(eval_set=eval_set.id,
                               models=["contract-kie-1"], frontier="benchmarked", wait=True)
except InsufficientCreditsError:
    print("Out of credit. Top up in the dashboard (billing is browser-only).")

TypeScript

import { InsufficientCreditsError } from "pareta";

try {
  const run = await pa.evals.runs.create({
    evalSet: evalSet.id,
    models: ["contract-kie-1"],
    frontier: "benchmarked",
    wait: true,
  });
} catch (e) {
  if (e instanceof InsufficientCreditsError) {
    console.log("Out of credit. Top up in the dashboard (billing is browser-only).");
  } else {
    throw e;
  }
}

InsufficientCreditsError is a subclass of APIStatusError (status 402), so you can also catch the broader ParetaError if you want one handler for every SDK failure.

Full example

Python

from pareta import Pareta, InsufficientCreditsError

pa = Pareta.from_env()

# 1. Pick the task.
task_id = "contract-key-fields"
task = pa.tasks.retrieve(task_id)
print("scoring on:", task.default_scorer)

# 2. Build the eval set from your rows.
items = [
    {"text": "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
     "target": {"effective_date": "2026-03-03", "parties": ["Acme Corp", "Globex LLC"]}},
    {"text": "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
     "target": {"effective_date": "2026-01-12", "parties": ["Initech", "Hooli"]}},
]
eval_set = pa.evals.sets.create(task=task_id, items=items, name="contract fields v1")

# 3. Run open candidates against the benchmarked frontier baselines.
try:
    run = pa.evals.runs.create(
        eval_set=eval_set.id,
        models=["contract-kie-1", "contract-kie-2"],
        frontier="benchmarked",
        wait=True,
    )
except InsufficientCreditsError:
    raise SystemExit("Out of credit. Top up in the dashboard.")

# 4. Read the ranked results.
for r in sorted(run.results, key=lambda r: r.quality_mean or 0, reverse=True):
    print(f"{r.model_id:24} {r.quality_mean:.3f}  ${(r.mean_cost_micro_usd or 0)/1e6:.6f}/item")

print("run cost:", run.cost)  # Decimal dollars, floored to cents

TypeScript

import { Pareta, InsufficientCreditsError } from "pareta";

const pa = Pareta.fromEnv();

// 1. Pick the task.
const taskId = "contract-key-fields";
const task = await pa.tasks.retrieve(taskId);
console.log("scoring on:", task.defaultScorer);

// 2. Build the eval set from your rows.
const items = [
  { text: "This Agreement is made on 3 March 2026 between Acme Corp and Globex LLC...",
    target: { effective_date: "2026-03-03", parties: ["Acme Corp", "Globex LLC"] } },
  { text: "Master Services Agreement, dated January 12, 2026, by and between Initech and Hooli...",
    target: { effective_date: "2026-01-12", parties: ["Initech", "Hooli"] } },
];
const evalSet = await pa.evals.sets.create({ task: taskId, items, name: "contract fields v1" });

// 3. Run open candidates against the benchmarked frontier baselines.
let run;
try {
  run = await pa.evals.runs.create({
    evalSet: evalSet.id,
    models: ["contract-kie-1", "contract-kie-2"],
    frontier: "benchmarked",
    wait: true,
  });
} catch (e) {
  if (e instanceof InsufficientCreditsError) {
    throw new Error("Out of credit. Top up in the dashboard.");
  }
  throw e;
}

// 4. Read the ranked results.
for (const r of [...run.results].sort((a, b) => (b.qualityMean ?? 0) - (a.qualityMean ?? 0))) {
  console.log(`${(r.modelId ?? "").padEnd(24)} ${r.qualityMean!.toFixed(3)}  $${((r.meanCostMicroUsd ?? 0) / 1e6).toFixed(6)}/item`);
}

console.log("run cost:", run.cost); // dollar string, floored to cents

Async

Every call here has an async twin on AsyncPareta. The signatures match; the methods are coroutines (wait included).

Python

import asyncio
from pareta import AsyncPareta

async def main():
    async with AsyncPareta.from_env() as pa:
        eval_set = await pa.evals.sets.create(task="contract-key-fields", items=items)
        run = await pa.evals.runs.create(
            eval_set=eval_set.id,
            models=["contract-kie-1", "contract-kie-2"],
            frontier="benchmarked",
            wait=True,
        )
        for r in run.results:
            print(r.model_id, r.quality_mean)
        print("run cost:", run.cost)

asyncio.run(main())

TypeScript

In TypeScript there is no separate AsyncPareta — the one Pareta client is already async. Every I/O method returns a Promise, so you just await it; there is no sync/async split to mirror and no context manager to close.

import { Pareta } from "pareta";

const pa = Pareta.fromEnv();

const evalSet = await pa.evals.sets.create({ task: "contract-key-fields", items });
const run = await pa.evals.runs.create({
  evalSet: evalSet.id,
  models: ["contract-kie-1", "contract-kie-2"],
  frontier: "benchmarked",
  wait: true,
});
for (const r of run.results) {
  console.log(r.modelId, r.qualityMean);
}
console.log("run cost:", run.cost);

Next steps

Deploy an endpoint: take the winner of your eval to a live, OpenAI-compatible endpoint.
Run inference: call your deployed model; inference is metered the same way evals are.
Discover tasks: match intent to tasks and read leaderboards in depth.
Errors and retries: the full exception hierarchy behind InsufficientCreditsError and friends.

Setup​

1. Pick a task​

2. Build an eval set from your rows​

2b. Document tasks: attach the file to each row​

3. Run open candidates against a frontier baseline​

Inline create (skip step 2)​

Picking candidates from the leaderboard​

4. Read the ranked results​

A note on money​

Not blocking on the run​

Handling an empty balance​

Full example​

Async​

Next steps​

Setup

1. Pick a task

2. Build an eval set from your rows

2b. Document tasks: attach the file to each row

3. Run open candidates against a frontier baseline

Inline create (skip step 2)

Picking candidates from the leaderboard

4. Read the ranked results

A note on money

Not blocking on the run

Handling an empty balance

Full example

Async

Next steps