Guide
A start-to-finish path through the Pareta SDK, from your first install to running it async in production. Read it in order the first time; come back to any page on its own later.
The throughline: you deploy open-weights models as endpoints (Pareta hides the GPU), call them with OpenAI-compatible inference, and prove a cheaper open model wins on your data before you commit. Inference and evals are metered against your org balance; model ids are per-task aliases.
Almost every example builds the client with Pareta.from_env(), which reads PARETA_API_KEY and an optional PARETA_BASE_URL.
- Installation & authentication — install
pareta(pip/uv/poetry), authenticate with apareta_sk_key viaPareta.from_env()orapi_key=, and make a first metered OpenAI-compatible call. - Quickstart — deploy the recommended model for a task and run inference end to end in about a dozen lines, with streaming, metering, async, and cleanup.
- Core concepts — tasks, open vs frontier models, per-task aliases, hidden hardware, balance metering, and the match to leaderboard to eval to deploy funnel.
- Running inference — call deployed endpoints with
chat.completions.create: completions, streaming chunks, passthrough params,models.list, async, metering errors, and pointing theopenaiSDK atbase_url. - Deploying & operating endpoints — the control plane:
deploy(wait=TrueEndpoint vswait=Falseprogress-event stream),list/retrieve/start/stop/delete, andmetrics(id). No GPU knob. - Finding the right model — the discovery loop: match intent to a task, rank models via
leaderboard/recommended, and list frontier baselines to eval against. - Evaluating on your own data — score open candidates and frontier baselines on your own rows with
evals.setsandevals.runs, reading per-model quality/CIs/cost and the metered run total. - Errors, retries & timeouts — the
ParetaErrorhierarchy and status-to-class mapping, which errors to catch (402/404/503/429), automatic retries with backoff, and request vs eval-wait timeouts. - Async usage —
AsyncPareta:async with/acloselifecycle, awaiting every method,async foron chat and deploy streams, and fanning out work concurrently withasyncio.gather. - Configuration — building the client:
api_key,base_url(prod vs staging),timeout,max_retries, injecting your ownhttpxclient, env vars, and lifecycle.