Guide

A start-to-finish path through the Pareta SDK, from your first install to running it async in production. Read it in order the first time; come back to any page on its own later.

The throughline: you deploy open-weights models as endpoints (Pareta hides the GPU), call them with OpenAI-compatible inference, and prove a cheaper open model wins on your data before you commit. Inference and evals are metered against your org balance; model ids are per-task aliases.

Almost every example builds the client with Pareta.from_env(), which reads PARETA_API_KEY and an optional PARETA_BASE_URL.

Installation & authentication — install pareta (pip/uv/poetry), authenticate with a pareta_sk_ key via Pareta.from_env() or api_key=, and make a first metered OpenAI-compatible call.
Quickstart — deploy the recommended model for a task and run inference end to end in about a dozen lines, with streaming, metering, async, and cleanup.
Core concepts — tasks, open vs frontier models, per-task aliases, hidden hardware, balance metering, and the match to leaderboard to eval to deploy funnel.
Running inference — call deployed endpoints with chat.completions.create: completions, streaming chunks, passthrough params, models.list, async, metering errors, and pointing the openai SDK at base_url.
Deploying & operating endpoints — the control plane: deploy (wait=True Endpoint vs wait=False progress-event stream), list/retrieve/start/stop/delete, and metrics(id). No GPU knob.
Finding the right model — the discovery loop: match intent to a task, rank models via leaderboard/recommended, and list frontier baselines to eval against.
Evaluating on your own data — score open candidates and frontier baselines on your own rows with evals.sets and evals.runs, reading per-model quality/CIs/cost and the metered run total.
Errors, retries & timeouts — the ParetaError hierarchy and status-to-class mapping, which errors to catch (402/404/503/429), automatic retries with backoff, and request vs eval-wait timeouts.
Async usage — AsyncPareta: async with/aclose lifecycle, awaiting every method, async for on chat and deploy streams, and fanning out work concurrently with asyncio.gather.
Configuration — building the client: api_key, base_url (prod vs staging), timeout, max_retries, injecting your own httpx client, env vars, and lifecycle.

Where to go next

Working examples for specific jobs: Examples.
Field-by-field API docs: Reference.

Where to go next​

Where to go next