Skip to main content

endpoints

client.endpoints is the control plane for serving open-weights models. Hand it a task and a model; it deploys an OpenAI-compatible inference endpoint, hands you back a live Endpoint, and lets you start, stop, delete, and measure it from code. There is no infrastructure to reason about.

Three platform truths shape this whole namespace:

  • GPUs are hidden. deploy() takes a task and a model, nothing else. There is no GPU, tensor-parallel, quantization, or run-mode knob. Pareta resolves the serving class from its registry.
  • Models are per-task aliases. The model you deploy and the Endpoint.model you read back are public per-task aliases ({family}-{rank}), not raw open-weights ids. Real ids never cross into the SDK.
  • Inference is metered against your org balance. Once an endpoint is live, every chat.completions.create() debits your org balance and raises InsufficientCreditsError (402) on an empty balance. Top-up is browser-only; the SDK never exposes balance or payment methods.
from pareta import Pareta

pa = Pareta.from_env() # reads PARETA_API_KEY (+ optional PARETA_BASE_URL)

deploy()

endpoints.deploy(
*,
task: str,
model: str = "recommended",
name: str | None = None,
wait: bool = False,
**extra,
) -> Iterator[dict] | Endpoint

Route: POST /v1/endpoints (named-event SSE)

Deploys a model for a task and brings the endpoint live.

  • task (required) is a catalog subtask id, e.g. "contract-key-fields". Discover one with tasks.match() / tasks.list(). Passing an empty task raises ValueError before any request goes out.
  • model is a per-task public alias, an explicit real-id-equivalent alias, or "recommended" (the default — the task's curated pick, else the leaderboard's top open model). To see what "recommended" resolves to before you deploy, read pa.tasks.recommended(task).
  • name is optional. Leave it off and Pareta names the endpoint for you.
  • wait controls the return type (see below).
  • **extra is passed straight to the backend (e.g. cost_per_request_micro_usd, frontier_cost_per_request_micro_usd, region, provider, quality, run_mode, taskDisplay). You never pass hardware.

The return type depends entirely on wait.

wait=True — block and get the live Endpoint

The simplest path. deploy(wait=True) consumes the deploy stream internally, blocks until the endpoint is live, and returns the Endpoint. If the deploy emits an "error" event (or the stream ends without a "complete" event), it raises ParetaError.

ep = pa.endpoints.deploy(
task="contract-key-fields",
model="recommended", # Pareta picks the task's best open model
wait=True,
)

assert ep.is_live # status == "live"
print(ep.id) # pass this to chat.completions.create(model=…)
print(ep.model) # per-task public alias that got deployed
print(ep.url) # OpenAI-compatible inference URL

# Use it immediately — metered against your org balance.
resp = pa.chat.completions.create(
model=ep.id,
messages=[{"role": "user", "content": "Extract the parties and effective date."}],
)
print(resp.choices[0].message.content)

wait=False — stream deploy progress (default)

With wait=False (the default), deploy() returns an iterator of named progress events so you can render a deploy UI or log stages. Each event is a {"event": str, "data": dict} dict.

endpoint = None
for ev in pa.endpoints.deploy(task="contract-key-fields"):
if ev["event"] == "progress":
# data carries the deploy stage status, e.g. {"stage": "pulling weights", "pct": 45}
print("progress:", ev["data"])
elif ev["event"] == "complete":
endpoint = ev["data"]["endpoint"] # the live endpoint payload (dict)
print("live:", endpoint)
elif ev["event"] == "error":
# wait=True raises ParetaError for you; with wait=False you handle it.
raise RuntimeError(ev["data"].get("message", "deploy failed"))

The terminal event is "complete" (its data.endpoint is the live endpoint) or "error". Pick wait=True for scripts and notebooks; pick wait=False only when you want to surface live progress.

list()

endpoints.list() -> list[Endpoint]

Route: GET /v1/endpoints

Returns every endpoint your org can access, in any status.

for ep in pa.endpoints.list():
print(ep.id, ep.status, ep.task, ep.model)

For the OpenAI-compatible subset — only deployed, url-bearing endpoints, shaped as Model objects — use pa.models.list() instead.

retrieve()

endpoints.retrieve(endpoint_id: str) -> Endpoint

Route: GET /v1/endpoints/{endpoint_id}

Fetches one endpoint by id. Raises NotFoundError (404) for an unknown id.

ep = pa.endpoints.retrieve("ep_a1b2c3")
print(ep.is_live, ep.url)

start() / stop()

endpoints.start(endpoint_id: str) # POST /v1/endpoints/{id}/start
endpoints.stop(endpoint_id: str) # POST /v1/endpoints/{id}/stop

A stopped endpoint costs nothing to keep but cannot serve. stop() pauses spend; start() resumes a stopped endpoint.

pa.endpoints.stop("ep_a1b2c3") # pause a live endpoint
pa.endpoints.start("ep_a1b2c3") # resume a stopped one

While an endpoint is stopped or still cold, inference calls against it raise EndpointNotReadyError (503). After start(), poll retrieve(id).is_live before sending traffic.

import time

pa.endpoints.start("ep_a1b2c3")
while not pa.endpoints.retrieve("ep_a1b2c3").is_live:
time.sleep(3)

delete()

endpoints.delete(endpoint_id: str) -> None

Route: DELETE /v1/endpoints/{endpoint_id}

Removes an endpoint for good. Returns None.

pa.endpoints.delete("ep_a1b2c3")

metrics()

endpoints.metrics(endpoint_id: str) -> Metrics

Returns a Metrics handle for querying one endpoint's observability dimensions. This call does not hit the network — each dimension method does.

class Metrics:
def performance(self, **params) -> dict # GET /v1/endpoints/{id}/performance
def uptime(self, **params) -> dict # GET /v1/endpoints/{id}/uptime
def cost(self, **params) -> dict # GET /v1/endpoints/{id}/cost
def quality(self, **params) -> dict # GET /v1/endpoints/{id}/quality
def activity(self, **params) -> dict # GET /v1/endpoints/{id}/activity

Each method returns raw metric JSON (shapes vary by dimension; typed models arrive with the OpenAPI generation) and accepts arbitrary query params as keyword arguments, passed straight through to the query string.

m = pa.endpoints.metrics("ep_a1b2c3")

m.performance() # p50/p95/p99 latency
m.uptime() # availability
m.cost() # per-endpoint spend + vs-frontier savings
m.quality() # judge windows
m.activity() # usage stats

# Params pass through to the query string:
m.performance(window="24h")
m.cost(group_by="day")
MethodReturns
performance(**params)p50/p95/p99 latency
uptime(**params)availability metrics
cost(**params)per-endpoint spend and savings versus the frontier baseline
quality(**params)judge-window quality scores
activity(**params)usage stats

metrics(id).cost() is per-endpoint observability — it reports what this endpoint spent and how much you saved against a frontier vendor. It is not your account balance. Balance and top-up live in the dashboard only.

The Endpoint object

Returned by deploy(wait=True), retrieve(), and each element of list().

FieldTypeMeaning
idstr | NoneEndpoint id (== name) — pass as chat.completions.create(model=…)
namestr | NoneDisplay name
modelstr | NonePer-task public alias serving here
statusstr | None"live", "starting", "stopped", …
taskstr | NoneTask name
urlstr | NoneOpenAI-compatible inference URL
is_liveboolstatus == "live"

Every Endpoint keeps the full server record on to_dict(), so nothing the API returns is lost behind the typed fields.

End to end

Discover a task, deploy its recommended model, serve traffic, check cost, then tear down.

from pareta import Pareta

with Pareta.from_env() as pa:
# 1. Find a task for your intent.
match = pa.tasks.match("pull key fields out of contracts")
task_id = match.chosen.task_id # e.g. "contract-key-fields"

# 2. Deploy the recommended open model (no GPU knob).
ep = pa.endpoints.deploy(task=task_id, model="recommended", wait=True)
print(f"live: {ep.id} serving {ep.model}")

# 3. Run metered inference (debits the org balance).
resp = pa.chat.completions.create(
model=ep.id,
messages=[{"role": "user", "content": "Extract the governing-law clause."}],
)
print(resp.choices[0].message.content)

# 4. Check what it cost and how it performed.
print(pa.endpoints.metrics(ep.id).cost())

# 5. Stop it to pause spend (or delete it to remove it).
pa.endpoints.stop(ep.id)

Async

AsyncPareta mirrors the sync surface. deploy(), list(), retrieve(), start(), stop(), and delete() are async def. metrics(id) returns an AsyncMetrics handle synchronously (it is not a coroutine), and its dimension methods are awaitable.

from pareta import AsyncPareta

async with AsyncPareta.from_env() as pa:
# wait=True awaits the deploy and returns the live Endpoint.
ep = await pa.endpoints.deploy(task="contract-key-fields", wait=True)

# wait=False returns an async progress-event iterator.
async for ev in await pa.endpoints.deploy(task="contract-key-fields"):
if ev["event"] == "complete":
print("live:", ev["data"]["endpoint"])

m = pa.endpoints.metrics(ep.id) # NOT awaited — returns the handle
print(await m.performance()) # the dimension call IS awaited

await pa.endpoints.stop(ep.id)

Errors

ExceptionStatusWhen
BadRequestError400 / 422Unknown task, malformed deploy params
InsufficientCreditsError402Org balance empty when you run inference against the endpoint
NotFoundError404Unknown endpoint_id
ConflictError409Seed/legacy endpoint conflict, or transient contention
EndpointNotReadyError503Endpoint stopped, cold-starting, or provider down
ParetaErrorA deploy stream emitted an "error" event, or ended without "complete"

deploy(task="") raises ValueError before any request leaves the process.

  • tasks — find a task id and read its recommended model before you deploy.
  • chat — call chat.completions.create() against a deployed endpoint id.
  • models — list the OpenAI-compatible subset of deployed endpoints.
  • evals — pick the right open model for a task before you deploy it.
  • Deploying & operating endpoints — the narrative guide.
  • Errors & retries — the exception hierarchy and retry policy.