Skip to main content

POST /v1/chat/completions

OpenAI-compatible chat completions. Drop-in replacement — point your existing OpenAI SDK at http://localhost:8080/v1 and it just works.

Source: src/api/routes/chat.ts · src/api/services/localRouter.ts

Endpoint

POST http://localhost:8080/v1/chat/completions

Auth

Sent headerTreated as
(none)Free tier — anonymous
Authorization: Bearer qs_…Pro tier — token forwarded to LocoPilot Cloud on fallback
Authorization: Bearer <anything else>Free tier (anonymous) — non-qs_ tokens are ignored by the local auth middleware

Request body (validated by Fastify schema)

FieldRequiredConstraint
modelstring, 1–256 chars
messages1–500 items, each { role: "system"|"user"|"assistant", content: string ≤ 128 KB }
streamboolean (default false)
temperaturenumber, 0–2
max_tokensinteger, 1–65536
{
"model": "llama3:8b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"stream": true,
"temperature": 0.7,
"max_tokens": 1024
}

Bodies that violate the schema return 400 from Fastify before reaching the handler.

note

The role enum is exactly system | user | assistant. Tool/function-calling roles are not part of v1.0 of the public CLI.

Routing logic

Source: localRouter.resolve():

  1. Fetch the full Ollama model list (GET <OLLAMA_HOST>/api/tags).
  2. Exact match — if any local model's name === requestedModelPROVIDERS.LOCAL.
  3. Prefix match — if any local model's name.startsWith(requestedModel.split(':')[0])PROVIDERS.LOCAL.
  4. No match — return PROVIDERS.REMOTE for Pro users, PROVIDERS.NOT_FOUND for Free users.
  5. If Ollama itself is unreachable, the same Pro/Free fall-through applies.

When the resolved provider is LOCAL, the API streams from Ollama. When it's REMOTE, the request is proxied to POST /api/inference on LocoPilot Cloud (which routes to RunPod). When it's NOT_FOUND, the handler returns:

HTTP/1.1 404 Not Found
{
"error": "model_not_found",
"message": "Model 'llama3:8b' is not available locally. Upgrade to Pro for remote GPU access."
}

Streaming response (SSE)

When "stream": true:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Content-Type: text/event-stream. data: [DONE] is the OpenAI-spec sentinel.

Non-streaming response

When "stream" is absent or false:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1746820000,
"model": "llama3:8b",
"choices": [{
"index": 0,
"message": { "role": "assistant", "content": "Hello! How can I help?" },
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 12, "completion_tokens": 9, "total_tokens": 21 }
}

Error shapes

StatusBodyWhen
400{ "error": "<schema error>" }Body fails Fastify schema validation (missing field, type mismatch, content too long, …)
403{ "error": "pro_subscription_required", "message": "...", "upgrade_url": "..." }Pro fallback rejected by the cloud (subscription expired / past-due / canceled)
404{ "error": "model_not_found", "message": "..." }Free tier and the requested model isn't pulled into Ollama
429{ "error": "Rate limit exceeded", "retryAfter": <seconds> }Rate limit exceeded (Pro only — Free tier is currently unrate-limited locally; cloud enforces the real limit)
503{ "error": "Provider unavailable" }Local provider failed and no fallback succeeded

All errors include Content-Type: application/json.

Rate limiting

src/api/middleware/rateLimiter.ts is a per-key in-memory token bucket:

  • Free tier (no API key) — middleware short-circuits with if (!apiKey) return; — no local rate limit.
  • Pro tier — the local stub assigns 9999 rpm. The real limit is enforced server-side by LocoPilot Cloud.

Headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset are set on every Pro response.

SDK examples

Node.js

import OpenAI from 'openai';

const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: process.env.LOCOPILOT_KEY ?? 'not-needed',
});

const res = await client.chat.completions.create({
model: 'llama3:8b',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(res.choices[0].message.content);

Python

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)

res = client.chat.completions.create(
model="llama3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(res.choices[0].message.content)

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
model="llama3:8b",
)
print(llm.invoke("Hello!").content)

curl

curl -N http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3:8b",
"messages": [{ "role": "user", "content": "Hello!" }],
"stream": true
}'

Usage tracking

The local API computes per-request usage metrics (tokensIn, tokensOut, latencyMs, ttfbMs, status) but the v1.0 public CLI uses a no-op tracker locally — see src/api/services/localStubs.ts. The inference_logs SQLite table is created by locopilot init for forward compatibility but is not written to. Pro-tier usage is metered server-side by LocoPilot Cloud and reported by locopilot usage.