POST /v1/chat/completions
OpenAI-compatible chat completions. Drop-in replacement — point your existing OpenAI SDK at http://localhost:8080/v1 and it just works.
Source: src/api/routes/chat.ts · src/api/services/localRouter.ts
Endpoint
POST http://localhost:8080/v1/chat/completions
Auth
| Sent header | Treated as |
|---|---|
| (none) | Free tier — anonymous |
Authorization: Bearer qs_… | Pro tier — token forwarded to LocoPilot Cloud on fallback |
Authorization: Bearer <anything else> | Free tier (anonymous) — non-qs_ tokens are ignored by the local auth middleware |
Request body (validated by Fastify schema)
| Field | Required | Constraint |
|---|---|---|
model | ✅ | string, 1–256 chars |
messages | ✅ | 1–500 items, each { role: "system"|"user"|"assistant", content: string ≤ 128 KB } |
stream | ❌ | boolean (default false) |
temperature | ❌ | number, 0–2 |
max_tokens | ❌ | integer, 1–65536 |
{
"model": "llama3:8b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"stream": true,
"temperature": 0.7,
"max_tokens": 1024
}
Bodies that violate the schema return 400 from Fastify before reaching the handler.
The role enum is exactly system | user | assistant. Tool/function-calling roles are not part of v1.0 of the public CLI.
Routing logic
Source: localRouter.resolve():
- Fetch the full Ollama model list (
GET <OLLAMA_HOST>/api/tags). - Exact match — if any local model's
name === requestedModel→PROVIDERS.LOCAL. - Prefix match — if any local model's
name.startsWith(requestedModel.split(':')[0])→PROVIDERS.LOCAL. - No match — return
PROVIDERS.REMOTEfor Pro users,PROVIDERS.NOT_FOUNDfor Free users. - If Ollama itself is unreachable, the same Pro/Free fall-through applies.
When the resolved provider is LOCAL, the API streams from Ollama. When it's REMOTE, the request is proxied to POST /api/inference on LocoPilot Cloud (which routes to RunPod). When it's NOT_FOUND, the handler returns:
HTTP/1.1 404 Not Found
{
"error": "model_not_found",
"message": "Model 'llama3:8b' is not available locally. Upgrade to Pro for remote GPU access."
}
Streaming response (SSE)
When "stream": true:
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Content-Type: text/event-stream. data: [DONE] is the OpenAI-spec sentinel.
Non-streaming response
When "stream" is absent or false:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1746820000,
"model": "llama3:8b",
"choices": [{
"index": 0,
"message": { "role": "assistant", "content": "Hello! How can I help?" },
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 12, "completion_tokens": 9, "total_tokens": 21 }
}
Error shapes
| Status | Body | When |
|---|---|---|
400 | { "error": "<schema error>" } | Body fails Fastify schema validation (missing field, type mismatch, content too long, …) |
403 | { "error": "pro_subscription_required", "message": "...", "upgrade_url": "..." } | Pro fallback rejected by the cloud (subscription expired / past-due / canceled) |
404 | { "error": "model_not_found", "message": "..." } | Free tier and the requested model isn't pulled into Ollama |
429 | { "error": "Rate limit exceeded", "retryAfter": <seconds> } | Rate limit exceeded (Pro only — Free tier is currently unrate-limited locally; cloud enforces the real limit) |
503 | { "error": "Provider unavailable" } | Local provider failed and no fallback succeeded |
All errors include Content-Type: application/json.
Rate limiting
src/api/middleware/rateLimiter.ts is a per-key in-memory token bucket:
- Free tier (no API key) — middleware short-circuits with
if (!apiKey) return;— no local rate limit. - Pro tier — the local stub assigns
9999 rpm. The real limit is enforced server-side by LocoPilot Cloud.
Headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset are set on every Pro response.
SDK examples
Node.js
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: process.env.LOCOPILOT_KEY ?? 'not-needed',
});
const res = await client.chat.completions.create({
model: 'llama3:8b',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(res.choices[0].message.content);
Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
res = client.chat.completions.create(
model="llama3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(res.choices[0].message.content)
LangChain
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
model="llama3:8b",
)
print(llm.invoke("Hello!").content)
curl
curl -N http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3:8b",
"messages": [{ "role": "user", "content": "Hello!" }],
"stream": true
}'
Usage tracking
The local API computes per-request usage metrics (tokensIn, tokensOut, latencyMs, ttfbMs, status) but the v1.0 public CLI uses a no-op tracker locally — see src/api/services/localStubs.ts. The inference_logs SQLite table is created by locopilot init for forward compatibility but is not written to. Pro-tier usage is metered server-side by LocoPilot Cloud and reported by locopilot usage.