`POST /v1/chat/completions`

OpenAI-compatible chat completions. Drop-in replacement — point your existing OpenAI SDK at http://localhost:8080/v1 and it just works.

Source: src/api/routes/chat.ts · src/api/services/localRouter.ts

Endpoint

POST http://localhost:8080/v1/chat/completions

Auth

Sent header	Treated as
(none)	Free tier — anonymous
`Authorization: Bearer qs_…`	Pro tier — token forwarded to LocoPilot Cloud on fallback
`Authorization: Bearer <anything else>`	Free tier (anonymous) — non-`qs_` tokens are ignored by the local auth middleware

Request body (validated by Fastify schema)

Field	Required	Constraint
`model`	✅	string, 1–256 chars
`messages`	✅	1–500 items, each `{ role: "system"\|"user"\|"assistant", content: string ≤ 128 KB }`
`stream`	❌	boolean (default `false`)
`temperature`	❌	number, 0–2
`max_tokens`	❌	integer, 1–65536

{
  "model": "llama3:8b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user",   "content": "Hello!" }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1024
}

Bodies that violate the schema return 400 from Fastify before reaching the handler.

note

The role enum is exactly system | user | assistant. Tool/function-calling roles are not part of v1.0 of the public CLI.

Routing logic

Source: localRouter.resolve():

Fetch the full Ollama model list (GET <OLLAMA_HOST>/api/tags).
Exact match — if any local model's name === requestedModel → PROVIDERS.LOCAL.
Prefix match — if any local model's name.startsWith(requestedModel.split(':')[0]) → PROVIDERS.LOCAL.
No match — return PROVIDERS.REMOTE for Pro users, PROVIDERS.NOT_FOUND for Free users.
If Ollama itself is unreachable, the same Pro/Free fall-through applies.

When the resolved provider is LOCAL, the API streams from Ollama. When it's REMOTE, the request is proxied to POST /api/inference on LocoPilot Cloud (which routes to RunPod). When it's NOT_FOUND, the handler returns:

HTTP/1.1 404 Not Found
{
  "error": "model_not_found",
  "message": "Model 'llama3:8b' is not available locally. Upgrade to Pro for remote GPU access."
}

Streaming response (SSE)

When "stream": true:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Content-Type: text/event-stream. data: [DONE] is the OpenAI-spec sentinel.

Non-streaming response

When "stream" is absent or false:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1746820000,
  "model": "llama3:8b",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "Hello! How can I help?" },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 12, "completion_tokens": 9, "total_tokens": 21 }
}

Error shapes

Status	Body	When
`400`	`{ "error": "<schema error>" }`	Body fails Fastify schema validation (missing field, type mismatch, content too long, …)
`403`	`{ "error": "pro_subscription_required", "message": "...", "upgrade_url": "..." }`	Pro fallback rejected by the cloud (subscription expired / past-due / canceled)
`404`	`{ "error": "model_not_found", "message": "..." }`	Free tier and the requested model isn't pulled into Ollama
`429`	`{ "error": "Rate limit exceeded", "retryAfter": <seconds> }`	Rate limit exceeded (Pro only — Free tier is currently unrate-limited locally; cloud enforces the real limit)
`503`	`{ "error": "Provider unavailable" }`	Local provider failed and no fallback succeeded

All errors include Content-Type: application/json.

Rate limiting

src/api/middleware/rateLimiter.ts is a per-key in-memory token bucket:

Free tier (no API key) — middleware short-circuits with if (!apiKey) return; — no local rate limit.
Pro tier — the local stub assigns 9999 rpm. The real limit is enforced server-side by LocoPilot Cloud.

Headers X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset are set on every Pro response.

SDK examples

Node.js

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: process.env.LOCOPILOT_KEY ?? 'not-needed',
});

const res = await client.chat.completions.create({
  model: 'llama3:8b',
  messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(res.choices[0].message.content);

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

res = client.chat.completions.create(
    model="llama3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(res.choices[0].message.content)

LangChain

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
    model="llama3:8b",
)
print(llm.invoke("Hello!").content)

curl

curl -N http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama3:8b",
    "messages": [{ "role": "user", "content": "Hello!" }],
    "stream": true
  }'

Usage tracking

The local API computes per-request usage metrics (tokensIn, tokensOut, latencyMs, ttfbMs, status) but the v1.0 public CLI uses a no-op tracker locally — see src/api/services/localStubs.ts. The inference_logs SQLite table is created by locopilot init for forward compatibility but is not written to. Pro-tier usage is metered server-side by LocoPilot Cloud and reported by locopilot usage.

Endpoint​

Auth​

Request body (validated by Fastify schema)​

Routing logic​

Streaming response (SSE)​

Non-streaming response​

Error shapes​

Rate limiting​

SDK examples​

Node.js​

Python​

LangChain​

curl​

Usage tracking​