Datasets

LocoPilot accepts two JSONL formats out of the box. Each line in the file is a single training example. The validator runs before the job is enqueued — a malformed dataset fails in milliseconds rather than after a long training run.

Source: src/training/validator.ts

Alpaca (instruction-tuning)

Single-turn instruction → response pairs.

{"instruction": "Translate to French.", "input": "Hello, how are you?", "output": "Bonjour, comment ça va ?"}
{"instruction": "Summarise this article in one sentence.", "input": "...", "output": "..."}
{"instruction": "Write a haiku about debugging.", "input": "", "output": "Bug found at midnight\nStack trace points to line forty\nSemicolon mocks me"}

A row is recognised as Alpaca when:

Field	Required	Constraint
`instruction`	✅	non-empty string
`output`	✅	non-empty string
`input`	❌	optional — empty string is fine

ShareGPT (multi-turn chat)

Used for chat / conversation fine-tuning.

{"conversations": [
  {"from": "system",  "value": "You are a helpful coding assistant."},
  {"from": "human",   "value": "How do I sort a list in Python?"},
  {"from": "gpt",     "value": "Use the sorted() built-in: `sorted(my_list)`."}
]}
{"conversations": [
  {"from": "human", "value": "What's 2+2?"},
  {"from": "gpt",   "value": "4."}
]}

A row is recognised as ShareGPT when:

Field	Required	Constraint
`conversations`	✅	non-empty array
`conversations[0].from`	✅	non-empty string (`system`, `human`, `gpt` is the convention)
`conversations[0].value`	✅	non-empty string

The validator only inspects the first turn. Downstream training adapters expect every turn to follow the same shape — runtime errors there land on you.

Validation rules

validateDataset() enforces:

Rule	Constant
Minimum 10 examples (lines)	`MIN_DATASET_EXAMPLES = 10`
Inspect first 5 lines for shape & format consistency	`DATASET_VALIDATION_LINES = 5`
First valid line decides the format; subsequent lines must use the same format	—

Files larger than 5 rows are not exhaustively validated — the assumption is that homogeneous JSONL is produced by tooling, not handwritten.

Validation errors

Every error includes the offending line number when applicable:

Message	Cause
`Dataset not found: <path>`	File doesn't exist
`Dataset file is empty`	File is zero-byte or only whitespace
`Dataset too small — found N examples, minimum 10 required`	Fewer than 10 lines
`Line N: empty line in dataset`	Blank line within the first 5
`Line N: invalid JSON — <preview>...`	Line is not valid JSON
`Line N: unrecognised format. Expected alpaca (instruction/output) or sharegpt (conversations[])`	Object doesn't match either shape
`Line N: mixed formats — line 1 is alpaca but line N is sharegpt. Use a single format throughout.`	First line was Alpaca, later line was ShareGPT (or vice versa)

Recommended sizes

Goal	Rows
Quick sanity check	10 – 50
Format / tone tweak	200 – 2,000
Domain adaptation	5,000 – 50,000
New capability	50,000+

QLoRA-style training (which Unsloth and Axolotl both use under the hood with the default loraR: 8) is sample-efficient — you usually don't need anywhere near as much data as you would for full fine-tuning.

Build a dataset programmatically

import { writeFileSync } from 'fs';

const rows = [
  { instruction: 'Translate to French.', input: 'Hello',   output: 'Bonjour'   },
  { instruction: 'Translate to French.', input: 'Goodbye', output: 'Au revoir' },
  // … at least 10 rows
];

writeFileSync(
  'data/translations.jsonl',
  rows.map(r => JSON.stringify(r)).join('\n') + '\n',
);

import json

rows = [
    {"instruction": "Translate to French.", "input": "Hello",   "output": "Bonjour"},
    {"instruction": "Translate to French.", "input": "Goodbye", "output": "Au revoir"},
    # … at least 10 rows
]

with open("data/translations.jsonl", "w") as f:
    for r in rows:
        f.write(json.dumps(r) + "\n")

Alpaca (instruction-tuning)​

ShareGPT (multi-turn chat)​

Validation rules​

Validation errors​

Recommended sizes​

Build a dataset programmatically​