Skip to main content

Datasets

LocoPilot accepts two JSONL formats out of the box. Each line in the file is a single training example. The validator runs before the job is enqueued — a malformed dataset fails in milliseconds rather than after a long training run.

Source: src/training/validator.ts

Alpaca (instruction-tuning)

Single-turn instruction → response pairs.

{"instruction": "Translate to French.", "input": "Hello, how are you?", "output": "Bonjour, comment ça va ?"}
{"instruction": "Summarise this article in one sentence.", "input": "...", "output": "..."}
{"instruction": "Write a haiku about debugging.", "input": "", "output": "Bug found at midnight\nStack trace points to line forty\nSemicolon mocks me"}

A row is recognised as Alpaca when:

FieldRequiredConstraint
instructionnon-empty string
outputnon-empty string
inputoptional — empty string is fine

ShareGPT (multi-turn chat)

Used for chat / conversation fine-tuning.

{"conversations": [
{"from": "system", "value": "You are a helpful coding assistant."},
{"from": "human", "value": "How do I sort a list in Python?"},
{"from": "gpt", "value": "Use the sorted() built-in: `sorted(my_list)`."}
]}
{"conversations": [
{"from": "human", "value": "What's 2+2?"},
{"from": "gpt", "value": "4."}
]}

A row is recognised as ShareGPT when:

FieldRequiredConstraint
conversationsnon-empty array
conversations[0].fromnon-empty string (system, human, gpt is the convention)
conversations[0].valuenon-empty string

The validator only inspects the first turn. Downstream training adapters expect every turn to follow the same shape — runtime errors there land on you.

Validation rules

validateDataset() enforces:

RuleConstant
Minimum 10 examples (lines)MIN_DATASET_EXAMPLES = 10
Inspect first 5 lines for shape & format consistencyDATASET_VALIDATION_LINES = 5
First valid line decides the format; subsequent lines must use the same format

Files larger than 5 rows are not exhaustively validated — the assumption is that homogeneous JSONL is produced by tooling, not handwritten.

Validation errors

Every error includes the offending line number when applicable:

MessageCause
Dataset not found: <path>File doesn't exist
Dataset file is emptyFile is zero-byte or only whitespace
Dataset too small — found N examples, minimum 10 requiredFewer than 10 lines
Line N: empty line in datasetBlank line within the first 5
Line N: invalid JSON — <preview>...Line is not valid JSON
Line N: unrecognised format. Expected alpaca (instruction/output) or sharegpt (conversations[])Object doesn't match either shape
Line N: mixed formats — line 1 is alpaca but line N is sharegpt. Use a single format throughout.First line was Alpaca, later line was ShareGPT (or vice versa)
GoalRows
Quick sanity check10 – 50
Format / tone tweak200 – 2,000
Domain adaptation5,000 – 50,000
New capability50,000+

QLoRA-style training (which Unsloth and Axolotl both use under the hood with the default loraR: 8) is sample-efficient — you usually don't need anywhere near as much data as you would for full fine-tuning.

Build a dataset programmatically

import { writeFileSync } from 'fs';

const rows = [
{ instruction: 'Translate to French.', input: 'Hello', output: 'Bonjour' },
{ instruction: 'Translate to French.', input: 'Goodbye', output: 'Au revoir' },
// … at least 10 rows
];

writeFileSync(
'data/translations.jsonl',
rows.map(r => JSON.stringify(r)).join('\n') + '\n',
);
import json

rows = [
{"instruction": "Translate to French.", "input": "Hello", "output": "Bonjour"},
{"instruction": "Translate to French.", "input": "Goodbye", "output": "Au revoir"},
# … at least 10 rows
]

with open("data/translations.jsonl", "w") as f:
for r in rows:
f.write(json.dumps(r) + "\n")