Fine-Tuning Gemma 3 1B: My Actual Workflow + Costs
When fine‑tuning beats prompting
Fine‑tuning is the right tool when the marginal cost of a few‑shot prompt exceeds the one‑time cost of a LoRA adapter. In practice that means:
- High‑throughput pipelines where each request carries 5–10 example lines. At 2 tokens per word, a 50‑token example block adds ~0.1 USD per 1 M tokens on a $20/M token model—tiny per request but billions of calls quickly dwarf a $10 training bill.
- Strict output contracts—JSON, CSV, or domain‑specific markup—where a single stray newline breaks downstream processing.
- Vocabulary gaps. If the base model consistently hallucinates a technical term (e.g., “hydroxy‑propyl‑methylcellulose”) you can inject the term with a handful of examples and lock it in.
The opposite regime—low volume, high tolerance for formatting errors, or a prompt that already hits target metrics—does not justify the engineering overhead. “I like the idea of fine‑tuning” is not a decision criterion.
Data preparation for a 1 B‑parameter specialist
We keep the dataset as a flat .jsonl file, each line a {"prompt":..., "completion":...} dict. This avoids the extra <assistant>/<user> scaffolding that chat‑style formats impose, which in turn reduces token waste during LoRA updates.
Key numbers from our recent runs:
- Size. 1 000–5 000 pairs is the sweet spot. Beyond 5 k the validation loss flattens for a 1 B model, yet the marginal compute cost rises linearly.
- Deduplication. We run a two‑pass filter: first an exact‑hash check, then a fuzzy 90 % similarity sweep using
rapidfuzz. In our experience, a single leaked duplicate can inflate test accuracy by 3–5 % because the model memorises the answer. - Split. Reserve 10 % of the cleaned set for evaluation before any training iteration. Random stratified sampling preserves label distribution and prevents accidental leakage through ordering.
- Formatting sanity. Include at least one “bad‑format” example (missing brace, extra comma) per 200 pairs. The model learns to reject malformed inputs during inference.
Training script (Unsloth + PEFT)
Unsloth’s FastLanguageModel wrapper eliminates the Python‑level bottleneck of HF’s Trainer. The following script is the exact command line we run on a single A100 (40 GB) in a rented container:
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
# Load base model in 4‑bit QLoRA mode
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
)
# Attach LoRA adapter – rank 16 works well for 1 B models
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
target_modules=[
"q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"
],
)
# Load cleaned JSONL
ds = load_dataset("json", data_files="train.jsonl")["train"]
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=ds,
args=SFTConfig(
output_dir="./out",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
save_steps=100,
# Reduce GPU memory pressure
optim="adamw_torch",
gradient_checkpointing=True,
),
)
trainer.train()
trainer.push_to_hub("your-handle/gemma-3-1b-mytask-v1")
Key hyper‑parameters:
- Batch size. 4 × gradient accumulation = effective 16. Keeps VRAM under 30 GB.
- Learning rate. 2 × 10⁻⁴ is a robust starting point; we rarely need to decay further for three epochs.
- Epoch count. Three passes over 2 k pairs yields ~6 k gradient steps, enough to converge without over‑fitting.
Cost breakdown by stage
| Stage | Compute venue | Indicative cost (USD) |
|---|---|---|
| Data cleaning & deduplication | Local workstation | $0 + ≈ 2 h engineer time |
| LoRA training (3 epochs, 2 k pairs) | RunPod A100 spot (≈ $0.90 /hr) | $1.5 – $2.5 |
| Evaluation (held‑out set) | Local CPU or vLLM sandbox | $0 |
| Adapter push to HF Hub | HuggingFace (public repo) | $0 |
| Production serving (vLLM + adapter) | RunPod A100 spot, autoscaled | $0.12 /hr (≈ $86 /mo at 24/7) |
All numbers are rounded estimates from our last 12 months of rentals. Spot markets fluctuate ±30 %; on‑demand pricing would roughly double the training line.
Evaluation methodology that surfaces regressions
We lock the same 10 % hold‑out across every iteration. Automation runs after each training job, writes a JSON report, and aborts deployment if thresholds are crossed.
- Exact‑match on structured fields. For JSON output we parse with
json.loadsand count strict equality. A 95 % pass rate is our production gate. - LLM‑judge pairwise comparison. We prompt a reference model (GPT‑4‑turbo) to rank the new model’s response against the previous version on a 0–1 scale. The average delta must stay above 0.05; otherwise we flag a regression.
Both metrics are computed on the same batch of 200 examples, ensuring statistical stability without excessive compute.
Common failure modes and mitigations
- Catastrophic format loss. The adapter learns to ignore the JSON wrapper if it appears in less than ~20 % of training rows. Remedy: duplicate well‑formed examples to reach a 30 %+ representation.
- Over‑fitting to frequent patterns. Validation loss may continue to drop while exact‑match stalls. Solution: lower LoRA rank to 8, or truncate epochs to 2.
- Data contamination. Even a single leaked pair can inflate test scores. We now run a SHA‑256 hash check on the eval split before every training launch; any overlap aborts the run.
- Adapter drift after long idle periods. When a model sits unused for > 30 days, the underlying base may be patched (e.g., security hotfix). Re‑run a quick sanity check (5‑sample eval) before serving.
Deploying the LoRA adapter at scale
Our production stack uses vllm in “adapter‑aware” mode. The command line is a single line:
vllm serve unsloth/gemma-3-1b-it \
--adapter-path ./out/adapter_model.bin \
--tensor-parallel-size 1 \
--max-model-len 1024
The base model occupies the same GPU memory as a vanilla inference run; the adapter adds <≈ 20 MB, negligible for A100 cards. Autoscaling is handled by the container orchestrator (Kubernetes HPA) based on request latency, keeping cost proportional to traffic.
Iterating fast: the weekly fine‑tuning loop
In production we close the loop every 7–14 days:
- Collect fresh logs, extract failed cases, and annotate 50–100 new pairs.
- Append to the master JSONL, re‑run deduplication script (
python dedup.py master.jsonl). - Kick off the training job via a GitHub Action that provisions a RunPod spot instance, runs the script above, and pushes the adapter to HF.
- Trigger the evaluation workflow; if both metrics clear, roll out the new adapter to the vLLM service with a zero‑downtime rolling update.
The entire pipeline—from data ingest to live serving—takes under two hours of wall‑clock time, and the monetary outlay stays below $5 per iteration. That cadence is what makes fine‑tuning a competitive alternative to “prompt‑engineering‑only” solutions for narrow, high‑volume use cases.
Scaling beyond a single task
If you need multiple specialists (e.g., separate adapters for finance, legal, and medical domains), we keep a mono‑repo of adapters and a routing layer that selects the appropriate LoRA at request time. The routing decision is a cheap keyword match on the incoming request’s metadata; the underlying vLLM process loads the chosen adapter on the fly, still within the same GPU process.
In our tests, serving five adapters concurrently adds ~0.03 seconds of latency per request—well within SLA limits for most APIs. Memory overhead remains under 150 MB total, confirming that a 1 B base can host a dozen niche experts on a single GPU.
This is part of the Production Agents cornerstone series.