OrionAI Build logo orionai.build

Fine-Tuning Gemma 3 1B: My Actual Workflow + Costs

By OrionAI Build Editorial · Published 2026-05-10 · // build

When fine‑tuning beats prompting

Fine‑tuning is the right tool when the marginal cost of a few‑shot prompt exceeds the one‑time cost of a LoRA adapter. In practice that means:

The opposite regime—low volume, high tolerance for formatting errors, or a prompt that already hits target metrics—does not justify the engineering overhead. “I like the idea of fine‑tuning” is not a decision criterion.

Data preparation for a 1 B‑parameter specialist

We keep the dataset as a flat .jsonl file, each line a {"prompt":..., "completion":...} dict. This avoids the extra <assistant>/<user> scaffolding that chat‑style formats impose, which in turn reduces token waste during LoRA updates.

Key numbers from our recent runs:

Training script (Unsloth + PEFT)

Unsloth’s FastLanguageModel wrapper eliminates the Python‑level bottleneck of HF’s Trainer. The following script is the exact command line we run on a single A100 (40 GB) in a rented container:

from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# Load base model in 4‑bit QLoRA mode
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
)

# Attach LoRA adapter – rank 16 works well for 1 B models
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj","k_proj","v_proj","o_proj",
        "gate_proj","up_proj","down_proj"
    ],
)

# Load cleaned JSONL
ds = load_dataset("json", data_files="train.jsonl")["train"]

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="./out",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        warmup_ratio=0.03,
        bf16=True,
        logging_steps=10,
        save_steps=100,
        # Reduce GPU memory pressure
        optim="adamw_torch",
        gradient_checkpointing=True,
    ),
)

trainer.train()
trainer.push_to_hub("your-handle/gemma-3-1b-mytask-v1")

Key hyper‑parameters:

Cost breakdown by stage

StageCompute venueIndicative cost (USD)
Data cleaning & deduplicationLocal workstation$0 + ≈ 2 h engineer time
LoRA training (3 epochs, 2 k pairs)RunPod A100 spot (≈ $0.90 /hr)$1.5 – $2.5
Evaluation (held‑out set)Local CPU or vLLM sandbox$0
Adapter push to HF HubHuggingFace (public repo)$0
Production serving (vLLM + adapter)RunPod A100 spot, autoscaled$0.12 /hr (≈ $86 /mo at 24/7)

All numbers are rounded estimates from our last 12 months of rentals. Spot markets fluctuate ±30 %; on‑demand pricing would roughly double the training line.

Evaluation methodology that surfaces regressions

We lock the same 10 % hold‑out across every iteration. Automation runs after each training job, writes a JSON report, and aborts deployment if thresholds are crossed.

  1. Exact‑match on structured fields. For JSON output we parse with json.loads and count strict equality. A 95 % pass rate is our production gate.
  2. LLM‑judge pairwise comparison. We prompt a reference model (GPT‑4‑turbo) to rank the new model’s response against the previous version on a 0–1 scale. The average delta must stay above 0.05; otherwise we flag a regression.

Both metrics are computed on the same batch of 200 examples, ensuring statistical stability without excessive compute.

Common failure modes and mitigations

Deploying the LoRA adapter at scale

Our production stack uses vllm in “adapter‑aware” mode. The command line is a single line:

vllm serve unsloth/gemma-3-1b-it \
    --adapter-path ./out/adapter_model.bin \
    --tensor-parallel-size 1 \
    --max-model-len 1024

The base model occupies the same GPU memory as a vanilla inference run; the adapter adds <≈ 20 MB, negligible for A100 cards. Autoscaling is handled by the container orchestrator (Kubernetes HPA) based on request latency, keeping cost proportional to traffic.

Iterating fast: the weekly fine‑tuning loop

In production we close the loop every 7–14 days:

  1. Collect fresh logs, extract failed cases, and annotate 50–100 new pairs.
  2. Append to the master JSONL, re‑run deduplication script (python dedup.py master.jsonl).
  3. Kick off the training job via a GitHub Action that provisions a RunPod spot instance, runs the script above, and pushes the adapter to HF.
  4. Trigger the evaluation workflow; if both metrics clear, roll out the new adapter to the vLLM service with a zero‑downtime rolling update.

The entire pipeline—from data ingest to live serving—takes under two hours of wall‑clock time, and the monetary outlay stays below $5 per iteration. That cadence is what makes fine‑tuning a competitive alternative to “prompt‑engineering‑only” solutions for narrow, high‑volume use cases.

Scaling beyond a single task

If you need multiple specialists (e.g., separate adapters for finance, legal, and medical domains), we keep a mono‑repo of adapters and a routing layer that selects the appropriate LoRA at request time. The routing decision is a cheap keyword match on the incoming request’s metadata; the underlying vLLM process loads the chosen adapter on the fly, still within the same GPU process.

In our tests, serving five adapters concurrently adds ~0.03 seconds of latency per request—well within SLA limits for most APIs. Memory overhead remains under 150 MB total, confirming that a 1 B base can host a dozen niche experts on a single GPU.

This is part of the Production Agents cornerstone series.