Fine-Tuning Gemma 3 1B: My Actual Workflow + Costs

By OrionAI Build Editorial · Published 2026-05-10 · // build

I fine-tune a small open-weight model every week or two for narrow tasks. Here's the actual workflow, including data prep, training script, evaluation and the costs at each stage.

When fine-tuning beats prompting

Fine-tuning wins when you have:

A high-volume, narrow task where prompt-with-examples gets expensive (each call carries the few-shot examples in input tokens).
A specific output format you need consistently.
Domain language a base model under-represents.

It loses when the task is small-volume or when prompting already passes your eval. "I'd prefer to fine-tune" is not a reason.

Data prep

Format: JSONL of {prompt, completion}. Cleaner than chat-template formatting for a small specialist model. Targets:

1,000–5,000 high-quality pairs. More past 5,000 helps surprisingly little for a 1B model.
Strict deduplication (exact and near-dup). Contamination is the #1 silent killer of fine-tunes.
Eval split held out before any training run. 10% is fine.

Training script (Unsloth)

from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
ds = load_dataset("json", data_files="train.jsonl")["train"]

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=ds,
    args=SFTConfig(
        output_dir="./out", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, warmup_ratio=0.03,
        bf16=True, logging_steps=10, save_steps=100,
    ),
)
trainer.train()
trainer.push_to_hub("your-handle/gemma-3-1b-mytask-v1")

Cost shape

Stage	Where	Indicative cost
Data prep	Local	$0 + your time
Training (LoRA, 3 epochs, 2k pairs)	Single consumer or rental GPU, ~30-60 min	Single-digit dollars on a rental tier
Eval	Local or vLLM box	$0
Push to HF	HuggingFace	$0
Serving (vLLM with adapter)	Same rental tier as base	$0.10–$0.20/hr while running

Numbers above are illustrative ranges, not vendor quotes — every rental marketplace has different spot/on-demand math.

Eval that catches regressions

Same held-out set every time, scored automatically where possible. Two metrics:

Exact-match accuracy on structured outputs (formats, classifications).
Pairwise LLM-judge score against the previous version, on freeform outputs. Treat as a regression detector, not absolute truth.

What goes wrong

Catastrophic format loss. Model forgets to emit JSON. Fix: include format examples in >30% of training pairs.
Overfitting on common patterns. Loss looks great, eval set tanks. Fix: smaller LoRA rank, fewer epochs.
Contaminated training data. Eval pairs leaked into training. Fix: hash check before every run.

Deployment

Push the LoRA adapter to HuggingFace, point vLLM at base + adapter. Inference cost is the same as the base model — no extra GPU footprint.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot