OrionAI Build logo orionai.build

Fine-Tuning Gemma 3 1B: My Actual Workflow + Costs

By OrionAI Build Editorial · Published 2026-05-10 · // build

I fine-tune a small open-weight model every week or two for narrow tasks. Here's the actual workflow, including data prep, training script, evaluation and the costs at each stage.

When fine-tuning beats prompting

Fine-tuning wins when you have:

It loses when the task is small-volume or when prompting already passes your eval. "I'd prefer to fine-tune" is not a reason.

Data prep

Format: JSONL of {prompt, completion}. Cleaner than chat-template formatting for a small specialist model. Targets:

Training script (Unsloth)

from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
ds = load_dataset("json", data_files="train.jsonl")["train"]

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=ds,
    args=SFTConfig(
        output_dir="./out", num_train_epochs=3,
        per_device_train_batch_size=4, gradient_accumulation_steps=4,
        learning_rate=2e-4, warmup_ratio=0.03,
        bf16=True, logging_steps=10, save_steps=100,
    ),
)
trainer.train()
trainer.push_to_hub("your-handle/gemma-3-1b-mytask-v1")

Cost shape

StageWhereIndicative cost
Data prepLocal$0 + your time
Training (LoRA, 3 epochs, 2k pairs)Single consumer or rental GPU, ~30-60 minSingle-digit dollars on a rental tier
EvalLocal or vLLM box$0
Push to HFHuggingFace$0
Serving (vLLM with adapter)Same rental tier as base$0.10–$0.20/hr while running

Numbers above are illustrative ranges, not vendor quotes — every rental marketplace has different spot/on-demand math.

Eval that catches regressions

Same held-out set every time, scored automatically where possible. Two metrics:

  1. Exact-match accuracy on structured outputs (formats, classifications).
  2. Pairwise LLM-judge score against the previous version, on freeform outputs. Treat as a regression detector, not absolute truth.

What goes wrong

Deployment

Push the LoRA adapter to HuggingFace, point vLLM at base + adapter. Inference cost is the same as the base model — no extra GPU footprint.

Model APIs — vetted picks
GPU & compute — vetted picks
Dev tools — vetted picks