LoRA – Final training of AI LLM models - Roman Kryvolapov

Hello
If you want the AI LLM model to produce data that is relevant to a particular task, there are different ways to achieve this.

For simple tasks with known initial conditions and one-time use, you can manually change the prompt to include current data.

RAG:

If you need to use dynamic data, such as task search results, page or document search, use RAG (Retrieval-Augmented Generation).
In a simple example, data such as the result of a database query or website content is converted to text and added to the user’s request, for this purpose a specialized lightweight embedding model can be used, for example multilingual-e5.

If the model needs any data during response generation, it is used Multi-step RAG (Iterative RAG/ReAct-like):
– LLM makes a guess and formulates a “sub-question”
– The system calls retriever with this subquery
– LLM gets a new context and continues generating

LoRA:

For solving problems where the model acts as an assistant in a certain area of activity, or when the model’s answers must be based on a large amount of specific data that changes rarely but must be present in the model’s context to form an answer, RAG may not be suitable for many reasons, such as performance, poor optimization, since we are forced to add the same thing to the incoming data every time, and the size of the model’s context may simply end, and this context will be less for the user’s request.

The downside of retraining is catastrophic forgetting, a phenomenon in which a model loses previously learned knowledge when it is retrained on new data, especially if it relates to a different task or distribution.

There are different approaches for retraining, such as retraining all model weights, Prefix-Tuning, but the best approach in many cases may be LoRA (Low-Rank Adaptation).

In the LoRA approach, we take a model, create and train adapters with weights in the same way the model was trained, and add them to the model, without changing the model’s weights.

Instead of changing all the model weights (which leads to forgetting), LoRA:
– freezes original weights (W₀)
– adds learnable low-rank matrices (ΔW = A @ B) that adapt the model’s behavior to the new task
– at the inference stage: W = W₀ + ΔW is used, but W₀ remains untouched
This allows preserving the original knowledge of the model, and retraining only affects new components (LoRA adapters).

You can enable/disable LoRA adapters – easily switch between tasks.

Libraries for LoRA:

https://github.com/microsoft/LoRA
Original implementation, outdated but useful as a concept

https://github.com/huggingface/peft
Supports: LoRA, QLoRA, AdaLoRA, Prefix Tuning, Prompt Tuning

https://github.com/unslothai/unsloth
Fast, supports Flash Attention 2, TensorRT, QLoRA

and others.

Python examples:

Example source code:

https://github.com/RomanKryvolapov/LoRA_additional_training_of_model_script

The training will take place on the video card, because
– peft supports training on CPU, but it takes a very long time
– unsloth does not support CPU training

The latest version of Nvidia Driver is installed https://www.nvidia.com/en-us/drivers

The latest version of CUDA Toolkit is installed https://developer.nvidia.com/cuda-toolkit-archive

PyTorch 2.5.0 installed, version for Cuda 12.4 https://pytorch.org/get-started/locally

pip3 install torch==2.5.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Unsloth installed, version for Cuda 12.4 and PyTorch 2.5.0

pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"

llama.cpp downloaded https://github.com/ggml-org/llama.cpp to convert the model to GGUF format in the last step

gh repo clone ggml-org/llama.cpp

Model https://huggingface.co/google/gemma-3-4b-it downloaded to the model folder

Library versions, requirements.txt :

transformers==4.54.0
peft==0.16.0
trl==0.19.1
datasets==2.19.1
accelerate==1.9.0
scipy==1.13.1
sentencepiece==0.1.99
unsloth==2025.7.9
torch==2.5.0+cu124
bitsandbytes==0.46.1
safetensors==0.5.3

Test data, since this is an example, we need to make sure that the data was added to the model. To do this, we create gemma_lora_data.json with the content – the same fact in many different formulations.

[
  {
    "prompt": "What is my cat's name?",
    "response": "Tiger"
  },
  {
    "prompt": "What do I name my cat?",
    "response": "Tiger"
  },
  ...
]

Unsloth:

imports:

import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())

import os
os.environ["UNSLOTH_PATCH_RL_TRAINERS"] = "false"
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
os.environ["TORCHINDUCTOR_DISABLED"] = "1"

import json
import shutil
import subprocess
from pathlib import Path
from datasets import Dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

Constants, the path to the GGUF_CONVERTER script may differ for you:

MODEL_DIR = Path("./model")
LORA_OUTPUT = Path("./lora")
MERGED_DIR = Path("./merged")
GGUF_DIR = Path("./gguf")
GGUF_PATH = GGUF_DIR / "model.gguf"
GGUF_CONVERTER = Path(r"C:\ExampleProjects\llama.cpp\convert_hf_to_gguf.py")
PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve()
DATA_PATH = Path("gemma_lora_data.json")
MAX_SEQ_LEN = 4096
NUM_EPOCHS = 3
BATCH_SIZE = 2
LR = 2e-4

We make a method with all the rest of the code:

def main() -> None:
	...
if __name__ == "__main__":
    import multiprocessing
    multiprocessing.freeze_support()
    main()

Cleaning up folders after the previous script run:

    print("[1/7] Cleaning previous artefacts…")

    for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR):
        if _dir.exists():
            shutil.rmtree(_dir)
            print(f"  ‑ removed «{_dir}»")

    print("Finished cleaning")

Reading training data, here we use specific formatting for Gemma start_of_turn and end_of_turn models, for other models it may be different, approximate list:

Gemma -> <start_of_turn>message_text<end_of_turn>
DeepSeek, Phi -> user: message_text
Qwen 3, ChatML -> <|im_start|> message_text <|im_end|>
LLaMA -> [INST] message_text [/INST]
Command -> <|START_OF_TURN_TOKEN|> message_text <|END_OF_TURN_TOKEN|>
OpenChat / Alpaca / Vicuna"-> ### User: message_text"

    print("[2/7] Reading training examples…")

    with DATA_PATH.open(encoding="utf‑8") as fp:
        records = json.load(fp)

	# Format each example as a chat-style question-answer sequence
    def build_chat(example: dict) -> dict:
        prompt = example["prompt"].strip()
        response = example["response"].strip()
        return {
            "text": (
                f"<start_of_turn>user\n{prompt}<end_of_turn>\n"
                f"<start_of_turn>model\n{response}<end_of_turn>\n"
            )
        }

	# Creates a HuggingFace Dataset from these formatted strings
    dataset = Dataset.from_list([build_chat(r) for r in records])

    print(f"Loaded {len(dataset):,} samples")

Loading and preparing the model:

    print(f"[3/7] Loading base model from «{MODEL_DIR}» …")

    # Loads a model (eg Gemma 3B) and tokenizer.
	# Adds LoRA adaptation (with automatic detection of target layers).
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=str(MODEL_DIR),
    )
    model = FastLanguageModel.get_peft_model(model)

    print("Model ready for fine‑tuning")

Education:

    print("[4/7] Starting supervised fine‑tuning …")

    sft_cfg = SFTConfig(
        per_device_train_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LR,
        logging_steps=10,
        dataset_num_proc=1,
    )
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=sft_cfg,
    )
    trainer.train()

    print("Training complete")

Saving LoRA scales to lora folder:

    print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …")

    LORA_OUTPUT.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(
        str(LORA_OUTPUT),
        tokenizer,
        save_method="lora"
    )

    print("Adapters saved")

Combining LoRA and the base model:

    print("[6/7] Merging LoRA + base ⟶ fp16 …")

    MERGED_DIR.mkdir(parents=True, exist_ok=True)
    model.save_pretrained_merged(
        str(MERGED_DIR),
        tokenizer,
        safe_serialization=True,
        save_method="merged_16bit"
    )

    print(f"Merged model written to «{MERGED_DIR}»")

Convert to GGUF, Runs the convert_hf_to_gguf.py script from llama.cpp to convert the model to GGUF format with q8_0 quantization for use in LM Studio, llama.cpp and similar frameworks.

    print("[7/7] Converting to GGUF (q8_0) …")
    GGUF_DIR.mkdir(parents=True, exist_ok=True)
    subprocess.run(
        [
            str(PYTHON_BIN),
            str(GGUF_CONVERTER),
            os.path.abspath(MERGED_CLEAN_DIR),
            "--outfile", str(GGUF_PATH),
            "--outtype", "q8_0",
        ],
        check=True,
    )
    print(f"GGUF ready at «{GGUF_PATH}»")
    print("Pipeline finished successfully!")

Full example code:

# ------------------------ Model in folder = google/gemma-3-4b-it ---------------------------
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.cuda.is_available())
import os
os.environ["UNSLOTH_PATCH_RL_TRAINERS"] = "false"
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
os.environ["TORCHINDUCTOR_DISABLED"] = "1"
import json
import shutil
import subprocess
from pathlib import Path
from datasets import Dataset
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig

MODEL_DIR = Path("./model")
LORA_OUTPUT = Path("./lora")
MERGED_DIR = Path("./merged")
GGUF_DIR = Path("./gguf")
GGUF_PATH = GGUF_DIR / "model.gguf"
GGUF_CONVERTER = Path(r"C:\ExampleProjects\llama.cpp\convert_hf_to_gguf.py")
PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve()
DATA_PATH = Path("gemma_lora_data.json")
MAX_SEQ_LEN = 4096
NUM_EPOCHS = 3
BATCH_SIZE = 2
LR = 2e-4

def main():
    print("[1/7] Cleaning previous artefacts…")
    for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR):
        if _dir.exists():
            shutil.rmtree(_dir)
            print(f"  ‑ removed «{_dir}»")
    print("Finished cleaning")

    print("[2/7] Reading training examples…")
    with DATA_PATH.open(encoding="utf-8") as fp:
        records = json.load(fp)
    def build_chat(example):
        prompt = example["prompt"].strip()
        response = example["response"].strip()
        return {
            "text": (
                f"<start_of_turn>user\n{prompt}<end_of_turn>\n"
                f"<start_of_turn>model\n{response}<end_of_turn>\n"
            )
        }
    dataset = Dataset.from_list([build_chat(r) for r in records])
    print(f"Loaded {len(dataset):,} samples")

    print(f"[3/7] Loading base model from «{MODEL_DIR}» …")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=str(MODEL_DIR),
    )
    model = FastLanguageModel.get_peft_model(model)
    print("Model ready for fine‑tuning")

    print("[4/7] Starting supervised fine‑tuning …")
    sft_cfg = SFTConfig(
        per_device_train_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LR,
        logging_steps=10,
        dataset_num_proc=1,
    )
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=sft_cfg,
    )
    trainer.train()
    print("Training complete")

    print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …")
    LORA_OUTPUT.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(
        str(LORA_OUTPUT),
        tokenizer,
        save_method="lora"
    )
    print("Adapters saved")

    print("[6/7] Merging LoRA + base ⟶ fp16 …")
    MERGED_DIR.mkdir(parents=True, exist_ok=True)
    model.save_pretrained_merged(
        str(MERGED_DIR),
        tokenizer,
        safe_serialization=True,
        save_method="merged_16bit"
    )
    print(f"Merged model written to «{MERGED_DIR}»")

    print("[7/7] Converting to GGUF (q8_0) …")
    GGUF_DIR.mkdir(parents=True, exist_ok=True)
    subprocess.run(
        [
            str(PYTHON_BIN),
            str(GGUF_CONVERTER),
            os.path.abspath(MERGED_DIR),
            "--outfile", str(GGUF_PATH),
            "--outtype", "q8_0",
        ],
        check=True,
    )
    print(f"GGUF ready at «{GGUF_PATH}»")
    print("Pipeline finished successfully!")

if __name__ == "__main__":
    import multiprocessing
    multiprocessing.freeze_support()
    main()

Text output of the example, training stage:

[4/7] Starting supervised fine‑tuning …
Unsloth: Tokenizing ["text"]: 100%|██████████| 91/91 [00:00<00:00, 10075.01 examples/s]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 91 | Num Epochs = 3 | Total steps = 69
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 32,788,480 of 4,332,867,952 (0.76% trained)
  0%|          | 0/69 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
 14%|█▍        | 10/69 [00:12<01:11,  1.21s/it]{'loss': 17.6186, 'grad_norm': 16.65082550048828, 'learning_rate': 0.00019354838709677422, 'epoch': 0.43}
 29%|██▉       | 20/69 [00:24<00:57,  1.18s/it]{'loss': 3.6136, 'grad_norm': 6.085631847381592, 'learning_rate': 0.00016129032258064516, 'epoch': 0.87}
 43%|████▎     | 30/69 [00:36<00:47,  1.21s/it]{'loss': 1.4697, 'grad_norm': 3.163569211959839, 'learning_rate': 0.00012903225806451613, 'epoch': 1.3}
 58%|█████▊    | 40/69 [00:48<00:34,  1.19s/it]{'loss': 1.2614, 'grad_norm': 4.380921840667725, 'learning_rate': 9.677419354838711e-05, 'epoch': 1.74}
 72%|███████▏  | 50/69 [01:00<00:23,  1.24s/it]{'loss': 1.0062, 'grad_norm': 3.690117835998535, 'learning_rate': 6.451612903225807e-05, 'epoch': 2.17}
 87%|████████▋ | 60/69 [01:13<00:11,  1.31s/it]{'loss': 0.7645, 'grad_norm': 4.235830783843994, 'learning_rate': 3.2258064516129034e-05, 'epoch': 2.61}
100%|██████████| 69/69 [01:24<00:00,  1.23s/it]
{'train_runtime': 84.8047, 'train_samples_per_second': 3.219, 'train_steps_per_second': 0.814, 'train_loss': 3.8283379941746807, 'epoch': 3.0}
Training complete

Peft: