Hello
If you want the AI LLM model to produce data that is relevant to a particular task, there are different ways to achieve this.
For simple tasks with known initial conditions and one-time use, you can manually change the prompt to include current data.
RAG:
If you need to use dynamic data, such as task search results, page or document search, use RAG (Retrieval-Augmented Generation).
In a simple example, data such as the result of a database query or website content is converted to text and added to the user’s request, for this purpose a specialized lightweight embedding model can be used, for example multilingual-e5.
If the model needs any data during response generation, it is used Multi-step RAG (Iterative RAG/ReAct-like):
– LLM makes a guess and formulates a “sub-question”
– The system calls retriever with this subquery
– LLM gets a new context and continues generating
LoRA:
For solving problems where the model acts as an assistant in a certain area of activity, or when the model’s answers must be based on a large amount of specific data that changes rarely but must be present in the model’s context to form an answer, RAG may not be suitable for many reasons, such as performance, poor optimization, since we are forced to add the same thing to the incoming data every time, and the size of the model’s context may simply end, and this context will be less for the user’s request.
The downside of retraining is catastrophic forgetting, a phenomenon in which a model loses previously learned knowledge when it is retrained on new data, especially if it relates to a different task or distribution.
There are different approaches for retraining, such as retraining all model weights, Prefix-Tuning, but the best approach in many cases may be LoRA (Low-Rank Adaptation).
In the LoRA approach, we take a model, create and train adapters with weights in the same way the model was trained, and add them to the model, without changing the model’s weights.
Instead of changing all the model weights (which leads to forgetting), LoRA:
– freezes original weights (W₀)
– adds learnable low-rank matrices (ΔW = A @ B) that adapt the model’s behavior to the new task
– at the inference stage: W = W₀ + ΔW is used, but W₀ remains untouched
This allows preserving the original knowledge of the model, and retraining only affects new components (LoRA adapters).
You can enable/disable LoRA adapters – easily switch between tasks.
Libraries for LoRA:
https://github.com/microsoft/LoRA
Original implementation, outdated but useful as a concept
https://github.com/huggingface/peft
Supports: LoRA, QLoRA, AdaLoRA, Prefix Tuning, Prompt Tuning
https://github.com/unslothai/unsloth
Fast, supports Flash Attention 2, TensorRT, QLoRA
and others.
Python examples:
Example source code:
https://github.com/RomanKryvolapov/LoRA_additional_training_of_model_script
The training will take place on the video card, because
– peft supports training on CPU, but it takes a very long time
– unsloth does not support CPU training
The latest version of Nvidia Driver is installed https://www.nvidia.com/en-us/drivers
The latest version of CUDA Toolkit is installed https://developer.nvidia.com/cuda-toolkit-archive
PyTorch 2.5.0 installed, version for Cuda 12.4 https://pytorch.org/get-started/locally
pip3 install torch==2.5.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Unsloth installed, version for Cuda 12.4 and PyTorch 2.5.0
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
llama.cpp downloaded https://github.com/ggml-org/llama.cpp to convert the model to GGUF format in the last step
gh repo clone ggml-org/llama.cpp
Model https://huggingface.co/google/gemma-3-4b-it downloaded to the model folder
Library versions, requirements.txt :
transformers==4.54.0 peft==0.16.0 trl==0.19.1 datasets==2.19.1 accelerate==1.9.0 scipy==1.13.1 sentencepiece==0.1.99 unsloth==2025.7.9 torch==2.5.0+cu124 bitsandbytes==0.46.1 safetensors==0.5.3
Test data, since this is an example, we need to make sure that the data was added to the model. To do this, we create gemma_lora_data.json with the content – the same fact in many different formulations.
[ { "prompt": "What is my cat's name?", "response": "Tiger" }, { "prompt": "What do I name my cat?", "response": "Tiger" }, ... ]
Unsloth:
imports:
import torch print(torch.__version__) print(torch.version.cuda) print(torch.cuda.is_available()) import os os.environ["UNSLOTH_PATCH_RL_TRAINERS"] = "false" os.environ["UNSLOTH_COMPILE_DISABLE"] = "1" os.environ["TORCHINDUCTOR_DISABLED"] = "1" import json import shutil import subprocess from pathlib import Path from datasets import Dataset from unsloth import FastLanguageModel from trl import SFTTrainer, SFTConfig
Constants, the path to the GGUF_CONVERTER script may differ for you:
MODEL_DIR = Path("./model") LORA_OUTPUT = Path("./lora") MERGED_DIR = Path("./merged") GGUF_DIR = Path("./gguf") GGUF_PATH = GGUF_DIR / "model.gguf" GGUF_CONVERTER = Path(r"C:\ExampleProjects\llama.cpp\convert_hf_to_gguf.py") PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve() DATA_PATH = Path("gemma_lora_data.json") MAX_SEQ_LEN = 4096 NUM_EPOCHS = 3 BATCH_SIZE = 2 LR = 2e-4
We make a method with all the rest of the code:
def main() -> None: ... if __name__ == "__main__": import multiprocessing multiprocessing.freeze_support() main()
Cleaning up folders after the previous script run:
print("[1/7] Cleaning previous artefacts…") for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR): if _dir.exists(): shutil.rmtree(_dir) print(f" ‑ removed «{_dir}»") print("Finished cleaning")
Reading training data, here we use specific formatting for Gemma start_of_turn and end_of_turn models, for other models it may be different, approximate list:
Gemma -> <start_of_turn>message_text<end_of_turn> DeepSeek, Phi -> user: message_text Qwen 3, ChatML -> <|im_start|> message_text <|im_end|> LLaMA -> [INST] message_text [/INST] Command -> <|START_OF_TURN_TOKEN|> message_text <|END_OF_TURN_TOKEN|> OpenChat / Alpaca / Vicuna"-> ### User: message_text"
print("[2/7] Reading training examples…") with DATA_PATH.open(encoding="utf‑8") as fp: records = json.load(fp) # Format each example as a chat-style question-answer sequence def build_chat(example: dict) -> dict: prompt = example["prompt"].strip() response = example["response"].strip() return { "text": ( f"<start_of_turn>user\n{prompt}<end_of_turn>\n" f"<start_of_turn>model\n{response}<end_of_turn>\n" ) } # Creates a HuggingFace Dataset from these formatted strings dataset = Dataset.from_list([build_chat(r) for r in records]) print(f"Loaded {len(dataset):,} samples")
Loading and preparing the model:
print(f"[3/7] Loading base model from «{MODEL_DIR}» …") # Loads a model (eg Gemma 3B) and tokenizer. # Adds LoRA adaptation (with automatic detection of target layers). model, tokenizer = FastLanguageModel.from_pretrained( model_name=str(MODEL_DIR), ) model = FastLanguageModel.get_peft_model(model) print("Model ready for fine‑tuning")
Education:
print("[4/7] Starting supervised fine‑tuning …") sft_cfg = SFTConfig( per_device_train_batch_size=BATCH_SIZE, num_train_epochs=NUM_EPOCHS, learning_rate=LR, logging_steps=10, dataset_num_proc=1, ) trainer = SFTTrainer( model=model, train_dataset=dataset, args=sft_cfg, ) trainer.train() print("Training complete")
Saving LoRA scales to lora folder:
print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …") LORA_OUTPUT.mkdir(parents=True, exist_ok=True) model.save_pretrained( str(LORA_OUTPUT), tokenizer, save_method="lora" ) print("Adapters saved")
Combining LoRA and the base model:
print("[6/7] Merging LoRA + base ⟶ fp16 …") MERGED_DIR.mkdir(parents=True, exist_ok=True) model.save_pretrained_merged( str(MERGED_DIR), tokenizer, safe_serialization=True, save_method="merged_16bit" ) print(f"Merged model written to «{MERGED_DIR}»")
Convert to GGUF, Runs the convert_hf_to_gguf.py script from llama.cpp to convert the model to GGUF format with q8_0 quantization for use in LM Studio, llama.cpp and similar frameworks.
print("[7/7] Converting to GGUF (q8_0) …") GGUF_DIR.mkdir(parents=True, exist_ok=True) subprocess.run( [ str(PYTHON_BIN), str(GGUF_CONVERTER), os.path.abspath(MERGED_CLEAN_DIR), "--outfile", str(GGUF_PATH), "--outtype", "q8_0", ], check=True, ) print(f"GGUF ready at «{GGUF_PATH}»") print("Pipeline finished successfully!")
Full example code:
# ------------------------ Model in folder = google/gemma-3-4b-it --------------------------- import torch print(torch.__version__) print(torch.version.cuda) print(torch.cuda.is_available()) import os os.environ["UNSLOTH_PATCH_RL_TRAINERS"] = "false" os.environ["UNSLOTH_COMPILE_DISABLE"] = "1" os.environ["TORCHINDUCTOR_DISABLED"] = "1" import json import shutil import subprocess from pathlib import Path from datasets import Dataset from unsloth import FastLanguageModel from trl import SFTTrainer, SFTConfig MODEL_DIR = Path("./model") LORA_OUTPUT = Path("./lora") MERGED_DIR = Path("./merged") GGUF_DIR = Path("./gguf") GGUF_PATH = GGUF_DIR / "model.gguf" GGUF_CONVERTER = Path(r"C:\ExampleProjects\llama.cpp\convert_hf_to_gguf.py") PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve() DATA_PATH = Path("gemma_lora_data.json") MAX_SEQ_LEN = 4096 NUM_EPOCHS = 3 BATCH_SIZE = 2 LR = 2e-4 def main(): print("[1/7] Cleaning previous artefacts…") for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR): if _dir.exists(): shutil.rmtree(_dir) print(f" ‑ removed «{_dir}»") print("Finished cleaning") print("[2/7] Reading training examples…") with DATA_PATH.open(encoding="utf-8") as fp: records = json.load(fp) def build_chat(example): prompt = example["prompt"].strip() response = example["response"].strip() return { "text": ( f"<start_of_turn>user\n{prompt}<end_of_turn>\n" f"<start_of_turn>model\n{response}<end_of_turn>\n" ) } dataset = Dataset.from_list([build_chat(r) for r in records]) print(f"Loaded {len(dataset):,} samples") print(f"[3/7] Loading base model from «{MODEL_DIR}» …") model, tokenizer = FastLanguageModel.from_pretrained( model_name=str(MODEL_DIR), ) model = FastLanguageModel.get_peft_model(model) print("Model ready for fine‑tuning") print("[4/7] Starting supervised fine‑tuning …") sft_cfg = SFTConfig( per_device_train_batch_size=BATCH_SIZE, num_train_epochs=NUM_EPOCHS, learning_rate=LR, logging_steps=10, dataset_num_proc=1, ) trainer = SFTTrainer( model=model, train_dataset=dataset, args=sft_cfg, ) trainer.train() print("Training complete") print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …") LORA_OUTPUT.mkdir(parents=True, exist_ok=True) model.save_pretrained( str(LORA_OUTPUT), tokenizer, save_method="lora" ) print("Adapters saved") print("[6/7] Merging LoRA + base ⟶ fp16 …") MERGED_DIR.mkdir(parents=True, exist_ok=True) model.save_pretrained_merged( str(MERGED_DIR), tokenizer, safe_serialization=True, save_method="merged_16bit" ) print(f"Merged model written to «{MERGED_DIR}»") print("[7/7] Converting to GGUF (q8_0) …") GGUF_DIR.mkdir(parents=True, exist_ok=True) subprocess.run( [ str(PYTHON_BIN), str(GGUF_CONVERTER), os.path.abspath(MERGED_DIR), "--outfile", str(GGUF_PATH), "--outtype", "q8_0", ], check=True, ) print(f"GGUF ready at «{GGUF_PATH}»") print("Pipeline finished successfully!") if __name__ == "__main__": import multiprocessing multiprocessing.freeze_support() main()
Text output of the example, training stage:
[4/7] Starting supervised fine‑tuning … Unsloth: Tokenizing ["text"]: 100%|██████████| 91/91 [00:00<00:00, 10075.01 examples/s] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 91 | Num Epochs = 3 | Total steps = 69 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 2 \ / Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4 "-____-" Trainable parameters = 32,788,480 of 4,332,867,952 (0.76% trained) 0%| | 0/69 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. 14%|█▍ | 10/69 [00:12<01:11, 1.21s/it]{'loss': 17.6186, 'grad_norm': 16.65082550048828, 'learning_rate': 0.00019354838709677422, 'epoch': 0.43} 29%|██▉ | 20/69 [00:24<00:57, 1.18s/it]{'loss': 3.6136, 'grad_norm': 6.085631847381592, 'learning_rate': 0.00016129032258064516, 'epoch': 0.87} 43%|████▎ | 30/69 [00:36<00:47, 1.21s/it]{'loss': 1.4697, 'grad_norm': 3.163569211959839, 'learning_rate': 0.00012903225806451613, 'epoch': 1.3} 58%|█████▊ | 40/69 [00:48<00:34, 1.19s/it]{'loss': 1.2614, 'grad_norm': 4.380921840667725, 'learning_rate': 9.677419354838711e-05, 'epoch': 1.74} 72%|███████▏ | 50/69 [01:00<00:23, 1.24s/it]{'loss': 1.0062, 'grad_norm': 3.690117835998535, 'learning_rate': 6.451612903225807e-05, 'epoch': 2.17} 87%|████████▋ | 60/69 [01:13<00:11, 1.31s/it]{'loss': 0.7645, 'grad_norm': 4.235830783843994, 'learning_rate': 3.2258064516129034e-05, 'epoch': 2.61} 100%|██████████| 69/69 [01:24<00:00, 1.23s/it] {'train_runtime': 84.8047, 'train_samples_per_second': 3.219, 'train_steps_per_second': 0.814, 'train_loss': 3.8283379941746807, 'epoch': 3.0} Training complete
Peft:
imports:
import os import json import torch import shutil import subprocess import safetensors.torch from pathlib import Path from datasets import Dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling, BitsAndBytesConfig, ) from peft import ( LoraConfig, get_peft_model, prepare_model_for_kbit_training )
Constants, the path to the GGUF_CONVERTER script may differ for you:
MODEL_DIR = Path("./model") LORA_OUTPUT = Path("./lora") MERGED_DIR = Path("./merged") MERGED_CLEAN_DIR = Path("./merged_clean") GGUF_DIR = Path("./gguf") GGUF_PATH = GGUF_DIR / "model.gguf" GGUF_CONVERTER = Path(r"C:\GitHub\llama.cpp\convert_hf_to_gguf.py") PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve() DATA_PATH = Path("gemma_lora_data.json") MAX_SEQ_LEN = 4096 NUM_EPOCHS = 3 BATCH_SIZE = 2 LR = 2e-4 LORA_R = 8 LORA_ALPHA = 16 LORA_DROPOUT = 0.05
We make a method with all the rest of the code:
def main() -> None: ... if __name__ == "__main__": import multiprocessing multiprocessing.freeze_support() main()
Cleaning up folders after the previous script run:
print("[1/7] Cleaning previous artefacts…") for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR): if _dir.exists(): shutil.rmtree(_dir) print(f" ‑ removed «{_dir}»") print("Finished cleaning")
Reading training data, here we use specific formatting for Gemma start_of_turn and end_of_turn models, for other models it may be different, approximate list:
Gemma -> <start_of_turn>message_text<end_of_turn> DeepSeek, Phi -> user: message_text Qwen 3, ChatML -> <|im_start|> message_text <|im_end|> LLaMA -> [INST] message_text [/INST] Command -> <|START_OF_TURN_TOKEN|> message_text <|END_OF_TURN_TOKEN|> OpenChat / Alpaca / Vicuna"-> ### User: message_text"
print("[2/7] Reading training examples…") with DATA_PATH.open(encoding="utf‑8") as fp: records = json.load(fp) # Format each example as a chat-style question-answer sequence def build_chat(example: dict) -> dict: prompt = example["prompt"].strip() response = example["response"].strip() return { "text": ( f"<start_of_turn>user\n{prompt}<end_of_turn>\n" f"<start_of_turn>model\n{response}<end_of_turn>\n" ) } # Creates a HuggingFace Dataset from these formatted strings dataset = Dataset.from_list([build_chat(r) for r in records]) print(f"Loaded {len(dataset):,} samples")
Loading and preparing the model:
print(f"[3/7] Loading base model from «{MODEL_DIR}» …") # Configures model loading with 4-bit quantization (nf4) bnb_config = BitsAndBytesConfig( load_in_4bit=True, load_in_8bit=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, llm_int8_threshold=6.0, llm_int8_skip_modules=None, llm_int8_enable_fp32_cpu_offload=False, llm_int8_has_fp16_weight=False, bnb_4bit_quant_storage=torch.uint8 ) # Loads the tokenizer and quantized model tokenizer = AutoTokenizer.from_pretrained( MODEL_DIR, trust_remote_code=True ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" model = AutoModelForCausalLM.from_pretrained( MODEL_DIR, device_map="auto", quantization_config=bnb_config, torch_dtype=torch.bfloat16, ) # Prepares the model for training with LoRA, unfreezes the required layers model = prepare_model_for_kbit_training(model) # Sets LoRA adaptation for given layers (q_proj, k_proj, etc.) lora_config = LoraConfig( r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, bias="none", task_type="CAUSAL_LM", target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], ) model = get_peft_model(model, lora_config) print("Model ready for fine‑tuning")
Education:
print("[4/7] Starting supervised fine‑tuning …") # Tokenizes text examples def tokenize(example): tokens = tokenizer( example["text"], truncation=True, max_length=MAX_SEQ_LEN, ) return tokens ds_tok = dataset.map(tokenize, batched=True, remove_columns=["text"]) # Creates a data collator, sets training parameters and runs fine-tuning collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, ) args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, gradient_accumulation_steps=4, num_train_epochs=NUM_EPOCHS, learning_rate=LR, lr_scheduler_type="cosine", warmup_ratio=0.03, logging_steps=10, save_strategy="epoch", bf16=True, optim="paged_adamw_8bit", report_to="none", ) trainer = Trainer( model=model, args=args, train_dataset=ds_tok, data_collator=collator, ) trainer.train() print("Training complete")
Saving LoRA scales to lora folder:
print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …") LORA_OUTPUT.mkdir(parents=True, exist_ok=True) model.save_pretrained( str(LORA_OUTPUT), save_method="lora" ) print("Adapters saved")
Combining LoRA and the base model:
print("[6/7] Merging LoRA + base ⟶ fp16 …") model = model.merge_and_unload() MERGED_DIR.mkdir(parents=True, exist_ok=True) model.save_pretrained( MERGED_DIR, safe_serialization=True, save_method="merged_16bit" ) tokenizer.save_pretrained(MERGED_DIR) print(f"Merged model written to «{MERGED_DIR}»")
Cleaning the combined model from service data obtained during additional training:
print("[6.5/7] Creating cleaned model in «merged_clean» …") # Copies all files except weights to a new directory merged_clean. MERGED_CLEAN_DIR.mkdir(parents=True, exist_ok=True) for file in MERGED_DIR.iterdir(): if file.name != "model.safetensors": shutil.copy(file, MERGED_CLEAN_DIR / file.name) model_path = MERGED_DIR / "model.safetensors" clean_path = MERGED_CLEAN_DIR / "model.safetensors" # Saves a cleaned version of the scales without technical artifacts, only the model weights. state_dict = safetensors.torch.load_file(str(model_path)) import re pattern = re.compile( r".*\.(absmax|zeros|scales|quant_map|quant_state(\..+)?|nested_absmax|nested_zeros|nested_scales|nested_quant_map)$" ) keys_to_remove = [k for k in state_dict if pattern.match(k)] for key in keys_to_remove: del state_dict[key] safetensors.torch.save_file(state_dict, str(clean_path), metadata={"format": "pt"}) print(f"Saved cleaned model to «{MERGED_CLEAN_DIR}», removed {len(keys_to_remove)} keys")
Convert to GGUF, Runs the convert_hf_to_gguf.py script from llama.cpp to convert the model to GGUF format with q8_0 quantization for use in LM Studio, llama.cpp and similar frameworks.
print("[7/7] Converting to GGUF (q8_0) …") GGUF_DIR.mkdir(parents=True, exist_ok=True) subprocess.run( [ str(PYTHON_BIN), str(GGUF_CONVERTER), os.path.abspath(MERGED_CLEAN_DIR), "--outfile", str(GGUF_PATH), "--outtype", "q8_0", ], check=True, ) print(f"GGUF ready at «{GGUF_PATH}»") print("Pipeline finished successfully!")
Full example code:
import os import json import torch import shutil import subprocess import safetensors.torch from pathlib import Path from datasets import Dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling, BitsAndBytesConfig, ) from peft import ( LoraConfig, get_peft_model, prepare_model_for_kbit_training ) MODEL_DIR = Path("./model") LORA_OUTPUT = Path("./lora") MERGED_DIR = Path("./merged") MERGED_CLEAN_DIR = Path("./merged_clean") GGUF_DIR = Path("./gguf") GGUF_PATH = GGUF_DIR / "model.gguf" GGUF_CONVERTER = Path(r"C:\ExampleProjects\llama.cpp\convert_hf_to_gguf.py") PYTHON_BIN = Path(".venv/Scripts/python.exe").resolve() DATA_PATH = Path("gemma_lora_data.json") MAX_SEQ_LEN = 4096 NUM_EPOCHS = 3 BATCH_SIZE = 2 LR = 2e-4 LORA_R = 8 LORA_ALPHA = 16 LORA_DROPOUT = 0.05 def main() -> None: print("[1/7] Cleaning previous artefacts…") for _dir in (LORA_OUTPUT, MERGED_DIR, GGUF_DIR): if _dir.exists(): shutil.rmtree(_dir) print(f" ‑ removed «{_dir}»") print("Finished cleaning") print("[2/7] Reading training examples…") with DATA_PATH.open(encoding="utf‑8") as fp: records = json.load(fp) def build_chat(example: dict) -> dict: prompt = example["prompt"].strip() response = example["response"].strip() return { "text": ( f"<start_of_turn>user\n{prompt}<end_of_turn>\n" f"<start_of_turn>model\n{response}<end_of_turn>\n" ) } dataset = Dataset.from_list([build_chat(r) for r in records]) print(f"Loaded {len(dataset):,} samples") # print(f"[3/7] Loading base model from «{MODEL_DIR}» …") bnb_config = BitsAndBytesConfig( load_in_4bit=True, load_in_8bit=False, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, llm_int8_threshold=6.0, llm_int8_skip_modules=None, llm_int8_enable_fp32_cpu_offload=False, llm_int8_has_fp16_weight=False, bnb_4bit_quant_storage=torch.uint8 ) tokenizer = AutoTokenizer.from_pretrained( MODEL_DIR, trust_remote_code=True ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right" model = AutoModelForCausalLM.from_pretrained( MODEL_DIR, device_map="auto", quantization_config=bnb_config, torch_dtype=torch.bfloat16, ) model = prepare_model_for_kbit_training(model) lora_config = LoraConfig( r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, bias="none", task_type="CAUSAL_LM", target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], ) model = get_peft_model(model, lora_config) print("Model ready for fine‑tuning") print("[4/7] Starting supervised fine‑tuning …") def tokenize(example): tokens = tokenizer( example["text"], truncation=True, max_length=MAX_SEQ_LEN, ) return tokens ds_tok = dataset.map(tokenize, batched=True, remove_columns=["text"]) collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, ) args = TrainingArguments( per_device_train_batch_size=BATCH_SIZE, gradient_accumulation_steps=4, num_train_epochs=NUM_EPOCHS, learning_rate=LR, lr_scheduler_type="cosine", warmup_ratio=0.03, logging_steps=10, save_strategy="epoch", bf16=True, optim="paged_adamw_8bit", report_to="none", ) trainer = Trainer( model=model, args=args, train_dataset=ds_tok, data_collator=collator, ) trainer.train() print("Training complete") print(f"[5/7] Saving LoRA weights to «{LORA_OUTPUT}» …") LORA_OUTPUT.mkdir(parents=True, exist_ok=True) model.save_pretrained( str(LORA_OUTPUT), save_method="lora" ) print("Adapters saved") print("[6/7] Merging LoRA + base ⟶ fp16 …") model = model.merge_and_unload() MERGED_DIR.mkdir(parents=True, exist_ok=True) model.save_pretrained( MERGED_DIR, safe_serialization=True, save_method="merged_16bit" ) tokenizer.save_pretrained(MERGED_DIR) print(f"Merged model written to «{MERGED_DIR}»") print("[6.5/7] Creating cleaned model in «merged_clean» …") MERGED_CLEAN_DIR.mkdir(parents=True, exist_ok=True) for file in MERGED_DIR.iterdir(): if file.name != "model.safetensors": shutil.copy(file, MERGED_CLEAN_DIR / file.name) model_path = MERGED_DIR / "model.safetensors" clean_path = MERGED_CLEAN_DIR / "model.safetensors" state_dict = safetensors.torch.load_file(str(model_path)) import re pattern = re.compile( r".*\.(absmax|zeros|scales|quant_map|quant_state(\..+)?|nested_absmax|nested_zeros|nested_scales|nested_quant_map)$" ) keys_to_remove = [k for k in state_dict if pattern.match(k)] for key in keys_to_remove: del state_dict[key] safetensors.torch.save_file(state_dict, str(clean_path), metadata={"format": "pt"}) print(f"Saved cleaned model to «{MERGED_CLEAN_DIR}», removed {len(keys_to_remove)} keys") print("[7/7] Converting to GGUF (q8_0) …") GGUF_DIR.mkdir(parents=True, exist_ok=True) subprocess.run( [ str(PYTHON_BIN), str(GGUF_CONVERTER), os.path.abspath(MERGED_CLEAN_DIR), "--outfile", str(GGUF_PATH), "--outtype", "q8_0", ], check=True, ) print(f"GGUF ready at «{GGUF_PATH}»") print("Pipeline finished successfully!") if __name__ == "__main__": import multiprocessing multiprocessing.freeze_support() main()
Text output of the example, training stage:
[4/7] Starting supervised fine‑tuning … Map: 100%|██████████| 91/91 [00:00<00:00, 11339.66 examples/s] No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead. 0%| | 0/36 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. C:\Users\Roman\PycharmProjects\LoRA_Example\.venv\Lib\site-packages\torch\_dynamo\eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) 28%|██▊ | 10/36 [00:27<01:07, 2.61s/it]{'loss': 26.8174, 'grad_norm': 22.677703857421875, 'learning_rate': 0.00017980172272802396, 'epoch': 0.87} 33%|███▎ | 12/36 [00:31<00:53, 2.24s/it]C:\Users\Roman\PycharmProjects\LoRA_Example\.venv\Lib\site-packages\torch\_dynamo\eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) 56%|█████▌ | 20/36 [00:53<00:43, 2.72s/it]{'loss': 4.588, 'grad_norm': 10.421425819396973, 'learning_rate': 0.0001, 'epoch': 1.7} 67%|██████▋ | 24/36 [01:02<00:27, 2.26s/it]C:\Users\Roman\PycharmProjects\LoRA_Example\.venv\Lib\site-packages\torch\_dynamo\eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, **kwargs) 83%|████████▎ | 30/36 [01:18<00:15, 2.58s/it]{'loss': 2.4059, 'grad_norm': 9.479026794433594, 'learning_rate': 2.0198277271976052e-05, 'epoch': 2.52} 100%|██████████| 36/36 [01:33<00:00, 2.60s/it] {'train_runtime': 93.46, 'train_samples_per_second': 2.921, 'train_steps_per_second': 0.385, 'train_loss': 9.714792675442165, 'epoch': 3.0} Training complete
Read more:
https://huggingface.co/learn/llm-course/chapter11/4
https://toashishagarwal.medium.com/how-to-fine-tune-a-llm-using-lora-5fdb6dea11a6
https://medium.com/@rachittayal7/my-experiences-with-finetuning-llms-using-lora-b9c90f1839c6