The No-BS Guide to Fine-Tuning an LLM

In this guide, I show you how easy it is to Fine-Tune LLaMA on relatively-cheap hardware.

Until recently, only the likes of OpenAI and Google had the cash to train LLMs powerful enough to do anything. The open-source community was left out of OpenAI's models in the name of "AI safety" and from Google's for somewhat more bureaucratic reasons.

Thankfully Mark Zuckerberg came to the rescue and gave us LLaMA. A small but potent LLM which has ushered in an era of small customisable models trainable on cheaper hardware. In fact, it's so good that Google now thinks it won't be able to compete with open-source. Follow the steps below to learn how to run it...

1. Getting a GPU

You'll ideally need 4x80GB A100s to train LLaMA. I'd recommend using either GCP, Lambda Labs or AWS - or in fact Brev which connects you to all the above clouds through one interface.

1.1. Installing dependencies

1. Install Python 3.10

sudo apt install -y software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update
sudo apt install -y python3.10

2. Clone the repo:

git clone https://github.com/tatsu-lab/stanford_alpaca.git

3. Get the Hugging Face converter script:

wget https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

3. Create a Conda Environment:

conda create --name py310 python=3.10 -y
conda activate py310
pip install -r requirements.txt
pip install accelerate

2. Getting the LLaMA weights

This is the tricky part...and unfortunately I can't help you much. You'll need to request access to the LLaMA weights from Meta or get the weights through some other means 😉.

3. Converting to Hugging Face format

First thing we need to do is convert the LLaMA weights to Hugging Face format. Run:

conda activate py310

Then:

python convert_llama_weights_to_hf.py \
 --input_dir llama --model_size 7B --output_dir llama_hf

If you notice an error around a _pb2.py file, check your protobuf version and try running:

pip install protobuf==3.20.3

More info here.

4. Fine-tuning LLaMA

Run the following command to fine-tune LLaMA:

torchrun --nproc_per_node=4 --master_port=1024 train.py \
--model_name_or_path llama_hf \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir first_train_output \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap offload" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True

and adjust nproc_per_node to the number of GPUs you have.

Dealing with Cuda memory errors

If you have cuda memory errors there are several things you can do:

drop the per device batch size (per_device_train_batch_size)
add offload to the fsdp parameter: --fsdp "full_shard auto_wrap offload"

5. Running inference

Even though the Alpaca authors don't provide inference code, the community does! Download the file here. Then run the command inside you conda environment:

python inference.py --model_name_or_path <path_to_checkpoint>

If you get an issue to do with a mismatch between cuda:0 and cuda:1 tensors, try running the inference code on a single-gpu machine.

As always, message us on Discord if you need any help...