Convert a fine-tuned model to GGUF format and run on Ollama 🤙

Many people often fine-tune LLMs on the cloud (like on Brev) and want to be able to run them locally on their laptops with Ollama. This process is often painstaking and requires running around the internet, different Reddit threads, and asking in various Discords (at least that was the case for me lol). This is where GGUF and Ollama comes in! In this guide, I will show you how to convert a model to GGUF format, create a modelfile, and run it on Ollama, so you can run your fine-tuned LLMs locally on your computer!

What is GGUF? What is Ollama?

GGUF is a file format designed to efficiently store and run large language models like LLaMa, Mistral, and more. It allows developers to easily download and use state-of-the-art models without requiring supercomputer or specialized hardware. This enables these models to run efficiently on consumer-grade CPUs and GPUs, making them accessible to a wider audience.

GGUF supports quantization, a technique that reduces the model's size by compressing its parameters, leading to faster inference times and lower memory requirements. This optimization is crucial for deploying LLMs on resource-constrained environments. GGUF also prioritizes extensibility by incorporating rich metadata about the model's architecture, training data, and optimization techniques. This future-proofing allows GGUF to adapt to new developments and functionalities in the rapidly evolving LLM landscape.

Ollama is an open-source project that provides a user-friendly interface for running and interacting with large language models (LLMs). It aims to make LLMs more accessible to a broader audience, including those without extensive technical expertise. For more information, visit the Ollama website at ollama.ai and read our blog on how to run Ollama on Brev here. As you can start to tell, GGUF and Ollama are closely related, and GGUF is the format that Ollama uses to store and run LLMs.

What is a modelfile?

An Ollama modelfile is a configuration file that defines and manages large language models (LLMs) on the Ollama platform. It serves as a blueprint for creating, customizing, and sharing LLM models within the Ollama ecosystem.

Within a modelfile, you can specify various parameters and settings related to the model's architecture, training data, and optimization techniques.

Parameter Customization
Prompt Templates
System Messages
Adapters and Licenses

Once you create a modelfile, you can push it into the Ollama registry and then use it locally on your laptop. Enough talk, let's get started! If you want an example of this in action, you can use this Launchable to fine-tune Llama3 and convert it to Ollama!

Let's convert a model to GGUF format!

You can take the code below and run it in a Jupyter notebook. This guide assumes you already have a model you want to convert to GGUF format and have it in on your Brev GPU instance.

We need to pull the llama.cpp repo from GitHub. This step might take a while, so be patient!

!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make

!pip install -r llama.cpp/requirements.txt

llama-brev is an example Llama3 LLM that I fine-tuned on Brev. You can replace it with your own model.

!python llama.cpp/convert-hf-to-gguf.py llama-brev

This will quantize your model to 4-bit quantization.

!cd llama.cpp && ./quantize ../llama-brev/ggml-model-f16.gguf ../llama-brev/ggml-model-Q4_K_M.gguf Q4_K_M

If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

!cd llama.cpp && ./server -m ../merged_adapters/ggml-model-Q4_K_M.gguf -c 2048

Let's create the Ollama modelfile!

Here, we're going to start by pointing the modelfile to where our quantized model is located. We also add a fun system message to make the model talk like a pirate when you prompt it!

tuned_model_path = "/home/ubuntu/verb-workspace/llama-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"

cmds = []

base_model = f"FROM {tuned_model_path}"

template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
"""'''

params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''

system = f'''SYSTEM """{sys_message}"""'''

cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)

def generate_modelfile(cmds):
    content = ""
    for command in cmds:
        content += command + "\n"
    print(content)
    with open("Modelfile", "w") as file:
        file.write(content)

generate_modelfile(cmds)

!curl -fsSL https://ollama.com/install.sh | sh

Let's start the Ollama server and push our modelfile to the Ollama registry so you can now run it locally!

!ollama create llama-brev -f Modelfile

Let's run the model on Ollama!

Now that we have our modelfile and Ollama server running, we should use it to run our fine-tuned model on Ollama! This guide assumes you have Ollama already installed and running on your laptop. If you don't, you can follow the instructions here.

To run our fine-tuned model on Ollama, open up your terminal and run:

ollama pull llama-brev

Remember, llama-brev is the name of my fine-tuned model and what I named my modelfile when I pushed it to the Ollama registry. You can replace it with your own model name and modelfile name.

To query it, run:

ollama run llama-brev

Since my system message is a pirate, when I said Hi!, my model responded with: "Ahoy, matey! Ye be lookin' mighty fine today. Hoist the colors and let's set sail on a grand adventure! Arrr!"

You've now taken your fine-tuned model from Brev, converted it to GGUF format, and ran it locally on Ollama!

For me, taking my models to Ollama has been a game-changer. I no longer have to use a GPU to run my fine-tuned model, and I can ping it directly from my laptop! Happy building and let us know what you build!