Convert a model to GGUF and deploy on Ollama!

Convert a model to GGUF format!

You can take the code below and run it in a Jupyter notebook.

This guide assumes you already have a model you want to convert to GGUF format and have it in on your Brev GPU instance.

Make sure to fine-tune a model on Brev (or have a model handy that you want to convert to GGUF format) before you start!

We need to pull the llama.cpp repo from GitHub. This step might take a while, so be patient!

!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make

!pip install -r llama.cpp/requirements.txt

In the following code-block, llama-brev is an example Llama3 LLM that I fine-tuned on Brev. You can replace it with your own model.

!python llama.cpp/convert-hf-to-gguf.py llama-brev

This will quantize your model to 4-bit quantization.

!cd llama.cpp && ./quantize ../llama-brev/ggml-model-f16.gguf ../llama-brev/ggml-model-Q4_K_M.gguf Q4_K_M

If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

!cd llama.cpp && ./server -m ../merged_adapters/ggml-model-Q4_K_M.gguf -c 2048

Let's create the Ollama modelfile!

Here, we're going to start by pointing the modelfile to where our quantized model is located. We also add a fun system message to make the model talk like a pirate when you prompt it!

tuned_model_path = "/home/ubuntu/verb-workspace/llama-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"

cmds = []

base_model = f"FROM {tuned_model_path}"

template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
"""'''

params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''

system = f'''SYSTEM """{sys_message}"""'''

cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)

def generate_modelfile(cmds):
    content = ""
    for command in cmds:
        content += command + "\n"
    print(content)
    with open("Modelfile", "w") as file:
        file.write(content)

generate_modelfile(cmds)

!curl -fsSL https://ollama.com/install.sh | sh

Let's start the Ollama server and push our modelfile to the Ollama registry so you can now run it locally!

!ollama create llama-brev -f Modelfile

Let's run the model on Ollama!

Now that we have our modelfile and Ollama server running, we should use it to run our fine-tuned model on Ollama! This guide assumes you have Ollama already installed and running on your laptop. If you don't, you can follow the instructions here.

To run our fine-tuned model on Ollama, open up your terminal and run:

ollama pull llama-brev

Remember, llama-brev is the name of my fine-tuned model and what I named my modelfile when I pushed it to the Ollama registry. You can replace it with your own model name and modelfile name.

To query it, run:

ollama run llama-brev

Since my system message is a pirate, when I said Hi!, my model responded with: "Ahoy, matey! Ye be lookin' mighty fine today. Hoist the colors and let's set sail on a grand adventure! Arrr!"

Convert a model to GGUF format!

You can take the code below and run it in a Jupyter notebook.

Let's create the Ollama modelfile!

Let's run the model on Ollama!

You've now taken your fine-tuned model from Brev, converted it to GGUF format, and ran it locally on Ollama!