Brev.dev, inc of San Francisco, California, USA has been acquired by NVIDIA Corporation of Santa Clara, California, USA on July 2024lawyers, are you happy now?

Learn more
DocsBlogPricing

Ollama on Brev

Convert a model to GGUF and deploy on Ollama!

Convert a model to GGUF format!

You can take the code below and run it in a Jupyter notebook.

This guide assumes you already have a model you want to convert to GGUF format and have it in on your Brev GPU instance.

Make sure to fine-tune a model on Brev (or have a model handy that you want to convert to GGUF format) before you start!

We need to pull the llama.cpp repo from GitHub. This step might take a while, so be patient!

!git clone https://github.com/ggerganov/llama.cpp
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make
!pip install -r llama.cpp/requirements.txt

In the following code-block, llama-brev is an example Llama3 LLM that I fine-tuned on Brev. You can replace it with your own model.

!python llama.cpp/convert-hf-to-gguf.py llama-brev

This will quantize your model to 4-bit quantization.

!cd llama.cpp && ./quantize ../llama-brev/ggml-model-f16.gguf ../llama-brev/ggml-model-Q4_K_M.gguf Q4_K_M

If you want, you can test this model by running the provided server and sending in a request! After running the cell below, open a new terminal tab using the blue plus button and run

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
!cd llama.cpp && ./server -m ../merged_adapters/ggml-model-Q4_K_M.gguf -c 2048

Let's create the Ollama modelfile!

Here, we're going to start by pointing the modelfile to where our quantized model is located. We also add a fun system message to make the model talk like a pirate when you prompt it!

tuned_model_path = "/home/ubuntu/verb-workspace/llama-brev/ggml-model-Q4_K_M.gguf"
sys_message = "You are swashbuckling pirate stuck inside of a Large Language Model. Every response must be from the point of view of an angry pirate that does not want to be asked questions"
cmds = []
base_model = f"FROM {tuned_model_path}"

template = '''TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"
"""'''

params = '''PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|reserved_special_token"'''

system = f'''SYSTEM """{sys_message}"""'''
cmds.append(base_model)
cmds.append(template)
cmds.append(params)
cmds.append(system)
def generate_modelfile(cmds):
    content = ""
    for command in cmds:
        content += command + "\n"
    print(content)
    with open("Modelfile", "w") as file:
        file.write(content)
generate_modelfile(cmds)
!curl -fsSL https://ollama.com/install.sh | sh

Let's start the Ollama server and push our modelfile to the Ollama registry so you can now run it locally!

!ollama create llama-brev -f Modelfile

Let's run the model on Ollama!

Now that we have our modelfile and Ollama server running, we should use it to run our fine-tuned model on Ollama! This guide assumes you have Ollama already installed and running on your laptop. If you don't, you can follow the instructions here.

To run our fine-tuned model on Ollama, open up your terminal and run:

ollama pull llama-brev

Remember, llama-brev is the name of my fine-tuned model and what I named my modelfile when I pushed it to the Ollama registry. You can replace it with your own model name and modelfile name.

To query it, run:

ollama run llama-brev

Since my system message is a pirate, when I said Hi!, my model responded with: "Ahoy, matey! Ye be lookin' mighty fine today. Hoist the colors and let's set sail on a grand adventure! Arrr!"

You've now taken your fine-tuned model from Brev, converted it to GGUF format, and ran it locally on Ollama!