How to easily run Falcon (7B & 40B) with Brev

The latest and greatest open-source Large Language Model is Falcon-40B. Huggingface ranks it number 1 and it's free for commercial use!

In this guide, we show you how you can easily run it on Brev GPUs. Everything is pre-tested, so if you run into Cuda errors or any other issues, just let us know and we'll come running to fix them for you...

Launching a GPU machine

You can of course get your GPU machine anywhere, but we humbly recommend Brev - a pretty great way to access cloud GPUs.

1. Get your Brev instance

The following links will take you to creating a Brev instance with everything pre-configured. Just hit create!

7B model only - T4 16GB.

Either model - A100 40GB.

If you're ok spending just a little more, you can size up your instance for either model for enhanced runtime!

2. Open Jupyter Lab

If you're running on Brev, go to your new instance's settings page and hit Open Notebook:


(It may be disabled for a few minutes while the instance is being started.)

3. Set up a notebook

You can either download the notebook from our Github repo or create a new notebook and follow along with the code below.


Running Falcon

If you haven't uploaded the notebook, create a new notebook and run the following code in it:

1. Install dependencies

!pip install torch==2.0.1
!pip install transformers
!pip install einops
!pip install accelerate
!pip install IProgress
!pip install ipywidgets
!pip install Xformers

Falcon requires Pytorch 2+.

2. Import dependencies

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

3. Load the model

# If you want to load the instruct models (, just append "-instruct" to the model name.
model7b = "tiiuae/falcon-7b"
model40b = "tiiuae/falcon-40b"

# This is where you can change to 40b!
model = model7b

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(

# You can ignore the Warning, "The model 'RWForCausalLM' is not supported for text-generation."

4. Generate text

# Your prompt goes here!
prompt = "the top_k value in an autoregressive large language model means"

sequences = pipeline(
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

# You can ignore the Warning, "You have modified..."

It'll take about 5 minutes to generate this on an A100. If you want to speed it up, try reducing the number of tokens the model generates with the max_length parameter.

And that's it! You should have gotten Falcon completing your prompts...If you have any questions, reach out to us in the Discord!