We’re Fly.io and we transmute containers into VMs, running them on our hardware around the world. We have fast booting VM’s and GPUs; so why not take advantage of them?
A big barrier to getting started with local AI development is access to hardware. And by “local”, we mean having direct access to a GPU and not going through AI-as-a-Service. Some of us are lucky enough to have a beefy Nvidia GPU, if so, good for you. For the rest of us, there are other ways.
- Llama.cpp - LLM inference in C/C++ which can run reduced models on CPUs.
- ChatGPT - represents the category of AI-as-a-Service. We can’t run our own models.
- Replicate - represents platforms that let you run open-source AI models.
But Elixir has the ability to run and host some pretty interesting open source models directly through Bumblebee. For the big ones, you need access to a GPU, or you have to be really, really, …really… patient.
For those of us who don’t have the hardware locally, we can run a GPU on Fly.io while editing the app on our machine.
Sounds crazy? It’s actually really cool! And when it’s up and running, it genuinely feels like you have a fast GPU plugged directly into your laptop.
What exactly are we talking about?
In a nutshell:
- Take a ready-to-deploy minimal Elixir application and deploy it on Fly.io.
- Get a GPU and volume for persistent storage to cache the multi-gig AI models on disk.
- Load up our local Elixir application, cluster it to the app on Fly.io through the VPN, using Elixir’s Nx library
- Start developing AI features in a local app where the AI part runs on a remote GPU. Nx makes the distributed calls transparent to us when we’re clustered.
We can visualize it like this:
Let’s get started!
An Elixir app to host our model
For Bumblebee to really shine, it needs a GPU attached. A major goal here is to keep the app on the server as simple and small as possible so we don’t need to re-deploy it.
This means all the active development of the app stays on our local machine! 🎉
The thin app on the server is just a harness for hosting our model with Bumblebee. It contains no business logic, has no UI, and basically does nothing but host the model and provide an Nx Serving we can talk to.
Get and deploy the app
Getting the ready-to-deploy harness application is as simple as:
git clone git@github.com:fly-apps/bumblebee-model-harness.git
cd bumblebee-model-harness
At this point, you can open the included fly.toml
file and change the name of the app to be unique and whatever you want. Just hold on to that new name! We’ll need it shortly.
Then, continue with:
fly launch
...
? Would you like to copy its configuration to the new app? Yes
This builds the Dockerfile image, deploys it, and starts the selected serving. For me, the process of starting the serving for a new Llama 2 model took about 4 minutes to download and start.
You can watch the logs to see when it’s ready:
fly logs
The following log lines are part of a healthy startup. Note: other lines were excluded.
2024-03-27T02:36:12Z app[3d8d79d5b24068] ord [info]02:36:12.491 [info] Elixir has cuda GPU access! Starting serving Llama2ChatModel.
2024-03-27T02:40:12Z app[3d8d79d5b24068] ord [info]02:40:12.245 [info] Serving Llama2ChatModel started
2024-03-27T02:40:37Z app[3d8d79d5b24068] ord [info]02:40:37.157 [warning] Nx.Serving.start_link - {:ok, #PID<0.2164.0>}
At this point the server is ready and we need a client to use it!
A client Elixir application to use it
Let’s create a new Phoenix application for this. No database is needed.
mix phx.new local_ai --no-ecto
cd local_ai/
We’ll create an .envrc
file with the following contents. It tells our local application which of our deployed Fly.io applications to cluster with. This is where your chosen app name comes in!
export CLUSTER_APP_NAME=my-harness-app-name
Next, add {:nx, "~> 0.5"}
to mix.exs
and run mix deps.get
Start and cluster the local app:
./cluster_with_remote
The linked guide walks you through setting up a VPN connection and creating your own local ./cluster_with_remote
script file.
The following is logged when successfully starting the local Elixir project and clustering it with the harness app:
Attempting to connect to my-harness-app-name-01HSYVSNTG4XT8TPPGXFP6RJ66@fdaa:2:f664:a7b:215:875f:cb5d:2
Node Connected?: true
Connected Nodes: [:"my-harness-app-name-01HSYVSNTG4XT8TPPGXFP6RJ66@fdaa:2:f664:a7b:215:875f:cb5d:2"]
With the local application running and clustered to the server, it’s time to start coding AI features!
Running local code on the GPU
As a simple getting-started example, copy and paste this into the IEx terminal.
stream = Nx.Serving.batched_run(Llama2ChatModel, "Say hello.")
Enum.each(stream, fn
{:done, token_data} ->
IO.puts("")
IO.puts("DONE!")
data ->
IO.write(data)
end)
It talks to the Llama2ChatModel
, which is the name of the serving we enabled in the harness application. We ask it generate text based on our "Say hello."
initial prompt.
The serving returns a stream which we enumerate over and write the data out to the console as it’s received. It writes out something like the following in the terminal:
Through the grapevine.
Say hello to the person next to you.
DONE!
That’s all it takes to start working with a self-hosted LLM!
The locally running application is the one we actively develop. It has all our business logic, UI, tests, etc. When the application needs to run something on the LLM with a GPU, it transparently happens for us and the response is streamed back to our application!
We really do get to keep our normal local-first development workflow, but now with powerful GPU access!
What’s can I do next?
Check out the Bumblebee docs for examples and lots of models to start playing with.
If working with LLMs is your interest, check out the Elixir LangChain library and the ChatBumblebee model to add chat message structures. It makes Bumblebee-based conversations easy.
Here’s a quick demo of a project running locally on my dev machine but executing the chat completion through the harness app running on a Fly.io server with GPU.
Did you notice how rapidly the text is generated? That would NEVER be possible on my local machine.
Setting some Bumblebee expectations
Functions are required for getting the most out of an LLM. That’s what enables the LLM to directly interact with your application.
At the time of this writing, function support is not yet reliably possible with Bumblebee. Bumblebee needs the ability to constrain the generated text to be only valid JSON, which at the time of this writing, it can’t.
There are some ways around it, but that’s beyond the scope of this post. It might be fun to explore this further. Hmm. 🤔
Develop as you would normally
This is where you win. We keep the same basic workflow that we are already comfortable and productive with, except with now we’re doing it with a remote Fly.io GPU too!
Tips and Tricks
Working with GPUs, Nx, Bumblebee, HuggingFace, and CUDA cores is still new to most of us. This section covers some tips that were learned personally by hitting one wall after another. Take head and keep your head unbruised!
Note
All the tips related to the server are already in the harness application. They are included here as an explanation of the change for making them on other projects.
Server: Dockerfile needs all the Nvidia dependencies
See the Fly GPU quickstart guide for Dockerfile examples. The harness GitHub repo includes a working Dockerfile as well.
Server: Turn off Bumblebee’s console progress logging
When using Bumblebee locally, it’s handy to have the progress reported in the console for large model file downloads. However, in a server environment, this breaks the IO stream for the application logs and crashes the server. Additionally, no one is watching the console on the server anyway.
To address this issue, we add the following to our config/prod.exs
config :bumblebee, progress_bar_enabled: false
Server: Delay starting the Nx.Serving
In the harness application, I created a Harness.DelayedServing
GenServer that spawns a separate process to start the start the Nx.Serving
.
The purpose of this GenServer is to start the Nx.Serving
for the desired module. Large models can take several minutes to download and process before they are available to the application. This GenServer, or something like it, can be added to the Application supervision tree. It detects and logs if Elixir has CUDA access to the GPU. If support is available, it starts the serving asynchronously and makes it available to the application.
That is the sole purpose for this module. Without this delayed approach, the extended start-up times can result in an application being found “unhealthy”, and killed before ever becoming active.
An alternative approach is to transfer the needed model file to the server in some other way.
Server: Enable clustering on the server
The Phoenix library dns_cluster is an easy option to enable clustering on the server. It’s built-in to new Phoenix applications
Client: Make it easy to reconnect to the server
In order to take advantage of a GPU on Fly.io, we need to be clustered with an Elixir application running there.
The guide Easy Clustering from Home to Fly.io documents how to get that setup. To automate the process of connecting to the server, use the shared bash script to start our local application and cluster it with the server.