We’re Fly.io. We run apps for our users on hardware we host around the world. Fly.io happens to be a great place to use GPUs. Check out how to get started!
Recently I started playing with the idea of using LLama.CPP with Elixir as a NIF. Creating C/C++ Nif’s in Erlang is kind of a project and you need to be especially careful to not cause memory errors bugs. So I found a Rust Wrapper around LLama.cpp and since I have some experience with Rustler I thought I’d give it a go. If this is your first time hearing about Elixir and Rust(ler) you might was to go back and read my initial experiences!
Let’s use LLama.cpp in Elixir and make a new Library!
If you are not familiar with llama.cpp its a project to build a native LLM application that happens to run many models just fine on a Macbook or any regular ole GPU. Today we’re going to do the follow:
- Setup a new Elixir Library!
- Use the Rust Library rust-llama-cpp with Rustler
- Build a proof of concept wrapper in Elixir
- And see what we can do!
Scaffolding
Let’s start off like every great Elixir Package with a mix new llama_cpp
we won’t be needing a supervisor because this will be a thin shim over the Rust library, rust-llama-cpp. Opening up our mix.exs
let’s add our single dep:
defp deps do
[
{:rustler, "~> 0.30.0", runtime: false}
]
end
Next up installing our deps with mix deps.get
, and executing the rustler scaffolding mix rustler.new
and following the prompts one by one. Finally adding our final dependency to native/llamacpp/Cargo.toml
:
[dependencies]
rustler = "0.30.0"
llama_cpp_rs = {version = "0.3.0", features = ["metal"]}
We should be ready to go! Time to dive into some rust!
A little Rust and Elixir
Looking over the Docs for the rust-llama-cpp
library the two core function’s we’ll need to implement are LLama::new
and llama.predict
new being the constructor for the LLama model and predict handling the actual text prediction. Below is our LlamaCpp
module in Elixir with the stubs that Rustler requires:
defmodule LlamaCpp do
use Rustler,
otp_app: :llama_cpp,
crate: :llamacpp
def new(_path), do: :erlang.nif_error(:nif_not_loaded)
def predict(_llama, _query), do: :erlang.nif_error(:nif_not_loaded)
end
An example of a typical workflow for exercising this code will look like:
{:ok, llama} = LlamaCpp.new("path_to_model.gguf")
LlamaCpp.predict(llama, "Write a poem about elixir and rust being a good mix.")
# Very good poem I definitely wrote..
Now let’s stub out the Rustler code replacing the body of native/llamacpp/src/lib.rs
with:
use llama_cpp_rs::{
options::{ModelOptions, PredictOptions},
LLama,
};
use rustler::Encoder;
use rustler::{Env, LocalPid, NifStruct, ResourceArc, Term};
#[rustler::nif(schedule = "DirtyCpu")]
fn new(path: String) -> Result<LLama, ()> {
LLama::new(path.into(), ModelOptions::default())
}
#[rustler::nif(schedule = "DirtyCpu")]
fn predict(llama: Llama, query: String) -> String {
llama.predict(query.into(), PredictOptions::Default).unwrap()
}
rustler::init!(
"Elixir.LlamaCpp",
[predict, new]
);
Right now if we try running this we will multiple warnings and errors but thats okay we’ve got our scaffolding. You might notice that our Elixir code returns a llama
resource that we’re expecting to pass through to our other LlamaCpp.predict
function, and while it would be awesome to say this worked “automagically” it does not.
Rustler Resources
We’re going to have to setup a Resource
that tells the BEAM virtual machine that this type is something we can hold on to, and that it should clean up when the process goes away. To do this we need to make some changes in our lib.rs
use rustler::{NifStruct, ResourceArc, Term};
pub struct ExLLamaRef(pub LLama);
#[derive(NifStruct)]
#[module = "LlamaCpp.Model"]
pub struct ExLLama {
pub resource: ResourceArc<ExLLamaRef>,
}
impl ExLLama {
pub fn new(llama: LLama) -> Self {
Self {
resource: ResourceArc::new(ExLLamaRef::new(llama)),
}
}
}
impl ExLLamaRef {
pub fn new(llama: LLama) -> Self {
Self(llama)
}
}
impl Deref for ExLLama {
type Target = LLama;
fn deref(&self) -> &Self::Target {
&self.resource.0
}
}
unsafe impl Send for ExLLamaRef {}
unsafe impl Sync for ExLLamaRef {}
Because we cannot alter the LLama
library directly without vendoring we need to wrap it and do the various implementations that the Rustler ResourceArc
type requires as a type. I do not fully understand why we need 2 types, a ExLLama
and ExLLamaRef
type but I was using explorers
library as a reference. My understanding is that in order to have the BEAM handle garbage collection you need to wrap your Rust Data in a ResourceArc, which requires that your type implement the Send and Sync protocols to work. The Deref
protocol is simply a nice to have for us so we don’t need to dereference our reference type.
The benefit for us is that the BEAM will handle our memory for us and we only need to give it a handle to clean up when it’s done.
Now we can update our functions to look like this:
fn on_load(env: Env, _info: Term) -> bool {
rustler::resource!(ExLLamaRef, env);
true
}
#[rustler::nif(schedule = "DirtyCpu")]
fn new(path: String, model_options: ExModelOptions) -> Result<ExLLama, ()> {
let model_options = ModelOptions::default();
let llama = LLama::new(path.into(), &model_options).unwrap();
Ok(ExLLama::new(llama))
}
// ...
rustler::init!(
"Elixir.LlamaCpp",
[predict, new],
load = on_load
);
Notice that we added the on_load callback that registers our ExLLamaRef
type, and now that Just Works TM and that we are using the defaults for ModelOptions
.
Finally we can use that with our predict function updated like so:
#[rustler::nif(schedule = "DirtyCpu")]
fn predict(llama: ExLLama, query: String) -> String {
let predict_options = PredictOptions::default();
let result = llama.predict(query.into(), predict_options).unwrap();
result
}
Here we accept the ExLLama
as a parameter and simply call predict on it, unwrapping the result and returning a string!
Example output
Going back to our original example above let’s try it with a real model and see how it does, please note this is running on a M1 MacBook Pro with 16gb of ram:
{:ok, llama} = LlamaCpp.new("openzephyrchat.Q4_K_M.gguf")
query = "Write a poem about elixir and rust being a good mix."
LlamaCpp.predict(llama, "GPT4 User: Follow the instructions below to complete the task:\n #{query}<|end_of_turn|>GPT4 Assistant:")
# a bunch of llama-cpp logs then.. and ~10s later
In the land of ancient lore, where myths entwine and roam,A tale of magic potion whispers and hums.\nElixir and rust, an unlikely pair,\nConjure up a story that leaves us in awe.\n\nBorn from the heart of alchemy's embrace,\nThe elixir, radiant as daylight's face,\nPromises youthful vitality and grace,\nA fountain of immortality to chase.\n\nYet rust, the nemesis of fine metal sheen,\nInfiltrates"
In the land of ancient lore, where myths entwine and roam, A tale of magic potion whispers and hums. Elixir and rust, an unlikely pair, Conjure up a story that leaves us in awe.
Born from the heart of alchemy’s embrace, The elixir, radiant as daylight’s face, Promises youthful vitality and grace, A fountain of immortality to chase.\n\nYet rust, the nemesis of fine metal sheen, Infiltrates"
I don’t think it picked up on our programming language, let’s try again with: “Write a poem about the programming languages elixir and rust being a good mix.”
In the world of code, where ones and zeros dance, There’s Elixir and Rust, a pair that romances. Their union is strong, like a fortress of steel, In harmony they stand, together they heal.
Elixir, with its elegance and grace, A functional language, in the Erlang base. Concurrency and fault tolerance are its strengths, Like an actor system, it never takes rests. Rust, a systems programming powerhouse, Safe and fast, like a stealthy fox. Zero-cost
Conclusion
So as you can imagine this is only the tip of the iceberg here for what’s possible and the rust-llama-cpp
library has more API to implement. I’ve started work on that here llama_cpp_ex
and as always contributions are very welcome! I hope today we got a taste of using long living structures in Rustler and how we might work with them, As well as had a little bit of fun with a LLM.
If you want to run these models with LLama.cpp on a Fly GPU you can now do that too!