Add Ollama

You can get an ollama image running on a GPU in a few minutes. Get started by adapting the following fly.toml file:

app = '<your-app>'
primary_region = 'ams'

[build]
  image = 'ollama/ollama'

[[mounts]]
  source = 'models'
  destination = '/root/.ollama'
  initial_size = '10gb'

[http_service]
  internal_port = 11434
  force_https = false
  auto_stop_machines = 'stop'
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  size = 'a100-80gb'

Modify the app name and region to suit your preferences and save the file as fly.ollama.toml. Then you can launch the app:

fly launch -c fly.ollama.toml --flycast

There are a couple of things to note here:

  • The ollama image is a GPU image, so you need to specify a GPU size in the [[vm]] section.
  • Not all regions have GPUs available, refer to region GPU availability for more info
  • The [[mounts]] section is used to mount a volume to store the models. This is a good practice to keep the models separate from the app code.
  • The volume size in this config file is 10gb, that’s enough for small models. Change this value if you need more.
  • The --flycast flag creates a private IPv6 address for the app.

Finally, this only starts the ollama server; at this point you cannot interact with any models yet. To do so, you will have to pull in a model with this one easy, short, intuitive command:

fly m run -e OLLAMA_HOST=http://<your-app>.flycast --shell --command "ollama pull llama3.1" ollama/ollama

This command will pull in the llama3.1 model. You can change the model name to suit your needs. At this point this model is now available to the internal network of the organization it is deployed in. You can access it using Flycast from this URL: http://<your-app>.flycast.

Now that we have a functioning ollama with a model, we have to expose the ollama host to our app. One way to do this is to set the host as a secret:

fly secrets set OLLAMA_HOST=http://<your-app>.flycast

To interact with our new AI friend, we will have to install the ollama package:

poetry add ollama

Now we can initialize the client:

import os
from ollama import AsyncClient

OLLAMA_HOST = os.getenv('OLLAMA_HOST')
ollama_client = AsyncClient(OLLAMA_HOST)

From here we can start integrating it into our app:

@app.get("/")
async def read_root():
    resp = await ollama_client.generate(
        model="llama3.1",
        prompt="Why is the sky not green?",
    )
    return resp["response"]

When you re-deploy your app you should see llama’s answer:

fly deploy

You can check out this gist for the complete example app.