Getting Started with Fly GPUs
How do I run a GPU Machine on Fly.io?
A Fly GPU Machine works very similarly to a CPU-only (or “normal”) Fly Machine, and has access to the same platform features by default. It boots up with GPU drivers installed and you can run nvidia-smi
right away. So running a Fly App that uses CUDA compute is a slight variation on running any Fly App. In a nutshell:
- Make sure GPUs are enabled on your Fly Organization.
- Tell the Fly Platform to provision your Machines on Fly GPU hosts.
- Provide a Docker image or Dockerfile for your project that installs the NVIDIA libraries your project uses, alongside the rest of your project.
- Carry on as you would for any Fly App, tailoring it to the needs of your application.
Aside: Firecracker doesn’t do GPUs, so GPU-enabled Machines run under Cloud Hypervisor instead. This means you’ll see some different output in logs, but in general you don’t have to think about this.
Placing your Machines on GPU hosts
Not all of Fly.io’s physical servers have GPUs attached to them, so you have to tell the platform your requirements before the Machine is created on a host.
Configuration is done by different means depending on whether you’re using Fly Launch to manage your app, or running Machines individually using the API or the fly machine run
command.
Specify the GPU model
Assuming that you’re going the Fly Launch way and configuring your Machines app-wide, you can include a line like the following in your fly.toml
file before running fly deploy
for the first time:
vm.size = "a100-40gb"
The a100-40gb
preset is the a100-pcie-40gb
GPU with the performance-8x
CPU config and 32GB RAM. You can exert more granular control over resources using the [[vm]]
table in fly.toml
.
Run fly platform vm-sizes
to get the list of presets, and check out the pricing page to help you plan ahead.
Specify the region
But you’re not done! For fly deploy
to succeed in provisioning your Machine(s), the app’s primary_region
must match a region that hosts the kind of GPU you’re asking for. At the time of writing, a100-40gb
GPUs are only available in ord
. Set the initial region in fly.toml
:
primary_region = "ord" # If you change this, ensure it's to a region that offers the GPU model you want
Currently GPUs are available in the following regions:
a10
:ord
l40s
:ord
a100-40gb
:ord
a100-80gb
:ams
,iad
,mia
,sjc
,syd
Change GPU model
GPU models are located in different regions, so you’ll likely have to destroy any existing Machines before redeploying using a different GPU spec.
Choosing a Docker base image
Many open-source ML projects have ready-made Docker images; for example: Ollama, Basaran, LocalAI. Check out GitHub fly-apps
repos with the gpu
topic for some examples.
If you’re building your own image, choose a base image that NVIDIA supports; ubuntu:22.04
is a solid choice that’s compatible with the CUDA driver GPU Machines use. You can launch a vanilla Ubuntu image and run nvidia-smi
with no further setup.
From there, install whatever packages your project needs. This will include some libraries from NVIDIA if you want to do something more than admire the output of nvidia-smi
.
Installing NVIDIA libraries
You don’t need to install any cuda-drivers
packages in the Docker image, but you’ll want some subset of NVIDIA’s GPU-accelerated libraries. libcublas-12-2
(linear algebra) and libcudnn8
(deep-learning primitives) are a common combination, along with cuda-nvcc
for compiling stuff with CUDA support.
In general, you’ll install NVIDIA libs using your Linux package manager. In a Python environment, it’s possible to skip system-wide installation and use pip packages instead.
Tips:
- Be deliberate about how much cruft you put in your Docker image. Avoid meta-packages like
cuda-runtime-*
. cuda-libraries-12-2
is a convenient, but bulky, start. Once you know what libs are needed at build and runtime, pick accordingly to optimize final image size.- Use multi-stage docker builds as much as possible.
To install packages from NVIDIA repos, you’ll need the ca-certificates
and cuda-keyring
packages first.
Where to store data
Machine learning tends to involve large quantities of data. We’re working with a few constraints:
- Large Docker images (many gigabytes) can be very unwieldy to push and pull, particularly over large geographical distances.
- The root file system of a Fly Machine is ephemeral – it’s reset from its Docker image on every restart. It’s also limited to 50GB on GPU-enabled Machines.
- Fly Volumes are limited to 500GB, and are attached to a physical server. The Machine must run on the same hardware as the volume it mounts.
Unless you’ve got a constant workload, you’ll likely want to shut down GPU Machines when they’re not needed—you can do this manually with fly machine stop
, have the main process exit when idle, or use the Fly Proxy autostop and autostart features—to save money. Saving data on a persistent Fly Volume means you don’t have to download large amounts of data, or reconstitute a large Docker image into a rootfs, whenever the Machine restarts. You’ll probably want to store models, at least, on a volume.
Using swap
Designing your workload and provisioning appropriate resources for it are the first line of defense against losing work by running out of memory (system RAM or VRAM).
You can also enable swap for system RAM on a Fly Machine, simply by including a line like the following in fly.toml
:
swap_size_mb = 8192 # This enables 8GB swap
Keep in mind that this consumes the commensurate amount of space on the Machine’s root file system, leaving less capacity for whatever else you want to store there.
If you need more system RAM and faster performance, also scale up with fly scale memory
.