App Availability and Resiliency
A resilient app is resistant to events like hardware failures or outages. Redundancy, supported by monitoring and scaling, is one of the ways to make your app more resilient, and we provide some built-in features to make this easier.
What your app needs to do
First, it’s important to understand what your app needs to take care of.
Since Fly Proxy does load balancing by default, you need to consider how your app handles:
- session storage, so that users stay logged in when their session moves to a different server
- persistent file storage and replication, as discussed in the following paragraphs
A note about Fly Proxy: Fly Proxy does a lot of work in the background, applying handlers, managing traffic and load balancing, and managing connections to services. Fly Proxy also automatically starts and stops Machines based on traffic to your app. The proxy works for process groups with services defined.
For databases, you’ll need to follow the specific recommendations for the database you’re using, including how to set up clusters for replication and failover.
If you’re deploying Fly Postgres, our un-managed database app, then you need to manage backups, recovery, replication, and scaling yourself. For information about high availability with Fly Postgres, refer to High Availability & Global Replication.
When you use Fly Volumes with your app, your app needs to implement replication and backups. Volumes are block devices that live on a single disk array and get mounted directly by your Machine. There’s no magic. And there’s no built-in replication between volumes. If your app provides a clustering mechanism with data replication, like most databases do, we recommend you take advantage of that and run multiple instances with attached volumes. When possible, we place your app’s volumes in different hardware zones within a region to mitigate hardware failures.
This brings us to disaster recovery and being prepared. You should have backups of your data and understand how to recover that data if needed. Fly.io takes daily snapshots of volumes and stores them for 5 days by default. You can restore a volume’s data from one of these daily snapshots. If your use case requires more frequent backups, then you’ll need to set up tooling and processes to backup your volumes on the required schedule.
Fly.io features for app resiliency
The Fly Platform has features to help make your app more resilient in case of outages or hardware failures.
The features for redundancy and availability are:
- Redundancy by default on first deploy
- Automatically start and stop Machines
- Health check-based routing
These features require or depend on your app configuration in the fly.toml
file, including process groups. Every Fly App has at least one process group. An app with no extra process groups defined in fly.toml
still uses the app
process group by default. Learn more about app configuration.
Redundancy by default on first deploy
You get redundancy by default on first deploy and when you deploy after scaling down to zero Machines. This feature helps you deploy a recommended setup without having to think about it. If the defaults in the following sections aren’t what you’re looking for, then you can still configure your app and processes the way you want. Learn more about resilient apps and multiple Machines, scaling Machine CPU and RAM, and scaling the number of Machines.
Here’s what Fly Launch (the fly launch
or fly deploy
command) does, based on your app configuration:
- creates and starts two Machines for process groups with services, sets automatic start and stop to true and minimum machines running to zero
- creates and starts one Machine and creates one stopped standby Machine for process groups without services
- creates and starts only one Machine if the process group has volumes mounted, because there’s no built-in replication between volumes
Two Machines for process groups with services
The most basic way to improve resiliency is to create more than one Machine per process group, and the fly launch
and fly deploy
commands do this by default for process groups with services. A process group with services has mappings to the external internet or to a private (Flycast) network, so that Fly Proxy can manage its connections and operation.
If a process group has services defined, two Machines are automatically created and started when:
- you deploy an app for the first time using
fly launch
orfly deploy
, - you redeploy an app using
fly deploy
after scaling down to zero, or - you add a new process group with services in the
fly.toml
file and then runfly deploy
Standby Machines for process groups without services
Standby Machines provide low-effort redundancy for process groups with no services defined. A process group with no services has no mappings to the external internet or to a private (Flycast) network, so Fly Proxy can’t manage its connections and operation.
A standby Machine is a stopped Machine that’s essentially paired to and watching a running Machine. The standby starts only if the paired Machine becomes unavailable.
If a process group doesn’t have services defined, then a standby Machine is automatically created, along with a running Machine, when:
- you deploy an app for the first time using
fly launch
orfly deploy
, - you redeploy an app using
fly deploy
after scaling down to zero, or - you add a new process group in fly.toml and then run
fly deploy
The standby Machine doesn’t consume resources or start up until it’s needed.
Remove a standby Machine
To remove a standby Machine, run fly status
and copy the ID of the Machine that displays the †
symbol beside the process group name. For example:
PROCESS ID VERSION REGION STATE ROLE CHECKS LAST UPDATED
task† 784e469a443e38 43 yul stopped 2024-08-06T20:48:10Z
Then destroy the standby Machine with fly machine destroy <machine id>
.
If you destroy a standby Machine, it is not recreated when you scale up or back down using fly scale count
.
If you add services for a process group in fly.toml
that previously had no services, then the standby designation is removed from the standby Machine on the next deploy.
Create a standby Machine
You can recreate the standby Machine if you scale down to 0 and then run fly deploy
, which will create one Machine and one standby Machine again.
At a lower level, you can also use the fly machines update
command with the --standby-for
option to create a standby configuration using two existing Machines.
Turn off redundancy on deploy
What if you don’t want to keep Machines running all the time for an app with a small or variable workload? The automatic start and stop feature is enabled by default for Machines and makes it possible to run at least two Machines without wasting resources.
If you still don’t want fly launch
or fly deploy
to create two Machines, then you can use the --ha
option to turn off this feature.
Create one Machine for a new app:
fly launch --ha=false
Create one Machine on first deploy or when your app is scaled down to zero Machines:
fly deploy --ha=false
Scale down an app to one Machine:
fly scale count 1
For apps with more than one process, refer to Scale a process group horizontally.
Automatically start and stop Machines
Automatic start and stop works for process groups with services defined. This feature enables you to have multiple Machines that only run when needed.
Fly Proxy can start and stop existing Machines based on incoming requests, so that your app can accommodate bursts in demand without keeping extra Machines running constantly. And if your app needs to have one or more Machines always running in your primary region, then you can set a minimum number of machines to keep running. Fly Proxy is also responsible for load balancing.
Default settings for new apps created using the fly launch
command: automatically start and automatically stop Fly Machines, and minimum machines running is zero.
Default settings for some existing apps (or any apps that don’t have these settings in fly.toml
): automatically start but don’t automatically stop Fly Machines.
Learn more about how Fly Proxy autostop/autostart works and how to configure it.
Health check-based routing
Fly Proxy routes network connections away from Machines that are failing health checks.
Health check-based routing works for process groups with services defined and is based on the configuration of services.tcp_checks
or services.http_checks
in your app’s fly.toml
file.
If the configured health checks are failing for a Machine, then the proxy doesn’t route network connections to that Machine. The proxy routes connections to a healthy Machine instead. If there aren’t any healthy Machines, then the connection will block, waiting for a Machine to become healthy.
Health check-based routing doesn’t work for the top-level health checks defined in the checks
section of fly.toml, because the top-level checks don’t apply to specific services.