---
title: Machine Suspend and Resume
layout: docs
nav: firecracker
author: kcmartin
date: 2025-08-15
---
**Machine suspend** lets you pause a running Fly Machine and save its complete state, including memory, to persistent storage. When resumed, the machine picks up exactly where it left off, without rebooting the OS or restarting your app. That can make startup take just **hundreds of milliseconds** instead of multiple seconds.
You can think of suspend as what a laptop does when you close the lid, except your “laptop” is a microVM running in, say, `dfw` or `fra` or `syd`.
## How it works
Suspend uses [Firecracker snapshots](https://firecracker-microvm.github.io/) to capture the entire VM state: CPU registers, memory contents, open file handles. When you start a suspended machine, Fly restores from this snapshot instead of cold booting.
**Typical performance:**
- Resume from suspend: a few hundred ms
- Cold start: ~2+ seconds for common apps
- TCP connections may survive if the remote side keeps them open
---
## Using Suspend
### Manually
```bash
# Suspend a machine
fly machine suspend <machine-id>
# Check status (running, suspending, suspended, etc.)
fly machine status <machine-id>
# Resume from snapshot
fly machine start <machine-id>
# Force a cold start (discard snapshot)
fly machine stop <machine-id>
fly machine start <machine-id>
```
### Automatically via Fly Proxy
Configure in `fly.toml`:
```
[http_service]
auto_stop_machines = "suspend" # or "stop"
auto_start_machines = true
[[http_service.concurrency]]
type = "requests"
soft_limit = 25
```
The proxy will automatically suspend machines during low traffic, checking for idle periods every few minutes, and resume them when requests arrive.
### Machines API
```
# Suspend
POST /v1/apps/{app_name}/machines/{machine_id}/suspend
# Wait for suspension to complete
GET /v1/apps/{app_name}/machines/{machine_id}/wait?state=suspended
# Resume (standard start endpoint)
POST /v1/apps/{app_name}/machines/{machine_id}/start
```
Generally, you need an API token to use the Machines API. But if you're just suspending _your own_ machine, you can skip the token and hit the `/.fly/api` Unix socket directly:
```bash
$ curl --unix-socket /.fly/api -X POST \
http://flaps/v1/apps/$FLY_APP_NAME/machines/$FLY_MACHINE_ID/suspend
```
---
## Requirements
A machine can use suspend if it has:
- **≤ 2 GB** memory (For larger memory sizes, suspend is discouraged due to increased suspend times)
- **No** [**swap**](https://fly.io/docs/reference/configuration/#swap_size_mb-option) **configured**
- **No** [**schedule**](https://fly.io/docs/machines/flyctl/fly-machine-run/#start-a-machine-on-a-schedule) **configured**
- **No GPU configured** (GPUs will be deprecated as of 08/01/26)
- Been updated since **June 20, 2024 20:00 UTC**
If you have an older machine, or you’re not sure when it was last updated, you can bring it up to date with:
```bash
fly machine update <machine-id> --yes
```
This updates the machine in place to the latest supported configuration for suspend, without changing your app code or image.
---
## Limitations and considerations
- Suspend is not currently recommended for large machine memory sizes (> 2 GB)
- Suspending many machines at once is not recommended
- Some logs may be lost after resume
- Unlike stop, suspend **does not** reset the machine's `rootfs`
- On resume, the clock can lag a few seconds until NTP syncs
<div class="callout">
Always design for both resume and cold start paths.
</div>
---
## Snapshot behavior with suspend
<div class="warning icon">
Snapshots are tied to the exact code and state of the machine they were taken from. If you deploy new code, the old snapshot can’t be resumed safely and will be discarded.
</div>
**Snapshots** **aren’t guaranteed to persist.** Cold starts may happen if:
- **You deploy a new version of your app** — deployments rebuild the machine image, which invalidates the old snapshot. Since a snapshot is a literal memory dump of the _old_ process, resuming it after you’ve swapped in new code or dependencies would be unsafe and unpredictable.
- The machine is migrated to a different host
- The snapshot file is lost or corrupted — Hardware failures, space reclamation, or corruption can cause them to be deleted
- We perform system maintenance or updates
---
## Volume behavior with suspend
Suspend automatically saves a machine's memory state to persistent storage. You don't need to attach a [volume](/docs/volumes/) for suspend to work, and the snapshot isn't stored on a volume.
If your machine does have an attached volume, the volume and its data aren't affected by suspend and resume. It helps to think of the machine snapshot and the volume as separate things:
- The **snapshot** is the saved CPU and memory state, managed by Fly. It can be discarded, for example, during a deploy, which forces a [cold start](#snapshot-behavior-with-suspend).
- The **volume** is your persistent storage. Its data survives suspend, resume, and cold starts, just as it does across a normal stop and start.
Even if a snapshot is discarded and the machine cold starts, the data on your volume is still there.
---
## Handling Network Connections After Resume
On resume, the machine thinks its network connections are still live. External systems (databases, APIs) may disagree.
Common symptoms:
- `ECONNRESET`
- "Connection closed"
- Timeouts on first request
- Database pool errors
**Fix:** Reconnect on failure.
Example (Python + DB):
```python
try:
result = db.execute(query)
except (ConnectionError, OperationalError):
db.reconnect()
result = db.execute(query)
```
Tips:
- Use connection pools with disconnect handling (see this excellent [SQLAlchemy guide](https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects))
- Shorten connection timeouts to fail fast
- Use retry/backoff for HTTP clients
- Test after long suspensions
---
## Billing
Suspended machines cost the same as stopped machines: storage only. There are no CPU/RAM charges.
If a suspended machine has a volume, you continue paying for the volume for as long as it exists [Volume storage](/docs/about/pricing/#persistent-storage-volumes) is billed whether the machine is running, stopped, or suspended.
<div class="note icon">
Suspending a machine lowers your costs, but it does not free capacity in a region. Suspended machines still reserve their resources, so suspension is not a way to fit more machines into a capacity-constrained region
</div>
---
## Monitoring & Debugging
```bash
fly machine status <machine-id>
```
States:
- `running`
- `suspending`
- `suspended`
- `starting` (resume or cold start)
- `stopped`
If machines cold start unexpectedly:
- Check requirements
- Confirm no migrations or deployments occurred
- Check logs for suspend/resume events
Test cold start:
```bash
fly machine stop <machine-id>
fly machine start <machine-id>
```
---
## Availability
Suspend works in **all Fly.io regions** as of July 2024.
---
**Related reading:**
- [Autostop & Autostart](/docs/launch/autostop-autostart/)
- [Fly Proxy Config](/docs/reference/fly-proxy-autostop-autostart/)
- [Scaling Machines](/docs/apps/scale-count/)
- [Machines API](https://docs.machines.dev/)
Machine Suspend and Resume
Machine suspend lets you pause a running Fly Machine and save its complete state, including memory, to persistent storage. When resumed, the machine picks up exactly where it left off, without rebooting the OS or restarting your app. That can make startup take just hundreds of milliseconds instead of multiple seconds.
You can think of suspend as what a laptop does when you close the lid, except your “laptop” is a microVM running in, say, dfw or fra or syd.
How it works
Suspend uses Firecracker snapshots to capture the entire VM state: CPU registers, memory contents, open file handles. When you start a suspended machine, Fly restores from this snapshot instead of cold booting.
Typical performance:
Resume from suspend: a few hundred ms
Cold start: ~2+ seconds for common apps
TCP connections may survive if the remote side keeps them open
Using Suspend
Manually
# Suspend a machine
fly machine suspend <machine-id>
# Check status (running, suspending, suspended, etc.)
fly machine status <machine-id>
# Resume from snapshot
fly machine start <machine-id>
# Force a cold start (discard snapshot)
fly machine stop <machine-id>
fly machine start <machine-id>
Automatically via Fly Proxy
Configure in fly.toml:
[http_service]
auto_stop_machines = "suspend" # or "stop"
auto_start_machines = true
[[http_service.concurrency]]
type = "requests"
soft_limit = 25
The proxy will automatically suspend machines during low traffic, checking for idle periods every few minutes, and resume them when requests arrive.
Machines API
# Suspend
POST /v1/apps/{app_name}/machines/{machine_id}/suspend
# Wait for suspension to complete
GET /v1/apps/{app_name}/machines/{machine_id}/wait?state=suspended
# Resume (standard start endpoint)
POST /v1/apps/{app_name}/machines/{machine_id}/start
Generally, you need an API token to use the Machines API. But if you’re just suspending your own machine, you can skip the token and hit the /.fly/api Unix socket directly:
$ curl --unix-socket /.fly/api -X POST \
http://flaps/v1/apps/$FLY_APP_NAME/machines/$FLY_MACHINE_ID/suspend
Requirements
A machine can use suspend if it has:
≤ 2 GB memory (For larger memory sizes, suspend is discouraged due to increased suspend times)
No GPU configured (GPUs will be deprecated as of 08/01/26)
Been updated since June 20, 2024 20:00 UTC
If you have an older machine, or you’re not sure when it was last updated, you can bring it up to date with:
fly machine update <machine-id> --yes
This updates the machine in place to the latest supported configuration for suspend, without changing your app code or image.
Limitations and considerations
Suspend is not currently recommended for large machine memory sizes (> 2 GB)
Suspending many machines at once is not recommended
Some logs may be lost after resume
Unlike stop, suspend does not reset the machine’s rootfs
On resume, the clock can lag a few seconds until NTP syncs
Always design for both resume and cold start paths.
Snapshot behavior with suspend
Snapshots are tied to the exact code and state of the machine they were taken from. If you deploy new code, the old snapshot can’t be resumed safely and will be discarded.
Snapshotsaren’t guaranteed to persist. Cold starts may happen if:
You deploy a new version of your app — deployments rebuild the machine image, which invalidates the old snapshot. Since a snapshot is a literal memory dump of the old process, resuming it after you’ve swapped in new code or dependencies would be unsafe and unpredictable.
The machine is migrated to a different host
The snapshot file is lost or corrupted — Hardware failures, space reclamation, or corruption can cause them to be deleted
We perform system maintenance or updates
Volume behavior with suspend
Suspend automatically saves a machine’s memory state to persistent storage. You don’t need to attach a volume for suspend to work, and the snapshot isn’t stored on a volume.
If your machine does have an attached volume, the volume and its data aren’t affected by suspend and resume. It helps to think of the machine snapshot and the volume as separate things:
The snapshot is the saved CPU and memory state, managed by Fly. It can be discarded, for example, during a deploy, which forces a cold start.
The volume is your persistent storage. Its data survives suspend, resume, and cold starts, just as it does across a normal stop and start.
Even if a snapshot is discarded and the machine cold starts, the data on your volume is still there.
Handling Network Connections After Resume
On resume, the machine thinks its network connections are still live. External systems (databases, APIs) may disagree.
Use connection pools with disconnect handling (see this excellent SQLAlchemy guide)
Shorten connection timeouts to fail fast
Use retry/backoff for HTTP clients
Test after long suspensions
Billing
Suspended machines cost the same as stopped machines: storage only. There are no CPU/RAM charges.
If a suspended machine has a volume, you continue paying for the volume for as long as it exists Volume storage is billed whether the machine is running, stopped, or suspended.
Suspending a machine lowers your costs, but it does not free capacity in a region. Suspended machines still reserve their resources, so suspension is not a way to fit more machines into a capacity-constrained region