30 Minutes With MCP and flyctl

Image by Annie Ruygt

I wrote this post on our internal message board, and then someone asked, “why is this an internal post and not on our blog”, so now it is.

well, Cursor built

I built the most basic MCP server for flyctl I could think of. It took 30 minutes.

MCP, for those unaware, is the emerging standard protocol for connecting an LLM (or an app that drives an LLM in the cloud, like Claude Desktop) to, well, anything. The “client” in MCP is the LLM; the “server” is the MCP server and the “tools” it exports. It mostly just defines an exchange of JSON blobs; one of those JSON blobs enables the LLM to discover all the tools exported by the server.

A classic example of an MCP server is (yes, really) a Python shell. MCP publishes to (say) Claude that it can run arbitrary Python code with a tool call; not only that, says the tool description, but you can use those Python tool calls to, say, scrape the web. When the LLM wants to scrape the web with Python, it uses MCP send a JSON blob describing the Python tool call; the MCP server (yes, really) runs the Python and returns the result.

Because I have not yet completely lost my mind, I chose to expose just two flyctl commands: fly logs and fly status. Because I’m lazy, I used MCP’s stdio mode; the MCP server just captures the output of flyctl. Because someone already wrote a really good Go MCP library , this whole thing is like 90 lines of code.

We may have gotten a little lucky, because we made a decision back in 2020 to give most of our flyctl commands a json mode to make them easier to drive from automation. I use that in my MCP server. I don’t know how much of a difference it made.

It’s, uh, annoyingly useful.

I loaded it up and pointed it at unpkg. I got this, from Claude:

This appears to be a globally distributed CDN service running the unpkg JavaScript package registry, with machines strategically placed around the world to provide low-latency access. There are some stability issues with a couple of the machines, and several have experienced memory-related crashes that triggered automatic restarts.

It then recited the complete configuration of the application:

**App Name**: unpkg       
**Status**: Deployed and running
**URL**: unpkg.fly.dev
**Organization**: unpkg
**Platform Version**: machines
**Runtime**: Bun (JavaScript runtime)

The app is running on 10 machines distributed across global regions:       
North America: lax (Los Angeles), atl (Atlanta), ewr (Newark)
Europe: lhr (London), cdg (Paris), ams (Amsterdam)
Asia: sin (Singapore), nrt (Tokyo), hkg (Hong Kong)
South America: bog (Bogotá)
Oceania: syd (Sydney)

OK. But then:

Most machines are in “passing” health check status 2 machines are in “critical” status: One in ewr (Newark) with “context deadline exceeded” error One in atl (Atlanta) with “gone” status

Recent Events

Multiple machines have experienced restarts Several machines show “oom_killed: true” in their event history, indicating they ran out of memory

So I asked, “try getting logs for one of the critical machines”.

The root cause appears to be memory usage - the Bun process was consuming about 3.7GB of memory (out of the 4GB allocated to the machine), and there was likely a surge in requests or memory usage that pushed it over the limit. This matches what we saw in the status report, which showed multiple instances of “oom_killed: true” in the event history for several machines.

And then it dug in:

  1. The machine was actively serving content requests with 200 OK responses until 20:47:03 PM
  2. At that point, it ran out of memory: Out of memory: Killed process 641 (bun) total-vm:85950964kB, anon-rss:3744352kB, file-rss:12032kB, shmem-rss:0kB, UID:0 pgtables:16908kB oom_score_adj:0
  3. The process was killed with SIGKILL
  4. The system attempted to recover by restarting: reboot: Restarting system
  5. A health check failed shortly after: Health check on port 3000 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
  6. The machine then restarted successfully and the service came back up: Server listening on http://localhost:3000 ...
  7. The health check started passing again at 20:47:46 PM: Health check on port 3000 is now passing.
  8. The machine resumed normal operations, serving requests.

I’m honestly kind of surprised how fast this was useful. It’s faster than I find problems in apps. Just by giving some LLM some tools that do basically what I’d do.

Local MCP servers are scary. I don’t like that I’m giving a Claude instance in the cloud the ability to run a native program on my machine. I think fly logs and fly status are safe, but I’d rather know it’s safe. It would be, if I was running flyctl in an isolated environment and not on my local machine.