It’s been a hectic first couple of weeks at Fly, and I’m writing things up as I go along, because if I have to learn, so do you. This is going to be a bit of a meander; you’ll have to deal.
Let’s start with “what’s Fly?” Briefly: Fly is a content delivery network for Docker containers. Applications hosted on Fly are fast because they’re running on machines close to users. To do that, we run bare metal servers in a bunch of cities and host containers on them in Firecracker VMs. We proxy traffic from edge servers to containers through a global WireGuard mesh. It’s much easier to play with than ECS or K8s is, so signing up for a free account is probably the best way to get a feel for it, and a pleasant way to burn 5-10 minutes.
Obviously, to do stuff like this, you need to generate certificates. The reasonable way to do that in 2020 is with LetsEncrypt. We do that for our users automatically, but “it just works” makes for a pretty boring writeup, so let’s see how complicated and meandering I can make this.
It’s time to talk about certificate infrastructure.
ACME
Rather than verifying information from “Qualified Independent Information Sources”, LetsEncrypt does domain-validated certificates, based simply on proof of ownership of a domain, and is driven by a protocol called ACME. ACME is really simple. It’s been implemented in almost pure Bourne shell. The most complicated thing about it is JWS signatures, which are awful, but at least standardized. The ACME protocol is itself done over normal HTTP requests; the flow is roughly:
- You make an account and associate an ECDSA/EdDSA key with it, which subsequently authenticates all your requests.
- You then post an “order” for a certificate, specifying the DNS names you need it for. You’re given “authorization” and “finalization” URLs in return.
- You retrieve the authorizations and select challenges to prove you own the domains.
- You set up the challenges on your own side, and then poll a status URL to verify that the challenges have completed.
- You post the CSR for your certificate to the “finalization” URL.
ACME challenges are intended to verify your ownership of a domain. There are three of them (four, if you count preauthorization, which LetsEncrypt doesn’t do); originally, they were:
tls-http-01
, in which you’re given a token to put on your server, under /.well-known/acme-challenge, and serve to LetsEncrypt’s client on 80/tcp. This is simple to describe and implement, but requires you to respond to HTTP requests on 80/tcp, which lots of people (sensibly) don’t want to do.tls-dns-01
, in which you’re given a token to put in a TXT record in your DNS zone. This directly proves control over a domain, but it can be hard for operators to do. In particular, especially in larger organizations, the people who need certificates are not necessarily given access to DNS configuration.tns-sni-01
, in which you’re given a token to embed in the SAN of a certificate you serve to TLS clients who request it through TLS SNI, which is TLS’s equivalent of the HTTP “Host” header. This is more complicated to implement, but is the most seamless of the challenges: all you need to do it is to run the TLS server you were going to run anyways.
The Story Of tls-sni-01
But tls-sni-01
no longer exists, because it’s insecure. The problem with SNI challenges is shared hosting.
Because IP addresses are scarce, many hosting providers arrange for customers to share IP addresses. As requests arrive for customers, they’re routed based on SNI.
In the same way that you can configure a local nginx to respond to any Host header without breaking the Internet, hosting providers routinely allow people to “claim” arbitrary hostnames on their platforms. This ostensibly doesn’t matter, because without control of the DNS, you can’t get people to talk to your claimed hostname.
Similarly, hosting providers will often let you provide your own TLS certificates.
You may see where this is going already. Here’s what LetsEncrypt did to verify domain ownership using SNI:
- It generated a token for you to put in a self-signed certificate, in the form of an “.acme.invalid” hostname.
- It resolved the hostname you were generating a certificate for in the DNS and connected to it.
- It asked for the token via TLS SNI specifying the “.acme.invalid” name.
- It read the certificate generated and checked to make sure the token was present.
If a hosting provider let you claim names in the “.invalid” TLD, and upload your own certificate for them, you could get a certificate issued for all the customers hosted on your IP. Heroku let you do this, as did AWS Cloudfront, and who knows who else.
LetsEncrypt quickly took the SNI challenges down while hosting providers deployed fixes. Ultimately, SNI was so widely used this way that CAs concluded SNI was fundamentally unsafe to use as a challenge, and the ACME SNI challenge was deprecated, and finally removed last year.
A Note About A Related Problem
This attack is an instance of a broader attack class called “subdomain takeover”, which is a mainstay among bug bounty hunters. HackerOne will tell you all about it, if you want to make $50 or so in an evening.
So, any time you’re hosting content for customer domains, you have the problem of what happens when the customer stops using your service. As you might expect, lots of times you’ll forget to stop forwarding DNS to old expired services. But your account on those services has lapsed, and that usually means that other people can claim the same names you were using. Since you’re still directing traffic to the service, the new claimant has now hijacked one of your subdomains.
Which is bad for all kinds of reasons; it allows you to steal cookies, violate CORS, bypass CSP; it even impacts OAuth2.
Fly mitigates this problem for ALPN challenges by not reusing IP addresses. Every application gets a unique, routable IPv6 address, and we won’t attempt Lets Encrypt validation unless the target hostname resolves via CNAME to that IPv6 address. (We do something similar for DNS challenges).
ALPN
Recall the virtue of the tls-sni-01
challenge: it doesn’t require you to have access to your DNS configuration, nor do you need to open 80/tcp. You want a challenge that works this way. And there is one: the new third ACME challenge, tls-alpn-01
.
To grok tls-alpn-01
, you’ll of course need to know what ALPN is. It’s an easy concept: Imagine TLS was a transport protocol in its own right, alongside TCP and UDP; ALPN would be its port number. I mean, they’re strings, not numbers, but same idea.
Why does TLS need such a thing? Most things that use TLS have their own TCP ports already. The answer is, of course, HTTP/2. HTTP/2 isn’t wire-compatible with HTTP/1 (it’s a binary protocol optimized for pipelining). But it can’t have its own TCP port, because if it did, nobody would be able to speak it: huge chunks of the Internet are locked down to ports 80 and 443.
(We’re not, at Fly, by the way; you can run any TCP service you want here. But I digress from my digressions).
To solve this problem, when Google was designing SPDY (HTTP/2’s predecessor), they came up with NPN, “Next Protocol Negotiation”. The way NPN worked was:
- A TLS client added an NPN extension to their ClientHello, the message TLS clients send to open up a connection.
- A supportive TLS server would respond with a ServerHello that had an NPN extension populated with the protocols it supported.
- A key exchange having been completed, both sides of the connection would switch on encryption.
- The TLS client would send an encrypted NextProtocol message that chose a next protocol (which technically may or may not have been one listed by the server, if both sides were trying to be sneaky about things).
By doing this, Chrome could opt into SPDY when talking to Google servers without burning a round trip for the negotiation.
When SPDY turned into HTTP/2, something like NPN needed to get standardized, too. But the IETF tls-wg wasn’t a fan of NPN; in particular, it reversed the normal order of TLS negotiation, where the client proposes and the server chooses. So the IETF came up with ALPN, Application Layer Protocol Negotiation. ALPN works like this:
- A TLS client adds an ALPN extension to its ClientHello indicating all the protocols it supports.
- A TLS server indicates which protocol it selected in the ALPN extension in its ServerHello.
- That’s pretty much it.
There’s a clear privacy implication here, right? Because the ALPN protocol you might be asking for is “tor”. The IETF ruins everything. And that’s true, but it’s complicated.
First, the security offered by the encrypted NextProtocol frame was a little sketchy. Here’s an outline of an attack:
- Alice connects to Bob, and Mallory really wants to know what protocol Alice is going to ask for.
- Mallory MITMs the connection and downgrades its security. Remember that NPN is running /inside/ the handshake, not /after/ it, when the “Finished” message has cryptographically authenticated the handshake.
- Alice sends her NextProtocol frame to Mallory on the downgraded connection.
- Mallory drops the connection and reads the NextProtocol.
- Alice, meanwhile, re-connects, because that’s what you do, and repeats the exact same process with Bob directly, sending the same NextProtocol.
In practice, with Firefox, you could at one point do this simply by sending a bogus certificate; Firefox would complete the handshake, NPN included, even if the certificate didn’t validate.
(For what it’s worth, some of the privacy issues here got mooted in TLS 1.3).
The JPEG Cat Extension
Additionally, while privacy was doubtlessly on Adam Langley’s mind when he wrote the NPN spec, the more important problem was probably middlebox compatibility.
The way middleboxes work is, enterprises buy them. They’re quite expensive, and enterprises buy big ghastly bunches of them in one go, so vendors work really hard to win those deals. And one straightforward way to win a bakeoff is to come to it with more features than your competitors. Here’s a feature: “filter connections based on what application protocol the client selects”. The Chrome team, presumably seeing that dumb feature a mile away, took it off the table by encrypting NPN selections.
(This sounds paranoid, but only if you’ve never worked on real-world TLS. In the NPN vs. ALPN tls-wg thread, AGL cited an ISP they found in the UK that took it upon themselves to block all the ECDHE ciphersuites. Why? Who knows? People do stuff like this.)
Ultimately, ALPN beat out NPN in the tls-wg. But, just as they were wrapping up the standard, Brian Smith at Mozilla (and author of Rust’s ring crypto library) threw a wrench in the works.
It had been Mozilla’s experience that, in some cases, middleboxes would hang when they got a ClientHello that was more than 255 bytes long. Hanging is very bad, because Mozilla needed timeout logic to detect it and try a simpler handshake, but that logic would also fire for people on crappy Internet connections, and had the effect of preventing those people from using modern TLS at all.
Miraculously, a day later, Xiaoyong Wu at F5 jumped onto the thread to explain that older F5 software confused 256 byte ClientHello frames with TLSv2. TLS frame lengths are 2 bytes wide; once the ClientHello ticks past 255 bytes, the high length byte becomes 01h. That byte occupies the same point in the frame as the message type in SSLv2. To the F5, the frame could be a long-ish ClientHello… or a very long SSLV2MTCLIENTHELLO, which was also 01h. The F5 chose SSLv2.
The fix? Send /more/ bytes! At 512 bytes, the high length byte is no longer 01h. And thus was born the “jpeg-of-a-cat” extension, which AGL took the fun out of by renaming it “the TLS ClientHello Padding Extension”.
Back To ACME
This is a little anti-climactic, but we’ve come all this way, so you might as well understand how Fly (and other CDNs, and things like Caddy) generates certificates with ACME:
- We request a
tls-alpn-01
challenge from LetsEncrypt for your hostname, using our ACME account. - LetsEncrypt gives us a token, for which we generate a self-signed certificate with the token embedded, that we load into our distributed certificate storage.
- We say (in ACME) “go ahead”, and LetsEncrypt looks up the hostname we’re serving, connects to it, and sends a ClientHello with “acme-tls/1” set as the ALPN protocol.
- Our Rust proxy catches the ACME ALPN case, retrieves the challenge certificate, and feeds it to LetsEncrypt.
- LetsEncrypt drops the connection, sets the challenge to completed, and allows us to complete the certificate generation.
The ALPN challenge is more explicit than the SNI challenge; we had to specifically set up a subservice to complete ALPN challenges for customers, rather than doing it sort of implicitly based on our native SNI handling. (We wouldn’t have had the problem anyways based on how our certificate handling works, but this is the logic behind why ALPN is OK and SNI isn’t).
This process is pretty much seamless; all you have to do is say “yeah, I want a TLS certificate for my app’s custom domain”. It only works with individual hostnames, though, which may be fine, but if it isn’t, you can do a DNS challenge with us to generate a wildcard certificate.