Fly users are noticing faster, more reliable logs. Find out what happened behind the scenes to make that happen in this article.
Since Fly launched, we’ve been collecting and managing logs for all the applications running on the Fly platform. It’s a critical but often rarely noted function of the platform. When you type flyctl logs
, behind the scenes, there is a lot of computing power and storage being brought to bear. Over the last weeks, and transparently to users, the entire logging platform has been replaced with a whole new logging platform and a new approach to working with logs. We talked to the person who drove the change, Steve Berryman.
Dj: How did this project begin?
Steve: The previous logging system was built around seven fairly large, centralized Graylog servers. Even with all of them running, the volume of logs we got at various times of the day couldn’t be processed fast enough, and that meant things were dropped.
Dj: What kind of volume are we talking about?
Steve: Between 20,000 and 30,000 logs per second.
In theory, we could have just expanded the servers and added more power to manage things. The way logs get to customers, though, was by polling the Graylog API and polling APIs isn’t really the nicest way to work with an API. But then there wasn’t any other way to get that information out other than through the API.
Dj: What were the options?
Steve: We considered Kafka and other message streaming services, but they would have been another big tool in the chain. It was pointed out to me by Jerome that there was a new Log processing tool called Vector. It’s written in Rust and it’s very efficient. Like Logstash, it does all the capture, process, and transform of logs. We initially thought of using that to send things to Graylog, with Vector running on each server, but we found that Vector didn’t support the various Graylog protocols. It was then we had an idea.
Dj: Which was?
Steve: Why even bother sending the logs to Graylog if we’re processing the logs in Vector at each server. Send the logs straight to Elasticsearch and take out the Graylog middleman. All Elasticsearch has to do then is index the logs and retrieve them.
Vector runs on every server and logs go straight from journald
, or other applications, into Vector. There it parses them and runs a number of transforms on them. For example, this includes taking fields out from journald logs where we aren’t interested in them, some regex parsing, transforming the names of things into slightly nicer things for the new schema. It then ships the results to Elasticsearch which happily takes them in.
Dj: A new schema?
Steve: Yes, although people can’t see it externally, I decided to move us to using ECS, the Elasticsearch Common Schema. It’s a general log schema they’ve defined for various purposes, with a lot of common fields already defined - there’s file fields, log fields, network source and destination fields, geo fields, HTTP fields, TLS fields and more. It also lets us add our own fields, so we have fields for Fly app names, Fly alloc id, Fly regions and other Fly-related things.
The good part is that, with all the apps feeding logs in according to the schema, it makes searching across apps much easier. Searching for say a source IP address across different apps may have meant searching for “src.IP”, “source-IP”, “IP.source” and any other variation. With a common schema, we know if there is a source IP address, it’ll be in the source field. It’s nice to know what you’re looking for regardless of application.
Obviously, not everything will follow this schema, but then as soon as you find something you want to parse or transform, you can add a bit of config into the config management system for Vector. The Vector configuration language is pretty simple, it’s a bunch of TOML that defines sources, transforms, and sinks. Logs come in through the sources. Transforms then modify the log’s structure, adding or removing fields, or making other changes. The result is then sent on to one or more sinks. The important sinks for us is Elasticsearch.
Deploy those Vector config changes and the configuration management system will distribute to all the servers and from then on, all the logs will reflect that change. That’s a bit more work than before with Graylog.
Dj: Why’s that?
Steve: Graylog’s rules and transformations are centralized on Graylog servers, so it was one place to change things. I like Graylog a lot, but it is a big, heavy, Java, enterprise app that does a lot and where centralization makes sense. But it also centralizes the work needed to be performed on logs.
With Vector, we’ve distributed that work out to all the servers where it barely registers as load and we get to manage that with our own configuration system. We can also add the hardware that was servicing Graylog to the Elasticsearch fleet to make Elasticsearch perform better.
As an aside, one cool feature of Vector is that it allows us to unit test configurations locally. Along with the validate command for checking configurations, it means we have the tools to efficiently check configurations before deployment, giving us a lot more confidence.
Dj: How long did this take?
Steve: In all, about two weeks alongside other work. The actual implementation of getting Vector onto servers and feeding Elasticsearch didn’t take long at all. The bulk of the work was in the snagging, getting all the little issues handled, from configuring mapping and schemas on Elasticsearch and getting the encryption support working right, to fixing up various fields’ content and making it all run smoothly. Also, the Vector Discord channel was very helpful too with Timber.io devs participating in the chat.
Dj: So we’ve got a distributed log collection and processing platform with Vector. What about getting that data to users?
Steve: Currently, we have the API servers picking up the data from Elasticsearch using the same polling mechanism as before. Now, though, we can optimize that and make it more searchable and flexible.
How `flyctl` Displays Logs
We normally tail logs by polling a cursor against a REST endpoint. When a deployment fails, we also want to show logs for the failed allocations. So we query our GraphQL API for the failed allocations, and for each one of them, the GraphQL resolver gets the last N lines of log entries with that allocation’s id. The problem was with the old logging system, there was a good chance the logs had not been processed when this query was made. With the new system, it’s fast enough that the logs are already available.
One of the cool things we’ll be able to do with the new logging platform - we’ve not done it yet - is being able to point logs at different log service endpoints. Eventually, we hope to be able to send logs to Papertrail, Honeycomb, Kafka, possibly anything with a Vector sink component.
Dj: Beyond that, any specific plans?
Steve: Probably incorporating internal metrics into the platform so we can see how efficiently we are handing logs and optimizing all our feeds into Vector, especially with Firecracker logs.
We are better set up for the future with this new architecture, so who knows what’s next.