Imagine inviting random strangers from the internet to bring along their code and run it on your servers in a Rails app. Sounds like a security nightmare, doesn’t it? Where do you even start?
If you run into a person at Fly.io, they might be saying something about “Fast booting VMs”, but what does that mean outside of faster deployment times?
Turns out when an entire machine can be boot in 2 seconds or less, it becomes possible to boot a server via a Rails background job, analyze a strangers code from within the confines of a virtual machine, and shut it down when the job is complete.
Sounds complicated right? It is, but Fly.io built the Machines API to manage all that complexity so you can spend your time and energy sweating the details about your app.
The Problem
Inspecting or executing arbitrary code from third parties comes with a lot of risks.
First off, there’s a threat for the application that has to run it: It could be a target of exploits from malicious code being introduced by attackers to bring the app down, extract passwords, etc.
Then, there’s a risk for the customer of such an application: It could be exposed to malicious code of other, malevolent customers that targets extracting data or intellectual property.
The first is mainly a security issue for the application’s operation, while the latter is business critical as it undermines trust between customer and SaaS provider.
Luckily, Fly.io boasts a solution that provides a safe environment to deploy such workloads and is simple to manage: Fly Machines.
The Context
Attractor is a code quality analysis tool that relies on the churn and complexity metrics to measure how tech debt evolves for a typical Ruby (on Rails) or JavaScript app.
At its heart lies a GitHub app that clones, inspects, (and optionally runs) third party code. A static analysis is conducted and the results are being reported back to the main app.
Since any paying customer can connect any GitHub repository, it would be possible to compromise the application (and customer data) were the user code cloned in the main app machine. A way to safely inspect and possibly run it had to be found.
The Solution
Attractor is a Ruby on Rails app deployed on Fly.io with
- One process running the app server (puma)
- Two worker (sidekiq) queues:
default
andsandbox
- a Fly machines app to create and run machines on the fly. Important: make sure you run this app in a private network for true isolation, as pointed out here.
When new code changes come in via a pull request, it uses a SandboxRun
model to encapsulate such a workload:
class SandboxRun < ApplicationRecord
has_secure_token
belongs_to :github_pull_request, class_name: "Github::PullRequest"
after_create_commit :start
def invalidate
self.invalidated_at = Time.now.utc
end
def invalidate!
invalidate
save!
end
def invalidated?
!!invalidated_at
end
private
def start
return if invalidated?
SandboxRunJob.perform_later(self)
end
end
After a SandboxRun
record is created, it self-executes via a SandboxRunJob
:
class SandboxRunJob < ApplicationJob
queue_as :sandbox
def perform(sandbox_run)
@sandbox_run = sandbox_run
boot_sandbox
end
private
def boot_sandbox
res_create = conn.post "apps/my-app-machines/machines",
"{
\"name\": \"sandbox-machine-#{@sandbox_run.id}\",
\"config\": {
\"image\": \"my-sandbox-image:latest\",
\"guest\": {
\"memory_mb\": 512,
\"cpu_kind\": \"shared\",
\"cpus\": 1
},
\"restart\": {
\"policy\": \"no\"
},
\"env\": {
\"SANDBOX_RUN_ID\": \"#{@sandbox_run.id}\",
\"SANDBOX_RUN_TOKEN\": \"#{@sandbox_run.token}\",
}
}
}",
"Content-Type" => "application/json"
# abort processing if machine start failed
if res_create.status >= 400
raise SandboxStartupError, res_create.body["error"]
end
@sandbox_run.fly_machine_id = res_create.body["id"]
@sandbox_run.save
end
def conn
@conn ||= Faraday.new(
url: ENV.fetch("FLY_API_URL", "http://_api.internal:4280/v1")
) do |conn|
conn.request :authorization, "Bearer", ENV["FLY_API_TOKEN"]
conn.response :json
end
end
end
This job boots a sandbox by issuing a POST
request to the Fly machines app (my_app_machines
). It spawns a container using a Docker image (my-sandbox-image:latest
) that has to be present in your organization’s registry. Furthermore it is passed two environment variables (SANDBOX_RUN_ID
and SANDBOX_RUN_TOKEN
) to identify the sandbox run. Critically, the restart policy is set to no
to avoid infinite loops.
The logic that runs in the actual sandbox is secondary, it simply returns a JSON payload in form of a POST
request to an incoming webhooks controller:
class SandboxWebhooksController < ApplicationController
# some details omitted
before_action :authenticate_token!
def create
SandboxWebhook.create(data: JSON.parse(request.body.read)).process_async
render json: {status: "OK"}, status: :created
end
private
def authenticate_token!
@sandbox_run ||= sandbox_run_from_token
head :unauthorized unless @sandbox_run.present? && !@sandbox_run.invalidated?
end
def sandbox_run_from_token
SandboxRun.find_by(token: token_from_header)
end
def token_from_header
request.headers.fetch("Authorization", "").split(" ").last
end
end
Note that the sandbox run is authenticated via a unique secure token that we passed to the sandbox machine as an environment variable (SANDBOX_RUN_TOKEN
). Optionally, precautions can be made to make this endpoint only accessible from the internal network.
class SandboxWebhook < ApplicationRecord
# module includes omitted
def process
@sandbox_run = SandboxRun.find(data["sandbox_run"]["id"])
return if @sandbox_run.invalidated?
# process incoming payload
@sandbox_run.invalidate!
ensure
teardown_sandbox
end
private
def teardown_sandbox
_res_wait = conn.get "apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}/wait",
{
state: "stopped",
instance_id: machine_instance_id
},
{
"Content-Type" => "application/json"
}
res_delete = conn.delete "apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}"
if res_delete.status >= 400
raise SandboxShutdownError, res_delete.body["error"]
end
res_delete
end
def machine_instance_id
res_machine = conn.get("apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}")
res_machine.body["instance_id"]
end
end
In the created SandboxWebhook
model the actual payload processing takes place, which isn’t really of interest. We have to take care, though, that the corresponding sandbox run is invalidated so it doesn’t get executed a second time.
The more salient part of this model for the purposes of this article is the tearing down of the sandbox machine. We want to clean up after the sandbox has run, otherwise we would have dangling machines that add to our bill. To destroy a machine, we have to wait for it to become stopped
, though. This is done via a special /wait
endpoint that we pass the desired state and the machine’s instance ID.
Beware: This is different from the machine’s ID, which is why we have to invoke another endpoint to obtain it.
The response to the /wait
call blocks until the machine reaches the desired state. Afterwards we can destroy it, and re-raise any possibly resulting error.
Wrap-up
Solving the need to separate user code from our own application, we picked up Fly Machines to run ephemeral, isolated workloads. We’ve shown a way to integrate these sandboxes and the results they produce in an idiomatic Rails workflow. In the future, hopefully the verbosity of the API integration will be replaced by an official Fly SDK to create, start, and destroy machines.