What we learned while migrating from Chef to Kubernetes

One exciting challenge of working on the infrastructure team at Front? Continuously adapting our infrastructure as our company grows. Here are the team’s learnings from migrating to Kubernetes.

At Front, we value transparency — in our company and our product. In this series, Front Engineering Stories, our engineering team will share insights into our engineering philosophy, our unique challenges, and visions for the future of our product.

Front Software Engineers Kevin Mo and Andrew Drake explain how the team created our highly scalable infrastructure with Kubernetes.

One exciting challenge of working on the infrastructure team at Front? Continuously adapting our infrastructure as our company grows.

A year ago, we had to do a full migration of our production environment from Chef to Kubernetes to handle several major engineering pain points: deploying and rolling back our code and autoscaling our servers depending on load.

For some context, we used Chef to orchestrate our infrastructure for the first few years of our existence because there is very little overhead to set up, and it only takes one cookbook to get everything up and running on a node.

However, as Front grew, the size of our infrastructure quickly outpaced our ability to manage it at a high level. Workflows that used to be manageable on a few servers no longer scaled at 100. For deploys, this meant we had to run Chef manually on each node, which not only took a long time but also meant that we sometimes forgot to upgrade a node with the latest version.

Kubernetes for infrastructure management

Luckily, there exists a tool called Kubernetes, which could help us better handle infrastructure management. Kubernetes is an open-source software that makes it easy to run and manage containerized applications across many servers.

It treats servers as interchangeable resources: application containers are automatically placed on servers wherever they fit and can be moved around as needed. For deployments and rollbacks, we can specify what images we want to deploy in our cluster, and Kubernetes will handle rolling out the latest version with no downtime or an engineer’s intervention.

Building our first cluster with kops

We knew from the start that migrating our entire infrastructure was going to be a big project. Such projects are often extremely hard to scope. So, the first order of business was to research our options on how to get a Kubernetes cluster up, period.

One of the first findings of our research was that running Kubernetes is quite complex. Its awesome flexibility means there are over a hundred options in theofficial setup docs, and there are dozens of options for the stack of plugins necessary to get applications running and talking to each other. A lot of work can be saved by picking a hosted solution where these low-level decisions are abstracted away.

Unfortunately for us, hosted Kubernetes wasn’t an option. Amazon’s Elastic Container Service for Kubernetes (EKS) was only announced while this project was already in flight and adoption would mean that EKS did not have feature parity for a while. Migrating to a different cloud provider was not in the cards. When we looked at third-party commercial Kubernetes vendors supporting AWS, we found that we were significantly larger than their typical customers, and we were not ready to be their at-scale guinea pigs.

In the end, we decided to manage our own cluster with kops. It was not a perfect solution (as evidenced by some of our ongoing efforts to improve the setup), but it provided flexibility and seemed to be the most mature way to self-manage a cluster at the time.

kops ended up making it extremely easy to get a cluster up and running, but there were many issues with its defaults, especially in regards to security. We first ran into issues with the default networking provider, kubenet, not having support for network policy, which is the only mechanism provided for restricting network traffic within a Kubernetes cluster. Luckily, there were many network providers, and we eventually settled on Calico because of its native support for network policy and relative maturity.

Another set of security blockers we encountered was around access to various parts of the Kubernetes cluster. By default, kops allowed for all nodes to have access to ETCD, and kubelet’s anonymous authentication is enabled. We were able to work around these issues by patching our kops setup scripts to limit access to and within our cluster.

Running Front on Kubernetes

Now that we had a basic Kubernetes cluster running, we turned our attention to getting Front running on the cluster. While doing research on how other teams managed clusters, we were disappointed that the instructions were often along the lines of kubectl apply: https://example.org/path/to/setup.yaml — the fancy Kubernetes equivalent of curl | bash!

We generally believe in Infrastructure as Code, so manual, non-reproducible installation procedures like that was not what we were looking for. Eventually, we were able to find a couple tools that met our needs.

Helm for package management

The first tool is Helm, which bills itself as “the package manager for Kubernetes.” The project maintains a public repository of versioned and parameterized packages called charts, which helps tremendously with reproducibility. Unfortunately, charts still have to be installed manually, so we needed a way to check in to Git some definition of what to install.

Helmfile for configuring charts on a cluster

Enter Helmfile. It’s a relatively simple tool: you provide it with a list of versioned charts, and it ensures that all of them are installed and configured on the cluster. The model is reminiscent of Terraform, which we use extensively.

Getting Front itself running on Kubernetes ended up being fairly obstacle free. Packaging Front into a Docker image took less than a day, and then putting together a Helm chart was straightforward.

Challenges with Helm

We ran into a few challenges, however, with upgrading. Helm charts specify one version of the underlying application: to upgrade, you change the image identifier in the chart and publish a new chart version. Automating that process as part of our deploy would mean automated commits to our version control, or storing the chart outside version control, neither of which is ideal.

We took a different approach to work around this issue. Our Front chart doesn’t embed the image identifier directly. Instead, the chart takes it as a parameter. Deploys of a new code version update that parameter and leave the chart alone. Helm tracks this like any other change, so history is correct and rollbacks still work (which was crucial!).

The last detail of Helm that we had to iron out was handling Helm’s server-side component, Tiller, which is used to track the version of the chart running. This component has no internal authorization, which means anyone that can perform one type of operation against Tiller can do pretty much everything; it’s sometimes called a “giant sudo server.”

We weren’t exactly comfortable exposing that level of access so broadly, so we built a tiny service that runs alongside Tiller and exposes a more restricted set of actions. With the deploy system using that new service, we can leave Tiller isolated and keep granted privileges to a minimum.

Preparing for Launch

Surprises are great in some contexts, but production isn’t one of them. With a functional development cluster handy, we had a great opportunity to rehearse most of the operations that we might need to perform on the production cluster.

Testing, 1,2,3...

We tested upgrades for pretty much all of the software we were using (including Kubernetes itself) and documented any sharp edges. We caused various sorts of system failures and watched what happened: for virtually all issues, Kubernetes could recover on its own, for the rest we added alerts and playbooks.

We used the tooling we built for the development cluster to launch the production clusters. We deployed an instance of Front configured for production, but with all components set to zero scale. For each type of component, we verified that we could adjust the scale up and down using a weighted DNS, and that the processes could handle traffic correctly from inside the cluster at low scale. This let us build up confidence that the migration process would go smoothly — it was a little bit like a dress rehearsal.

Training the team

Concurrently, we started to more broadly communicate about the state of the migration. Most of our developers had little to no previous Kubernetes exposure, so we wanted to make sure they had ample opportunity to prepare before their code was running on it. We developed and ran several training sessions on Kubernetes fundamentals and new workflow. We also held a workshop where we had people get Kubernetes tools set up and running on their development machines.

Launch and Aftermath

Finally, it was time to go live. With the process already planned and tested, we started scaling up components and shifting over traffic from the Chef nodes. The migration started off slowly, as there were still some DNS-related issues that we had to iron out. Once those were resolved, we were able to ramp up really quickly.

The overall migration took a month, but half of our components were migrated in the last week!

By the time we were finished, we immediately saw the fruits of our labor. Our deployment times had been cut by 80%, and the whole process was now fully automated.

As a company that deploys ~20 times per day, this was a major win for our engineering team. Rollbacks were now super straightforward as they are versioned and managed by Helm, making an otherwise nerve-wracking process much simpler. On top of all this, we now had the foundation for building autoscaling for our cluster (stay tuned for a blog post on this).

This project came with challenges, and it left us with a couple of takeaways that are applicable at large.

1. Migrations of this size take a ton of research, and the best solution and implementation are seldom straightforward. As alluded to earlier, we spent a great deal of time evaluating options before landing on kops (which was far from perfect out-of-the-box) for cluster management. It was nevertheless necessary to make an informed but quick decision so we could deliver on tangible results affecting the entire team.

2. The Kubernetes migration really tested our ability to balance quality and practicality, which will pop up time and time again.

With many projects ahead of us — some of which build on top of the initial Kubernetes work — we’ll take the learnings from this launch to improve our execution for the future 💪

Looking to work on challenging engineering problems with a creative and driven team? Check out frontapp.com/jobs. We’re hiring in our San Francisco and Paris offices!

Written by Kevin Mo and Andrew Drake

Originally Published: 17 April 2020