How we adopted Kubernetes for our data science infrastructure

Pew Research Center illustration

(Related posts: How we review code at Pew Research Center, How Pew Research Center uses git and GitHub for version control, How we built our data science infrastructure at Pew Research Center)

When we started Pew Research Center’s Data Labs unit in 2015, we faced a fundamental question: What infrastructure do we need to build to support the ongoing work of a computational social science team?

This was no easy task. There was no standard “data science stack” that we could simply copy and implement. We found lots of online resources that help individual researchers cobble one together, as well as technological stacks and engineering designs better suited for larger enterprises with predictable workflows and dedicated platform teams. But it was hard to get a clear picture of how other social science research teams like ours were addressing their broad engineering needs.

In a previous Decoded post, we told you about our use of JupyterHub, which is the primary “front-end” tool our researchers use to conduct their analyses and interact with our shared infrastructure. In this post, we’ll go one step deeper and delve into Kubernetes – the back-end tool that powers the systems our research team interacts with. We’ll discuss:

  • Problems Kubernetes solves for us
  • Challenges we’ve encountered along the way
  • How Kubernetes has changed the relationship between engineers and researchers in our shop

Laying the groundwork

Managing a data science team means addressing the technical needs of a group of people who all are doing work that’s too large or computationally intensive to simply live on their laptops.

For an individual researcher, the solution to a dataset that no longer fits in local memory or a model that takes a very long time to run is simple: Just use a bigger computer. Modern cloud computing vendors offer easy access to servers with almost any specifications imaginable for as long as someone might need them.

Scaling up a personal computer can work reasonably well for an individual, but it’s not ideal for a team. For one thing, costs can quickly get out of hand if multiple people are spinning up machines whenever they need them. And when researchers need to collaborate and share access to resources, it also means that everyone on the team needs to be knowledgeable about things like securing and managing their personal computer and conscientious of using expensive resources.

This is also complicated by the fact that our research output covers a wide range of topics and methods. As a result, most of our computing needs are project specific. To be sure, some tasks – making inferences with large pretrained machine models, transforming large datasets using as much memory as we can afford, or labeling data – are regular features of our work. But the exact balance between those resources is hard to predict in advance, and there are frequent changes in the tools and specifications that researchers need.

Ultimately, we needed to give our researchers and engineers the ability to try things out – whether it’s a new application, process or design – and then move on quickly to the next project. That led us to prioritize a common basic infrastructure that could expand and contract with the needs of any given project on our collective docket, achieving flexibility in terms of both our ability to provision the optimal resources for a project and our ability to support different workflows and tools.

The computing cluster

We achieved that flexibility by using an open-source tool called Kubernetes to organize and control a computer cluster running on the cloud as the centerpiece of our infrastructure. The cluster lets us offer resources virtually on demand without having to configure or manage machines for researchers. Our researchers can access the resources they need without having to alert our engineering team, except on rare occasions when they need special resources.

Kubernetes is the software layer that makes this design possible. It is a popular tool for managing a cluster, and integrates tasks like provisioning, networking and monitoring under a single framework. It also performs autoscaling – that is, giving more or fewer resources to different applications as needed, while making sure that all the components stay healthy. Kubernetes can integrate with a variety of other applications and offers the robust set of safeguards that our security team expects.

Beyond these criteria, Kubernetes is one of the recommended ways to run JupyterHub, the platform we use to offer R and Python to our researchers. That made it a natural choice for scaling up our analytical infrastructure. In fact, JupyterHub was the first major researcher-facing tool that we migrated to our new cluster – in part because JupyterHub is specifically designed to hide the technical complexity of whatever system it’s running on. We quickly began to explore what other parts of our infrastructure could benefit from migrating to the cluster.

One benefit of an infrastructure supported by Kubernetes is that it is easy to deploy new tools. When installing applications on a single machine, engineers have to worry about things like environment consistency or making sure that there are no conflicts with existing applications. On the cluster, applications are “containerized,” or packaged in a standalone unit with their own environment and dependencies. Applications on the cluster can also be replicated or moved between nodes without an engineer’s intervention. Put simply, Kubernetes reduces the overhead of installing new applications and makes it much easier to try things out.

Now that we have a Kubernetes cluster, we not only prefer to run our own applications there – we explicitly favor tools that are suitable for Kubernetes. Our cluster now runs workflow management tools like Airflow and Prefect, custom GitHub runners that we use to build internal packages, internal chatbots, data annotation tools, internal websites for documentation, and a large variety of smaller services.

At the same time, we’ve learned that Kubernetes is challenging to incorporate into a research team. Moving from a single machine framework to orchestrating the operations of the services running on multiple machines means getting used to new terminologies, abstractions, and ways of defining, structuring and solving problems. In addition, it’s a hefty infrastructure that adds some amount of overhead, requires a lot of attention to maintain, and entails a significant amount of initial work to deploy. There are some obvious benefits to that – starting with the fact that it gives some clear consistency to how we operate – but it does come at the cost of the additional engineering resources needed to maintain the cluster.

Researchers and Kubernetes

As the cluster has become more central to how we do our work, we’ve faced a new question: How much, if at all, do we want researchers to interact with Kubernetes directly?

Computational social science is unique in the research world in that researchers frequently play the role of de facto software developers. This is especially true during the data collection phase of projects, when the research team may write many small, long-running applications that consume information from the internet. The cluster is an appealing way to run these applications, provided that our researchers all have exposure to software development workflows and are fully capable of deploying and managing their own resources.

However, interacting with Kubernetes can be daunting for researchers, simply because Kubernetes is too new and too far down the engineering stack for most researchers to have even passing familiarity with it. Moreover, researchers’ interactions with it are intermittent enough that most will never be highly knowledgeable of the platform’s features and pitfalls. Social science researchers are not software developers, and that can create persistent friction: The more researchers need to interact with Kubernetes to carry out their everyday work, the more significant the friction will be. This can be mitigated by investing in staff training or in additional engineering to create interfaces for researchers, but it’s a cost that needs to be paid up front.

It’s tempting to have the cluster at the center of your workflow, but for most social science shops, this imposes too much technical labor on the research teams. Our current solution is to embed a data science engineer in our research projects. In addition to helping the research team in a more traditional platform support role, the engineer helps design, build and prepare research applications for deployment to the cluster. We’re also watching the rapid growth of the Kubernetes ecosystem itself and expect that tools in that space, such as coding assistants, can make the task of interfacing with the cluster less onerous.

Conclusions

We’ve found significant value in using Kubernetes to run our data science platform. We recommend using Kubernetes if your team often installs new tools and manages extensive resources. And if your projects involve computation-heavy tasks requiring scalable resources, Kubernetes can make operations significantly smoother.

Still, other organizations should think carefully about adopting Kubernetes. Many organizations don’t need to run dozens of different applications and support systems, don’t need to scale up their operations to process large amounts of data very often, or lack either the skills or dedicated staff to maintain and support a cluster system. In that situation, we probably wouldn’t have invested in Kubernetes as heavily as we have. It’s crucial to factor in your team’s skill set and the investment required for successful implementation.

Categories:

More from Decoded

About Decoded