Building a Kubernetes infrastructure for CERN’s Content Management Systems

. The infrastructure behind home.cern and 1000 other Drupal websites serves more than 15,000 unique visitors daily. To best serve the site own-ers, a small engineering team needs development speed to adapt to their evolv-ing needs and operational velocity to troubleshoot emerging problems rapidly. We designed a new Web Frameworks platform by extending Kubernetes to replace the ageing physical infrastructure and reduce the dependency on homebrew components. The new platform is modular, built around standard components and thus less complex to operate. Some requirements are covered solely by upstream open source projects, whereas others by components shared across CERN’s web hosting platforms. We leverage the Operator framework and the Kubernetes API to get observability, policy enforcement, access control and auditing, and high availability for free. Thanks to containers and namespaces, websites are isolated. This isolation clariﬁes security boundaries and minimizes attack surface, while empowering site owners. In this work we present the open-source design of the new system and contrast it with the one it replaces, demonstrating how we drastically reduced our technical debt.


Introduction
Google rewrites most of their software every few years [1]. This surprising activity consumes a large fraction of their resources each year! And yet, they consider it crucial to their agility and long-term success, because software requirements change as technologies evolve -and with them, user expectations. This practice typically reduces unnecessary complexity in each iteration, and transfers knowledge into the new generation of engineers.
All these factors apply as much to CERN as they apply to Google. Despite the much slower pace at which services evolve, CERN lives in the same dynamic technological environment. Without a constant input of effort our software falls behind and fails to address modern expectations of features and aspects such as security, high availability, portability and isolation. At the same time unnecessary complexity accumulates: yesterday's custom solutions can often be replaced with new upstream components, the product of continuous standardization of solutions to problems that affect entire industries.
The increasing divergence between the original requirements the specified a piece of software and present requirements results in technical debt [2]. The main purpose of this work is to pay back technical debt in CERN's Content Management Systems by modernizing the software architecture and making the service more secure and flexible (see section 4.1).

Why Kubernetes?
Kubernetes is for cloud native applications an extension of what the operating system is for traditional applications. It is becoming the de facto standard for Platform as a Service [3], abstracting computational infrastructure and standardizing deployment, so that an application can run unmodified on sites across the globe. Scientific applications are routinely deployed on Kubernetes [4][5][6], and even HPC use cases are being investigated [7].
At CERN the uptake is also evident. CERN IT has integrated Kubernetes in the Cloud Infrastructure and allows instant provisioning of new clusters. The ATLAS experiment is evaluating the replacement of all Grid computing services with Kubernetes clusters [8]. The Batch service, consisting of the largest portion of offline computational workloads at CERN, is prototyping a Kubernetes platform [9]. REANA, a system for Reproducible Analyses, is targeting Kubernetes as a main execution backend [10].
Many of these use cases are attracted by the promise of development velocity: rapidly shipping features, while maintaining highly available services [11]. A key element is to expand the pieces of a software stack that are immutable and versioned, declaratively configured, and self-healing.

Operator pattern
The Web Frameworks use case is not only about using Kubernetes as a deployment vehicle. Since we develop platforms and are concerned with their operational characteristics, we extend Kubernetes with custom APIs and controllers, building infrastructure management applications that are part of Kubernetes. Our applications range from integrating with other CERN systems to providing website management APIs and automating operations.
But what is included in the task of managing websites? The infrastructure, seen as an application, needs to have a concept of a website and let users define the website they need: its name, the technology, parameters, etc. Once a website is specified, the infrastructure needs to ensure that the website is automatically provisioned and set up. More than just a server, this task might include setting up storage and database, and integration with external CERN systems. After that, it needs to ensure that every component stays healthy and synchronized, propagating changes as requested by the user to every part.
In many cases, the solution has two components: a custom Kubernetes API and a controller that watches the API and ensures certain conditions. This pattern is called Operator, because it uses Kubernetes primitives to automate high-level operational workflows specific to the website technology.

What is Drupal?
Drupal is an open-source Content Management System (CMS): a tool for site builders to organize and deliver content to their website visitors. It's used in 10% of the top-10k websites with the highest traffic [12,13].
The market leader in Drupal's niche is the CMS WordPress [12]. Drupal is often contrasted with it and, according to UX surveys conducted by a Drupal-focused company, seems to offer a complicated start for beginner users, but a powerful experience for experts [14].
Drupal is frequently embraced as an open source community-driven project, making it strategically attractive for enterprise sustainability [15]. Use cases range from simple blogs to professional newspaper publishing, from enterprise presentation to e-commerce, across government and private sector entities [16]. A frequently cited feature is the flexibility with which it adapts to bespoke requirements, while scaling to large amounts of content.

Drupal at CERN
Who can benefit from a dedicated infrastructure for CMS websites? An organization that needs a dynamic Drupal environment, with high turnover of sites: universities, organizations comprising many departments and independent activities. Drupal was selected as Content Management System due its active community and extensibility with contributed modules. It has become the platform of choice for public outreach.

Article structure
Having laid down background information on the motivation, technologies and concepts used in this work, in the following sections we will describe: Section 2 What are the requirements of Content Management as a Service at CERN? Section 3 What system currently serves these requirements? Section 4 What system did we design on Kubernetes to replace section 3? Section 5 An experimental investigation of the new platform's efficiency Section 6 Reflections on this work and plans for the future

Requirements for Content Management as a Service
The service supports website admins to host and administer Drupal websites directed to the grand public, such as experiment or departmental central websites. Some of the most popular sites based on this service are home.cern, atlas.cern, cms.cern, careers.cern and visit.cern. They form CERN's main outreach channel and are critical for the Organization's reputation. Users of the service range across a wide spectrum of different professional profiles, and it's quite common that the responsibility of site building at CERN falls on administrative personnel, or personnel with little technical background in web technologies. This in turn shapes the kind of service we have to provide; it is, for example, impractical to rely on developer-centric workflows, like GitOps and CLI tools. A small fraction of our user base, however, indeed have web development experience.
The consequence is that the Content Management service has a dual mission: 1. to ensure the high availability and performance of these communication channels 2. to make site building and administration accessible to a wide-ranging user base, while remaining extensible for websites needing special features

Control vs. customization
Curating the Drupal distribution, and critically, the application of security updates, is the responsibility of the infrastructure team. However, many websites need extra features and Drupal was selected exactly because of its extensibility. Website admins should be able to use community modules, thereby extending Drupal specifically for their website -and assuming limited responsibility to keep custom code secure.  websites. This is an intrinsic characteristic of the service load, which is heavily skewed towards a very small number of critical websites. Unique visitors over 1 month are taken as a measure of a site's popularity, or how much impact it has on the Organization's reputation, but a measure more suitable to assess an infrastructure on is the rate of HTTP requests. In section 5 we will describe an experiment on resource optimization in the Kubernetes infrastructure by assigning websites to different Quality of Service classes.
The 10 websites with the highest traffic are the target of 60% of all requests, and they have a high overlap with the most popular sites ( fig. 1a). The most popular websites therefore, apart from the highest availability guarantees, need also the highest throughput.
What sustained rate of requests should a website be able to handle with stable response time? To better understand how the load impacts a single website (and therefore estimate the required hosting resources), we performed the measurements of figure 1a.
These observations align with expectations and requirements: critical websites should be able to handle a throughput of 40 requests per second with stable response times.

Current implementation
The infrastructure that currently serves the Drupal websites can be seen in figure 2. It runs on CentOS 7 and uses Puppet for configuration management. All servers run the same environment with systemd services, some of which are: • HAProxy load balancer: routes requests to worker nodes, with an affinity cookie The journey of an HTTP request can be seen in figure 2.
Production websites respond to the load by spawning PHP workers, up to a maximum of 25. A worker process is always listening for requests, even without load. Test websites, on the other hand, spawn the first worker on demand and scale up to a maximum of 10 workers. The PHP memory limit for every website is 512MB.

Website isolation
Weak isolation is one of the biggest concerns of this infrastructure. Each website is assigned a Linux user. Its directory is owned by it and not accessible by the users of other websites. When Apache serves a request, it chroots the PHP process into the Drupal directory and sets the website's user. This distinction provides a basic isolation mechanism. Nevertheless, websites are coupled in many places. We've never detected a cross-site security incident, but there are no cgroup limits to resources, and not enough security layers to defend against privilege escalation exploits. This is a critical concern, given the vulnerability of CMS software [17], and the impact that defacing a high-traffic public site would have to CERN's reputation.
A fundamental security practice is rapidly update upon security releases [18], but updating the multisite environment is fraught with dangers. All websites need to be updated in a massive, forced upgrade campaign, and if any has errors, it needs rapid debugging. Errors are supposed to be detected in a test environment, but not every website has one. Furthermore, even though websites can be customized with contributed modules, there is no workflow to version control the websites.

Development workflow
Two types of website are supported administratively, corresponding to different Quality of Service (QoS) expectations: official (production) and test (test and development). There is no concept of "environments" or branches in this infrastructure. Developing a website involves maintaining a production website and one or more independent development websites. Data can be cloned between websites, so that the development website reproduce the production one.
A site admin that wants to safely develop a new content type or view, add a new module, or even change configuration, should start by cloning the production website to a development website. They need to keep track of changes, then reproduce them on the production website. There is no GitOps.
Despite seeming inefficient from a software engineering perspective, this workflow is acceptable by most website admins. The ones with development experience though would benefit significantly from version controlling configuration changes and extra modules.

Limitations of the current infrastructure
Reiterating the discussion, these are the major limitations of the current infrastructure. In section 4, the Kubernetes infrastructure lifts all of them.
• Hard to adapt resources, resulting in massive under-utilization For each web framework there is currently a separate infrastructure, based on different technologies. This creates silos of operational expertise within the small engineering team that supports each one, which are costly and inefficient to keep alive in CERN's dynamic working environment. At the same time, there are a lot of software components developed in-house to support a single use case at the time, generating a large technical debt. Many requirements however are shared, such as interfacing with external CERN systems.
We therefore developed a common platform on the Openshift Kubernetes community distribution (OKD 4), leaving only a thin business logic layer specific to each use case. Open-Shift was chosen firstly for its production-tested multitenancy support, which we relied on for the present PaaS infrastructure. Openshift extends Kubernetes with tooling that simplifies our design [19]: a developer-focused console UI that we expose to end users, in-place upgrades of the control plane, node (machine) management API, monitoring and logging stacks. The first web framework to use the new infrastructure is EOS web hosting ("We-bEOS"), which has been in production since November 2020.
At the same time, the Platform as a Service and Drupal use cases are in Pilot phase, due to enter production in 2021.

Serving perspective
The physical servers are replaced with virtual machines composing an OKD 4 cluster. Each website is served by 1 or multiple replicas of a pod with Nginx and PHP-FPM containers, in different cloud availability zones in case of critical websites. The server perspective can be seen by following an HTTP request in figure 3.

DrupalSite API: create and manage websites
The infrastructure can be seen as an application that offers its users functions to manage websites. It provides an API for website admins to specify the kind of website they need: what version of Drupal, what amount of resources, which git repository to fetch configuration from. Each Project (Kubernetes namespace) forms the administrative domain of a website: it serves a single production website. Website admins can similarly create different environments of their website in the same project for development or test purposes (which are full websites with independent data stores), clone data between websites, and take and restore backups. An overview of the components is in figure 4.
Following the operator pattern introduced in section 1.1, the business logic is implemented with the DrupalSite Custom Resource Definition (CRD) and the drupalSite operator. The DrupalSite, in turn, controls all the resources in the dashed box in the architecture diagram ( figure 4).
Each website runs an immutable version of Drupal code, compiled from a version of the CERN Drupal distribution 1 (controlled by the Infrastructure team) with a source-to-image build that injects user dependencies and configuration from a git repository. Site administrators with technical background thus gain extra flexibility in customizing their website.
Two privilege modes are offered as different sets of RBAC permissions and policy rules: SaaS and PaaS. By default limited permissions are given (SaaS mode), limiting users to the functions offered by the DrupalSite CRD. To facilitate more advanced web development workflows and low-level control, such as deploying a different Drupal distribution, website administrators can ask for PaaS mode permissions, which come with limited support.

Measuring baseline resource requirements
The Kubernetes infrastructure can easily adjust its size according to the expected load. Each workload is also adaptable to load peaks with the Horizontal Pod Autoscaler (although so far we have not found it necessary). Nevertheless, we performed an experiment to measure the expected baseline resource consumption 2 and compare against the physical infrastructure, which was massively over-provisioned.
We need a resource estimate to size the Kubernetes infrastructure so that it can handle the same baseline load as the physical infrastructure. To understand the baseline load and the resources needed to handle it, we define Quality of Service (QoS) classes based on required throughput that needs to be handled with a stable response latency. We perform a stress test to emulate the throughput and define the appropriate baseline resources.

Service level objectives
To define the load each QoS class needs to handle, we use the physical infrastructure as starting point ( fig. 1a). Throughput peaked on the most popular website at around 16000 requests per hour, which averages on 4.4 requests per second. We define 3 QoS classes, which serve as Service Level Objectives (SLO), against the Service Level Indicator (SLI) of response latency under load. Because the latency depends on the complexity of each website, which is in the hands of the website admins and not the infrastructure, the SLO is met not by defining a set value of the SLI, but by asserting that the SLI be stable 3 . The 3 QoS classes are: • Critical: the most popular websites and therefore the most important to have high availability and request throughput (see section 2.1). They need to handle 35 requests per second (around 8 times the average on peak usage).
• Standard: these websites usually don't handle as much traffic and therefore don't need to have high request throughput They need to handle 10 requests per second. 1 PHP Composer project, standard setup without multisite 2 The resources required to satisfy each website's Quality of Service requirements 3 Stability is interpreted as the latency settling to a fixed value after some time with steady load • Test: as in the name itself, these websites are used by website managers to test new features, and therefore are used by testers and developers They only need to handle 2 requests per second.  The stress test consists of multiple simulated clients requesting URLs on the same website over a period of time.

Stress test
We copied a few websites (nursery and fluka used as examples) with varying content complexity from the physical infrastructure to experiment with on the Kubernetes infrastructure. The websites are stressed with load appropriate for the QoS class under investigation, and we tweak their configuration (mainly number of PHP workers), to get the lowest possible values for reasonable and stable response times. The simulated clients live on a dedicated Kubernetes cluster that deploys a custom tool based on Locust to make multiple requests to the targeted website on the new infrastructure. We spread the clients across multiple pods, each containing multiple processes that simulate users by requesting URLs at random 4 .

Stress load
Multiple runs have been made with different configurations in order to find a suitable one for each QoS class to process the desired throughput with minimal resource consumption.
The throughput is affected by the load times on the website, meaning, a website with less content will take less time to handle a request and thefore the User will be able to do a new request faster than it would take on a website with more content. Figure 5 shows the load and the response during stress test for the evaluated websites. The stress tests ramp up during the first minute, after which they maintain the stress load for 9 more minutes. 10 minutes were sufficient for the response time to settle.

Measurements
The resource consumption was also monitored under load. The memory usage for nursery and fluka respectively can be seen in figure 6. Tables 2 show the highest response time under full stress and lowest requests per second for each QoS. Table 3 summarizes the test results in terms of resources needed by each QoS class under load.  T otalMem = C * L + 2 * C * I + S * L + T * L = 20 * 960MiB + 20 * 104MiB + 600 * 309MiB + 500 * 125MiB = 269180MiB ≈ 282.3GB

QoS
Where: The physical infrastructure has 2TB of memory. To meet our SLO, the memory estimate for the Kubernetes infrastructure is 283GB. This is only 14.15% of the memory required in the physical infrastructure, resulting in significant potential cost savings, even disregarding cluster autoscaling.

Reflections
It is striking how big a difference it makes to discuss a design with 10 engineers rather than 3, and to have the peace of mind in case of emergency that many colleagues can take part. This is the hidden benefit of sharing a common platform, which this team has already felt. Especially in CERN's dynamic environment where the turnover of people is high, knowledge silos can be ill afforded.
The Pilot phase of this new infrastructure is still too immature to provide significant operational insights. We fully expect a few minor adaptations of the presented design as we scale up to production, but the experiments of section 5 reassure us that the infrastructure is functional and that modifications should be limited to the details. On the other hand, all the limitations of the physical infrastructure of section 3.4 have been lifted.
The next big challenge will be the production migrations in Q2 2021. Development is ongoing, including critical features for production. It is important to keep them as transparent to the website admins as possible and minimize disruption. From the power users however we expect the new features to receive a warm welcome and open new doors in their workflows.

Directions to explore
Develop once, run everywhere is a yet-to-be materialized promise. Plans for disaster recovery from a catastrophic failure of the CERN data center hinge on maintaining a public communications channel accessible. With Kubernetes cluster federation, using Public Cloud resources as a safety net is conceivable.
We will explore adding WordPress as a Service to the same infrastructure. Fundamentally, it should take no more than introducing a new build configuration.
The CNCF landscape provides a salient overview of industry-standard solutions that can take this project the proverbial extra "mile". We plan to experiment with runtime security, root cause analysis, chaos engineering and serverless for the non-production environments. Kubernetes turns a homebrew system into a cosmopolitan denizen of a brave new world of the Cloud.