Jul 25

5 min read

Scaling our CI to 14k+ E2E Browser Tests

Since the beginning of Doctolib development, we’ve put a strong emphasis on Test-Driven Development (TDD). A big part of what that means to us is not just testing first and writing code to match, but also testing what is relevant to the user. To avoid the costs of Quality Assurance validation (both in terms of time 🏃🏻‍♂️ and 💸), it’s critical that our tests are extremely thorough. Achieving this is done by inverting the testing pyramid 🇪🇬 and putting end-to-end browser testing at priority numero uno. We have 14k (and growing) tests (of our 40k total test-suite) that require a PostgreSQL database, Elasticsearch database, Redis database, and headless chrome browser executing Javascript and rendering content, reaching out to 3rd party APIs, uploading and downloading files, etc.. These tests alone account for 80% of the total duration of our CI. It might sound completely crazy, and perhaps it is, but we’ve found a lot of ways to scale development to meet the demands. As a reward, we get super relevant tests that are a breeze to write (`click_on` this, `assert_text` that) without the eyesore of mocking and stubbing that could be simulating inaccurate behavior anyways.

Despite its benefits, the strategy brings us two major challenges: CI duration/scalability and flakiness. They’re something we constantly have to catch up on, but it is working. On a 4-core machine, running all our tests sequentially would take 3 days, 15 hours, 10 minutes, and 3 seconds. It’s extremely difficult to run a subset of tests since we’ve built a pure monolith with deeply interwoven concerns, so we run every test on every push of every PR … in only 18 minutes! How do we do it? And how do we keep it cost effective? Today we’ll talk about CI duration and scalability, answering these two questions.

CI Architecture

We run a mostly custom CI comprising 4 major components.

The developer (the most important, complex, and demanding component by far)
GitHub actions
A highly-concurrent and resilient work distribution system we specially built just for us. We call it CIrrus (with a capital `I`; no relation to Cirrus CI).
A K.I.S.S. test reporting interface (with some extremely practical bells and whistles) controlled by a simple REST API, also a custom solution. We call it test-failures.

The information flow is directed and cyclic: the workflow starts and stops at the developer, with information only traveling in one direction throughout.

GitHub Actions

GitHub Actions is responsible for triggering the run when a push is made to a PR. It builds a docker image containing our code and its dependencies. This is extremely convenient as the image built is very similar to the image we use in the production environment, making our testing environment relevant — not to mention we can ship around all our needs across the rest of the CI run very quickly and easily. After building the docker image, it creates a CIrrus job and queues the work to be done. And that’s it, it doesn’t run any tests as it would have a very hard time scaling to run them in a timely manner.

Of course we use GitHub Actions for a large variety of other things, but not as part of our main CI workflow.

CIrrus

Cirrus is a Kubernetes native solution written in pure Ruby for fast parallel job execution. In reality CIrrus is a collection of scripts and services that emerge to produce a highly scalable solution, relying heavily on Kubernetes mechanics rather than re-implementing the wheel. Each test run is represented by a single job in Kubernetes with a high parallelism (up to 832 as of today). All the database dependencies are implemented as sidecar containers directly in Kubernetes. This relieves the tests from having the responsibility of starting them and gives each Kubernetes pod it’s own isolated set of services. There is no overarching manager for the cluster, because we don’t need one. Each pod is fully autonomous, it simply iterates over the jobs in a RabbitMQ queue, executing them, and sending the results to test-failures. By leveraging Kubernetes scheduling abilities with the Kubernetes Cluster Autoscaler, we’re able to handle all of our scheduling needs with minimal overhead and almost zero code.

In order to keep costs down, we run on AWS EC2 Spot instances. They’re cheaper, but can be taken away from us at any time with only 2 minutes of notice. Many of our jobs are much longer than 2 minutes, so we need the ability to reschedule them. Fortunately RabbitMQ provides us with the ability to pull items from the queue without `ACK`ing them, this way, we can `NACK` them to re-add them to the top of the queue. Even if we forget (something goes horribly wrong, the Ruby VM crashes or a hardware error is encountered on the machine) RabbitMQ will re-queue them automatically when it detects the CIrrus runner is no longer alive. Our system is so resilient, that a typical build has at least one runner that either crashes, gets preempted by AWS due to spot eviction, or encounters a hardware or networking error without affecting the result of the build and only a minimal performance impact.

By further distributing this work across multiple AWS Elastic Kubernetes Service clusters, we’re able to have almost unlimited scalability here, where the only limits are the duration of our slowest workqueue item and amount of cluster resources 1 build requires.

test-failures

A basic Rails app with two main goals: 1) ingest test results from CIrrus runners, 2) display those results to the developer. It displays simply the failing tests, the failure messages, the stack traces of those failures, the screenshot at time of failure, the duration of the test, and the historical failure rate of the test.

The Developer

The developer, for our team, is the user. And although the developer may be more technical than our main application users, they’re still a user. It’s important to us to have a good UX and UI design, help them avoid making mistakes, and clearly highlight the errors that occurred during their builds. With frequent communication and a strong user-first perspective, we’re able to maintain this component well 😉

A deeper dive?

If you liked this, let us know. We’ll be happy to take a deeper dive into things like:

How do we manage our flaky tests?
How do we do data analysis of builds/test results?
How do we keep our docker build fast?
How the growth of the test-suite and our CI scale together?
How do we manage the queueing of work?

Scaling our CI to 14k+ E2E Browser Tests

CI Architecture

GitHub Actions

CIrrus

test-failures

The Developer

A deeper dive?

Thanks 😊

More from Doctolib

Recommended from Medium

Top Most Data Structures and Algorithm Question in Interview

How To Vertically Snap Windows on a Portrait PC Monitor

Syncing GitHub Forks

Green tests and ham

How Many ILITIES Does It Take to Change a Light Bulb

SQL Analytical Function : Part I — Introduction to Analytical Function

{UPDATE} football juegos de animales vs halloween team Hack Free Resources Generator

B+ Trees in Database

Get the Medium app

Isabelle COWAN-BERGMAN

​​Scaling our CI to 14k+ E2E Browser Tests

CI Architecture

GitHub Actions

CIrrus

test-failures

The Developer

A deeper dive?

Thanks 😊

More from Doctolib

Recommended from Medium

Top Most Data Structures and Algorithm Question in Interview

How To Vertically Snap Windows on a Portrait PC Monitor

Syncing GitHub Forks

Green tests and ham

How Many ILITIES Does It Take to Change a Light Bulb

SQL Analytical Function : Part I — Introduction to Analytical Function

{UPDATE} football juegos de animales vs halloween team Hack Free Resources Generator

B+ Trees in Database

Get the Medium app

Isabelle COWAN-BERGMAN

Scaling our CI to 14k+ E2E Browser Tests