Doctolib
Published in

Doctolib

Yes, Test-driven Development is useful in Data Science

Coming from an analytics & data science background, I used to see writing tests as something painful. I knew that it was important, but I never knew where to start and always procrastinated until the end of the project. What’s more boring than testing a piece of code that already works?

Recently, I started to see things the other way round. I discovered Test-Driven Development: write your tests before you actually code the functional part of the code. It’s a best practice in software engineering which deserves to be applied more often to data science projects.

What is Test-Driven Development (TDD) ?

A simple way to put it is through the Red / Green / Refactor framework. Every time you want to add functionality to the code, you can follow a three-step cycle:

  1. Red. Create a test that fails. i.e. write your code’s functional needs
  2. Green. Write production code that makes the test pass. i.e. fulfill the code’s functional needs
  3. Refactor. Clean up the mess you just made. i.e. clean your code without changing the functionality

Real-life Example

Let’s illustrate this with a real-life example. As a post-processing step for a Named Entity Recognition project, we want to build a processing function to extract the duration unit (day/ week/ month/..) and duration value in a text.

  1. Let’s write a unit test that meets the functional needs and an empty function.
The function
The unit test

Test is RED 🔴 That’s normal because the function returns an empty dictionary.

2. We write code that passes the test.

Test is GREEN 🟢 Hurray!! Hold on.. are we sure it’s DRY, SOLID, pep8??

3. We refactor the function, to ensure coding best practices.

Here we added type annotations, created a generic function to convert letter number to float (with its own unit test as well) and refactored how we fill the dictionary.

Where can we apply Test-Driven Development in Data Science?

Test Driven Development is not relevant at every step of a data science project. For instance, it is not worth it during data or model explorations where you are not sure of what you’re looking for: writing tests without knowing the expected output is probably overkill.

It becomes very useful whenever you need to build robust production pipelines.

In this context, we need to implement several types of tests:

  • Unit tests: a test for each piece of code of the project
  • Model tests: ensure that the model has good performance and behaves correctly
  • Integration tests: ensuring that there is a good linkage between each piece of code

Test-driven development applies very well to unit tests, especially for the processing parts of the pipeline (pre-processing and post-processing), where the code is deterministic.

For model tests, it should be used carefully. When dealing with predictive models, we are dealing with uncertainty. Indeed, many machine learning algorithms are inherently random — multiple runs using the same inputs might produce slightly different results each time. This can lead to flaky tests: a test that sometimes passes and sometimes fails despite no changes to the code or the test itself. If we are too specific on the test cases during a first Test-driven development, it’s highly likely that some tests will break on a next iteration. A new model could behave differently in some cases, while performing better globally. Therefore, it’s better to include in model tests only basic cases that are necessary for the project.

Lastly for integration tests, it applies well to pieces of code that do not include model predictions. If it includes model prediction, it is better to test on the format of the output than the actual output.

A good practice is to include those tests in the CI/CD of your project. Therefore whenever a new functionality proposal is made, it is ensured that no other functionality breaks.

Why is it game-changing ?

Adopting Test-Driven Development really changes the way you organize your coding sessions and there are tons of benefits to it:

  • It instantly validates the business and technical specifications. It is also a great documentation for a data scientist that discovers the project and needs to understand how the project works.
  • It gives confidence in the code. Every use case is covered by a test and you or your teammates can add additional functionalities without fear of breaking anything that’s already done.
  • It’s Time-saving. One can see it as something that’s slowing down development. But it’s not, it forces you to think beforehand on the functional needs and anticipate edge cases. Believe me, at the end of the day, it saves a lot of debugging and iteration time.
  • It even makes development more fun ! It breaks down the code into small problem solving challenges. And it’s a perfect fit for peer coding..

Test-Driven Development is one of the multiple coding best practices I learned while being a data scientist at Doctolib. Check out for open positions: https://about.doctolib.fr/jobs/.

--

--

Founded in 2013, Doctolib is the fastest growing e-health service in Europe. We provide healthcare professionals with services to improve the efficiency of their organization, transform their patients’ experience, and strengthen cooperation with other practitioners.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store