Mutation testing: Too good to be true?

Piotr Kubowicz
January 14, 2021

Unit tests provide an additional layer of safety when refactoring code or pushing changes to production. However, they can also give a false sense of confidence if they don’t fail when errors are introduced. Code coverage reports can show places in code that are not reached in tests, but the fact that a place in code is covered by a test does not mean that a test will catch an error introduced there. A poorly written test may check too few details (and in extreme cases be ‘assertion-less’), providing coverage without providing safety.

Mutation testing is a relatively new approach in assuring code quality. The idea behind it is to automatically find code that is poorly tested by making small changes (‘mutations’) to the code and checking if there is a test that fails as a result. If nothing fails, a test is missing or an existing test needs to be rewritten to catch errors better. Well, that’s theory – but how useful is mutation testing in practice?

When mutation testing first appeared, it was seen as the wonder cure for the problem of poor tests. It was a really hot topic around 2016. It appeared on ThoughtWorks Technology Radar and people gave it some spotlight at conferences – there is a great presentation Epic Battle: Zombies vs Mutants by Tomek Dubikowski. I remember at some point I felt guilty for not having mutation testing running in my project. It seemed like a glaring omission, comparable to not doing continuous integration or not being able to measure test coverage of your code. But once the hype faded, I noticed colleagues from other teams also did not use mutation testing either and managed to survive without annihilating their production environments.

Now the popularity of the mutation testing is on the rise again, as the October edition of Technology Radar started promoting Pitest tool using the term ‘assess’ rather than ‘trial’. So what’s the truth – is it just a hype or a must-have?

What’s in this article #

I will begin with shortly explaining how mutation testing works, how much time an analysis takes and which mistakes are typically found and missed. For the purpose of this article I will focus on Pitest, probably the most popular framework, written for Java and JVM, in a particular setup my team uses (a backend Kotlin project). I will give you a glimpse of what Pitest reports look like, how the two engines (Gregor and Descartes) differ, what execution time you should expect, and what the community around the tool looks like. Apart from the short code snippets used in the article, there is also an example project on GitHub in case you wanted to do some experimenting on your own. I will conclude the article with some general ideas for how teams can improve the quality of their tests.

An overview of mutation testing #

Let’s consider the following buggy code:

You can write a test with just 2 cases:

  • 10 cents on account, 1 to pay, should return true
  • 10 cents on account, 20 to pay, should return false

This test will pass and cover all branches of the code. As a result the code coverage report won’t consider it as a poor test, i.e. one that fails to specify what happens if both numbers are equal and thus accepts an incorrect logic.

One of the most basic mutations involves changing the conditional boundary:

If you run our simplistic test on the mutated code, it will still pass, which reveals this line is not properly tested. To satisfy mutation testing you will need to add a test case in which true is returned for equal account balance and amount (and fix the incorrect condition in the production code).

There are many other mutations, for example removing a call to a void function or returning a fixed value (null/0/empty string). You can have a look at a mutators list available in Pitest documentation.

Pitest #

Pitest library implements mutation testing for JVM-based languages. It’s actually the only actively maintained tool for this ecosystem.

Pitest tries to avoid nonsense mutations by using test coverage data to decide what to mutate. If a statement is not covered by tests, mutating it is just a waste of time – no test reaches this code, so no test could fail after it’s mutated. Also, when a mutation is done, it makes sense to only run tests that reach the code instead of waiting for the whole test suite to finish. This means the tool works in three phases:

  • It runs all your tests once to calculate test coverage,
  • It determines what places in code to mutate and how,
  • For each mutation, a class modified by the mutation is loaded instead of the original one and tests covering the particular place are executed; if at least one test fails, the mutation is ‘killed’, if none fails, the mutation ‘survives’ and the framework can report an area where tests can possibly be improved.

An HTML report is created as a result.

Pitest HTML report showing packages

Pitest HTML report showing coverage per line and applied mutations

Speed #

As mentioned before, Pitest tries to be smart by avoiding testing mutations that make no sense. But in the end there will still be lots of cases to check. How much time does it take to execute mutation testing analysis?

I made some measurements when my team was considering integrating Pitest into our projects. We wanted to know if it’s something you can use like test coverage analysis – you start it, go and prepare yourself a cup of good tea, come back and it’s ready for you to read. Or is it like a hardcore set of UI tests, where you start it, go home, a CI server is tortured over night, and in the morning the report is ready for you. Well, in our project it was somewhere in the middle, and it wasn’t good news.

While our test execution time for 4 different projects looks like this:

Test execution time

The time for Pitest execution is an order of magnitude higher:

Pitest execution time

Gregor and Descartes are two Pitest ‘engines’, where the former is the built-in one and the latter is an independently developed one. Descartes is more ‘coarse-grained’, meaning it mutates by replacing the whole method body, hoping to achieve better performance.

Was something wrong with our project? We have high test coverage, and our tests are mostly integration, involving multiple classes instead of just one, which means Pitest coverage analysis will find that many tests need to be executed if a line of code is mutated. Our tests call high-level APIs, as this gives us great confidence and makes tests refactor-tolerant (Robert Daniel Moore wrote a comprehensive summary of this approach). But this means our tests require a Spring context and a database instance, so they take a while to start. All in all this makes mutation testing take quite a long time. However, Descartes documentation shows results from some open-source projects and you can find some astronomical numbers there:

Gregor vs. Descartes execution time for different open-source projects

It depends on the type of project you have, but it’s safe to assume that mutation testing will be very slow.

Accuracy #

Seeing how dramatically Descartes and Gregor can differ in their speed you might wonder what is the difference in their accuracy.

Gregor, the default engine, wastes a lot of time testing trivial code like getters and setters. It is exceptionally bad with Kotlin as it attempts to break things implemented in the language itself instead of things implemented by you. For example, it will mutate your data classes and report that they are not covered properly. Also, Kotlin compiler adds null checks to bytecode in places where Kotlin code contacts non-Kotlin code, because such code cannot guarantee null-safety. Gregor will mutate each of these null checks, and find many false positives making Pitest reports completely unreadable, filled mostly with:

removed call to kotlin/jvm/internal/Intrinsics::checkExpressionValueIsNotNull → SURVIVED.

You can instruct Gregor to avoid mutating Kotlin null check by passing an option

avoidCallsTo = ["kotlin.jvm.internal"]

This, however, will stop Gregor from mutating any statement containing at least one null check – meaning Gregor will also not mutate business logic you wrote in such a statement. It is very disappointing, but false positives are so painful that we have chosen to always enable this option.

Descartes understands Kotlin much better and, by default, does not generate useless mutations for data classes and null checks. The approach of mutating the method body as a whole works quite well – in our code we found some cases where Gregor was better at finding poorly tested lines, but there were cases where Descartes found something Gregor missed. In general, Gregor finds more, but Descartes is still quite useful.

I cannot share the actual code I work on, but I’ve prepared a demonstration project you can download and run by yourself – https://github.com/pkubowicz/pitest-gradle-sandbox. The example project uses the ‘STRONGER’ set of Gregor mutators, as it found more issues in our real-life code than the default set.

There are cases where Gregor is clearly better, for example here:

Descartes fails to detect anything. Gregor, on the other hand, even running with basic settings, allows us to detect that the unit test is not trying hard enough to check the if condition.

Other cases aren’t so easy. Let’s take a test written in a style that is, sadly, quite popular – where a misinterpretation of TDD leads to a test tightly coupled with tested class:

And the production code:

The test for this class passes and reports 100% line coverage, although there are fundamental mistakes:

  1. The repository query should also exclude cancelled invoices
  2. The test only sets up overdue invoices, so if isOverdue is wrongly implemented or never called, the test won’t detect it
  3. Only the first result is taken
  4. Invoice status is used instead of invoice ID

The ‘STRONGER’ set of Gregor mutators does not detect anything. Descartes, in turn, catches the second problem. The ‘ALL’ set of Gregor mutators detects 2 and 4. Only a human can detect 1 and 3.

In general, the ‘ALL’ set detects more problems, but comes at a cost and I wouldn’t consider using it in real-life projects. The number of false positives in Kotlin data classes is even higher with this setting. The number of mutations increases 10 times in comparison to ‘STRONGER’. As execution time for ‘STRONGER’ was already too long, ‘ALL’ looks impractical to me, at least in most projects.

Also, even when ‘ALL’ detects a problem in Kotlin code, the message is often very cryptic. For example:

Pitest report says only:

1. Substituted 2 with 3 → SURVIVED
19. Substituted 2 with 3 → SURVIVED
26. equal to less or equal → SURVIVED

What condition should I additionally test? Imagine guessing the answer if there are 5 elements instead of 2 elements in the list above.

Lack of community #

The substandard Kotlin support is just the tip of the iceberg. The broader problem is that there is no force driving the development of Pitest and no capacity to respond to demand for new features or bug fixes. The main pitest repository, Descartes and Pitest Gradle plugin all have just 1 active developer each (which I consider a person with at least 10 commits since November 2018). These are, in fact, single-person hobby projects.

As a result the issue for better Kotlin support in Gregor has remained open since February 2019. Until December 2020, there was no hint how to pass configuration options for Descartes if you use Gradle. Pitest creators don’t provide support for Gradle. The plugin is developed independently and does not have feature parity with Pitest tool itself.

Finding a place for mutation testing #

We wondered how to integrate mutation testing into our daily work. The analysis takes really long. Switching from Gregor to Descartes wasn’t a solution: it made the analysis 2 times faster at best, but it still took over 20 minutes. However, Descartes detects less problems than the default engine. As mentioned above, we have high quality tests and we feared that with Descartes we will wait for 20 minutes and get nothing detected, so it’s better to wait a few minutes more but get something.

Still, the problem remained: when was the best time to run Pitest? No one wants to wait 20 minutes to get a report on their own code or a code locally checked out for review. The incremental analysis option in Pitest can sometimes cut down this time to just a minute or two, but when people commit frequently, accumulating changes force Pitest to do a full analysis and we go back to 20 minutes at minimum.

We could generate the report at our CI servers when a merge request is opened. However, we doubted anyone will read those reports. There are lots of things to look for when you do a code review. Adding another thing to read won’t make people happier. The number of unavoidable false positives in Pitest reports is frustrating and we were afraid people would soon lose trust in these reports.

What about failing a build if mutation coverage drops as result of a merge request? Our CI servers handle multiple projects and as result build time is a bit longer than on developer machines. Looking at test execution time, we estimated that a Pitest run that takes 20 minutes locally will take about 40 minutes on CI. This means if someone pushes commits to a merge request, as a reviewer you won’t be able to merge it for 40 minutes, plus a CI runner will be occupied for 40 minutes. We rejected this idea as well. It would be just pumping CO₂ into the atmosphere for no good reason.

Having an automatic guard rejecting code changes if they reduce mutation coverage or having developers constantly reading mutation coverage reports and analyzing weak sports don’t look very sound from an economic point of view, at least to me. Usually your code has different parts, some are critically important, others are not. Some are changed frequently, others not so much. Investing equally into all those is just a waste of money. Keep the quality bar high where it pays off. If you are a developer, it may be better for your project if, instead of looking for places in your code that can be improved, you spend more time talking to your Product Owner about forthcoming requirements or you clean up the issue backlog.

In the end we added Pitest to our repository to give developers an option to run the tool locally if they feel they really need some automated help analysing a particularly complicated piece of code. The Pitest task is completely separate from the rest of the build to avoid slowing down the other tasks. However, we found that people rarely use it.

They may be valid use cases for running mutation testing during Continuous Integration. Maybe your domain requires keeping the number of defects extremely low: you launch a rocket to the International Space Station or write an open-source library used by hundreds of projects. In these circumstances, tracking mutation testing coverage and assuring it never drops makes sense.

I also suspect mutation testing may be useful when you take over a project and don’t know if you can trust automated tests. With a coverage report you will be able to decide if you can safely change a place in code or need to write additional tests first.

Individuals and interactions over processes and tools #

As demonstrated above, there are fundamental defects that can be spotted by a human but not necessarily so by a machine. The law of diminishing returns is definitely at play with mutation testing – you can gain much more by investing in people.

Just as no one is born knowing how to write tests without misusing mocks, code review is also a skill that needs to be learned. A good review will reveal not just tests that are not thorough enough, it will also question if tests align with business requirements and are easy to maintain. It’s worth investing time into learning how to review well.

Reading Google Code Review Developer Guide is a good starting point. Another practice that I find extremely useful is starting a review by reading test classes and checking if you can make sense of them without any knowledge of the code being tested. This way it’s easier to spot unhelpful error messages, missed corner cases, various code smells like failing to keep cause and effect clear. For me, knowing production code before reading a test makes you too tolerable to overcomplicated and otherwise poorly written tests.

Interestingly, I found that experimenting with mutation testing vastly improved my code review skills. I tend to think more whether a test is able to detect an error in the code I assess. If I have doubts, I check out a local copy of the code I review and act as a manual mutation engine – I change the code slightly and see if any test fails. You might argue that this is a waste of time and I could do it automatically if I really want to. However, knowing how limited automated mutations are and how frequently you get false positives, I think in most cases I am better than the machine at changing the code in such a way that has the highest chance of cheating the tests. See the case with .next() call in the example above – it changes the type of reactive stream to a different one (from Mono to Flux). In general, such a change will cause a compilation error, so I doubt any mutation engine will ever offer such an automated mutation. For a human developer, however, it’s easy to notice that adding or removing such a line will create a code that has a different behaviour, but still compiles.

Keep people in your team willing to improve their skills and learn from each other. Find a person passionate about code review in your organization and borrow them for a week or two to interact with your team. If it’s not possible, hire such a person or look if there is a training that can teach your team code review skills.

If I were to recommend some kind of tool, I would suggest something much more basic instead. My observation is that team’s test writing skills improve quickly just when it requires less ceremony to write test code.

I am surprised by the positive influence Kotlin had on test quality in my team. With data classes, named parameters and built-in copy() function you no longer need to write boilerplate code to have good quality test data. You can easily have fields with valid and descriptive values and data that does not leak irrelevant details into test code. With a little help from a modern assertion library it’s convenient to extract just fields you want to check in the particular test case and ignore parts that are not relevant. When the same people in my company who write great tests come back to Java code, they suddenly tend to write much worse tests.

Conclusions #

For me, mutation testing promises too much. It’s not a tool that you simply plug into your project and will guard you against quality issues. This approach is sensitive to false positives and can be blind to problems an experienced developer can easily notice. Also, mutation testing can be very slow to execute.

You can consider this approach if your project has unusually high quality requirements. In such circumstances, think twice before using Kotlin, as it will work really poorly with the de facto only tool available for JVM – Pitest. The community for mutation testing is small, so don’t expect your problems will be solved for you – be prepared to invest time into investigating them by yourself and submitting patches to tool authors.

Mutation testing won’t be a replacement for people in your team who don’t catch poor tests during code review. Those people will also miss tests that don’t meet business requirements and no machine will help you there. Help your team grow code review skills. Learn how mutation testing works to gain a new perspective on evaluating whether tests are useful or not. Appreciate people who do great code reviews and encourage everybody to learn from each other. Make sure you have solid foundations for writing descriptive, declarative and easy to maintain tests – if not, consider introducing modern tools. We found Kotlin expressiveness a great help, check out what will be good for you.

Concentrate on things that give the highest return of investment. I like the quote by Kent Beck: “I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence.” Understand what parts of code need to be tested really hard and which don’t.

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.

LET'S CONNECT