Mutation testing: Too good to be true?

Piotr Kubowicz - January 14, 2021

Unit tests provide an additional layer of safety when refactoring code or pushing changes to production. However, they can also give a false sense of confidence if they don’t fail when errors are introduced. Code coverage reports can show places in code that are not reached in tests, but the fact that a place in code is covered by a test does not mean that a test will catch an error introduced there. A poorly written test may check too few details (and in extreme cases be ‘assertion-less’), providing coverage without providing safety.

Mutation testing is a relatively new approach in assuring code quality. The idea behind it is to automatically find code that is poorly tested by making small changes (‘mutations’) to the code and checking if there is a test that fails as a result. If nothing fails, a test is missing or an existing test needs to be rewritten to catch errors better. Well, that’s theory – but how useful is mutation testing in practice?

When mutation testing first appeared, it was seen as the wonder cure for the problem of poor tests. It was a really hot topic around 2016. It appeared on ThoughtWorks Technology Radar and people gave it some spotlight at conferences – there is a great presentation Epic Battle: Zombies vs Mutants by Tomek Dubikowski. I remember at some point I felt guilty for not having mutation testing running in my project. It seemed like a glaring omission, comparable to not doing continuous integration or not being able to measure test coverage of your code. But once the hype faded, I noticed colleagues from other teams also did not use mutation testing either and managed to survive without annihilating their production environments.

Now the popularity of the mutation testing is on the rise again, as the October edition of Technology Radar started promoting Pitest tool using the term ‘assess’ rather than ‘trial’. So what’s the truth – is it just a hype or a must-have?

What’s in this article

I will begin with shortly explaining how mutation testing works, how much time an analysis takes and which mistakes are typically found and missed. For the purpose of this article I will focus on Pitest, probably the most popular framework, written for Java and JVM, in a particular setup my team uses (a backend Kotlin project). I will give you a glimpse of what Pitest reports look like, how the two engines (Gregor and Descartes) differ, what execution time you should expect, and what the community around the tool looks like. Apart from the short code snippets used in the article, there is also an example project on GitHub in case you wanted to do some experimenting on your own. I will conclude the article with some general ideas for how teams can improve the quality of their tests.

An overview of mutation testing

Let’s consider the following buggy code:

You can write a test with just 2 cases:

  • 10 cents on account, 1 to pay, should return true
  • 10 cents on account, 20 to pay, should return false

This test will pass and cover all branches of the code. As a result the code coverage report won’t consider it as a poor test, i.e. one that fails to specify what happens if both numbers are equal and thus accepts an incorrect logic.

One of the most basic mutations involves changing the conditional boundary:

If you run our simplistic test on the mutated code, it will still pass, which reveals this line is not properly tested. To satisfy mutation testing you will need to add a test case in which true is returned for equal account balance and amount (and fix the incorrect condition in the production code).

There are many other mutations, for example removing a call to a void function or returning a fixed value (null/0/empty string). You can have a look at a mutators list available in Pitest documentation.

Pitest

Pitest library implements mutation testing for JVM-based languages. It’s actually the only actively maintained tool for this ecosystem.

Pitest tries to avoid nonsense mutations by using test coverage data to decide what to mutate. If a statement is not covered by tests, mutating it is just a waste of time – no test reaches this code, so no test could fail after it’s mutated. Also, when a mutation is done, it makes sense to only run tests that reach the code instead of waiting for the whole test suite to finish. This means the tool works in three phases:

  • It runs all your tests once to calculate test coverage,
  • It determines what places in code to mutate and how,
  • For each mutation, a class modified by the mutation is loaded instead of the original one and tests covering the particular place are executed; if at least one test fails, the mutation is ‘killed’, if none fails, the mutation ‘survives’ and the framework can report an area where tests can possibly be improved.

An HTML report is created as a result.

Pitest HTML report showing packages

Pitest HTML report showing coverage per line and applied mutations

Speed

As mentioned before, Pitest tries to be smart by avoiding testing mutations that make no sense. But in the end there will still be lots of cases to check. How much time does it take to execute mutation testing analysis?

I made some measurements when my team was considering integrating Pitest into our projects. We wanted to know if it’s something you can use like test coverage analysis – you start it, go and prepare yourself a cup of good tea, come back and it’s ready for you to read. Or is it like a hardcore set of UI tests, where you start it, go home, a CI server is tortured over night, and in the morning the report is ready for you. Well, in our project it was somewhere in the middle, and it wasn’t good news.

While our test execution time for 4 different projects looks like this:

Test execution time

The time for Pitest execution is an order of magnitude higher:

Pitest execution time

Gregor and Descartes are two Pitest ‘engines’, where the former is the built-in one and the latter is an independently developed one. Descartes is more ‘coarse-grained’, meaning it mutates by replacing the whole method body, hoping to achieve better performance.

Was something wrong with our project? We have high test coverage, and our tests are mostly integration, involving multiple classes instead of just one, which means Pitest coverage analysis will find that many tests need to be executed if a line of code is mutated. Our tests call high-level APIs, as this gives us great confidence and makes tests refactor-tolerant (Robert Daniel Moore wrote a comprehensive summary of this approach). But this means our tests require a Spring context and a database instance, so they take a while to start. All in all this makes mutation testing take quite a long time. However, Descartes documentation shows results from some open-source projects and you can find some astronomical numbers there:

Gregor vs. Descartes execution time for different open-source projects

It depends on the type of project you have, but it’s safe to assume that mutation testing will be very slow.

Accuracy

Seeing how dramatically Descartes and Gregor can differ in their speed you might wonder what is the difference in their accuracy.

Gregor, the default engine, wastes a lot of time testing trivial code like getters and setters. It is exceptionally bad with Kotlin as it attempts to break things implemented in the language itself instead of things implemented by you. For example, it will mutate your data classes and report that they are not covered properly. Also, Kotlin compiler adds null checks to bytecode in places where Kotlin code contacts non-Kotlin code, because such code cannot guarantee null-safety. Gregor will mutate each of these null checks, and find many false positives making Pitest reports completely unreadable, filled mostly with:

removed call to kotlin/jvm/internal/Intrinsics::checkExpressionValueIsNotNull → SURVIVED.

You can instruct Gregor to avoid mutating Kotlin null check by passing an option

avoidCallsTo = ["kotlin.jvm.internal"]

This, however, will stop Gregor from mutating any statement containing at least one null check – meaning Gregor will also not mutate business logic you wrote in such a statement. It is very disappointing, but false positives are so painful that we have chosen to always enable this option.

Descartes understands Kotlin much better and, by default, does not generate useless mutations for data classes and null checks. The approach of mutating the method body as a whole works quite well – in our code we found some cases where Gregor was better at finding poorly tested lines, but there were cases where Descartes found something Gregor missed. In general, Gregor finds more, but Descartes is still quite useful.

I cannot share the actual code I work on, but I’ve prepared a demonstration project you can download and run by yourself – https://github.com/pkubowicz/pitest-gradle-sandbox. The example project uses the ‘STRONGER’ set of Gregor mutators, as it found more issues in our real-life code than the default set.

There are cases where Gregor is clearly better, for example here:

Descartes fails to detect anything. Gregor, on the other hand, even running with basic settings, allows us to detect that the unit test is not trying hard enough to check the if condition.

Other cases aren’t so easy. Let’s take a test written in a style that is, sadly, quite popular – where a misinterpretation of TDD leads to a test tightly coupled with tested class:

And the production code:

The test for this class passes and reports 100% line coverage, although there are fundamental mistakes:

  1. The repository query should also exclude cancelled invoices
  2. The test only sets up overdue invoices, so if isOverdue is wrongly implemented or never called, the test won’t detect it
  3. Only the first result is taken
  4. Invoice status is used instead of invoice ID

The ‘STRONGER’ set of Gregor mutators does not detect anything. Descartes, in turn, catches the second problem. The ‘ALL’ set of Gregor mutators detects 2 and 4. Only a human can detect 1 and 3.

In general, the ‘ALL’ set detects more problems, but comes at a cost and I wouldn’t consider using it in real-life projects. The number of false positives in Kotlin data classes is even higher with this setting. The number of mutations increases 10 times in comparison to ‘STRONGER’. As execution time for ‘STRONGER’ was already too long, ‘ALL’ looks impractical to me, at least in most projects.

Also, even when ‘ALL’ detects a problem in Kotlin code, the message is often very cryptic. For example:

Pitest report says only:

1. Substituted 2 with 3 → SURVIVED
19. Substituted 2 with 3 → SURVIVED
26. equal to less or equal → SURVIVED

What condition should I additionally test? Imagine guessing the answer if there are 5 elements instead of 2 elements in the list above.

Lack of community

The substandard Kotlin support is just the tip of the iceberg. The broader problem is that there is no force driving the development of Pitest and no capacity to respond to demand for new features or bug fixes. The main pitest repository, Descartes and Pitest Gradle plugin all have just 1 active developer each (which I consider a person with at least 10 commits since November 2018). These are, in fact, single-person hobby projects.

As a result the issue for better Kotlin support in Gregor has remained open since February 2019. Until December 2020, there was no hint how to pass configuration options for Descartes if you use Gradle. Pitest creators don’t provide support for Gradle. The plugin is developed independently and does not have feature parity with Pitest tool itself.

Finding a place for mutation testing

We wondered how to integrate mutation testing into our daily work. The analysis takes really long. Switching from Gregor to Descartes wasn’t a solution: it made the analysis 2 times faster at best, but it still took over 20 minutes. However, Descartes detects less problems than the default engine. As mentioned above, we have high quality tests and we feared that with Descartes we will wait for 20 minutes and get nothing detected, so it’s better to wait a few minutes more but get something.

Still, the problem remained: when was the best time to run Pitest? No one wants to wait 20 minutes to get a report on their own code or a code locally checked out for review. The incremental analysis option in Pitest can sometimes cut down this time to just a minute or two, but when people commit frequently, accumulating changes force Pitest to do a full analysis and we go back to 20 minutes at minimum.

We could generate the report at our CI servers when a merge request is opened. However, we doubted anyone will read those reports. There are lots of things to look for when you do a code review. Adding another thing to read won’t make people happier. The number of unavoidable false positives in Pitest reports is frustrating and we were afraid people would soon lose trust in these reports.

What about failing a build if mutation coverage drops as result of a merge request? Our CI servers handle multiple projects and as result build time is a bit longer than on developer machines. Looking at test execution time, we estimated that a Pitest run that takes 20 minutes locally will take about 40 minutes on CI. This means if someone pushes commits to a merge request, as a reviewer you won’t be able to merge it for 40 minutes, plus a CI runner will be occupied for 40 minutes. We rejected this idea as well. It would be just pumping CO₂ into the atmosphere for no good reason.

Having an automatic guard rejecting code changes if they reduce mutation coverage or having developers constantly reading mutation coverage reports and analyzing weak sports don’t look very sound from an economic point of view, at least to me. Usually your code has different parts, some are critically important, others are not. Some are changed frequently, others not so much. Investing equally into all those is just a waste of money. Keep the quality bar high where it pays off. If you are a developer, it may be better for your project if, instead of looking for places in your code that can be improved, you spend more time talking to your Product Owner about forthcoming requirements or you clean up the issue backlog.

In the end we added Pitest to our repository to give developers an option to run the tool locally if they feel they really need some automated help analysing a particularly complicated piece of code. The Pitest task is completely separate from the rest of the build to avoid slowing down the other tasks. However, we found that people rarely use it.

They may be valid use cases for running mutation testing during Continuous Integration. Maybe your domain requires keeping the number of defects extremely low: you launch a rocket to the International Space Station or write an open-source library used by hundreds of projects. In these circumstances, tracking mutation testing coverage and assuring it never drops makes sense.

I also suspect mutation testing may be useful when you take over a project and don’t know if you can trust automated tests. With a coverage report you will be able to decide if you can safely change a place in code or need to write additional tests first.

Individuals and interactions over processes and tools

As demonstrated above, there are fundamental defects that can be spotted by a human but not necessarily so by a machine. The law of diminishing returns is definitely at play with mutation testing – you can gain much more by investing in people.

Just as no one is born knowing how to write tests without misusing mocks, code review is also a skill that needs to be learned. A good review will reveal not just tests that are not thorough enough, it will also question if tests align with business requirements and are easy to maintain. It’s worth investing time into learning how to review well.

Reading Google Code Review Developer Guide is a good starting point. Another practice that I find extremely useful is starting a review by reading test classes and checking if you can make sense of them without any knowledge of the code being tested. This way it’s easier to spot unhelpful error messages, missed corner cases, various code smells like failing to keep cause and effect clear. For me, knowing production code before reading a test makes you too tolerable to overcomplicated and otherwise poorly written tests.

Interestingly, I found that experimenting with mutation testing vastly improved my code review skills. I tend to think more whether a test is able to detect an error in the code I assess. If I have doubts, I check out a local copy of the code I review and act as a manual mutation engine – I change the code slightly and see if any test fails. You might argue that this is a waste of time and I could do it automatically if I really want to. However, knowing how limited automated mutations are and how frequently you get false positives, I think in most cases I am better than the machine at changing the code in such a way that has the highest chance of cheating the tests. See the case with .next() call in the example above – it changes the type of reactive stream to a different one (from Mono to Flux). In general, such a change will cause a compilation error, so I doubt any mutation engine will ever offer such an automated mutation. For a human developer, however, it’s easy to notice that adding or removing such a line will create a code that has a different behaviour, but still compiles.

Keep people in your team willing to improve their skills and learn from each other. Find a person passionate about code review in your organization and borrow them for a week or two to interact with your team. If it’s not possible, hire such a person or look if there is a training that can teach your team code review skills.

If I were to recommend some kind of tool, I would suggest something much more basic instead. My observation is that team’s test writing skills improve quickly just when it requires less ceremony to write test code.

I am surprised by the positive influence Kotlin had on test quality in my team. With data classes, named parameters and built-in copy() function you no longer need to write boilerplate code to have good quality test data. You can easily have fields with valid and descriptive values and data that does not leak irrelevant details into test code. With a little help from a modern assertion library it’s convenient to extract just fields you want to check in the particular test case and ignore parts that are not relevant. When the same people in my company who write great tests come back to Java code, they suddenly tend to write much worse tests.

Conclusions

For me, mutation testing promises too much. It’s not a tool that you simply plug into your project and will guard you against quality issues. This approach is sensitive to false positives and can be blind to problems an experienced developer can easily notice. Also, mutation testing can be very slow to execute.

You can consider this approach if your project has unusually high quality requirements. In such circumstances, think twice before using Kotlin, as it will work really poorly with the de facto only tool available for JVM – Pitest. The community for mutation testing is small, so don’t expect your problems will be solved for you – be prepared to invest time into investigating them by yourself and submitting patches to tool authors.

Mutation testing won’t be a replacement for people in your team who don’t catch poor tests during code review. Those people will also miss tests that don’t meet business requirements and no machine will help you there. Help your team grow code review skills. Learn how mutation testing works to gain a new perspective on evaluating whether tests are useful or not. Appreciate people who do great code reviews and encourage everybody to learn from each other. Make sure you have solid foundations for writing descriptive, declarative and easy to maintain tests – if not, consider introducing modern tools. We found Kotlin expressiveness a great help, check out what will be good for you.

Concentrate on things that give the highest return of investment. I like the quote by Kent Beck: “I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence.” Understand what parts of code need to be tested really hard and which don’t.

About the author

Piotr Kubowicz

Software Engineer

Piotr is a polyglot developer who has been coding in Java for over ten years. He also tried many other languages, from C and Perl to Ruby.
During his time at nexocode, Piotr's primary focus has been on evolving team culture and ongoing projects by developing build automation and systems architecture to ensure delivery is smooth even as project codebases get bigger and more complex. As an active developer in the community, you can notice him speaking at various meetups and conferences.

Tempted to work
on something
as creative?

That’s all we do.

join nexocode

More articles

Find us on

Need help with implementing AI in your business?

Let's talk blue circle

This site uses cookies for analytical purposes.

Accept Privacy Policy

In the interests of your safety and to implement the principle of lawful, reliable and transparent processing of your personal data when using our services, we developed this document called the Privacy Policy. This document regulates the processing and protection of Users’ personal data in connection with their use of the Website and has been prepared by Nexocode.

To ensure the protection of Users' personal data, Nexocode applies appropriate organizational and technical solutions to prevent privacy breaches. Nexocode implements measures to ensure security at the level which ensures compliance with applicable Polish and European laws such as:

  1. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (published in the Official Journal of the European Union L 119, p 1); Act of 10 May 2018 on personal data protection (published in the Journal of Laws of 2018, item 1000);
  2. Act of 18 July 2002 on providing services by electronic means;
  3. Telecommunications Law of 16 July 2004.

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet.

1. Definitions

  1. User – a person that uses the Website, i.e. a natural person with full legal capacity, a legal person, or an organizational unit which is not a legal person to which specific provisions grant legal capacity.
  2. Nexocode – NEXOCODE sp. z o.o. with its registered office in Kraków, ul. Generała Henryka Kamieńskiego 51, 30-644 Kraków, entered into the Register of Entrepreneurs of the National Court Register kept by the District Court for Kraków-Śródmieście in Kraków, 11th Commercial Department of the National Court Register, under the KRS number: 0000686992, NIP: 6762533324.
  3. Website – website run by Nexocode, at the URL: nexocode.com whose content is available to authorized persons.
  4. Cookies – small files saved by the server on the User's computer, which the server can read when when the website is accessed from the computer.
  5. SSL protocol – a special standard for transmitting data on the Internet which unlike ordinary methods of data transmission encrypts data transmission.
  6. System log – the information that the User's computer transmits to the server which may contain various data (e.g. the user’s IP number), allowing to determine the approximate location where the connection came from.
  7. IP address – individual number which is usually assigned to every computer connected to the Internet. The IP number can be permanently associated with the computer (static) or assigned to a given connection (dynamic).
  8. GDPR – Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of individuals regarding the processing of personal data and onthe free transmission of such data, repealing Directive 95/46 / EC (General Data Protection Regulation).
  9. Personal data – information about an identified or identifiable natural person ("data subject"). An identifiable natural person is a person who can be directly or indirectly identified, in particular on the basis of identifiers such as name, identification number, location data, online identifiers or one or more specific factors determining the physical, physiological, genetic, mental, economic, cultural or social identity of a natural person.
  10. Processing – any operations performed on personal data, such as collecting, recording, storing, developing, modifying, sharing, and deleting, especially when performed in IT systems.

2. Cookies

The Website is secured by the SSL protocol, which provides secure data transmission on the Internet. The Website, in accordance with art. 173 of the Telecommunications Act of 16 July 2004 of the Republic of Poland, uses Cookies, i.e. data, in particular text files, stored on the User's end device.
Cookies are used to:

  1. improve user experience and facilitate navigation on the site;
  2. help to identify returning Users who access the website using the device on which Cookies were saved;
  3. creating statistics which help to understand how the Users use websites, which allows to improve their structure and content;
  4. adjusting the content of the Website pages to specific User’s preferences and optimizing the websites website experience to the each User's individual needs.

Cookies usually contain the name of the website from which they originate, their storage time on the end device and a unique number. On our Website, we use the following types of Cookies:

  • "Session" – cookie files stored on the User's end device until the Uses logs out, leaves the website or turns off the web browser;
  • "Persistent" – cookie files stored on the User's end device for the time specified in the Cookie file parameters or until they are deleted by the User;
  • "Performance" – cookies used specifically for gathering data on how visitors use a website to measure the performance of a website;
  • "Strictly necessary" – essential for browsing the website and using its features, such as accessing secure areas of the site;
  • "Functional" – cookies enabling remembering the settings selected by the User and personalizing the User interface;
  • "First-party" – cookies stored by the Website;
  • "Third-party" – cookies derived from a website other than the Website;
  • "Facebook cookies" – You should read Facebook cookies policy: https://www.facebook.com/policy/cookies
  • "Other Google cookies" – Refer to Google cookie policy: www.google.com/policies/technologies/types/

3. How System Logs work on the Website

User's activity on the Website, including the User’s Personal Data, is recorded in System Logs. The information collected in the Logs is processed primarily for purposes related to the provision of services, i.e. for the purposes of:

  • analytics – to improve the quality of services provided by us as part of the Website and adapt its functionalities to the needs of the Users. The legal basis for processing in this case is the legitimate interest of Nexocode consisting in analyzing Users' activities and their preferences;
  • fraud detection, identification and countering threats to stability and correct operation of the Website.

4. Cookie mechanism on the Website

Our site uses basic cookies that facilitate the use of its resources. Cookies contain useful information and are stored on the User's computer – our server can read them when connecting to this computer again. Most web browsers allow cookies to be stored on the User's end device by default. Each User can change their Cookie settings in the web browser settings menu: Google ChromeOpen the menu (click the three-dot icon in the upper right corner), Settings > Advanced. In the "Privacy and security" section, click the Content Settings button. In the "Cookies and site date" section you can change the following Cookie settings:

  • Deleting cookies,
  • Blocking cookies by default,
  • Default permission for cookies,
  • Saving Cookies and website data by default and clearing them when the browser is closed,
  • Specifying exceptions for Cookies for specific websites or domains

Internet Explorer 6.0 and 7.0
From the browser menu (upper right corner): Tools > Internet Options > Privacy, click the Sites button. Use the slider to set the desired level, confirm the change with the OK button.

Mozilla Firefox
browser menu: Tools > Options > Privacy and security. Activate the “Custom” field. From there, you can check a relevant field to decide whether or not to accept cookies.

Opera
Open the browser’s settings menu: Go to the Advanced section > Site Settings > Cookies and site data. From there, adjust the setting: Allow sites to save and read cookie data

Safari
In the Safari drop-down menu, select Preferences and click the Security icon.From there, select the desired security level in the "Accept cookies" area.

Disabling Cookies in your browser does not deprive you of access to the resources of the Website. Web browsers, by default, allow storing Cookies on the User's end device. Website Users can freely adjust cookie settings. The web browser allows you to delete cookies. It is also possible to automatically block cookies. Detailed information on this subject is provided in the help or documentation of the specific web browser used by the User. The User can decide not to receive Cookies by changing browser settings. However, disabling Cookies necessary for authentication, security or remembering User preferences may impact user experience, or even make the Website unusable.

5. Additional information

External links may be placed on the Website enabling Users to directly reach other website. Also, while using the Website, cookies may also be placed on the User’s device from other entities, in particular from third parties such as Google, in order to enable the use the functionalities of the Website integrated with these third parties. Each of such providers sets out the rules for the use of cookies in their privacy policy, so for security reasons we recommend that you read the privacy policy document before using these pages. We reserve the right to change this privacy policy at any time by publishing an updated version on our Website. After making the change, the privacy policy will be published on the page with a new date. For more information on the conditions of providing services, in particular the rules of using the Website, contracting, as well as the conditions of accessing content and using the Website, please refer to the the Website’s Terms and Conditions.

Nexocode Team