xfail and code coverage

In previous posts, I've extolled the virtues of xfail, and showed how I incorporate it into my workflow, but in this post, I'd like to point out a downside of heavy use of xfail: it can artificially inflate your code coverage metrics.

The difference between xfail tests and skip tests is that xfail actually executes the test, which has the desirable feature that you can be sure the test is still failing as expected, but the undesirable property that any lines hit by the failing test will count as covered.

This is a problem because normally line coverage means that every line is covered by a test which made an assertion about the code, and because the test suite passed, it is assumed that each line under test did what it was supposed to do — at least in the context of the tests that covered it. However, xfail undermines this assumption by running code as part of a failing test that doesn't cause the test suite itself to fail. This means that some lines will be hit by well-executed passing tests, and some lines will be hit as collateral damage from the flailing death throes of an expected failure — and they both show up the same way in the coverage statistics!

Consider the following implementation of FizzBuzz[1]

import pytest

def do_fizzbuzz(n: int) -> str | int:
    if (n % 15 == 0):
        return "fizzbuzz"

    if (n % 3 == 0):
        return "fizz"

    if (n % 5 == 0):
        return "buzz"

    return n

Imagine that alongside this implementation you write the following test suite, which includes a working test, but also a test marked xfail, because we also want the function to raise ValueError when passed a negative number, but we haven't implemented it yet.

@pytest.mark.parametrize("n, expected", [
    (3, "fizz"),
    (9, "fizz"),
    (1, 1),
    (2, 2),
    (14, 14),
    (15, "fizzbuzz"),
])
def test_fizzbuzz(n, expected):
    assert do_fizzbuzz(n) == expected


@pytest.mark.xfail(reason="Not implemented yet")
def test_fizzbuzz_failure():
    """Fizzbuzz should fail when passed a negative number."""
    with pytest.raises(ValueError):
        do_fizzbuzz(-5)  # Foreshadowing: this hits the n == 5 case!

Now if I run this with pytest --cov (using the pytest-cov plugin), I'll see that the tests have 100% code coverage:

====================== test session starts ======================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /tmp/tmp.0w6Zpc3Rb2
plugins: cov-3.0.0
collected 7 items

test_xfail.py ......x                                     [100%]

---------- coverage: platform linux, python 3.10.0-final-0 -----------
Name            Stmts   Miss  Cover  Missing
--------------------------------------------
impl.py             8      0   100%
--------------------------------------------
TOTAL               8      0   100%


================= 6 passed, 1 xfailed in 0.04s ==================

But when we actually fix the code (and remove the corresponding xfail decorator):

def do_fizzbuzz(n: int) -> str | int:
    if n < 0:
        raise ValueError(f"Number must be positive, but got {n}")

    if (n % 15 == 0):
        return "fizzbuzz"

    if (n % 3 == 0):
        return "fizz"

    if (n % 5 == 0):
        return "buzz"  # This line is no longer covered!

    return n

You can see that the code coverage goes down:

====================== test session starts ======================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /tmp/tmp.0w6Zpc3Rb2
plugins: cov-3.0.0
collected 7 items

test_xfail.py .......                                     [100%]

---------- coverage: platform linux, python 3.10.0-final-0 -----------
Name      Stmts   Miss  Cover   Missing
---------------------------------------
impl.py      10      1    90%   12
---------------------------------------
TOTAL        10      1    90%


======================= 7 passed in 0.04s =======================

The reason for this is that in my initial testing, I failed to add a test case that would hit the (n % 5 == 0) branch, but because my failing test happened to hit that branch I got the erroneous signal that my function had 100% coverage.

What to do about it?

My original solution to this problem was to invoke the tests twice, once with coverage enabled using pytest -m not xfail to hit all the tests not marked with xfail, and a second time using pytest -m xfail to hit all the xfailing tests, but with coverage disabled. The problem with this approach is that for conditional xfails (e.g. "this is a known bug on Windows"), the test still gets marked as xfail even if the condition is met. To disable coverage only if the test is expected to fail, you can write a simple script in your conftest.py to actively mark actual expected failures as no_cover at collection time:[2]

def pytest_collection_modifyitems(items):
    for item in items:
        marker = item.get_closest_marker("xfail")

        # Need to query the args because conditional xfail tests still have
        # the xfail mark even if they are not expected to fail
        if marker and (not marker.args or marker.args[0]):
            item.add_marker(pytest.mark.no_cover)

If you are using pytest-cov, the no_cover marker will tell coverage.py not to include the test in its coverage. If you are using coverage directly, you can add no_cover as a custom marker and then run pytest with -m not no_cover when running under coverage.py, as in my original scheme.

On the stack overflow question from which this approach is derived, Ned Batchelder, maintainer of coverage.py, suggested that starting with version 5.0, coverage.py gained the ability to track which tests are covering which lines, and that in the future coverage.py could grow the option to automatically disable coverage of lines only hit during a failing test, though as of 2021-12-13, no one has implemented it.[3]

Should I care about this?

In most cases, probably not. Maybe if you have an enormous volume of xfail tests and a relatively small test suite, this could be a problem, but it was actually some work to contrive a situation that demonstrates the issue.[4] If I had implemented do_fizzbuzz slightly differently, for example, we could easily have had 100% code coverage both with and without the xfail test. Plus in most cases, you won't be writing passing tests and xfailing tests at the same time, so you'll be able to notice any gaps in coverage from the good tests that would have otherwise been papered over by failing tests hitting otherwise-uncovered lines.

I imagine the most common scenario where this will make a difference is not in highlighting where new, passing tests could shore up your code coverage, but in finding code that is untested because any tests you added would fail, either because something isn't implemented yet or because there's a bug in it. I think this is valuable information, but you may not care about it.

Another scenario in which it may be useful to set up your coverage instrumentation this way would be if you are using mutation testing, where your code is randomly mutated and errors are reported if your tests don't fail after mutation. In my experience, it is difficult to make mutation testing useful because it tends to report a large number of false positives — which is consistent with what Ned Batchelder reported in his blog post on the subject. One thing I've found that makes it much more useful — particularly in code bases with less than 100% code coverage — is using coverage data to select which lines to mutate. In most cases, lines only covered by xfailing tests will not cause the test suite to fail no matter what you change them to, so they would be a great source of false positives.

In my opinion, setting up your code coverage to exclude xfailing tests is easy enough to do and to maintain that it's worth doing, even for marginal or nebulous benefits. If the maintenance burden were higher, or someone on my team had a strong preference for not doing this, I would probably just give in and let xfailing tests contribute to the code coverage. Hopefully, though, someone will implement this functionality natively in coverage.py and it'll be even easier to do than it is now.[3]

Footnotes

[1]The type hint here makes use of one of the newer features of Python: PEP 604, introduced in Python 3.10, allows writing typing.Union[str, int] as str | int.
[2]You can see how this is implemented in dateutil here.
[3](1, 2) Maybe you'd like to be the one who implements it, dear reader? If you do, please let me know and I can update this post with a shout-out!
[4]

This is not to say that real problems are always easy to boil down into simple examples — often times serious problems require a lot of complex background that makes any minimal examples of the behavior in question necessarily contrived. It is possible that this is actually likely to come up a lot in code that has a lot of platform-specific behavior, where adding platform-specific conditional xfails to you codebase might disable testing of large swaths of the logic that could easily be tested without hitting whatever bug is causing the tests to fail.

I can say I've never seen a problem like this crop up "in the wild", but I also have not seen widespread adoption of xfail, so the baseline rate is very low.