Subtests in Python

I recently made an ill-advised post on Twitter, wherein I implied that I had a deeply-considered opinion by suggesting that more people should be using the pytest-subtests package, even going so far as to state that "They were the one thing I actually preferred about unittest before pytest added support." For my sins, Brian Okken has asked me to come on the Test and Code podcast to discuss subtests in more detail; I can only surmise he has done this as some sort of didactic exercise to teach me that I should not get all hopped up on Splenda and herbal tea and start spouting off ill-conceived opinions about software testing.

However, when Brian looks over to me with his knowing grin and says, "So, are you ready to talk about subtests?", I plan to say, "Yes, I am prepared - I have prepared extensive notes and references.", and when we are on stage together accepting the Daytime Emmy Award for Best Testing Podcast, I will whisper to him, "I called your bluff, and though I bested you, you truly taught me the value of humility" and he will cry a single tear.

Or, somewhat more likely, he just wanted someone to talk about this particular aspect of Python testing with, and I'm one of the few people he's seen express an opinion on it. Either way, this post will serve as my notes on the subtest feature of unittest introduced in Python 3.4 and some of its advantages, disadvantages and use cases. It is a companion piece to the podcast: Test and Code Episode 111.

Introduction

unittest.TestCase.subTest was originally introduced in Python 3.4 as a lightweight mechanism for test parameterization [1]; it allows you to mark a section of your test as a separate test in its own right using a context manager. The canonical example is testing something in a loop:

def test_loop(self):
    for i in range(5):
        with self.subTest("Message for this subtest", i=i):
            self.assertEqual(i % 2, 0)

Without the self.subTest context manager, this test would fail immediately when i=1 and execution would end, reporting that test_loop has failed. With the context manager, though, the failures in each subTest's context don't cause the test to exit and execution continues. The result of running this test is that you'll see successes reported for 0, 2 and 4 with failures reported for 1 and 3. You can pass any arbitrary keywords to subTest and they will be reported as part of the test failure, e.g.:

______ Test.test_loop [Message for this subtest] (i=1) ___

self = <test.Test testMethod=test_loop>

    def test_loop(self):
        for i in range(1, 5):
            with self.subTest("Message for this subtest", i=i):
>               self.assertEqual(i % 2, 0)
E               AssertionError: 1 != 0

test.py:7: AssertionError

______ Test.test_loop [Message for this subtest] (i=3) ___

...

Why not `pytest.mark.parametrize` or equivalent?

For users of pytest, the parameterization use case of subtests is not terribly compelling, since we already have several mechanisms to do this including pytest.mark.parameterize, parameterized fixtures, and the much more obscure pytest_generate_tests hook in conftest.py. Even Google's absltest framework (which is mainly minor extensions to unittest) has a parameterization decorator.

I would tend to agree that in general I do not use subtests for test parameterization — I mostly do it when I must use only unittest, such as when developing for the standard library. There are, however, occasionally situations where the subTest form factor offers some advantages even in parameterization. For example, if you have a number of tests you'd like to perform that have an expensive set-up function that builds or acquires an immutable resource that is used in common by all the subtests:

def test_expensive_setup(self):
    resource = self.get_expensive_resource()

    for query, expected_result in self.get_test_cases():
        with self.subTest(query=query):
            self.assertEqual(resource.make_query(query), expected_result)

I'm sure it's possible that you can write a pytest fixture that is scoped such that the resource is acquired only before this particular set of parameterized tests runs and is torn down immediately afterwards, but even if it is possible without any significant contortions, I doubt that the lifetime of the resource will be quite as straightforward to understand for a casual reader or reviewer as the subtest-based approach.

It is also possible for the two approaches to work together harmoniously — using a decorator-based approach to provide canonical "test cases", and then subtests to explore variations on that theme. For example, you could parameterize a test function by the value of its inputs and add in subtests to check multiple properties of the result. For example, if you want to make sure that both the utcoffset() and tzname() functions of a tzinfo object are right at several datetime s: [2]

from datetime import *
from zoneinfo import ZoneInfo

import pytest

def datetime_test_cases():
    GMT = ("GMT", timedelta(0))
    BST = ("BST", timedelta(1))
    zi = ZoneInfo("Europe/London")

    return [
        (datetime(2020, 3, 28, 12, tzinfo=zi), GMT),
        (datetime(2020, 3, 29, 12, tzinfo=zi), BST),
        (datetime(2020, 10, 24, 12, tzinfo=zi), BST),
        (datetime(2020, 10, 25, 12, tzinfo=zi), GMT),
    ]

@pytest.mark.parameterize("dt, offset", datetime_test_cases())
def test_europe_london(subtests, dt, offset):
    exp_tzname, exp_utcoffset = offset
    with subtests.test(msg="tzname", dt=dt):
        assert dt.tzname() == exp_tzname

    with subtests.test(msg="utcoffset", dt=dt):
        assert dt.utcoffset() == exp_utcoffset

Here the test is parameterized by value, but it tests each value in two different ways.

Beyond parameterization

Although there are a few cases where you want a lightweight parameterization mechanism, the times that I've felt that subtests were a missing feature in pytest have little to do with parameterization. One thing that subtests do very well is to let you keep to the spirit of "one assertion per test" when you'd like to explore multiple properties of the state of a system.

For example, let's take a look at this test I wrote for the reference implementation of PEP 615. The PEP describes the creation of a new zoneinfo.ZoneInfo object, which (to simplify the situation a bit) generates singleton objects — to a first approximation, any time you call zoneinfo.ZoneInfo(key) with the same value for key, you should get the same object you got in earlier keys. This also applies to ZoneInfo objects constructed from a pickle, so this is a test to ensure that if we pickle and unpickle a ZoneInfo object, we get the same object back. Starts out simple enough, and I could have written it this way:

def test_cache_hit(self):
    zi_in = ZoneInfo("Europe/Dublin")
    pkl = pickle.dumps(zi_in)
    zi_out = pickle.loads(pkl)

    self.assertIs(zi_in, zi_out)

However, this test inherently is about global state. First I've populated the ZoneInfo cache via the primary constructor, then I've hit it via whatever mechanism pickle.loads uses — what if doing so puts our repository in a weird state? What if only the first cache miss works, and subsequent cache misses do something else? To test for that, I could write a second test:

def test_cache_hit_twice(self):
    zi_in = ZoneInfo("Europe/Dublin")
    pkl = pickle.dumps(zi_in)
    zi_rt = pickle.loads(pkl)
    zi_rt_2 = pickle.loads(pkl)

    self.assertIs(zi_rt, zi_rt2)

But you'll note that up until the second pickle.loads call, this is the same test: I'm establishing the same state! If I were to add a self.assertIs(zi_in, zi_rt) into the second test, I'd be able to do both tests at once, but that would violate the "one assertion per test" guideline — I am testing two different things, they should be two different tests. Subtests solve this conundrum by allowing you to mark off sections of your multi-assertion test as being logically separate tests:

def test_cache_hit(self):
    zi_in = ZoneInfo("Europe/Dublin")
    pkl = pickle.dumps(zi_in)

    zi_rt = pickle.loads(pkl)
    with self.subTest("Round-tripped is non-pickled zoneinfo"):
        self.assertIs(zi_in, zi_rt)

    zi_rt2 = pickle.loads(pkl)
    with self.subTest("Round-trip is second round-tripped zoneinfo"):
        self.assertIs(zi_rt, zi_rt2)

Note also that I've excluded the pickle.loads calls from my subtest contexts. This is because if something in a subtest fails, the rest of the test is executed; if zi_in and zi_rt are not identical objects, that doesn't preclude zi_rt and zi_rt2 from being identical objects, so it makes sense to run both tests, but if either zi_rt or zi_rt2 fails to be constructed, tests involving them will necessarily fail.

Downsides of subtests

I do not want to paint an overly rosy picture of the current subtest landscape; I think that there's great potential in the concepts, but the devil is in the details. As I've been using subtests more as part of implementing PEP 615, and more generally in researching this post, I've come across a few fairly reasonable objections to the use of subtests.

Counting tests is weird

When using a decorator-based mechanism for test parameterization, the total number of test cases to run is determined before the first test is run, and so unless you actually change the number of tests, the number of tests reported as having been run is the same every time. With subtests, the concept of what a single "test" is can be weird. Consider the following simple test:

def test_a_loop(subtests):
    for i in range(0, 6, 2):
        with subtests.test(i=i):
            assert i % 2 == 0

This could be considered 3 tests — one for each subtest, or it could be considered 4 tests — one for each subtest and one for the test case itself (which could fail outside of a subtest), or it could be considered a single test, which succeeds or fails based on whether all of the subtests have failed. It seems that pytest and unittest both count this as a single test, because when I run pytest I get 1 passed in 0.01s (though I'll note that I see 4 passing tests when I run pytest -v, so already things are getting complicated). What happens if I change the pattern of failures?

def test_a_loop(subtests):
    for i in range(3):
        with subtests.test(i=i):
            assert i % 2 == 0

Now I get a very strange result: 1 failed, 1 passed in 0.04s. Now we've gone from 1 test to 2 tests, with pytest -v reporting 3 tests passed and 1 failed. Similarly if I simply skip a subtest, they are reported separately from the failures and the passes:

def test_a_loop(subtests):
    for i in range(3):
        with subtests.test(i=i):
            if i == 2:
                pytest.skip()

            assert i % 2 == 0

This results in 1 failed, 1 passed, 1 skipped in 0.04s. Throughout this, pytest -v has consistently reported 4 tests, though, so that's something, but even that's not completely stable, because you could have a failure that occurs outside of subtest context, which would terminate the test early:

def test_a_loop(subtests):
    for i in range(3):
        if i == 1:
            pytest.fail()

        with subtests.test(i=i):
            assert i % 2 == 0

This case is reported in an interesting way, the output of pytest -v looks like this:

test.py::test_a_bunch_of_stuff PASSED
test.py::test_a_bunch_of_stuff FAILED

But the summary line says 1 failed, 0 passed in 0.04s. So we had a single passing subtest, but the overall test failed, so the "passed" number gets reported as 0.

I don't see this as not a major issue because I don't do much with this information and even if it's a bit confusing, it at least seems to be somewhat consistent in its set of rules, but I can see how this could be problematic for people writing software that presents test dashboards or whatnot. The strangest part of this, to me, is that the overall test is reported as passing or failing. The strangest part of this, to me, is that the overall test is reported as passing or failing based only on the parts of the test not in a subtest, which means that a test consisting entirely of subtests where all subtests fail will still be considered to have passed; but again, this is a minor cosmetic blemish that won't affect most people.

If you have strong feelings about this, there's an open bug on the pytest-subtests repository for discussing what the right behavior is.

It can easily get spammy

Since I am planning for the PEP 615 [3] implementation to eventually be integrated into CPython (and thus I cannot use pytest for parameterization), I've been using subtests for simple parameterization in its test suite, and I have a lot of edge cases to test. This can be very annoying in the fairly common situation where I've introduced a bug that breaks everything, not just one or two edge cases.

This is compounded by the fact that pytest -x seems to stop the test suites only after all subtests have run, rather than on the first subtest failure — a problem that is harder than you'd hope to solve because of the "counting tests is weird" problem: what definition of a "test" do you use for --max-fail?

That said, I do not see this as a fundamental problem with subtests; decorator-based parameterization schemes suffer from the same problem — they just have the UI advantage that it's easier to effectively communicate "stop after one test fails". In both cases, some self-restraint in the proliferation of test cases and a decent UI can make this mostly a non-issue.

Poor interactions with other features

I've already mentioned that pytest -x (and pytest --max-fail) have some basic UI issues with subtests, but there are many other testing tools and features that are not equipped to handle subtests. For example, another issue with pytest-subtests is that currently pytest --pdb doesn't seem to work when a subtest fails. Similarly, I'm finding that pytest.xfail() doesn't work in a subtest either.

I've also found that unittest.TestCase.subTest does not work with hypothesis, but (aside from a warning that I consider mostly erroneous), the subtests fixture provided by pytest-subtests seems to work OK (though don't run pytest -v, because it will generate a lot of subtests).

Even within the standard library, there are some creaky issues, for example, in Python 3.8.1:

def test_loop(self):
    for i in range(5):
        with self.subTest(i=i):
            if i % 2:
                self.skipTest("Skipping odd")
            self.assertEqual(i % 2, 0)

I would expect this to report some passing tests and some skipped tests, like I see with pytest-subtests, but in fact I see only skipped tests:

test.py::Test::test_loop SKIPPED    [100%]
test.py::Test::test_loop SKIPPED    [100%]

============ 2 skipped in 0.02s ==========

Clearly this feature needs work on further integration, but I see this mostly as a symptom of the fact that subtests are a somewhat little-known feature and not yet widely used, so bugs like this go unreported and unfixed. The pytest-subtests repo only has (at the time of this writing) 51 stars on GitHub and is still in early stages of development [4]. With greater adoption and more contributions, I would expect to see many of these wrinkles ironed out.

Summary

This is my longest and most rambling blog post to date, and it covers a lot of ground, so I thought that I should sum up the things I've covered in the post.

Some key use cases I've identified:

Allows test parametrization if you cannot use pytest, or if you want to generate some test cases in a way that is scoped to the test you are currently running.
When it is expensive to acquire a resource but you want to probe more than one of its properties, subtests provide an easy to use and easy to understand mechanism to logically separate out the tests.
If you would like to probe the state of a system as it evolves, subtests allow you to make non-fatal assertions, so you can get the benefits of multiple tests without the boilerplate of setting up a lot of redundant state.

The biggest downsides I've identified:

The concept of counting "a test" or "a failure" gets a lot fuzzier.
Easy to get spammed with a million results if the failures are correlated.
Subtests tend to present edge cases to a lot of other testing tools and features, and as a result there are still a decent number of bugs in the implementations.

I will say that I am optimistic about the future for subtests, whether it's the standard library implementation, pytest-subtests or other related non-fatal failure testing libraries I haven't even talked about like pytest-check.

There are real upsides to using subtests and none of the downsides seem terribly fundamental; with greater adoption of the concepts and some iteration on the tools providing these capabilities, I could see subtests becoming a normal part of one's testing arsenal.

It's worth noting that as of this writing pytest-subtests is still in early development (it's version 0.3.0 on PyPI), and with some additional work, a lot of the rough edges could be smoothed out. If this blog post or the corresponding podcast episode has gotten you interested in subtests, perhaps contributing to the plugin is a good way to start getting involved!

Footnotes

[1]	You can read the original discussion on Python's issue tracker at bpo-16997. A decent fraction of it is about some of the details of the implementation, but there's also some interesting discussion in there.

[2] When using pytest.mark.parametetrize, I have to switch from using unittest.TestCase to using the subtests fixture from pytest-subtests, because the way the parametrize mark works is incompatible with unittest-style test cases. It is possible to write a parameterization decorator that is compatible with unittest-style test cases, though, so please don't take this to be some sort of fundamental incompatibility between the approaches.

[3]	Have I mentioned I'm working on PEP 615 enough yet?

[4]	Note that the creator of the module, Bruno Oliveira, has mentioned on twitter that it still "needs some love", and in his review of an early draft of this post he suggested that I add a disclaimer to that effect.