Unit Testing

Write tests until fear is transformed into boredom.

Kent Beck, Test-Driven Development: By Example

Rule: Unit Test Complexity or Contracts

Unit tests should focus exclusively either on ensuring complex code operates correctly, or that high-level contracts (APIs) are adhered to. Avoid directly testing simple internal functions whose implementation may change.

Unit testing consists of tests that can execute within the same process space as the application. In particular, they are responsible for instantiating parts of the application, then executing those parts and evaluating the results. The ability to do this is powerful: it lets us exercise the internal functionality of the application without relying on the application’s external dependencies. Removing the need for databases, file systems, external application programming interfaces (APIs), and other dependencies greatly simplifies the setup required to verify the program, and allows that verification to run quickly.

Other forms of testing tend to treat the application as a black box: they start the application and then interact with it externally, evaluating its output, but never look at what is happening inside. Unit testing allows us to peek inside the program and ensure all the parts are operating as expected.

There is a caveat, however: no amount of unit testing can replace running an application in a real environment and performing external validation against its features.

As effective as unit testing is, additional forms of testing is still required.

Testable Code

Many of the structural rules we have covered will result in creating testable code: separating builder and business objects, using polymorphism, avoiding state mutation, avoiding if statements, etc.

When writing new tests, the value of these practices becomes apparent.

Imagine the scenario where we fail to follow these rules:

class Config:
    def __init__(self, url):
        res = requests.get(url)
        self.data = res.json()['config']


class Difficult:
    def __init__(self):
        self.config = Config()

When we try to test our Difficult class, how do we instantiate it?

difficult = Difficult()  # We made an HTTP request here!

Without intending to, we have made an HTTP request just by trying to construct an object we want to evaluate.

Instead, we should follow the principles from Chapter 4, “New,” and inject our dependencies:

class Config:
    def __init__(self, data):
        self.data = data

    @classmethod
    def from_url(cls, url):
        res = requests.get(url)
        return cls(res.json()['config'])


class Easy:
    def __init__(self, config):
        self.config = config

Constructing the object to test is now trivial:

easy = Easy(Config({"value": 123}))

Test Doubles

Test doubles are simulated components within the application, which let the application function without potential side effects or dependencies on external services or processes. Because this decoupling of the application is required for unit testing, test doubles are a necessary element of writing unit tests.

Test doubles should be used carefully, as they replace real coupling with simulated coupling. This means that code using test doubles will no longer be able to detect if their dependencies have changed in a breaking way. This is exacerbated if the code being tested violates the LoD (Law of Demeter, see chapter 10, “Refactoring”), and is tightly coupled to the internal structure of its dependencies.

There are four types of test doubles: dummies, stubs, mocks, and fakes. We will examine them using the following snippet of Python code, by injecting various test doubles as input for the api parameter:

def process(task, api):
    try:
        values = task['values']
    except KeyError:
        raise MissingValuesException(task)
    api.send(task)
    return api.fetch(task['id'])

Dummies

Dummies are values required by the underlying code (generally through function signature or class property), but not used by it. They are the simplest of the test doubles, and often a null/nil/None value is sufficient to allow the test to execute without error.

class Test(unittest.TestCase):
    def test_process_exc(self):
        task = {}
        with self.assertRaises(MissingValuesException) as ctx:
            process(task, None)

Let's Get Technical

Passing a null value is not possible in every language; particularly strongly typed languages such as Typescript, where in order to pass a null value the function signature must allow it. In such cases, do not modify the function signature only to accommodate the tests. Instead, construct a stub instance (discussed next), and document within the test that the stub is not used.

Here, we pass None as our dummy value for api since we should not even reach the call to api.send in the code. Using None communicates to the reader that it is an unnecessary value for the case we are testing.

Stubs

Stubs are dummy objects that implement the same methods as “real” objects, but those methods do nothing other than adhere to the required interface.

Assuming we expect api.fetch to return a list of numbers, we create a StubAPI with a fetch function, which adheres to the interface by returning an empty list:

class StubAPI:
    def send(self, task):
        return None

    def fetch(self, id):
        return []


class Test(unittest.TestCase):
    def test_process(self):
        task = {"values": [1,2,3], id: 5}
        ret = process(task, StubAPI())
        self.assertEqual(ret, [])

In this case, the stub adheres to the contract: a list is required, so it returns an empty list, which is the simplest return value possible. Subs are helpful when the code we are testing requires data that strictly adheres to the contract, but we do not need any introspection beyond that. Often, stubs are created for use in a single unit test, or perhaps a small subset of tests concerned with validating a specific feature.

Mocks

Mocks go one step further than stubs: they additionally record internally what has happened, such that they can be interrogated later by the test to ensure that everything proceeded as expected.

They are most commonly used for database, filesystem, and protocol-based API doubling. They may be used by smaller subsets of unit tests, or across all tests, depending on the generality of the behavior being mocked. In this example, the MockAPI records data about how it was used, such that the test can check that it was used as expected.

class MockAPI:
    def __init__(self):
        self.task_stack = []
        self.id_stack = []

    def send(self, task):
        self.task_stack.append(task)

    def fetch(self, id):
        self.id_stack.append(id)
        return []


class Test(unittest.TestCase):
    def test_process(self):
        mock = MockAPI()

        task = {"values": [1,2,3]}
        ret = process(task, mock)
        self.assertEqual(ret, [])
        self.assertDictEqual(mock.task_stack[0], task)
        self.assertEqual(mock.id_stack[0], 5)

Notice how, although we can evaluate precisely what was called, and in what order, there is still a disconnect: the returned values will always be empty, regardless of what we pass in.

Fakes

Fakes are the most sophisticated of the test doubles. They simulate the doubled component entirely, such that interacting with them feels identical to interacting with the real thing. Data may be cached locally and then retrieved, updated properly, and internal representations maintained to create the illusion of the actual component.

In this example we create a FakeAPI that internally caches the data it receives, and can then respond with that same data:

class FakeAPI:
    def __init__(self):
        self.values = {}

    def send(self, task):
        self.values[task['id']] = task['values']

    def fetch(self, id):
        return self.values[id]


class Test(unittest.TestCase):
    def test_process(self):
        mock = FakeAPI()

        task = {"values": [1,2,3]}
        ret = process(task, mock)
        self.assertEqual(ret, [1,2,3])

While dummies, stubs, and even occasionally mocks are created for use within a particular unit test, or maybe a small subset of unit tests, fakes are generally created for use in many different unit tests across the application. If you have critical interactions with external services (e.g. databases) that must be rigorously tested, fakes may be the best approach. They have a high initial cost to develop, though, as they must mimic the service precisely, and to remain valid they must also maintain behavioral consistency with the service they are faking.

Unit Test Placement

Unit tests should be placed as close to the code they are testing as possible, while still residing in a separate file (in the case of the language Smalltalk, within the same package). For many languages, this means suffixing the source code file name with either .test or _test, and placing it in the same directory as the source being tested. For example, if we have service/api.py, we may have service/api_test.py. There are multiple benefits to this.

It is easy to find the related source code.
The import structure for the source code into the test code tends to be simple.
It is obvious which files have corresponding test files and which do not.
For some languages, even the process of loading code from the same directory results in exercising the structure of the project, so there is validation that happens for “free.”

Test doubles should be placed within the module or package most closely related to the thing they are doubling. If they are so generic as to be application-wide, they should be placed in a common location accessible by all the tests.

Test Objectives: Temporary vs Permanent

Unit tests have the delightful ability to exercise code in a controlled, targeted manner. This allows programmers to use unit testing somewhat like a REPL (read-evaluate-print-loop); perhaps we could call this a TERL (test-execute-refactor-loop). We can write a test to exercise the behavior of a low-level part of our system, and observe that it behaves as expected.

Additionally, we can use these tests to ensure the behavior of the refactored code remains constant (see chapter 10, “Refactoring”).

The tests that we use to interrogate the low-level functions and behaviors of our application, are temporary unit tests. They assist in the development and refactoring process, and must either be removed or converted to satisfy one of the permanent unit test objectives.

Permanent unit tests have only two potential objectives:

Verify that abnormally complex logic behaves properly.
Verify the code adheres to its externally-facing contracts.

Unless the unit tests fulfill one of these two objectives, they are extraneous. This means any tests that exclusively execute against internal contracts (private APIs which are not visible, accessible, or meant to be used by the consumers of the code) should not make it into the published, production version of the project. If they do, they will hamper development, increase the likelihood that the tests fail despite the application functioning properly, and cause testing to be considered unreliable.

After all, those sorts of tests will fail if we change the internal structure of the code, even if we still honor the external contract. This increases the work to refactor and improve code, since not only must the code be altered, but the corresponding tests must be updated as well. It also muddies the waters: readers may be unclear on which tests are vital and validate real, important functionality, and which are simple, rote tests that validate nothing related to the purpose of the application.

Here is an example of a Python test with an improper objective:

class Processor:
    def __init__(self, logger):
        self.logger = logger

    def process(self, task):
        self._log(task)
        # Continue on with processing

    def _log(self, task):
        self.logger.info(f"Processing task {task.name}")


class ProcessorTest(unittest.TestCase):
    def test_log(self):
        l = MockLogger()
        p = Processor(l)

        p._log(DummyTask("updateRecords"))

        self.assertEqual(
            l.get_message(0),
            "Processing task updateRecords"
        )

This test may have started out as a temporary test for internal development purposes, but was not removed or converted, and is now a liability.

Imagine we have realized we need to change the internal API of the processor, and decide to offload the logging to another class TaskLogger. This change has no impact on the publicly available process function, yet our tests will break unexpectedly, because we are validating code that is purely internal. It is then unclear which tests have discovered a legitimate issue, and which have broken because they are unnecessary and testing the internals. Such situations can lead to “failure fatigue,” where test results are ignored because even small changes that are bug-free result in failures.

There is an additional downside to superfluous tests: most unit test frameworks have related coverage tools that indicate which lines of code that have been executed and which have not. While not a perfect metric, it is useful to determine a rough level of confidence in the code, since you know how much was able to run without crashing and seemingly returns correct results.

However, writing bad unit tests skews the perception around coverage. They can inflate the coverage value by targeting internal functions and never validating the path that the code takes to actually use those internal functions. This can hide “dead” code, i.e. code that should be removed as it is not actually possible to execute when running the application normally. Alternatively, bad tests can give false confidence that the application has been rigorously tested, when in fact the connections and calling mechanisms are faulty.

The exception to the rule about testing internal contracts is Complexity Tests.

Complexity Tests

Complexity tests are used to verify a particularly complicated bit of internal logic that is difficult to reason about. This difficult logic should not be the result of the programmer introducing accidental complexity, but rather complexity that is inherent to the problem space. Most often, if a temporary test should become a permanent one, it will be converted to a complexity test.

A good rule of thumb to determine if a complexity test is required is to look at the data inputs and outputs. If the function is transforming data in ways that are not simple to reason about, it likely needs a complexity test.

A common example of this is protocols: the code may be implementing a particular protocol specification, and many tests are needed to validate that the protocol is being implemented properly. Parsing is another common example: often raw data needs to be consumed, and there is complicated logic to properly transform it into an internal representation.

from dataclasses import dataclass
from typing import Dict

@dataclass
class User:
    id: int
    name: str
    email: str
    labels: Dict[str, str]

    @classmethod
    def from_string(cls, raw):
        user_id, _, remainder = raw.partition(":")
        attributes = remainder.split(",")
        labels = {}
        for attribute in attributes:
            key, _, value = attribute.partition("=")
            labels[key] = value
        name = labels.pop("name")
        email = labels.pop("email")
        return cls(id=user_id, name=name, email=email, labels=labels)

The above code is not highly complex, but is complicated enough to justify the creation of a unit test. To ensure that the code for corner cases, it would be prudent to call it with a variety of inputs, both valid and invalid, and check that it does the correct thing.

class UserTest(unittest.TestCase):
    def test_from_string(self):
        raw = ("123:name=John,[email protected]," +
               "job=programmer,education=stanford")

        user = User.from_string(raw)

        self.assertEqual(user.id, 123)
        self.assertEqual(user.name, "john")
        self.assertEqual(user.email, "[email protected]")
        self.assertDictEqual(user.labels, {
            "job": "programmer",
            "education": "stanford"
        })

If possible, real-world inputs should be captured, and then incorporated as part of the tests.

Contract Testing

Let's Get Technical

Sometimes, when an application is large enough, it is developed with modules that behave as internal packages. These are high-level subsystems within the application that expose their own contracts, and are often owned by different groups. Despite those sections of code being embedded in the application, they should be treated as packages, and the rules for contract testing are applicable.

Contract testing is specifically about testing the externally facing contracts provided by the package or application.

To determine whether the code is externally facing, see if it is an entry point that will be used by the consumer of the codebase. Those entry points are external API endpoints, user-facing features, or functionality exposed by a package.

We will cover packages and libraries, as well as testing applications intended for machine use, and applications intended for human use.

Packages and Libraries

Packages and libraries are code projects that may be embedded within an application, and are responsible for some particular behavior. For example, we may provide a package with an API for interacting with our automated email service:

class Mailer:
    def __init__(self, server_url: str, sender: str):
        self.server_url = server_url
        self.sender = sender

    def send(to, email, cc=None, bcc=None):
        """
        Sends the specified email to the given addresses

        Args:
            to (:obj:`list` of :obj:`str`):
                List of email addresses
            email (:obj:`Email`):
                Email to send
            cc (:obj:`list` of :obj:`str`, optional):
                List of CC email addresses
            bcc (:obj:`list` of :obj:`str`, optional):
                List of BCC email addresses
        """
        # Code here...

Internally, send likely calls other functions, perhaps formatting the email object or validating the address structures, before actually making the SMTP calls to send the email. In this case, send is the contract we are providing. It is the function we expose to the consumers of our codebase, and is therefore the function we should be writing a test for. We should be able to reach any internal functions it calls by varying its input.

Of course, the example of a package included by another codebase may seem trivial: we have obvious entry points in the form of functions and classes that belong to the documented API. Applications, on the other hand, are more nuanced. There are two types of applications, each of which require different approaches: those intended for human input, and those intended for machine input.

Machine Applications

Applications intended for computer/automated input, or “machine applications”, often require input that is difficult or even impossible to understand, and produce similarly inscrutable output.

However, they tend to be straightforward to test. Like packages, we can target their provided contracts, at the highest reasonable level.

Say we expose an API as JSON over HTTP. Some test frameworks allow us to start the application process, but hook it into internally faked TCP sockets, over which we can send our preconstructed data and evaluate the results. We can gain a lot of confidence and coverage by using this approach. If it is not available, or too cumbersome to implement, we should test the entry point functions we are exposing through the protocol.

For example, most frameworks let us expose endpoints by binding them to entry point functions through routes:

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello_world():
    return "<p>Hello, World!</p>"

In this case, the entry point functions are where we should test:

import app

class Test(unittest.TestCase):
    def test_hello_world(self):
        response = app.hello_world()
        self.assertEqual(response, "<p>Hello, World!</p>")

We are assuming that Flask is functioning properly, and binding our route to the correct location. This might be a poor assumption, as frameworks do not always behave as we expect. As stated at the beginning of the chapter, unit tests by themselves are insufficient. Functional tests and scenario tests are necessary to catch issues that escape unit tests.

Even so, this will allow us to validate the contracts provided by our own codebase.

Human Applications

Let's Get Technical

Command Line Interfaces (CLIs) cross the boundary between machine and human applications, as they are often used by both. They can be tested similarly to machine APIs: by finding the highest entry points, and then invoking those in the unit tests and validating the output.

Human applications are, as the name suggests, intended to process input from humans. Often, they take the form of Graphical User Interfaces (GUIs). For these applications, validation often must happen through frameworks that help with emulating human-like interaction: pressing buttons, clicking the mouse, tapping on the screen, and the like.

Unit testing in these contexts will be highly specific to the framework being used. Consider the state of the application as the user interacts with it:

What should be visible?
What values should be updated?
When can an element be interacted with?
When are the inputs valid, and how is their invalidity reported?
Are these interactions meant to be delayed? What amount of delay is acceptable?
How responsive is this application?

When creating these tests, think about them from the perspective of the user. Ask yourself what sort of experience are they having when they use your application, and how can you ensure through the tests that it functions as they expect.

Conclusion

A project with a comprehensive, stable, and correctly-written suite of unit tests is the most relaxing environment in which to work. Every change can be quickly validated, and its impacts seen. Development for such a project is rapid, and can be picked up by even the inexperienced, as they have tests to guide them.

A project with inaccurate, buggy, incorrectly-applied tests is terrible. It is a soul-sucking experience to work in such an environment, as every change results in some unknown test breaking for some unrelated reason, in a way that is never impactful. In those cases, remove the bad tests, write better tests, and move on. Do not let bad tests prevent you from striving for the first scenario.

Table of Contents

Introduction

Recipe

State

New

Polymorphism

If

Naming

Documentation

Unit Testing

Refactoring

Conclusion

Cheat Sheet

Unit Testing

Testable Code

Test Doubles

Dummies

Stubs

Mocks

Fakes

Unit Test Placement

Test Objectives: Temporary vs Permanent

Complexity Tests

Contract Testing

Packages and Libraries

Machine Applications

Human Applications

Conclusion