The Elements of Code

State

The road to programming hell is paved with global variables

Steve McConnell, Code Complete

Rule: Reduce State and Avoid Mutation

Avoid changing the state of variables. Limit the scope of variables as much as possible.

The state of a program consists of all the values within the program at a given point in time. A practical way to think of state is as all the variables (fields, arguments, properties, etc.) and their values. Programs only consist of only two things, state and behavior, so knowing how to manage state within a program is a prerequisite for writing clear, maintainable code. It cannot be said that a program is understood unless the program’s state and how it changes over time is understood in minute detail. This chapter focuses on that knowledge: we will begin by covering essential terms and definitions, then consider which practices to avoid and which to embrace, and finally we will review the impacts that state can have on an application. MTC will remain a primary focus of our discussion. Additionally, we will walk through some example code and outline precisely which state changes are taking place.

First, we need to consider the two key concepts in understanding state: scope, which deals with where code is usable, and mutability, which deals with how variable values change.

Scope

Scope is the section of code, or context, in which a variable is usable, and is an important way to organize code. It allows us to reuse variable names and create modular, isolated pieces of code.

Let us look at an example in Python:

def log_hello(value):
    hello_msg = f"Hello {value}!"
    print(hello_msg)

In the above function, there are two variables, value and hello_msg, which are scoped to the function. This means if we tried to use those variables outside of that function, there would be an error.

Lexical and Dynamic

Python, and indeed most languages (Ruby, C, C#, Java, Javascript, etc), use lexical, also known as static, scope. Lexical scope means that a variable’s value is determined by the value assigned in the closest containing block of code.

There is another type of scope: dynamic scope, where the variable’s value is determined at runtime based on the most recently assigned value within an executing block.

Let’s examine the difference between lexical and dynamic by seeing how the following Perl code would run in each scenario.

$x = "world!"; # creates a variable 'x'
sub hello {
   print "hello ".$x."\n"; # which 'x' does this refer to?
}

sub hello_universe {
   local $x = "universe!"; # creates another variable 'x'

   hello;
}

hello_universe;

In the above example, x is first defined as “world!”, and then referred to in the hello function.

In a lexically scoped program, this prints “hello, world!”, as that is the closest assignment in a containing block of code.

However, in a dynamically scoped program, scope is inherited from execution context. This means that when the program is called, a new scoped variable x is created. When the hello_universe function calls the hello function, hello is given the scope of its caller, which contains a different x value: “universe!”. Then, when the hello function attempts to resolve x, it uses the closest one, in this case “universe!”, resulting in the output “hello, universe!”.

A useful way to think about nested state is as a map of values. As we work through the code, we’ll examine the values within that map representing the current scope.

Lexical scope:

  1. A variable x is created. Scope: {x: "world!"}
  2. Functions are defined but not yet called. Scope: {x: "world!", hello: {}, hello_universe: {}}
  3. hello_universe is called.
  4. Another variable x is created. Scope: {x: "world!", hello: {}, hello_universe: {x: "universe!"}}
  5. hello is called.
  6. hello attempts to resolve x. It looks in its own scope, and then moves up one level where it finds x is set to “world!”.
  7. hello prints “hello, world!”

Dynamic scope:

  1. A variable x is created. Scope: {x: "world!"}
  2. Functions are defined but not yet called. As scope is determined when functions are called rather than defined, the scope does not change. Scope: {x: "world!"}
  3. hello_universe is called. Scope: {x: "world!", hello_universe: {}}
  4. Another variable x is created. Scope: {x: "world!", hello_universe: {x: "universe!"}}
  5. hello is called. Scope: {x: "world!", hello_universe: {x: "universe!", hello: {}}}
  6. hello attempts to resolve x. It looks in its own scope, and then moves up one level where it finds x is set to “universe!”.
  7. hello prints “hello, universe!”

Dynamic scope is not as popular, since it is more difficult to look at the code and quickly understand which variable is being used. With lexical scope, it is often sufficient to look at the “blocks” surrounding a particular piece of code, and know which variables are within scope.

Global Variables

Variables that are accessible at any point in a program are said to be “global”; that is, they can be used regardless of the namespace, class, module, or function. They are defined in the program’s top-level scope.

Global variables are best avoided. This includes avoiding the “singleton” pattern, where a special, globally available constructor is used to ensure only a single object is ever created. There are three major reasons for this:

  • When global variables are used, it is difficult to determine where or how they are defined. While modern IDEs make this less of a concern, the required tooling is not always available, or sufficient.
  • If the variable needs to be changed, determining the impacts of that change can be quite costly and difficult to reason about, dramatically increasing MTC.
  • As more code refers to the same global variable, the code becomes tightly coupled and fragile. Code cannot be re-arranged, added or removed, since doing so may impact completely unrelated code.

Take for example the following Python, in main_module.py:

a = "hello world!"

def log():
    print(a)

What does the above log function print? You may think, it clearly prints “hello world!”, but with further consideration you might get the feeling that this is a trap, since it appears to be using a global variable. You would be correct: it is a trap. In another module, far away inside the program, there is this insidious code:

import main_module

main_module.a = "unexpected value!"

This has reached inside our module and modified its value, causing us to print “unexpected value!”. In fact, Python is better than many languages at managing global scope, because a variable labeled as global is still restricted to the module level. Even so, the value can be modified in unexpected ways, leaving subsequent programmers confused and frustrated.

Proximal Variables

Understanding the perils of global variables leads us to use the opposite: proximal variables, sometimes imprecisely referred to as “local” variables. Proximal variables are created closest to the location where they are used, while local variables are created in a given scope, and are considered “local” to that scope (they cannot be accessed outside of it).

For example, here is a use of local, but non-proximal, variables in Python:

def main():
    msg_1 = "count: "
    msg_2 = "value: "
    msg_3 = "index: "

    for i in range(10):
        print(msg_3 + str(i))

    for i in range(100):
        print(msg_1 + str(i))

    for i in range(1000):
        print(msg_2 + str(i))

Let's Get Technical

Often, this style is an artifact of languages like C, where variables needed to be declared at the beginning of the scope block (in this case, the top of the function). This was required in C89, but not in C99, and should be avoided.

In the above example, the msg variables are defined at the top of the function, but they are not used until later in the code. While there is some benefit to grouping variables that are used together for reference, it tends to be clearer and reduces MTC to define them more proximally– close to where they are used. Note how in the first example it is easy to assume they will be used in the order they are declared and named: msg_1, then msg_2, then msg_3. However, defining them proximally to their use makes it obvious this is not the case:

def main():
    msg_3 = "index: "
    for i in range(10):
        print(msg_3 + str(i))

    msg_1 = "count: "
    for i in range(100):
        print(msg_1 + str(i))

    msg_2 = "value: "
    for i in range(1000):
        print(msg_2 + str(i))

In general, try to define variables as proximally as possible.

Mutability

Variables whose values can be changed after creation are called mutable variables. Variables whose values cannot be changed over the course of program execution are called immutable variables. Immutable variables are a form of invariants: conditions that do not vary (or change) throughout execution. Whenever possible, we should prefer immutable variables over mutable ones. This is because invariants make our program simpler to understand, because we do not need to track changes over time. The following section focuses on immutability and how to write immutable code.

There are two general types of immutability: weak and strong.

Weak vs Strong Immutability

Weak immutability is when a reference cannot be changed (variables cannot be reassigned to new values), but data contained within the reference can be. The following is an example of weak immutability.

> const example = {value: 3};
undefined
> example = {id: 1}
Uncaught TypeError: Assignment to constant variable.
> example.value = 4
4

In this Javascript example, we create a const variable, and then attempt to reassign it, which results in an error.

Javascript will, however, allow us to mutate the internal data, such as when we set example.value to 4, making its const a form of weak immutability.

Strong immutability prevents any change to the data. In fact, languages with strong immutability often lack the syntax to express such changes. Instead, modifications return a new reference to the updated value, which must be rebound to a variable. This is effectively the inverse of the weak immutability we saw in Javascript. Not all languages support both strong and weak immutability; often, they support either one or the other. Javascript, for example, does not natively support strong immutability. Elixir, on the other hand, supports strong immutability but allows “rebinding” (similar to variable reassignment in most languages). Here is an example of strong immutability in Elixir: we cannot update our existing reference, we can only rebind it to a new value.

iex> example = Map.new
%{}
iex> Map.put(example, :value, 1)
%{value: 1}
iex> example
%{}
iex> example = Map.put(example, :value, 1)
%{value: 1}
iex> example
%{value: 1}

In the above example, we perform the following steps:

  1. We create a new, empty Map label (similar to a variable on other languages), which we name example. The REPL indicates to us the map is empty.
  2. We attempt to put a value into example. Because Elixir is immutable, this returns a new map but does not modify example.
  3. We verify that example remains empty; it has not been modified.
  4. We perform the same operation again, but this time we rebind the label example to the return value.
  5. We verify that example now contains the data we added.

Because Elixir does not allow us to modify the internal data of example, and instead we must rebind the label to the new data, Elixir provides strong immutability.

Impact of Mutation

Mutating state is one of the most common forms of bugs. It is difficult to keep track of when and where state is being changed, and by who.

This is why, even in languages that do not support immutability, it is best to write code that does not change variable state. Although this is not possible all the time, we can usually get close.

def build_request(hostname, duration, value):
    request = {}

    request['hostname'] = hostname
    request['duration'] = duration

    add_record(request, value)

    return


def add_record(request, value):

    record_id = get_next_id()

    request['record'] = {}
    request['record']['id'] = record_id
    request['record']['value'] = value

In the above Python example, we initially create a dict, and then proceed to add keys into it after creation. Interacting with the variable in this way makes it hard to reason about the structure of the data. It would be simpler to create it all at once, rather than mutate it, as doing so makes the structure obvious. For example:

def build_request(hostname, duration, value):
    return {
        "hostname": hostname,
        "duration": duration,
        "record": build_record(value)
    }

def build_record(value):
    return {
        "id": get_next_id(),
        "value": value
    }

We may even want to combine these two functions, to give even more clarity to the structure we are dealing with:

def build_request(hostname, duration, value):
    return {
        "hostname": hostname,
        "duration": duration,
        "record": {
            "id": get_next_id(),
            "value": value
        }
    }

Let's Get Technical

Immutability can have a cost, usually in the form of performance. Even languages that support immutability as a first-class concept tend to be slower than their lower-level counterparts. For most projects, this difference is small enough as to not be a consideration. However, if performance is critical, the impacts of mutable variables and shared memory should be meticulously considered prior to their implementation. In that case, it should still be possible to avoid mutating the vast majority of the variables.

Of course, decreasing MTC is not the only reason to program this way. Immutability has a large impact on concurrency, as well. Functions that do not mutate state can easily be run in parallel, as they do not need to lock access to their variables to prevent race conditions (i.e. non-deterministic modification of a value by concurrent operations) and the resulting data corruption.

Conclusion

State is a fundamental aspect of programs. If state management is done poorly, or is difficult to understand, it will be almost impossible to write bug-free programs. The unnecessary complexity and confusion created by poor state management inevitably results in the wrong value being updated, or missing which values should be updated, or, in the case of concurrent operations, race conditions and data corruption.

To properly manage state, take care to do two things:

  1. Ensure you understand precisely how the state in your program behaves.
  2. Write the program so that step one is simple.

The rules outlined throughout this chapter will help you accomplish this task.