CP

CP: Concurrency and parallelism

Here, we articulate principles and rules for using the ISO standard C++ facilities for expressing basic concurrency and parallelism.
Threads are the machine-level foundation for concurrent and parallel programming.
- Threads allow running multiple sections of a program independently, while sharing the same memory.
- Concurrent programming is tricky, because protecting shared data between threads is easier said than done.
- Making existing single-threaded code execute concurrently can be as trivial as adding std::async or std::thread strategically, or it can necessitate a full rewrite, depending on whether the original code was written in a thread-friendly way.
The concurrency/parallelism rules in this document are designed with three goals in mind:
- To help in writing code that is amenable to being used in a threaded environment
- To show clean, safe ways to use the threading primitives offered by the standard library
- To offer guidance on what to do when concurrency and parallelism aren't giving the performance gains needed
It is also important to note that concurrency in C++ is an unfinished story.
- C++11 introduced many core concurrency primitives,
- C++14 and C++17 improved on them,
- and there is much interest in making the writing of concurrent programs in C++ even easier.
- We expect some of the library-related guidance here to change significantly over time.

CP.1: Assume that your code will run as part of a multi-threaded program

It's hard to be certain that concurrency isn't used now or won't be used sometime in the future.
Code gets reused. Libraries not using threads might be used from some other part of a program that does use threads.
Note that this rule applies most urgently to library code and least urgently to stand-alone applications. However, over time, code fragments can turn up in unexpected places.

//Example, bad
double cached_computation(int x)
{
    // bad: these statics cause data races in multi-threaded usage
    static int cached_x = 0.0;
    static double cached_result = COMPUTATION_OF_ZERO;

    if (cached_x != x) {
        cached_x = x;
        cached_result = computation(x);
    }
    return cached_result;
}

Although cached_computation works perfectly in a single-threaded environment, in a multi-threaded environment the two static variables result in data races and thus undefined behavior.

struct ComputationCache {
  int cached_x = 0;
  double cached_result = COMPUTATION_OF_ZERO;

  double compute(int x) {
      if (cached_x != x) {
          cached_x = x;
          cached_result = computation(x);
      }
      return cached_result;
  }
};

Here the cache is stored as member data of a ComputationCache object, rather than as shared static state. This refactoring essentially delegates the concern upward to the caller: a single-threaded program might still choose to have one global ComputationCache, while a multi-threaded program might have one ComputationCache instance per thread, or one per "context" for any definition of "context."
The refactored function no longer attempts to manage the allocation of cached_x. In that sense, this is an application of the Single Responsibility Principle.
In this specific example, refactoring for thread-safety also improved reusability in single-threaded programs. It's not hard to imagine that a single-threaded program might want two ComputationCache instances for use in different parts of the program, without having them overwrite each other's cached data.
There are several other ways one might add thread-safety to code written for a standard multi-threaded environment (that is, one where the only form of concurrency is std::thread):
- Mark the state variables as thread_local instead of static.
- Implement concurrency control, for example, protecting access to the two static variables with a static std::mutex.
- Refuse to build and/or run in a multi-threaded environment.
- Provide two implementations: one for single-threaded environments and another for multi-threaded environments.
Exception: Code that is never run in a multi-threaded environment.
- Be careful: there are many examples where code that was "known" to never run in a multi-threaded program was run as part of a multi-threaded program, often years later.
- Typically, such programs lead to a painful effort to remove data races. Therefore, code that is never intended to run in a multi-threaded environment should be clearly labeled as such and ideally come with compile or run-time enforcement mechanisms to catch those usage bugs early.

CP.2: Avoid data races

Unless you do, nothing is guaranteed to work and subtle errors will persist.
In a nutshell, if two threads can access the same object concurrently (without synchronization), and at least one is a writer (performing a non-const operation), you have a data race.
There are many examples of data races that exist, some of which are running in production software at this very moment. One very simple example:

int get_id()
{
  static int id = 1;
  return id++;
}

The increment here is an example of a data race. This can go wrong in many ways, including:
- Thread A loads the value of id, the OS context switches A out for some period, during which other threads create hundreds of IDs.
- Thread A is then allowed to run again, and id is written back to that location as A's read of id plus one.
- Thread A and B load id and increment it simultaneously. They both get the same ID.
Local static variables are a common source of data races.

void f(fstream& fs, regex pattern) {
    array<double, max> buf;
    int sz = read_vec(fs, buf, max); // read from fs into buf
    gsl::span<double> s{buf};
    // ...
    auto h1 =
        async([&] { sort(std::execution::par, s); }); // spawn a task to sort
    // ...
    auto h2 = async([&] {
        return find_all(buf, sz, pattern);
    }); // spawn a task to find matches
    // ...
}

Here, we have a (nasty) data race on the elements of buf (sort will both read and write).
All data races are nasty. Here, we managed to get a data race on data on the stack. Not all data races are as easy to spot as this one.

// code not controlled by a lock

unsigned val;

if (val < 5) {
    // ... other thread can change val here ...
    switch (val) {
    case 0: // ...
    case 1: // ...
    case 2: // ...
    case 3: // ...
    case 4: // ...
    }
}

Now, a compiler that does not know that val can change will most likely implement that switch using a jump table with five entries. Then, a val outside the [0..4] range will cause a jump to an address that could be anywhere in the program, and execution would proceed there.
Really, "all bets are off" if you get a data race.
Actually, it can be worse still: by looking at the generated code you might be able to determine where the stray jump will go for a given value; this can be a security risk.
There are other ways you can mitigate the chance of data races:
- Avoid global data
- Avoid static variables
- More use of concrete types on the stack (and don't pass pointers around too much)
- More use of immutable data (literals, constexpr, and const)

If you don't share writable data, you can't have a data race.
The less sharing you do,
- the less chance you have to forget to synchronize access (and get data races)
- the less chance you have to wait on a lock (so performance can improve).
Immutable data can be safely and efficiently shared. No locking is needed: You can't have a data race on a constant.

CP.4: Think in terms of tasks, rather than threads

A thread is an implementation concept, a way of thinking about the machine.
A task is an application notion, something you'd like to do, preferably concurrently with other tasks.
Application concepts are easier to reason about.

void some_fun(const std::string& msg) {
    std::thread publisher(
        [=] { std::cout << msg; }); // bad: less expressive
                                    //      and more error-prone
    auto pubtask = std::async([=] { std::cout << msg; }); // OK
    // ...
    publisher.join();
}

With the exception of async(), the standard-library facilities are low-level, machine-oriented, threads-and-lock level.
This is a necessary foundation, but we have to try to raise the level of abstraction: for productivity, for reliability, and for performance.
This is a potent argument for using higher level, more applications-oriented libraries (if possible, built on top of standard-library facilities).

CP.8: Don't try to use `volatile` for synchronization

In C++, unlike some other languages, volatile does not provide atomicity, does not synchronize between threads, and does not prevent instruction reordering (neither compiler nor hardware).
It simply has nothing to do with concurrency.

int free_slots = max_slots; // current source of memory for objects

Pool* use()
{
    if (int n = free_slots--) return &pool[n];
}

Here we have a problem: This is perfectly good code in a single-threaded program, but have two threads execute this and there is a race condition on free_slots so that two threads might get the same value and free_slots.
That's (obviously) a bad data race, so people trained in other languages might try to fix it like this:

volatile int free_slots = max_slots; // current source of memory for objects

Pool* use()
{
    if (int n = free_slots--) return &pool[n];
}

This has no effect on synchronization: The data race is still there!
The C++ mechanism for this is atomic types:

atomic<int> free_slots = max_slots; // current source of memory for objects

Pool* use()
{
    if (int n = free_slots--) return &pool[n];
}

Now the -- operation is atomic, rather than a read-increment-write sequence where another thread might get in-between the individual operations.
Alternative - Use atomic types where you might have used volatile in some other language.
Use a mutex for more complicated examples.

CP.9: Whenever feasible use tools to validate your concurrent code

Experience shows that concurrent code is exceptionally hard to get right and that compile-time checking, run-time checks, and testing are less effective at finding concurrency errors than they are at finding errors in sequential code.
Subtle concurrency errors can have dramatically bad effects, including memory corruption, deadlocks, and security vulnerabilities.
Thread safety is challenging, often getting the better of experienced programmers: tooling is an important strategy to mitigate those risks.
There are many tools "out there", both commercial and open-source tools, both research and production tools.
Static enforcement tools:
- both clang and some older versions of GCC have some support for static annotation of thread safety properties.
- Consistent use of this technique turns many classes of thread-safety errors into compile-time errors.
- The annotations are generally local (marking a particular member variable as guarded by a particular mutex), and are usually easy to learn.
- However, as with many static tools, it can often present false negatives; cases that should have been caught but were allowed.
dynamic enforcement tools:
- Clang's Thread Sanitizer (aka TSAN) is a powerful example of dynamic tools: it changes the build and execution of your program to add bookkeeping on memory access, absolutely identifying data races in a given execution of your binary.
- The cost for this is both memory (5-10x in most cases) and CPU slowdown (2-20x).
- Dynamic tools like this are best when applied to integration tests, canary pushes, or unit tests that operate on multiple threads.
- Workload matters: When TSAN identifies a problem, it is effectively always an actual data race, but it can only identify races seen in a given execution.

CP.par: Parallelism

By "parallelism" we refer to performing a task (more or less) simultaneously ("in parallel with") on many data items.
Parallelism rule summary:
Where appropriate, prefer the standard-library parallel algorithms
Use algorithms that are designed for parallelism, not algorithms with unnecessary dependency on linear evaluation

CP.vec: Vectorization

Vectorization is a technique for executing a number of tasks concurrently without introducing explicit synchronization.
An operation is simply applied to elements of a data structure (a vector, an array, etc.) in parallel.
Vectorization has the interesting property of often requiring no non-local changes to a program.
However, vectorization works best with simple data structures and with algorithms specifically crafted to enable it.

CP

CP: Concurrency and parallelism

CP.1: Assume that your code will run as part of a multi-threaded program

CP.2: Avoid data races

CP.3: Minimize explicit sharing of writable data

CP.4: Think in terms of tasks, rather than threads

CP.8: Don't try to use volatile for synchronization

CP.9: Whenever feasible use tools to validate your concurrent code

CP.par: Parallelism

CP.vec: Vectorization

Subsections

CP.8: Don't try to use `volatile` for synchronization