RC4 keystream bias, or parallel processing made easy with for_each/accumulate

RC4 is an obsolete cipher that was widely used just a decade ago, despite the fact that it has an exploitable statistical bias. This post shows how to compute the bias, using the neat modern C++ trick of parallel execution policies for the algorithms library, or for_each(std::execution::par). Complete code is available on GitHub.

std::for_each() applies an operation to each of the elements of a vector; since C++17, we can ask that this operation be multi-threaded by using the optional argument std::execution::par. The function std::accumulate() sums up the elements of a vector, using a left fold. Think map and reduce for for_each and accumulate, respectively. They provide a simple way to perform multi-threaded computations, especially for perfectly parallel problems like computing empirical probability distributions for cryptographic functions.

Motivation: computing RC4 biases

RC4 is a pseudorandom keystream generator whose output is biased. Designed in 1987, it was still widely used in 2011 (according to RFC 6229) because it performed well on low-end processors; it is beautiful in its simplicity. Several research results have characterized the bias in its output, and explored how to exploit them, but the research publications do not provide a complete distribution in machine readable form. To study the impact that this bias has on an application that uses RC4, we need to know the probability distribution of its output bytes. An effective way to do this is to generate many keys at random, compute the keystream output for each key, and track the number times that byte patterns of interest occur. It’s a perfect example of a problem begging for a multi-threaded solution, like the following clean C++17 one.


Suppose that we have a class distribution with an interface like this:

class distribution {
    void compute(size_t num_trials);
    void write_to_file(const char *filename);

    // functions needed by std::accumulate()
    distribution & operator+(const distribution &rhs);
    distribution(const distribution &rhs);

Using the constructor and the first two member functions, it is straightforward to create an empty distribution, compute a bunch of trials, and then write the output to a file. The class stores the byte distribution of all of the computed keystreams; that is, the number of times that the ith byte of the keystream matched each possible value. Let’s assume for now that this class computes the RC4 distribution. On a single core machine, we can’t really do much better than calling distribution::compute and then waiting. But on a 12 core machine, much less a 56 core machine, this single-threaded approach to the problem leaves the vast majority of the hardware’s performance potential untapped.

It is easy to use for_each to spin up multiple threads that will keep all of our cores busy. To do that, we need to create a vector of distributions, then invoke for_each() on that vector by passing in begin() and end() to denote its entire extent. That invocation also uses a lambda that will be applied to each of the elements. In our case, the lambda just invokes the compute() member function. Since our goal is to keep each core busy, we use the function std::thread::hardware_concurrency() to determine how many independent executors we have. Putting it all together, we have:

   unsigned int concurrency = std::thread::hardware_concurrency();
   std::vector<distribution> dist(concurrency);
                  std::begin(dist), std::end(dist),
                  [](distribution &d) { d.compute(num_trials); });


This is really easy; we didn’t need to write any member functions beyond those that we used in the single-threaded case, and we didn’t have to explicitly create and manage threads. However, we still need to combine elements of our vector of distributions into a single, comprehensive result. For that, we need to use std::accumulate(), like this:

    distribution empty{};
    distribution result = std::accumulate(std::begin(dist), 

For this to work, we need to define a operator+() member function that merges together two distributions and a copy constructor for our distribution class. This is no big deal, as we would need to define similar functions to combine our results anyway.


How does it work? Great! Timing tests show that, on a twelve core machine, for_each/accumulate computes keystream very nearly twelve times faster. On Linux, the Resources tab of the System Monitor utility clearly shows that core usage is maxed out. It is a thing of beauty to see all of the hardware getting utilized, while the system is still stable.

The System Monitor shows all eight of the cores on my Framework laptop are maxed out during a run of the keystream_generator utility with concurrency=8.

Compiling and Linking

To run for_each/accumulate, you will of course need the right #include files, which in this case are

#include <thread>     // std::thread::hardware_concurrency()
#include <execution>  // std::for_each()
#include <numeric>    // std::accumulate()    

If you are using g++, you will also need to install the Threading Building Blocks (TBB) library and link your application with -ltbb. G++ versions 9.1 and later support C++17 parallel algorithims through TBB. On Debian/Ubuntu, the development package you need is libtbb-dev, which is installed with the command sudo apt install libtbb-dev. Alternatively, you can install TBB from source.

Complete Code

Complete code that implements a keystream distribution generator utility, including the for_each/accumulate code used in this example, is on GitHub. An example run is captured in the screenshot below. For flexibility, the utility can read in a distribution on startup, and refine that data with new computations, before writing the distribution out into a file; it can also merge together the distributions in two or more files. For fun, I included a progress bar, which writes a colorized bar to the standard error output. To get the progress bar to work with for_each, I used the simple strategy of giving every keystream_distribution the ability to print out a progress bar if configured to do so, and then configuring only a single one to output its progress.


The characteristic RC4 bias shows up in the output, even after a fairly short computation time. The most biased bytes occur at keystream locations that are multiples of the key length, which is 16 in our case.

In cryptanalysis, it is rare to be able to get an useful result in a timescale short enough to be measured with a progress bar! RC4 is an interesting cipher, and a great introduction to cryptography, due to the simplicity of its design and its historical use (and misuse). But don’t use it to encrypt anything if you can avoid it – it has significant insecurities, and has long been disallowed in the TLS and SSH standards.

Copyright © 2022 David McGrew.