Optimizing the sieve of Eratosthenes

Algorithm description

The sieve of Eratosthenes is a simple algorithm to calculate all the primes up to a certain amount. While it's not the fastest existing algorithm for that purpose, it's very simple to implement, and much faster than resolving every individual prime separately.

The basic algorithm is as follows (implemented in C++, and assuming that the primality info is collected into a std::bitset or boost::dynamic_bitset named "prime"):

    prime.set();
    for(unsigned long n = 2; n*n < kMaxValue; ++n)
        if(prime[n])
            for(unsigned long index = n*n; index < kMaxValue; index += n)
                prime[index] = false;

The idea is that, starting from 2, if a number has been marked as prime, all of its multiples (up to the maximum value) are unmarked as such. This is quite a simple algorithm.

Note that the outer loop only has to go up to the square root of the maximum value. The reason for this is that every composite number has at least one prime factor that's smaller than its square root (which is a trivial thing to prove and intuitively understand), and therefore it's enough to mark multiples of primes up to the square root of the maximum value to unambiguously mark all composite numbers in the entire range.

It might not be equally trivial to understand why the inner loop can start from the square of the current prime number, but the reasoning is similar.

There's a small optimization that can be immediately applied to the above algorithm, and it's that it needlessly marks all even numbers as composite as its first step. Since this is basically useless work (checking for divisibility by 2 is trivial), the memory consumption and running time can be almost halved with a slight modification:

    prime.set();
    for(unsigned long n = 3; n*n < kMaxValue; n += 2)
        if(prime[n/2])
            for(unsigned long index = n*n; index < kMaxValue; index += n*2)
                prime[index/2] = false;

(Note that if the amount of set bits is now counted in the bitset, it will give the correct amount of primes because the first bit is also set. There's no need to compensate for the lack of even numbers in this case.)

Efficiency problems

There are two major problems with the basic algorithm above: It's not very cache-efficient, and it's hard to parallelize to multiple threads.

If the maximum value is so large that the bit array will not fit inside even the outermost processor cache, it will be quite inefficient. That's because the algorithm just traverses the entire array from beginning to end over and over. The cache locality is extremely poor, and there will be constant cache misses.

Also, the basic algorithm above is very hard to parallelize to multiple threads in an efficient manner. The outer loop cannot be distributed among threads because the inner loop modifies the same data over and over. The different threads would either mess up each other's work, or they would need locking, which would basically kill any performance benefit that could come from multithreading.

The inner loop can be parallelized (because it can work with independent sections of the bit array), but doing so has very little benefit. (According to my experiments, parallelizing the inner loop to 4 cores made the program approximately 25% faster, which is laughably small compared to the potential maximum speedup of 300% given by the three additional cores. It's basically just a huge waste of processor cores.)

Solving both problems with one single solution

In order to take advantage of processor caches, we need to maximize cache locality. This means in English that we need to do as much work as possible on an amount of data that fits the processor's caches, before moving on.

We start by noticing that there's nothing in the algorithm that requires all the multiples of a given prime to be marked on the entire array at once. This means that we can take a smaller section of the array, and mark all the multiples of all the primes on it (ie. we first mark all the multiples of 3 inside that section, then the multiples of 5, and so on). Only after the entire section has been thus marked, do we move to the next section, and do the same.

Of course for this to be possible, we need to first know what those primes are. However, as per the basic algorithm, we only need to know only the first primes up to the square root of the maximum value, we can calculate them in the normal way. In other words, we perform these steps:

Run the basic algorithm up to sqrt(maximum value).
Divide the rest of the array into approximately CPU cache sized sections (or preferably a bit less), and run the algorithm for each such section.

Since the square root of the maximum value is very small, the first step is usually very quick and doesn't need optimization. (For instance, if you were calculating all prime numbers up to 100 billion, its square root would be 316228, which in terms of a bitset with odd numbers only means that the working space is about 19 kilobytes, which is really small and will easily fit even inside the smallest of CPU caches.)

The difference in speed can be quite drastic. For instance, I tested the algorithm for a maximum value of 10 billion on an Intel i5, and the times were:

Basic algorithm: 44.7 seconds.
Modified algorithm: 17.5 seconds.

The modified algorithm is almost three times faster than the original basic algorithm, even though it does basically the exact same amount of work. (The basic working principle of the algorithm is identical. The only difference is the order in which things are done.) This demonstrates quite well how important cache locality can be.

There's another huge, huge advantage to the modified algorithm: The second step can be trivially distributed among threads. Since each segment is calculated independently of each other, it allows for different threads to calculate them without interfering with each other. No locking is needed.

When running on all four cores of the abovementioned computer, I got this result:

Basic algorithm: 44.7 seconds.
Modified algorithm: 17.5 seconds.
Modified algorithm on 4 cores: 5.2 seconds.

That's quite a significant improvement over the original.

Implementation details

(To be written...)