Fixed-Size Block Allocator suite for C++

(Last updated 9 Apr 2011.)

Table of contents:

Introduction
Allocators provided by this library

FSBAllocator
FSBAllocator2

Speed benchmark of FSBAllocator
Usage of FSBAllocator
Thread safety
FSBAllocator2 as a reference counter pool
License

Introduction

In most systems the default allocator used by C++ (and C) compilers is a very general memory allocator which supports allocating and freeing blocks of memory of any size. The advantage of this is that it's memory-efficient when the sizes of the allocated blocks are very varied throughout the execution of the program.

The disadvantage of this is, however, that the bookkeeping data needed by the default allocator is relatively complicated, so managing all this memory with free-sized blocks is relatively slow. Another problem is that the default allocator tends to be rather cache-unfriendly with repeated allocations and deallocations, as the list of freed blocks gets quite randomized (a very random list of free memory blocks will cause allocations to cause large amounts of cache misses, which can heavily penalize the speed of the allocator).

There are many situations, especially when programming in C++ using the STL data containers, where it would be enough to have an allocator for fixed-sized memory blocks. The advantage of this is that the allocator can be made considerably faster because the bookkeeping is much simpler and it's much easier to make it cache-friendly.

This library provides two allocators which are more optimized for speed and memory usage. The first one is especially suitable for the C++ STL data containers, specifically std::list, std::set and std::map.

Allocators provided by this library

This library presents two fixed-size block allocators:

FSBAllocator

This is a more generic fixed-size block allocator, which is very suitable for use with the C++ STL data containers which allocate one element at a time, ie. std::list, std::set and std::map. It cannot be used with std::vector, std::deque nor std::string (but on the other hand the memory usage of those containers is already very good, so they don't really need a specialized allocator).

The allocator works by allocating contiguous blocks of memory where the allocated elements are stored. The size of the element is determined automatically from the objects to be allocated. Basically an allocator is created for every element size that is used. (On the other hand, if two different data containers use elements of the same size, they will share the same allocator.)

If an entire block of elements is deallocated, then the block will be freed from the main memory as well. Besides the obvious advantage of memory being freed for other uses, this has also speed advantages (because if the block is allocated again, it will be cache-friendly and consecuently very fast).

Each allocated element has an overhead of 4 bytes in 32-bit systems (8 bytes in 64-bit systems), which is the same as with most default memory allocators. Thus this allocator does not consume any more memory than the default one. In fact, it can sometimes allocate even less memory than the default allocator because many such allocators align allocated elements to 8-byte boundaries, while FSBAllocator allocates to 4-byte boundaries (in 32-bit systems).

The disadvantages of using this allocator are the following:

The memory used by this allocator cannot be shared with anything else in the program which uses dynamically allocated memory.
This allocator supports allocating only individual elements. It cannot be used to allocate arrays (which is why it cannot be used with std::vector nor std::deque). On the other hand, this is usually not a problem.
If the memory allocated using this allocator becomes fragmented, it may in some cases cause the allocator to waste considerable amounts of memory. (This happens mainly if a lot of elements are allocated and then the majority of them are deallocated, but random individual elements here and there aren't.)
This allocator is not thread-safe by default. However, the library offers several possibilities for making it thread-safe with a precompiler constant. See the thread-safety section below.

FSBAllocator2

This is very similar to the FSBAllocator, with the difference that it never frees memory. The advantage of this is that no bookkeeping data is needed, and thus there's no overhead and allocated elements take only as much memory as the size of the element.

This can be especially efficient if lots of very small elements (such as individual integers) are allocated. For example the default memory allocator in Linux allocates a minimum of 16 bytes per element, so if you allocate an individual (4-byte) integer, 16 bytes will be allocated nevertheless. Using FSBAllocator2 only 4 bytes will be allocated (in 32-bit systems).

The disadvantage over FSBAllocator is that since memory is never freed, this memory cannot be used by anything else and, more importantly, FSBAllocator2 can be considerably slower than FSBAllocator when lots of elements are constantly being allocated and freed. (As mentioned earlier, if the list of freed elements gets very randomized, this will cause the allocator to become cache-unfriendly, causing lots of cache misses, which can penalize it in speed quite a lot.)

Note, however, that according to my tests FSBAllocator2 is basically never slower than the default allocator, and thus safe to use speedwise.

Only if all the allocated elements are freed, then FSBAllocator2 will release all of its memory, becoming (for the time being) cache-friendly and fast again.

FSBAllocator2 also has a special function for freeing blocks of memory which only contain deallocated elements. This function performs a linear sweep through all the blocks and deallocates the ones which do not contain allocated elements, as well as making the list of free elements linear. While this sweep can be slow, it will make subsequent allocations faster.

This allocator is especially efficient for allocating very small elements, such as for example smart pointer reference counters. Details about this are given later in this document.

Note that in simple circumstances FSBAllocator2 can be even faster than FSBAllocator. (For example if memory is only allocated during the execution of the program, and only freed all at the same time at the end.)

FSBAllocator2 can be used with the STL containers in the same way as FSBAllocator, but due to its speed issues this should only be done if the memory savings are important or in simple cases, like the above.

The same disadvantages apply as with FSBAllocator.

Speed benchmark of FSBAllocator

To test the speed of FSBAllocator compared to the default allocator used by the compiler, the following test was performed on a Linux system using gcc 4.1.2 on a Pentium4 3.4GHz:

A std::list<int> was used, and the following operations where done to it:

Add 10 million integers (all consecutive integers between 1 and 10 millions) with push_back().
Remove each 3rd element with erase().
Add again the 10 million integers with push_back().
Remove each 7th element with erase().
Add again the 10 million integers with push_back().
Destroy the list.

This process was repeated in a loop 10 times.

The benchmark was run for both the default allocator and the FSBallocator. For comparison, this test was also run using boost::fast_pool_allocator. All conditions were otherwise exactly identical. The results were:

Default allocator: 1 minute 12 seconds.
Boost pool allocator: 57 seconds.
FSBAllocator: 16 seconds.

The same test was repeated using std::set<int> (obviously using insert() instead of push_back()), and the results were:

Default allocator: 2 minutes 12 seconds.
Boost pool allocator: 1 minute 40 seconds.
FSBAllocator: 1 minute 10 seconds.

(Note that in this case there are far fewer element creations and deletions because of all the repeated values, which is the reason why the times are closer to each other.)

Usage of FSBAllocator

Using the allocator is rather simple. Simply #include "FSBAllocator.hh" in the source files where the allocator is used, and then simply give the allocator to the desired STL container as template parameter. For example, instead of writing this:

    std::list<int> aList;

write this:

    std::list<int, FSBAllocator<int> > aList;

That's it. No other modifications are needed.

The same for std::set would be:

    std::set<int, std::less<int>, FSBAllocator<int> > aSet;

If you want to use the allocator directly, as a replacement of new and delete, it can be done like this:

    FSBAllocator<YourType> alloc;
    YourType* anInstance = alloc.allocate(1);
    // alloc.construct(anInstance, YourType()); // if YourType is a class
    ...
    // alloc.destroy(anInstance); // if YourType is a class
    alloc.deallocate(anInstance, 1);

Note that allocate() only allocates memory for the object but doesn't initialize it in any way (similarly to how, for example, std::malloc() would do). If your type is a class, you'll have to initialize and destroy it yourself as demonstrated in the commented lines or by using placement new directly after allocating and by calling its destructor explicitly before deallocating.

(Also note that, as stated earlier, arrays cannot be allocated with this allocator, and thus the only valid parameter value is 1.)

FSBAllocator2

FSBAllocator2 can be used in the same way as FSBAllocator, although usually it's only recommended for special cases, like for example as a reference counter pool, described below.

Additionally, FSBAllocator2 offers a special function for freeing unused memory, which can be used like this:

    FSBAllocator2<YourType>().cleanSweep();

This will make the allocator to traverse all the allocated blocks and free the ones which do not contain any allocated elements. It will also make the list of freed elements linear.

Note, however, that there's a catch, which can make this function to malfunction, and thus it must be used with care: In order to distinguish which allocated elements are free and which aren't, it fills these freed elements with an "unused value", which it takes as parameter, and which by default is a size_t(-1). This value must indeed by unused by the application. If the application has written this value in some of the allocated elements, this function will malfunction (and free elements it shouldn't).

For example if this allocator is used as a reference counter pool, and all reference counts get values greater or equal to 0, then it's safe to call this function as it is. If the value size_t(-1) is used by the application for something, then it must provide some other unused value for this function to use, for example:

    FSBAllocator2<size_t>().cleanSweep(size_t(-2));

If unsure, it's better to not to call this function at all.

Note, however, that even though this function can be slow (depending on how much memory has been allocated), it can make subsequent allocations much faster, especially if the list of freed elements has become very randomized.

Thread safety

By default this library is not thread-safe, and thus cannot be used as an allocator for multiple threads. (As long as only one thread uses this allocator and the memory allocated by it, it should be ok.)

The library offers several possibilities for making itself thread-safe. This is done by defining one (and only one) of the following precompiler constants, either in your compiler settings, or before including this library. The possible precompiler constants are:

FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_BOOST: Use the Boost Thread library. This will probably require linking the boost thread library into the project (eg. with -lboost_thread-mt).
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_OPENMP: Use the OpenMP compiler extension. The OpenMP extension requires specific compiler support. (With gcc it also requires linking with -fopenmp)
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_PTHREAD: Use the posix pthread library. (Probably requires the -lpthread linker option.)
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC: If you are using the gcc compiler, this uses an extension provided by it (more precisely, the __sync_bool_compare_and_swap() function).
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED: Like the previous, but additionally uses the posix <sched.h> library to better interact with the system task scheduler. This results in an even more efficient and CPU-friendly locking.

Usage example:

#define FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED
#include "FSBAllocator.hh"

Note that enabling thread-safe locking will make the library slower. In fact, in many systems all the locking mechanisms except the last two (ie. the ones which use the GCC extension) may even make the library slower than the default allocator when using multiple threads. They are still offered, especially as a means to make FSBAllocator2 (which offers memory saving) thread-safe in most systems.

The following is a list of running times of the first benchmark mentioned before, using the different locking mechanisms (from fastest to slowest). Also the same benchmark was run using the default allocator.

No locking: 16 seconds.
GCC: 28 seconds. (The sched version makes no difference when only one thread is used.)
OpenMP: 46 seconds.
pthread: 55 seconds.
Boost: 70 seconds.
Default allocator: 72 seconds.

A different (but somewhat similar) benchmark, using two threads, both of them using the same FSBAllocator type, resulted in the following times (from fastest to slowest):

GCC with <sched.h>: 39 seconds.
GCC: 60 seconds.
Default allocator: 96 seconds.
OpenMP: 107 seconds.
pthread: 120 seconds.
Boost: 133 seconds.

Naturally FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED should be preferred, if the target system supports its requirements.

FSBAllocator2 as a reference counter pool

On smart pointers

There are basically three types of smart pointers:

Non-intrusive reference-counting smart pointer.
Non-intrusive doubly-linked smart pointer.
Intrusive reference-counting smart pointer.

The third type of smart pointer is usually the most efficient: The size of the smart pointer object is that of one single pointer, and it's very fast to copy and assign. Its problem is that it requires for the allocated object to contain a reference counter as member variable, and thus this smart pointer cannot be used with existing types which do not contain such member variable (nor with internal types, such as ints or doubles, obviously).

The first type of smart pointer has the advantage that it can be used with any type of allocated object. In other words, the smart pointer does not require anything from the object in question. However, it has two main problems compared to the intrusive type: Its size is that of two pointers (rather than one), and most importantly, it needs to dynamically allocate the reference counter (which is just an integer type) separately. Such an extra allocation not only slows down the creation and destruction of such smart pointers, but it also wastes memory (typically each allocation consumes 16 bytes of memory).

The second type of smart pointer is a clever variation: Rather than allocating a separate reference counting integer, what it does is that it doubly-links itself with all the other smart pointer instances which are pointing to the same object, effectively creating a doubly-linked list of smart pointers pointing to the same object. This way it has all the advantages of the first smart pointer type, but it doesn't need to allocate anything. Its disadvantage is that its size is that of three pointers.

The doubly-linked smart pointer is more efficient than the non-intrusive reference-counting smart pointer when there are lots of allocated objects with only one (or a few) smart pointer pointing to them, and these smart pointers are not copied/assigned around a lot. In this case the doubly-linked smart pointers consume less memory and are faster to create and destroy.

However, the non-intrusive reference-counting smart pointers start being more efficient if there are lots of smart pointers pointing to the same object, and these smart pointers are copied/assigned around a lot. The more smart pointers point to the same object, the less memory is wasted on the smart pointers themselves, compared to the doubly-linked smart pointers. Also copying and assigning is slightly faster for the reference-counting ones.

The ideal solution (when intrusive smart pointers cannot be used) is to use the non-intrusive reference-counting smart pointers with a reference counter pool.

A reference counter pool solution

FSBAllocator2 is ideal for such a reference counter pool. This is because each allocated counter requires only as much memory as the size of the counter (4 bytes in 32-systems) and nothing more. Allocating and deallocating these counters is also faster (sans the problems with cache-friendliness discussed earlier).

When such a reference counter pool is used, the reference-counting smart pointer loses all of its disadvantages over the doubly-linked smart pointers: Each smart pointer instance allocates at most two pointers and one reference counter (which has the size of a pointer), which makes it equal in memory consumption to the doubly-linked smart pointer. However, if more than one reference-counting smart pointer points to the same object, it immediately is more memory-efficient than the doubly-linked smart pointer.

Also creating and destroying such smart pointers becomes faster, as the FSBAllocator2 is used for allocating and deallocating the reference counters.

The "FSBAllocator.hh" header defines an allocator name for this specific purpose: FSBRefCountAllocator (which is simply a shortcut for FSBAllocator2<size_t>).

Also an example smart pointer implementation is supplied, as the "SmartPtr.hh" file.

Benchmark

To test the memory consumption and speed of the supplied smart pointer using different allocators, a simple test was run, where 15 million elements of type double were dynamically allocated to that many smart pointers in a vector (and then simply destroyed when the program ends), with the following four different allocation strategies used:

Both the double and the reference counter were allocated using the default allocator of the compiler.
The double was allocated with the default allocator, but the reference counter was allocated with FSBRefCountAllocator.
Both the double and the reference counter were allocated using FSBAllocator.
The double was allocated with FSBAllocator, and the reference counter was allocated with FSBRefCountAllocator.

The peak memory usage and the execution time of the whole program were the following:

572 MB, 3.6 seconds.
401 MB, 2.2 seconds.
401 MB, 1.0 seconds.
344 MB, 0.8 seconds.

The same test was repeated, but allocating objects of type int rather than of type double. The results were:

572 MB, 3.6 seconds.
401 MB, 2.2 seconds.
344 MB, 0.9 seconds.
287 MB, 0.8 seconds.

The memory usage numbers in the above tests are caused by how the linux gcc allocator behaves:

The size of the allocation request plus 4 bytes are allocated, aligned to an 8-byte boundary, with a minimum allocation of 16 bytes. The size of a double element is 8 bytes and the size of an int element is 4 bytes, but the default allocator allocates 16 bytes in both cases.

FSBAllocator has a minimum allocation size of 8 bytes (in 32-bit systems), aligned to a 4-byte boundary. FSBAllocator2 has a minimum allocation size of 4 bytes (in 32-bit systems), aligned to a 4-byte boundary.

License

This library is released under the MIT license.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.