(Last updated 9 Apr 2011.)
Table of contents:
In most systems the default allocator used by C++ (and C) compilers is a very general memory allocator which supports allocating and freeing blocks of memory of any size. The advantage of this is that it's memory-efficient when the sizes of the allocated blocks are very varied throughout the execution of the program.
The disadvantage of this is, however, that the bookkeeping data needed by the default allocator is relatively complicated, so managing all this memory with free-sized blocks is relatively slow. Another problem is that the default allocator tends to be rather cache-unfriendly with repeated allocations and deallocations, as the list of freed blocks gets quite randomized (a very random list of free memory blocks will cause allocations to cause large amounts of cache misses, which can heavily penalize the speed of the allocator).
There are many situations, especially when programming in C++ using the STL data containers, where it would be enough to have an allocator for fixed-sized memory blocks. The advantage of this is that the allocator can be made considerably faster because the bookkeeping is much simpler and it's much easier to make it cache-friendly.
This library provides two allocators which are more optimized for
speed and memory usage. The first one is especially suitable for the
C++ STL data containers, specifically std::list
,
std::set
and std::map
.
This library presents two fixed-size block allocators:
This is a more generic fixed-size block allocator, which is very
suitable for use with the C++ STL data containers which allocate one
element at a time, ie. std::list
, std::set
and
std::map
. It cannot be used with std::vector
,
std::deque
nor std::string
(but on the other
hand the memory usage of those containers is already very good, so they
don't really need a specialized allocator).
The allocator works by allocating contiguous blocks of memory where the allocated elements are stored. The size of the element is determined automatically from the objects to be allocated. Basically an allocator is created for every element size that is used. (On the other hand, if two different data containers use elements of the same size, they will share the same allocator.)
If an entire block of elements is deallocated, then the block will be freed from the main memory as well. Besides the obvious advantage of memory being freed for other uses, this has also speed advantages (because if the block is allocated again, it will be cache-friendly and consecuently very fast).
Each allocated element has an overhead of 4 bytes in 32-bit systems
(8 bytes in 64-bit systems), which is the same as with most default
memory allocators. Thus this allocator does not consume any more memory
than the default one. In fact, it can sometimes allocate even less memory
than the default allocator because many such allocators align allocated
elements to 8-byte boundaries, while FSBAllocator
allocates
to 4-byte boundaries (in 32-bit systems).
The disadvantages of using this allocator are the following:
std::vector
nor std::deque
). On the other
hand, this is usually not a problem.This is very similar to the FSBAllocator
, with the difference
that it never frees memory. The advantage of this is that no bookkeeping data
is needed, and thus there's no overhead and allocated elements take only
as much memory as the size of the element.
This can be especially efficient if lots of very small elements
(such as individual integers) are allocated. For example the default
memory allocator in Linux allocates a minimum of 16 bytes per element,
so if you allocate an individual (4-byte) integer, 16 bytes will be
allocated nevertheless. Using FSBAllocator2
only 4 bytes will be
allocated (in 32-bit systems).
The disadvantage over FSBAllocator
is that since memory is
never freed, this memory cannot be used by anything else and, more
importantly, FSBAllocator2
can be considerably slower than
FSBAllocator
when lots of elements are constantly being
allocated and freed. (As mentioned earlier, if the list of freed elements
gets very randomized, this will cause the allocator to become
cache-unfriendly, causing lots of cache misses, which can penalize it
in speed quite a lot.)
Note, however, that according to my tests FSBAllocator2
is basically never slower than the default allocator, and thus safe to use
speedwise.
Only if all the allocated elements are freed, then
FSBAllocator2
will release all of its memory, becoming
(for the time being) cache-friendly and fast again.
FSBAllocator2
also has a special function for freeing
blocks of memory which only contain deallocated elements. This function
performs a linear sweep through all the blocks and deallocates the ones
which do not contain allocated elements, as well as making the list of
free elements linear. While this sweep can be slow, it will make
subsequent allocations faster.
This allocator is especially efficient for allocating very small elements, such as for example smart pointer reference counters. Details about this are given later in this document.
Note that in simple circumstances FSBAllocator2
can be
even faster than FSBAllocator
. (For example if memory is
only allocated during the execution of the program, and only freed all
at the same time at the end.)
FSBAllocator2
can be used with the STL containers in the
same way as FSBAllocator
, but due to its speed issues this
should only be done if the memory savings are important or in simple
cases, like the above.
The same disadvantages apply as with FSBAllocator
.
To test the speed of FSBAllocator
compared to the default
allocator used by the compiler, the following test was performed on a Linux
system using gcc 4.1.2 on a Pentium4 3.4GHz:
A std::list<int>
was used, and the following operations
where done to it:
push_back()
.erase()
.push_back()
.erase()
.push_back()
.This process was repeated in a loop 10 times.
The benchmark was run for both the default allocator and the FSBallocator.
For comparison, this test was also run using
boost::fast_pool_allocator
. All conditions were otherwise
exactly identical. The results were:
FSBAllocator
: 16 seconds.The same test was repeated using std::set<int>
(obviously using insert()
instead of push_back()
),
and the results were:
FSBAllocator
: 1 minute 10 seconds.(Note that in this case there are far fewer element creations and deletions because of all the repeated values, which is the reason why the times are closer to each other.)
Using the allocator is rather simple. Simply
#include "FSBAllocator.hh"
in the source files where
the allocator is used, and then simply give the allocator to the desired
STL container as template parameter. For example, instead of writing this:
std::list<int> aList;
write this:
std::list<int, FSBAllocator<int> > aList;
That's it. No other modifications are needed.
The same for std::set
would be:
std::set<int, std::less<int>, FSBAllocator<int> > aSet;
If you want to use the allocator directly, as a replacement of
new
and delete
, it can be done like this:
FSBAllocator<YourType> alloc; YourType* anInstance = alloc.allocate(1); // alloc.construct(anInstance, YourType()); // if YourType is a class ... // alloc.destroy(anInstance); // if YourType is a class alloc.deallocate(anInstance, 1);
Note that allocate()
only allocates memory for the object
but doesn't initialize it in any way (similarly to how, for example,
std::malloc()
would do). If your type is a class, you'll have
to initialize and destroy it yourself as demonstrated in the commented
lines or by using placement new directly after allocating and by
calling its destructor explicitly before deallocating.
(Also note that, as stated earlier, arrays cannot be allocated with this allocator, and thus the only valid parameter value is 1.)
FSBAllocator2
can be used in the same way as
FSBAllocator
, although usually it's only recommended for
special cases, like for example as a reference counter pool, described below.
Additionally, FSBAllocator2
offers a special function for
freeing unused memory, which can be used like this:
FSBAllocator2<YourType>().cleanSweep();
This will make the allocator to traverse all the allocated blocks and free the ones which do not contain any allocated elements. It will also make the list of freed elements linear.
Note, however, that there's a catch, which can make this function to
malfunction, and thus it must be used with care: In order to distinguish
which allocated elements are free and which aren't, it fills these freed
elements with an "unused value", which it takes as parameter, and which
by default is a size_t(-1)
. This value must indeed by
unused by the application. If the application has written this value
in some of the allocated elements, this function will malfunction (and
free elements it shouldn't).
For example if this allocator is used as a reference counter pool,
and all reference counts get values greater or equal to 0, then it's
safe to call this function as it is. If the value size_t(-1)
is used by the application for something, then it must provide some other
unused value for this function to use, for example:
FSBAllocator2<size_t>().cleanSweep(size_t(-2));
If unsure, it's better to not to call this function at all.
Note, however, that even though this function can be slow (depending on how much memory has been allocated), it can make subsequent allocations much faster, especially if the list of freed elements has become very randomized.
By default this library is not thread-safe, and thus cannot be used as an allocator for multiple threads. (As long as only one thread uses this allocator and the memory allocated by it, it should be ok.)
The library offers several possibilities for making itself thread-safe. This is done by defining one (and only one) of the following precompiler constants, either in your compiler settings, or before including this library. The possible precompiler constants are:
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_BOOST
: Use the Boost
Thread library. This will probably require linking the boost thread library
into the project (eg. with -lboost_thread-mt
).
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_OPENMP
: Use the
OpenMP compiler extension. The OpenMP extension requires specific compiler
support. (With gcc it also requires linking with -fopenmp
)
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_PTHREAD
: Use the
posix pthread library. (Probably requires the -lpthread
linker
option.)
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC
: If you are using
the gcc compiler, this uses an extension provided by it (more precisely, the
__sync_bool_compare_and_swap()
function).
FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED
: Like
the previous, but additionally uses the posix <sched.h>
library to better interact with the system task scheduler. This results in
an even more efficient and CPU-friendly locking.
Usage example:
#define FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED #include "FSBAllocator.hh"
Note that enabling thread-safe locking will make the library slower. In fact, in many systems all the locking mechanisms except the last two (ie. the ones which use the GCC extension) may even make the library slower than the default allocator when using multiple threads. They are still offered, especially as a means to make FSBAllocator2 (which offers memory saving) thread-safe in most systems.
The following is a list of running times of the first benchmark mentioned before, using the different locking mechanisms (from fastest to slowest). Also the same benchmark was run using the default allocator.
A different (but somewhat similar) benchmark, using two threads, both of them using the same FSBAllocator type, resulted in the following times (from fastest to slowest):
<sched.h>
: 39 seconds.
Naturally FSBALLOCATOR_USE_THREAD_SAFE_LOCKING_GCC_WITH_SCHED
should be preferred, if the target system supports its requirements.
There are basically three types of smart pointers:
The third type of smart pointer is usually the most efficient: The size of the smart pointer object is that of one single pointer, and it's very fast to copy and assign. Its problem is that it requires for the allocated object to contain a reference counter as member variable, and thus this smart pointer cannot be used with existing types which do not contain such member variable (nor with internal types, such as ints or doubles, obviously).
The first type of smart pointer has the advantage that it can be used with any type of allocated object. In other words, the smart pointer does not require anything from the object in question. However, it has two main problems compared to the intrusive type: Its size is that of two pointers (rather than one), and most importantly, it needs to dynamically allocate the reference counter (which is just an integer type) separately. Such an extra allocation not only slows down the creation and destruction of such smart pointers, but it also wastes memory (typically each allocation consumes 16 bytes of memory).
The second type of smart pointer is a clever variation: Rather than allocating a separate reference counting integer, what it does is that it doubly-links itself with all the other smart pointer instances which are pointing to the same object, effectively creating a doubly-linked list of smart pointers pointing to the same object. This way it has all the advantages of the first smart pointer type, but it doesn't need to allocate anything. Its disadvantage is that its size is that of three pointers.
The doubly-linked smart pointer is more efficient than the non-intrusive reference-counting smart pointer when there are lots of allocated objects with only one (or a few) smart pointer pointing to them, and these smart pointers are not copied/assigned around a lot. In this case the doubly-linked smart pointers consume less memory and are faster to create and destroy.
However, the non-intrusive reference-counting smart pointers start being more efficient if there are lots of smart pointers pointing to the same object, and these smart pointers are copied/assigned around a lot. The more smart pointers point to the same object, the less memory is wasted on the smart pointers themselves, compared to the doubly-linked smart pointers. Also copying and assigning is slightly faster for the reference-counting ones.
The ideal solution (when intrusive smart pointers cannot be used) is to use the non-intrusive reference-counting smart pointers with a reference counter pool.
FSBAllocator2
is ideal for such a reference counter pool.
This is because each allocated counter requires only as much memory as the size
of the counter (4 bytes in 32-systems) and nothing more. Allocating and
deallocating these counters is also faster (sans the problems with
cache-friendliness discussed earlier).
When such a reference counter pool is used, the reference-counting smart pointer loses all of its disadvantages over the doubly-linked smart pointers: Each smart pointer instance allocates at most two pointers and one reference counter (which has the size of a pointer), which makes it equal in memory consumption to the doubly-linked smart pointer. However, if more than one reference-counting smart pointer points to the same object, it immediately is more memory-efficient than the doubly-linked smart pointer.
Also creating and destroying such smart pointers becomes faster,
as the FSBAllocator2
is used for allocating and deallocating the
reference counters.
The "FSBAllocator.hh"
header defines an allocator name
for this specific purpose: FSBRefCountAllocator
(which is
simply a shortcut for FSBAllocator2<size_t>
).
Also an example smart pointer implementation is supplied, as the
"SmartPtr.hh"
file.
To test the memory consumption and speed of the supplied smart pointer
using different allocators, a simple test was run, where 15 million elements
of type double
were dynamically allocated to that many smart
pointers in a vector (and then simply destroyed when the program ends),
with the following four different allocation strategies used:
double
and the reference counter were allocated
using the default allocator of the compiler.
double
was allocated with the default allocator,
but the reference counter was allocated with
FSBRefCountAllocator
.
double
and the reference counter were allocated
using FSBAllocator
.
double
was allocated with FSBAllocator
,
and the reference counter was allocated with
FSBRefCountAllocator
.
The peak memory usage and the execution time of the whole program were the following:
The same test was repeated, but allocating objects of type int
rather than of type double
. The results were:
The memory usage numbers in the above tests are caused by how the linux gcc allocator behaves:
The size of the allocation request plus 4 bytes are
allocated, aligned to an 8-byte boundary, with a minimum allocation of
16 bytes. The size of a double
element is 8 bytes and the
size of an int
element is 4 bytes, but the default allocator
allocates 16 bytes in both cases.
FSBAllocator
has a minimum allocation size of 8 bytes (in
32-bit systems), aligned to a 4-byte boundary. FSBAllocator2
has a minimum allocation size of 4 bytes (in 32-bit systems), aligned to
a 4-byte boundary.
This library is released under the MIT license.
Copyright (c) 2008-2011 Juha Nieminen
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.