Double-Burst-Selection sort

Table of contents

1 Introduction

This page presents a O(n2) sorting algorithm which is faster than insertion sort in some cases (specifically: when there is a lot of repetition in the data to be sorted).

(Strictly speaking this is not an O(n2) algorithm, but an O(n*m) algorithm where n is the number of items in the array to be sorted and m is the number of distinct items in the array, but since the O-notation is used to denote the worst case scenario (which happens when m=n), O(n2) should be correct by all practical means.)

1.1 Why?

Why develop a O(n2) sorting algorithm since O(n log n) sorting algorithms are so much faster, specially with large amounts of data?

The answer is precisely in the "with large amounts of data" part. O(n log n) sorting algorithms usually perform very poorly when the amount of data to be sorted is very small (eg. 10-20 items). For example, with 10 items to sort, insertion sort can be several times faster than for example quicksort.

One could then ask why that matters. If there are so few items to sort, it doesn't take almost any time at all, no matter what the sorting algorithm. However, this does matter when the amount of small groups to sort is very large. For example, if you have a million groups of 10 items and each group needs to be sorted, it's much faster to use insertion sort on each group than eg. quicksort.

This is exactly why the fastest existing sorting methods do not use exclusively a O(n log n), but they usually sort the data only up to a point with that sorting algorithm (eg. when partitioning in quicksort, don't sort any partition which is smaller than 15 items) and then they run insertion sort on the data. This trick of running quicksort only up to a certain partition size and then running insertion sort can speed up a pure quicksort by even 20-40%.

1.2 The meaning of the name

Since there's seldom something truely new under the Sun, it's quite probable that a similar, if not even the exact same algorithm, has already been developed (and named) by someone somewhere. However, since I have never heard of this sorting algorithm or anything similar, I have named it "double-burst-selection sort" at least for the time being.

The reason for using "selection sort" in the name is that this algorithm is somewhat similar in idea.

The idea in selection sort is to find the smallest item in the array and swap the first array item with it. Then this is repeated again for the remaining of the array (ie. everything but the first item) until all items have been placed in this way. If there's more than one smallest item, only one of them will be placed at the beginning.

The idea came to me that this is actually quite an inefficient way of doing it. Why should just one of the smallest items be placed at the beginning of the array when it's possible to do so for all of them with one pass?

This idea made me develop what I call the "burst-selection sort": After each pass the smallest item, or most importantly, if there's more than one smallest item, all of them will end up at the beginning of the array. (For example, if the smallest value in the array is '1' and there are 10 of them in the array, after one pass all of them will be located at the beginning of the array.) When this is repeated for the rest of the array until all the smallest items have been translated to the beginning, the array is sorted.

After that it ocurred to me that why should I restrict the algorithm to just place all the smallest items to the beginning? Why not, in the same pass, also place all the largest items to the end of the array? I discovered that it is, in fact, possible to do so and that it speeds up the sorting by some amount. Hence I called this enhanced variant of the algorithm "double-burst-selection sort".

2 Efficiency comparison

2.1 Comparison with insertion sort

Since insertion sort is one of the fastest (and also simplest) O(n2) sorting algorithms, and usually the algorithm of choice to speed up other sorting algorithms like quicksort, it's logical to compare the double-burst-selection sort with it.

The greatest advantage of this sorting algorithm is that it takes advantage of repetition in the data to be sorted. The more repetitions there are in the data, the faster the algorithm is. As the table below shows, depending on the amount of repetition the double-burst-selection sort algorithm can be several orders of magnitude faster then insertion sort.

However, this algorithm starts losing to insertion sort when there's little or no repetition in the data. Another disadvantage of this algorithm is that it's much harder to understand and implement than insertion sort (a typical implementation of insertion sort takes about 7 lines of code while double-burst-selection sort takes more than 25).

The following table shows a speed comparison between the two algorithms. Both have been implemented as efficiently as possible (eg. the insertion sort does not swap items as the basic algorithm description says, but takes the item to be moved, moves bigger items one place up until the correct place for the current item is found, and the item is then placed there; this saves a lot of unnecessary reads and writes). The program used to make this speed comparison can be found later in this page.

The test program creates an array of random unsigned integer values with differing array sizes and maximum item values (the smaller the maximum item value, the more repetition there is) and measures how many times each sorting algorithm can sort the array within 10 seconds. Naturally the same array was given to both algorithms to sort for fairness. The number under the algorithm column is thus the amount of times the algorithm was able to sort the array (that is, the bigger the number, the better).

Max valueArray sizeInsertionD-B-Selection
10104 859 7334 498 987
10100296 2401 114 643
101 0003 802140 492
1010 0003613 200
100104 867 7413 994 877
100100283 263384 414
1001 0003 67531 487
10010 000322 915
1 000104 754 1454 054 443
1 000100308 434191 426
1 0001 0003 5515 070
1 00010 00033324
10 000104 777 4784 031 111
10 000100292 915183 846
10 0001 0003 6332 574
10 00010 0003246
100 000104 824 1574 222 801
100 000100318 131155 052
100 0001 0003 7132 138
100 00010 0003230

As we can see from the table, the larger the array, the more likely it is for repetitions to happen, which benefits double-burst-selection sort. For example, with an array of 10 000 items with the range of item values being between 0 and 10, double-burst-selection sort is more than 350 times faster than insertion sort. On the other hand, with smaller arrays and less repetitions the former can be quite much slower.

2.2 Speeding up quicksort

As already indicated in the introduction, insertion sort is useful to speed up other sorting algorithms such as quicksort.

In order to compare double-burst-selection sort with insertion sort in a more practical use, I made an efficient implementation of quicksort1 and compared it with the same implementation but enhanced with insertion sort and another enhanced with double-burst-selection sort.

The version of quicksort enhanced with insertion sort skips partitioning partitions with less than 15 items and after it is done runs insertion sort to the whole table. The version enhanced with double-burst-selection sort skips partitioning partitions with less than 30 items and runs double-burst-selection sort to each of these partitions. (The partition sizes were fine-tuned by hand to get the fastest average sorting speed in both cases.)

All three versions of the algorithm were used to sort an array of 100 000 items containing random unsigned integer values with differing maximum values. The value in each algorithm column tells how many times the algorithm was able to sort the table in 10 seconds (thus a bigger number is better):

Max valueQuicksortQS + InsertionQS + D-B-Selection
1 000275399425
10 000256348371
100 000256340308
1 000 000251331266

As we can see from the table, enhancing quicksort with a fast O(n2) algorithm (fast for very small arrays, of course) has a very considerable impact in performance. (As said earlier, this is because quicksort is not very efficient with small partitions.)

We can also see that double-burst-selection sort presents a slight improvement over insertion sort when there's a considerable amount of repetition. When the amount of repetition is minimal, insertion sort starts being considerably faster.

The (C++) source code of the program used to make these measurements can be found here.

1 The quicksort implementation used in the program chooses the pivot item by calculating the median of the first, last and middle items of the partition. This method of choosing the pivot item introduces a considerable average speed performance compared to the regular quicksort (which just takes the first value of the partition as pivot) and even to the randomized quicksort (which chooses a random item as pivot).

3 The algorithm

3.1 Burst-selection sort

First I'll explain the burst-selection sort algorithm, which is the simpler version of the double-burst-selection sort.

As explained in the introduction, the idea of this sorting algorithm is to make a pass over all the items in the array and swap items in such way that after the pass is done the smallest item has been placed at the beginning of the array. If there's more than one smallest item, all of them will be placed there. The same algorithm is repeated for the rest of the array (ie. all the items in the array except the smallest ones at the beginning) until the entire array is sorted.

In pseudocode, the algorithm is the following:

  1. Initialize a minIndex index variable which points to the beginning of the array (where the minimum values will be placed).
  2. Initialize a min variable of the array item type with the value at minIndex. This variable will contain the minimum item value in the array (it may not contain it at first, but will eventually do so).
  3. Initialize a newMinIndex index variable with minIndex+1. This variable contains the location where the found minimum items will be placed during the pass.
  4. For each item in the array from the item at minIndex+1 to the end of the array, if the item is less or equal to min do:
  5. Assign newMinIndex to minIndex.
  6. If minIndex is still inside the array, jump to step 2.

3.2 Double-burst-selection sort

If an item is not a smallest item, it's possible to check to see if it's the largest item and if it is, move it to the end of the array. This way we can achieve two goals at the same time and perform the job faster. This kind of addition to the burst-selection sort makes it somewhat faster (but not twice faster as one could hastily think because the amount of work done does not actually drop to half, but it drops somewhat nevertheless).

This addition makes the algorithm a bit more complex because swapping the current item with the item at the current end of the array may actually bring a smallest item at the current place which means we can't move to the next item in the array just yet (if we did, the algorithm would malfunction in some cases). Also, if a new minimum item is found (and is swapped to the start of the current array section), the value swapped to the current item place may be a largest item, which also means that we can't move to the next item yet. Otherwise the addition can be done basically by replicating the burst-selection sort code in a way that it makes the reverse job.

These additions need to be done to the burst-selection sort algorithm in order to make it a double-burst-selection sort:

The source code of the test program has an implementation of this algorithm in C++ at the beginning.

Copyright 2004 Juha Nieminen

Valid HTML 4.0!