As already noted in part 2, the most crucial component of the encoding algorithm speedwise is the dictionary search, ie. step 4 of the algorithm. The implementation of this step determines whether the encoding of a hundred megabytes of input will take tens of minutes or just a few seconds. It is also the step with the least clear "optimal" implementation.
Most fast implementations use a separate data structure for performing the search, usually a hash table, or perhaps in some cases a balanced binary tree. The problem with these solutions is that they don't take any advantage of the internal structure of the dictionary but handle it as an uniform block of data. They also require more memory than would be necessary (especially a balanced binary tree).
The solution presented here starts from a different approach: By examining the structure of the dictionary itself and taking advantage of its properties.
This glossary is used:
We start by noting that there can exist at most 256 strings in the dictionary which have the same prefix index. No more than that amount of strings with a given prefix can be added to the dictionary (because the only differentiating element in them is the byte value).
From this it follows that to check if a string already exists in the dictionary it would be enough to simply check those strings with the same prefix index to see if any of them has the same byte value as the new string. Checking any other strings in the dictionary would be a useless waste of time.
But how to check only the strings with the same prefix?
The first candidate solution for this task is to link these strings to each other. In other words, each string has an index to the next string which uses the same prefix index. This way they will form a linked list. The last string in the list can have its "next" index point to empty to denote the end of the list.
But how to find the first string which uses the prefix index? The solution is easy: Put the index to this first string in the prefix string. Since we know what this prefix index is, we can get the first list element from there.
Thus in this first candidate solution each string in the dictionary would be formed of these values:
<prefix index>
: The prefix index.
B
: The byte value.
<first>
: The index to the first string using this string
as prefix. Initialized to the empty string index.
<next>
: The index to the next string using the same prefix
index as this one. Initialized to the empty string index.
Thus searching a string from the dictionary would happen like this:
<prefix index>
of the new string is the index
to empty, the string is in the dictionary, and its index is
B
: Thus return B
. Else:
<index>
← <first>
from the
string at the <prefix index>
of the new string
<index>
is the empty index or the string at
<index>
is equal to the new string (ie. their byte values
are equal), return <index>
<index>
← <next>
from string at
<index>
Thus a return value of the empty string index means that the string was not in the dictionary, else the index to the string in the dictionary is returned.
Adding a new string to the list can be done by putting it at the beginning
of the list. In other words, the <next>
of the new string is set
to point to the <first>
of the prefix string, and the latter is
set to point to the new string.
Although this is not the optimal solution, it already speeds up the searching a lot. This is because at most 256 comparisons will be made regardless of the size of the dictionary. Even if the dictionary had millions of elements, searching an element would still require at most 256 comparisons. (Technically speaking the searching has become an O(1) operation.)
The search can be made even faster by simply adding one additional index value to the string element. The idea is that instead of using a linked list we use a (non-balanced) binary tree instead. Thus the string element will contain these values:
<prefix index>
B
<first>
<left>
: The index to the next string using the same prefix
index as this one, and which has a smaller byte value than
B
. Initialized to the empty string index.
<right>
: The index to the next string using the same prefix
index as this one, and which has a larger byte value than
B
. Initialized to the empty string index.
The idea is very similar to the linked list solution, but now we follow
either the <left>
or the <right>
link depending
on whether the B
in the new string is smaller or larger
(respectively) to the B
in the current string. Naturally if
it's equal then we have found what we were searching for and we can return
the index of the current string.
Adding a new string to the tree is simply a question of finding its
proper place in the tree (by doing the search above) and then setting
the <left>
or <right>
index of the previous
string to point to the new string.
Note that the searching and adding can be done in one single step. In other words, if the searched string is not found then we can add it right there. This way we avoid doing the same search twice.
Although this is an unbalanced binary tree the worst case scenario is still a maximum of 256 comparisons, as with the list solution. Practical tests of this algorithm have shown considerable speedups compared to the list solution (encoding usually becomes more than twice as fast).
Part 5: The basic LZW decoding algorithm
Copyright 2007: Juha Nieminen