The following terms are used in this text:
B
. A specific value of a byte will be denoted by
enclosing the value in parentheses, for example (25)
.[prefix]
, and thus an arbitrary string will
be denoted with [prefix]B
. A prefix can be empty (which
then means that the string has only one byte). An empty prefix is
denoted with [empty]
.<index>
.The dictionary is an array of strings, or in other words, prefix-byte pairs. Each string in the dictionary is unique, ie. no string appears in the dictionary twice.
The first 256 elements in the dictionary consist of pairs of empty prefixes
and the byte values corresponding to their index in the dictionary. In other
words, the first element of the dictionary is [empty](0)
, the
next one is [empty](1)
and so on, up to [empty](255)
.
(When optimizing the algorithm the dictionary can be initialized with less
entries if the input data uses less than 256 byte values, but for the sake
of simplicity we'll assume in this introductory section that all 256 values
are used.)
All the new strings added to the dictionary will be added to index positions from 256 upwards. Each new string is added to the next unused position in the dictionary.
This is the basic LZW algorithm in pseudocode:
[prefix]
← [empty]
B
← next byte in the input.
[prefix]B
in the dictionary?
[prefix]
← [prefix]B
[prefix]B
to the dictionary.
[prefix]
to the result.
[prefix]
← B
[prefix]
to the result.
Obviously the dictionary cannot grow forever (or else we would at some point run out of memory with very large inputs). For this reason we have to set a maximum size for the dictionary. In practice (as we will see in part 3) we should set a maximum bitsize for the index values of the dictionary (which thus automatically limits the maximum size of the dictionary).
For example, if we set this maximum bitsize to 16 that means that the maximum number of elements in the dictionary will be 65536.
Obviously this maximum bitsize must be larger than the minimum bitsize which, in this introduction, was 8 bits.
What happens when the dictionary gets full? What we have to do is that immediately when the dictionary gets full (ie. a string is stored in the last possible location of the dictionary) we just reset the dictionary and start over (or, in other words, jump to step 1 in the pseudocode above).
Part 2: Optimizing the dictionary
Copyright 2007: Juha Nieminen