An efficient LZW implementation

1. The basic LZW encoding algorithm

Glossary

The following terms are used in this text:

Byte: The fundamental input data element, having possible values between 0 and 255. An arbitrary byte will be denoted with B. A specific value of a byte will be denoted by enclosing the value in parentheses, for example (25).
String: One or more continuous bytes.
Prefix: All but the last byte in a string. A prefix will be denoted with [prefix], and thus an arbitrary string will be denoted with [prefix]B. A prefix can be empty (which then means that the string has only one byte). An empty prefix is denoted with [empty].
Dictionary: An array of strings (ie. each element of the array contains a string).
Index: The position of a given string in the dictionary. The first index is 0, the next one is 1, etc. An arbitrary index value will be denoted with <index>.
←: Assignment.

The dictionary

The dictionary is an array of strings, or in other words, prefix-byte pairs. Each string in the dictionary is unique, ie. no string appears in the dictionary twice.

The first 256 elements in the dictionary consist of pairs of empty prefixes and the byte values corresponding to their index in the dictionary. In other words, the first element of the dictionary is [empty](0), the next one is [empty](1) and so on, up to [empty](255). (When optimizing the algorithm the dictionary can be initialized with less entries if the input data uses less than 256 byte values, but for the sake of simplicity we'll assume in this introductory section that all 256 values are used.)

All the new strings added to the dictionary will be added to index positions from 256 upwards. Each new string is added to the next unused position in the dictionary.

The basic LZW algorithm

This is the basic LZW algorithm in pseudocode:

Initialize the dictionary (with the first 256 entries).
[prefix] ← [empty]
B ← next byte in the input.
Is the string [prefix]B in the dictionary?
- Yes:
  1. [prefix] ← [prefix]B
- No:
  1. Add the string [prefix]B to the dictionary.
  2. Output the index of [prefix] to the result.
  3. [prefix] ← B
If there are bytes left in the input, jump to step 3.
Else output the index of [prefix] to the result.

Resetting the dictionary

Obviously the dictionary cannot grow forever (or else we would at some point run out of memory with very large inputs). For this reason we have to set a maximum size for the dictionary. In practice (as we will see in part 3) we should set a maximum bitsize for the index values of the dictionary (which thus automatically limits the maximum size of the dictionary).

For example, if we set this maximum bitsize to 16 that means that the maximum number of elements in the dictionary will be 65536.

Obviously this maximum bitsize must be larger than the minimum bitsize which, in this introduction, was 8 bits.

What happens when the dictionary gets full? What we have to do is that immediately when the dictionary gets full (ie. a string is stored in the last possible location of the dictionary) we just reset the dictionary and start over (or, in other words, jump to step 1 in the pseudocode above).

Part 2: Optimizing the dictionary

Return to the main page.