"charstream"
a stream of characters, as in a data file.
"string" a number of continuous characters, anywhere
from one to very many characters in length. I can specify an
arbitrary string as "[...]K".
"prefix" almost the same as a string, but with the
implication that a prefix immediately precedes a character,
and a prefix can have a length of zero. So, a prefix and a
character make up a string. I will refer to an arbitrary prefix
as "[...]". "root"
a single-character string. For most purposes, this is a character,
but we may occasionally make a distinction. It is [...]K, where
[...] is empty. "code"
a number, specified by a known number of bits, which
maps to a string. "codestream"
the output stream of codes, as in the "raster data".
"entry"
a code and its string. "string
table" a list of entries; usually, but not necessarily,
unique. LZW is a way of compressing
data that takes advantage of repetition of strings in the data.
Since raster data usually contains a lot of this repetition,
LZW is a good way of compressing and decompressing
it. For the moment, lets consider normal LZW encoding
and decoding. GIF's variation on the concept is just an extension
from there. LZW manipulates three objects in both
compression and decompression: the charstream, the codestream,
and the string table. In compression, the charstream
is the input and the codestream is the output.
In decompression, the codestream is the input and
the charstream is the output. The string table
is a product of both compression
and decompression, but is never passed from one to the other.
The first thing we do in LZW compression is
initialize our string table. To do this, we need
to choose a code size (how many bits) and know how many
values our characters can possibly take. Let's say our
code size is 12 bits, meaning we can store 0->FFF, or 4096
entries in our string table. Lets also say
that we have 32 possible different characters. (This corresponds
to, say, a picture in which there are 32 different colors possible
for each pixel.) To initialize the table, we set code#0 to
character#0, code #1 to character#1, and so on, until
code#31 to character#31. Actually, we are specifying
that each code from 0 to 31 maps to a root.
There will be no more entries in the table
that have this property. Now we start compressing data.
Let's first define something called the "current prefix".
It's just a prefix that we'll store things in
and compare things to now and then. I
will refer to it as "[.c.]".
Initially, the current prefix has nothing in it. Let's
also define a "current string", which
will be the current prefix plus the next character
in the charstream. I will refer to the current
string as "[.c.]K", where K is some character.
OK, look at the first character in the charstream.
Call it P. Make [.c.]P the current string.
(At this point, of course, it's just the root P.) Now
search through the string table to see if [.c.]P appears
in it. Of course, it does now, because our string
table is initialized to have all roots. So we don't do anything.
Now make [.c.]P the current prefix. Look at the
next character in the charstream. Call it Q. Add it
to the current prefix to form [.c.]Q, the current string.
Now search through the string table to see
if [.c.]Q appears in it. In this case, of course,
it doesn't. Aha! Now we get to do something. Add [.c.]Q
(which is PQ in this case) to the string table
for code#32, and output the code for [.c.] to the codestream.
Now start over again with the current prefix being just
the root P. Keep adding characters to [.c.] to form [.c.]K,
until you can't find [.c.]K in the string table. Then output
the code for [.c.] and add [.c.]K to the string
table. In pseudo-code, the algorithm goes something
like this: [1] Initialize string
table; [2] [.c.] <- empty;
[3] K <- next character in charstream;
[4] Is [.c.]K in string table?
(yes: [.c.] <- [.c.]K;
go to [3]; )
(no: add [.c.]K to the string table;
output the code for [.c.] to the codestream;
[.c.] <- K;
go to [3]; ) It's as
simple as that! Of course, when you get to step [3] and there
aren't any more characters left, you just output
the code for [.c.] and throw the table away. You're done.
Wanna do an example? Let's pretend we have a four-character
alphabet: A,B,C,D. The charstream looks like ABACABA.
Let's compress it. First, we initialize our string table
to: #0=A, #1=B, #2=C, #3=D. The first character is A,
which is in the string table, so [.c.] becomes A. Next we get
AB, which is not in the table, so we
output code #0 (for [.c.]), and add AB to the string
table as code #4. [.c.] becomes B. Next we get [.c.]A
= BA, which is not in the string table, so output
code #1, and add BA to the string table as code #5.
[.c.] becomes A. Next we get AC, which is not in the string
table. Output code #0, and add AC to the string
table as code #6. Now [.c.] becomes C. Next we get
[.c.]A = CA, which is not in the table. Output #2 for
C, and add CA to table as code#7. Now [.c.] becomes A.
Next we get AB, which IS in the string table, so
[.c.] gets AB, and we look at ABA, which is not
in the string table, so output the code for AB, which
is #4, and add ABA to the string table as code #8. [.c.] becomes
A. We can't get any more characters, so we just
output #0 for the code for A, and we're
done. So, the codestream is #0#1#0#2#4#0. A few words (four)
should be said here about efficiency: use a hashing strategy.
The search through the string table can be computationally
intensive, and some hashing is well worth the effort. Also,
note that "straight LZW" compression runs
the risk of overflowing the string table - getting
to a code which can't be represented in the number of
bits you've set aside for codes. There are several
ways of dealing with this problem, and GIF implements a very
clever one, but we'll get to that. An important
thing to notice is that, at any point
during the compression, if [...]K is in the
string table, [...] is there also. This fact suggests
an efficient method for storing strings in the
table. Rather than store the entire string
of K's in the table, realize that any string
can be expressed as a prefix plus a character: [...]K. If
we're about to store [...]K in the table,
we know that [...] is already there, so we can just store
the code for [...] plus the final character K.
That takes care of compression.
Decompression is perhaps more difficult conceptually,
but it is really easier to program. We again have
to start with an initialized string table. This table comes
from what knowledge we have about the charstream
that we will eventually get, like what possible values the characters
can take. In GIF files, this information is in the
header as the number of possible pixel values.
The beauty of LZW, though, is that this is
all we need to know. We will build the rest of the string table
as we decompress the codestream. The compression is done
in such a way that we will never encounter a code
in the codestream that we can't translate
into a string. We need to define
something called a "current code", which I
will refer to as "<code>", and an
"old-code", which I will refer to as
"<old>". To start things off, look at
the first code. This is now <code>. This code
will be in the intialized string table as the code
for a root. Output the root to the charstream.
Make this code the old-code <old>. *Now look at the next
code, and make it <code>. It is possible that
this code will not be in the string table, but
let's assume for now that it is. Output the string
corresponding to <code> to the codestream. Now find the
first character in the string you just translated. Call this
K. Add this to the prefix [...] generated
by <old> to form a new string [...]K.
Add this string [...]K to the string table, and set the
old-code <old> to the current code <code>.
Repeat from where I typed the asterisk, and you're all set.
Read this paragraph again if you just skimmed it!!!
Now let's consider the possibility that <code>
is not in the string table. Think back to compression,
and try to understand what happens when you have a string
like P[...]P[...]PQ appear in the charstream. Suppose
P[...] is already in the string table, but P[...]P is not. The
compressor will parse out P[...], and find that P[...]P is
not in the string table. It will output the code for P[...],
and add P[...]P to the string table. Then it will get
up to P[...]P for the next string, and find that P[...]P is
in the table, as the code just added. So it will
output the code for P[...]P if it finds that P[...]PQ is
not in the table. The decompressor
is always "one step behind"
the compressor. When the decompressor sees the code
for P[...]P, it will not have added that code to
it's string table yet because it needed the beginning
character of P[...]P to add to the string for the last
code, P[...], to form the code
for P[...]P. However, when a decompressor
finds a code that it doesn't know yet, it will always be
the very next one to be added to the string table. So it can
guess at what the string for the code should be, and,
in fact, it will always be correct. If I am a decompressor,
and I see code#124, and yet my string
table has entries only up to code#123, I can figure
out what code#124 must be, add it to my string table, and output
the string. If code#123 generated the string, which I will refer
to here as a prefix, [...], then code#124, in this special
case, will be [...] plus the first character of
[...]. So just add the first character of [...] to the
end of itself. Not too bad. As an example (and a very common
one) of this special case, let's assume we have a raster image
in which the first three pixels have the same color value.
That is, my charstream looks like: QQQ.... For
the sake of argument, let's say we have 32 colors, and
Q is the color#12. The compressor will generate
the code sequence 12,32,.... (if you don't know why, take
a minute to understand it.) Remember that #32 is not in
the initial table, which goes from #0 to
#31. The decompressor will see #12 and translate it
just fine as color Q. Then it will see #32 and not yet
know what that means. But if it thinks about
it long enough, it can figure out that QQ should be entry#32
in the table and QQ should be
the next string output. So
the decompression pseudo-code goes something like:
[1] Initialize string table;
[2] get first code: <code>; [3]
output the string for <code> to the charstream;
[4] <old> = <code>; [5]
<code> <- next code in codestream;
[6] does <code> exist in the string table?
(yes: output the string for <code> to the charstream;
[...] <- translation for <old>;
K <- first character of translation for <code>;
add [...]K to the string table;
<old> <- <code>; )
(no: [...] <- translation for <old>;
K <- first character of [...];
output [...]K to charstream and add it to string table;
<old> <- <code>
) [7] go to [5]; Again,
when you get to step [5] and there are no more codes,
you're finished. Outputting of strings, and finding of
initial characters in strings are efficiency problems all
to themselves, but I'm not going to suggest ways
to do them here. Half the fun of programming
is figuring these things out! Now for the GIF
variations on the theme. In part of the header of a
GIF file, there is a field, in the Raster Data
stream, called "code size". This is a very
misleading name for the field, but we have to live with
it. What it is really is the "root size".
The actual size, in bits, of
the compression codes actually changes
during compression/decompression, and I will refer to
that size here as the "compression size".
The initial table is just the codes for all the
roots, as usual, but two special codes are
added on top of those. Suppose you have a "code
size", which is usually the number of bits per pixel
in the image, of N. If the number of bits/pixel is one, then
N must be 2: the roots take up slots #0 and #1 in the
initial table, and the two special codes will take up
slots #4 and #5. In any other case, N is the number of
bits per pixel, and the roots take up slots #0 through
#(2**N-1), and the special codes are (2**N) and (2**N + 1).
The initial compression size will be N+1 bits
per code. If you're encoding, you output the codes
(N+1) bits at a time to start with, and if you're decoding,
you grab (N+1) bits from the codestream at a time. As for
the special codes: <CC> or the clear
code, is (2**N), and <EOI>, or end-of-information,
is (2**N + 1). <CC> tells the compressor to re- initialize
the string table, and to reset the compression size to
(N+1). <EOI> means there's no more in the
codestream. If you're encoding or decoding,
you should start adding things to the string table
at <CC> + 2. If you're encoding, you should output
<CC> as the very first code, and then whenever
after that you reach code #4095 (hex FFF), because
GIF does not allow compression sizes to be greater than
12 bits. If you're decoding, you should reinitialize your
string table when you observe <CC>.
The variable compression sizes are really
no big deal. If you're encoding, you start with a compression
size of (N+1) bits, and, whenever you output the code (2**(compression
size)-1), you bump the compression size up one bit.
So the next code you output will be one
bit longer. Remember that the largest
compression size is 12 bits, corresponding to a code of
4095. If you get that far, you must output <CC> as the
next code, and start over. If you're decoding, you must
increase your compression size AS SOON AS YOU write entry #(2**(compression
size) - 1) to the string table. The next code you READ
will be one bit longer. Don't make the mistake of
waiting until you need to add the code (2**compression
size) to the table. You'll have already missed
a bit from the last code. The packaging
of codes into a bitsream for the raster
data is also a potential stumbling block for
the novice encoder or decoder. The lowest
order bit in the code should coincide with the lowest available
bit in the first available byte in the codestream. For
example, if you're starting with 5-bit compression
codes, and your first three codes are, say,
<abcde>, <fghij>, <klmno>, where e,
j, and o are bit#0, then your codestream will start off
like: byte#0: hijabcde
byte#1: .klmnofg so
the differences between straight LZW
and GIF LZW are: two additional special
codes and variable compression sizes.
If you understand LZW, and you understand those variations,
you understand it all! Just as sort of a P.S., you
may have noticed that a compressor has a little bit of
flexibility at compression time. I specified a "greedy"
approach to the compression, grabbing as many characters as
possible before outputting codes. This is, in fact,
the standard LZW way of doing things, and
it will yield the best compression
ratio. But there's no rule saying you can't stop anywhere along
the line and just output the code for the current
prefix, whether it's already in the table or not,
and add that string plus the next character
to the string table. There are various
reasons for wanting to do this, especially
if the strings get extremely long
and make hashing difficult. If you need to,
do it.
|