Text Compression with the Burrows-Wheeler Transform


Text compression is an important and interesting computer application. If you master the program given on this page you'll become familiar with several common ideas in compression, and become acquainted with an amazing and elegant algorithm: the Burrows-Wheeler Transform. This is not my algorithm (I wish!), though the source code implementation and some of the details are mine.

You can compile this source and get a genuine compressor and decompressor. It will be slower than gzip (though fast enough for most use) and have a slightly inferior compression ratio, but just getting in the same ``ballpark'' as gzip is already remarkable, given the simple (and unoptimized) nature of this code.

The heart of this compression method is something called the ``Burrows-Wheeler Transform (BWT)'' which is one of those remarkable inventions that makes the study of algorithms a pleasure. The BWT was discovered many years ago but has never been widely publicized. Perhaps a certain major technology company could never quite find a use for it, but didn't want competitors to share the secret. If so I can sympathize with the inventors thus denied their due fame. (Perhaps Burrows or Wheeler would care to comment.)

The main source reference to the Burrows-Wheeler Transform appears to be ``A Block--sorting Lossless Data Compression Algorithm'' (SRC Research Report 124) by M. Burrows and D. J. Wheeler.

The BWT can be defined with an example. Consider the 51-character text fragment:


in the jingle jangle morning I'll go following you 

Let's build a 51 x 51 array with all possible rotations of that string:


in the jingle jangle morning I'll go following you 
n the jingle jangle morning I'll go following you i
 the jingle jangle morning I'll go following you in
the jingle jangle morning I'll go following you in 
he jingle jangle morning I'll go following you in t
e jingle jangle morning I'll go following you in th
 jingle jangle morning I'll go following you in the
jingle jangle morning I'll go following you in the 
ingle jangle morning I'll go following you in the j
ngle jangle morning I'll go following you in the ji
gle jangle morning I'll go following you in the jin
le jangle morning I'll go following you in the jing
e jangle morning I'll go following you in the jingl
 jangle morning I'll go following you in the jingle
jangle morning I'll go following you in the jingle 
angle morning I'll go following you in the jingle j
ngle morning I'll go following you in the jingle ja
gle morning I'll go following you in the jingle jan
le morning I'll go following you in the jingle jang
e morning I'll go following you in the jingle jangl
 morning I'll go following you in the jingle jangle
morning I'll go following you in the jingle jangle 
orning I'll go following you in the jingle jangle m
rning I'll go following you in the jingle jangle mo
ning I'll go following you in the jingle jangle mor
ing I'll go following you in the jingle jangle morn
ng I'll go following you in the jingle jangle morni
g I'll go following you in the jingle jangle mornin
 I'll go following you in the jingle jangle morning
I'll go following you in the jingle jangle morning 
'll go following you in the jingle jangle morning I
ll go following you in the jingle jangle morning I'
l go following you in the jingle jangle morning I'l
 go following you in the jingle jangle morning I'll
go following you in the jingle jangle morning I'll 
o following you in the jingle jangle morning I'll g
 following you in the jingle jangle morning I'll go
following you in the jingle jangle morning I'll go 
ollowing you in the jingle jangle morning I'll go f
llowing you in the jingle jangle morning I'll go fo
lowing you in the jingle jangle morning I'll go fol
owing you in the jingle jangle morning I'll go foll
wing you in the jingle jangle morning I'll go follo
ing you in the jingle jangle morning I'll go follow
ng you in the jingle jangle morning I'll go followi
g you in the jingle jangle morning I'll go followin
 you in the jingle jangle morning I'll go following
you in the jingle jangle morning I'll go following 
ou in the jingle jangle morning I'll go following y
u in the jingle jangle morning I'll go following yo
 in the jingle jangle morning I'll go following you

Now let's reorder the rows so they are in sorted order:


 I'll go following you in the jingle jangle morning
 following you in the jingle jangle morning I'll go
 go following you in the jingle jangle morning I'll
 in the jingle jangle morning I'll go following you
 jangle morning I'll go following you in the jingle
 jingle jangle morning I'll go following you in the
 morning I'll go following you in the jingle jangle
 the jingle jangle morning I'll go following you in
 you in the jingle jangle morning I'll go following
'll go following you in the jingle jangle morning I
I'll go following you in the jingle jangle morning 
angle morning I'll go following you in the jingle j
e jangle morning I'll go following you in the jingl
e jingle jangle morning I'll go following you in th
e morning I'll go following you in the jingle jangl
following you in the jingle jangle morning I'll go 
g I'll go following you in the jingle jangle mornin
g you in the jingle jangle morning I'll go followin
gle jangle morning I'll go following you in the jin
gle morning I'll go following you in the jingle jan
go following you in the jingle jangle morning I'll 
he jingle jangle morning I'll go following you in t
in the jingle jangle morning I'll go following you 
ing I'll go following you in the jingle jangle morn
ing you in the jingle jangle morning I'll go follow
ingle jangle morning I'll go following you in the j
jangle morning I'll go following you in the jingle 
jingle jangle morning I'll go following you in the 
l go following you in the jingle jangle morning I'l
le jangle morning I'll go following you in the jing
le morning I'll go following you in the jingle jang
ll go following you in the jingle jangle morning I'
llowing you in the jingle jangle morning I'll go fo
lowing you in the jingle jangle morning I'll go fol
morning I'll go following you in the jingle jangle 
n the jingle jangle morning I'll go following you i
ng I'll go following you in the jingle jangle morni
ng you in the jingle jangle morning I'll go followi
ngle jangle morning I'll go following you in the ji
ngle morning I'll go following you in the jingle ja
ning I'll go following you in the jingle jangle mor
o following you in the jingle jangle morning I'll g
ollowing you in the jingle jangle morning I'll go f
orning I'll go following you in the jingle jangle m
ou in the jingle jangle morning I'll go following y
owing you in the jingle jangle morning I'll go foll
rning I'll go following you in the jingle jangle mo
the jingle jangle morning I'll go following you in 
u in the jingle jangle morning I'll go following yo
wing you in the jingle jangle morning I'll go follo
you in the jingle jangle morning I'll go following 

Each row of this array has as much information as the entire array and (if accompanied by a small integer telling us which rotation to use) provides the original text string.

Each column contains the same 51 characters as each row, but they seem to be in a scrambled useless order. Let's show the first, middle and last columns of this sorted array, left-to-right instead of top-to-bottom:

First column:


         'Iaeeefggggghiiiijjllllllmnnnnnnooooortuwy

Middle column:


 gjnolgg htlf'nlneowiIi g ll  ie j nilgouneyoao rnm

Final column:


golueeengI jlhl nnnn t nwj  lgg'ol iiiiargfmylo oo 

The first column contains the 51 characters but they are ``clumped.'' The way we constructed the array they will always be maximally clumped. Clumped data can be compressed better than unclumped data, but this sequence of the 51 characters won't tell us much: the original string could be any of its permutations.

The middle column doesn't exhibit clumping. It almost seems like a random permutation of the 51 letters (although of course it isn't really random at all).

The final column does exhibit clumping -- notice the four consecutive n's and four consecutive i's -- though not nearly as much as the first column. This clumping wasn't forced by the sorting, but arises from the repetition of digraphs in English. The clumps of n's and i's arise from ng and in digraphs in the words in, jingle, jangle, morning, and following.

The key idea of the Burrows-Wheeler invention is this: knowledge of just this final column allows us to reconstruct the entire 51 x 51 array (and hence the original text string).

Thus the Burrows-Wheeler Transform will replace the string


in the jingle jangle morning I'll go following you 
with

(22, golueeengI jlhl nnnn t nwj  lgg'ol iiiiargfmylo oo )

(The 22 tells the decompressor to output the 22nd row of the reconstructed sorted array.)

In the source code below, I use a 32768 x 32768 array instead of 51 x 51. This array would be far too big of course, but the preceding discussion was just conceptual and the software need not actually build such an array physically. To prove that this transform is reversible, and thus useful for compression, we must display an inverse coding procedure. You can undo the transform step by step and finish with an impossibly expensive nest of sorting steps; however Burrows and Wheeler also discovered a straightforward, if non-obvious, procedure to invert the transform in almost linear time.

Referring to the example again, the compressor replaced a 51-character string with 52 characters (the index 22 must be transmitted also). There's no compression yet, but the BWT will be just the first of several steps in the compression: once the characters are ``clumped'' we can apply other compression techniques.

Next let's replace each character with an integer index. poppy for example might become (16, 15, 16, 16, 25) since p is the 16'th letter of the alphabet and so on. (We'll suppose that space is the ``0'th letter.'') We apply a simple idea called Move to Front (MTF), where each character is moved, as it is encountered, to the front of the indexing table, and hence given an index of zero (which will then increase slowly as other characters are moved to the front). Now poppy becomes (16, 16, 1, 0, 25). The first p is still 16 since we start with default indexes. The o is now 16 instead of 15 since p was moved in front of it. The second p is 1 (not 0, because o was just moved in front of it), but that moves p to the very front again so the third p is 0.

We still haven't achieved compression, but at least many of these indexes will be very small integers now. In fact, since we applied the BWT earlier, many of the indexes will be zero.

The next compression step is called Zero Running (ZR). Each non-zero number is now preceded with a count of preceding zeros. With this step, our example (16, 16, 1, 0, 25) will become (0, 16, 0, 16, 0, 1, 1, 25). Our five numbers, four of which were non-zero, have been replaced with eight numbers, two for each non-zero. (For clarity I show the zero run-counts in an alternate color.) If this looks like bad compression, remember that fifty numbers, only four of which are non-zero, would also map to an eight-number output.

Finally, we want to output the list of these numbers to the compressed file. Most of the numbers will be very small so we want to use a code which represents them with few bits. A Huffman code would be appropriate, but even simpler and almost as good is to use a standard regular pattern. We'll use Elias's ``gamma code,'' a popular and elegant device which goes by many names. Perhaps the simplest definition of this code is:

Elias's gamma code maps the positive integer K to K's binary representation with just enough leading zeros to put the first one in the exact middle of the code output. 9, for example, is represented 9 = 0001001.

The gamma code handles positive integers, but we need to handle zero as well. Simply add 1 before the encode, and subtract 1 after the decode.

That sums up our compression algorithm, and the source code should be pretty self-explanatory. (That the marvelous Burrows-Wheeler decoding algorithm works may not be obvious at first glance!) Much of the code just deals with the tedium of actually reading or writing bits to/from a disk or pipe. The code is followed by remarks on a few implemenatation details.

For your convenience if you want to download this software, I have arranged the code into a single source file (and without the HTML ``<'' strings that would cause trouble). After you snip the source at the obvious place and name it bwt.c the commands


        cc -O -DCOMPRESSOR -o compressor bwt.c
        cc -O -o decompressor bwt.c
        compressor < bwt.c > bwt.c.ZZ
        decompressor < bwt.c.ZZ > bwt.c.reconstruction
        cmp bwt.c bwt.c.reconstruction
will compile, demonstrate, and validate this compression software.

(Use the "View Source" option to extract this source, as your HTML browser will have changed parts of it when rendering.)



#include	
#include	
#include	

/* Coder.h */

#define BIPSHO  8 * sizeof(short)

#define LOGSIZ  15
#define BSIZE   (1 << LOGSIZ)

#define BITBUFSIZE      2048
#define UNDOBUF         3       /* max unget in shorts */
unsigned short  Bitbuff[BITBUFSIZE + UNDOBUF];
int     inshctr = BITBUFSIZE + UNDOBUF - 1, inbitrem = 0;

long    getbits();

/* BSIZE must be divisible by sizeof(short) ! */

/*
 * Need to handle byte order or compressed files
 * won't be portable across hosts.
 */
int     Swabbing = 0;

void ungetbits(int cnt)
{
        inbitrem += cnt;
        while (inbitrem >= BIPSHO) {
                inbitrem -= BIPSHO;
                inshctr -= 1;
        }
}

void setswab()
{
        union {
                char    c[2];
                short   s;
        } u;

        if (u.s = 1, u.c[0] == 1)
                Swabbing ^= 1;
}

void cswab(short *sp, int cnt)
{
        register unsigned x;

        if (Swabbing) while (cnt--) {
                x = *sp;
                *sp++ = x >> 8 & 0xff | (x & 0xff) << 8;
        }
}

/* IO subroutines that work with pipes */
int p_write(int fd, char *buf, int cnt)
{
        int     acnt, tcnt;

        for (acnt = 0; acnt < cnt; acnt += tcnt) {
                tcnt = write(fd, buf + acnt, cnt - acnt);
                if (tcnt <= 0) {
                        perror("write");
                        exit(1);
                }
        }
        return acnt;
}

int p_read(int fd, char *buf, int cnt)
{
        int     acnt, tcnt = 1;

        for (acnt = 0; acnt < cnt && tcnt; acnt += tcnt) {
                tcnt = read(fd, buf + acnt, cnt - acnt);
                if (tcnt < 0) {
                        perror("read");
                        exit(1);
                }
        }
        return acnt;
}

#ifdef  COMPRESSOR

u_char  Decbuff[BSIZE + BSIZE];
u_char  Encbuff[BSIZE];
u_short Sindex, Leng;
u_char  *Ptab[BSIZE];

int ptcomp(const void *pp1, const void *pp2)
{
        register u_char *p1 = *(u_char **)pp1, *p2 = *(u_char **)pp2;
        register c, len = Leng;

        do
                c = *p1++ - (int)*p2++;
        while (c == 0 && --len);
        return c;
}

void bwt_encode()
{
        int     i;

        memcpy(Decbuff + Leng, Decbuff, Leng);
        for (i = 0; i < Leng; i++)
                Ptab[i] = Decbuff + i;
        qsort(Ptab, Leng, sizeof(u_char *), ptcomp);
        for (i = 0; i < Leng; i++) {
                Encbuff[i] = Ptab[i][Leng - 1];
                if (Ptab[i] == Decbuff)
                        Sindex = i;
        }
}

u_char  Mtftab[256];

int mtf_encode(u_char din)
{
        register        i, resu;

        for (resu = 0; Mtftab[resu] != din; resu++)
                ;
        for (i = resu; i; i--)
                Mtftab[i] = Mtftab[i - 1];
        Mtftab[0] = din;
        return resu;
}

void eg_encode(int datum)
{
        int     n, nn;

        ++datum;
        for (n = 1, nn = 2; nn <= datum; n += 2, nn += nn)
                ;
        putbits((long)datum, n);
}

int     Numzeros = 0;

/*
 * Modified Elias_gamma code for run-lengths is defined by the code-lengths
 *         1  2  4  4  6  6  6  6  8  8  8  8  8  8  8 ...
 * Modified Elias_gamma code for positive MTF is defined by the code-lengths
 *         2  2  4  4  4  4  6  6  6  6  6  6  6  6  8 ...
 */
void zr_encode(int datum)
{
        if (datum == 0)
                ++Numzeros;
        else {
                if (Numzeros) {
                        putbits(1L, 1);
                        eg_encode(Numzeros - 1);
                } else
                        putbits(0L, 1);
                Numzeros = 0;
                eg_encode(datum - 1 >> 1);
                putbits(!((long)datum & 1), 1);
        }
}

int     outshctr, outbitctr;

int putbits(unsigned long dval, int cnt)
{
        if (cnt < 0) {
                /* Termination code */
                putbits(0L, BIPSHO - 1);
                goto forcewrite;
        }
        if (cnt > BIPSHO) {
                putbits(dval >> BIPSHO, cnt - BIPSHO);
                cnt = BIPSHO;
                dval &= (1L << BIPSHO) - 1;
        }
        dval <<= BIPSHO * 2 - outbitctr - cnt;
        Bitbuff[outshctr] |= dval >> BIPSHO;
        outbitctr += cnt;
        if (outbitctr >= BIPSHO) {
                outbitctr -= BIPSHO;
                outshctr += 1;
                if (outshctr >= BITBUFSIZE) {
                forcewrite:
                        cswab((short *)Bitbuff, outshctr);
                        p_write(1, (char *)Bitbuff, outshctr * sizeof(short));
                        outshctr = 0;
                }
                Bitbuff[outshctr] = dval;
        }
}

int main(int argc, char **argv)
{
        int     i, ix;

        setswab();
        for (i = 0; i < 256; i++)
                Mtftab[i] = i;
        /* Loop through blocks */
        do {
                Leng = p_read(0, Decbuff, BSIZE);
                if (Leng == BSIZE)
                        putbits(1L, 1);
                else
                        putbits((long)Leng, LOGSIZ + 1);
                if (Leng == 0)
                        break;
                bwt_encode();
                putbits((long)Sindex, LOGSIZ);
                for (i = 0; i < Leng; i++)
                        zr_encode(mtf_encode(Encbuff[i]));
                zr_encode(1);
        } while (Leng == BSIZE);
        /* special Code to flush output buffer */
        putbits(0L, -1);
        exit(0);
}

#else

u_char  Decbuff[BSIZE + BSIZE];
u_char  Encbuff[BSIZE];
u_char  Sorted[BSIZE];
u_short Sindex, Leng;
u_short Dchart[256];
u_short Echart[256];
u_short Link[BSIZE];

int charcomp(const void *p1, const void *p2)
{
        return *(u_char *)p1 - (int)*(u_char *)p2;
}

/* This is the fabulous Burrows-Wheeler decoder. */
void bwt_decode()
{
        register int    i, m;
        register u_char b;

        memcpy(Sorted, Encbuff, Leng);
        qsort(Sorted, Leng, sizeof(u_char), charcomp);
        for (i = 0; i < 256; i++)
                Echart[i] = 0;
        for (m = -1, i = 0; i < Leng; i++) {
                if ((b = Sorted[i]) != m)
                        Dchart[m = b] = i;
        }
        for (i = 0; i < Leng; i++) {
                b = Encbuff[i];
                Link[Dchart[b] + Echart[b]++] = i;
        }
        for (m = Sindex, i = 0; i < Leng; i++) {
                Decbuff[i] = Sorted[m];
                m = Link[m];
        }
}

u_char  Mtftab[256];

int mtf_decode(int ix)
{
        u_char  resu;

        for (resu = Mtftab[ix]; ix > 0; ix--)
                Mtftab[ix] = Mtftab[ix-1];
        return Mtftab[0] = resu;
}

int     Numzeros = -1;

int zr_decode()
{
        int     res;

        if (Numzeros-- > 0)
                return 0;
        if (Numzeros != -1) {
                if (getbits(1)) {
                        Numzeros = eg_decode();
                        return 0;
                } else
                        Numzeros = -1;
        }
        res = (eg_decode() << 1) + 1;
        return res + getbits(1);
}

int eg_decode()
{
        int     n;

        switch (getbits(3)) {
        case 0:
                for (n = 3; getbits(1) == 0; n++)
                        ;
                return (1 << n) + getbits(n) - 1;
        case 1:
                return 3 + getbits(2);
        case 2:
                return 1;
        case 3:
                return 2;
        case 4: case 5: case 6: case 7:
                ungetbits(2);
                return 0;
        }
}

long getbits(int cnt)
{
        long    result;
        int     x;

        if (cnt > BIPSHO) {
                result = getbits(cnt - BIPSHO) << BIPSHO;
                return result | getbits(BIPSHO);
        } else if (cnt <= inbitrem) {
                result = Bitbuff[inshctr] >> inbitrem - cnt;
                inbitrem -= cnt;
        } else {
                result = (long)Bitbuff[inshctr] << cnt - inbitrem;
                if (++inshctr >= BITBUFSIZE + UNDOBUF) {
                        for (x = 0; x < UNDOBUF; x++)
                                Bitbuff[x] = Bitbuff[BITBUFSIZE + x];
                        p_read(0, (char *)(Bitbuff + UNDOBUF),
                                (inshctr - UNDOBUF) * sizeof(short));
                        /* We assume the input file is valid;
                         * Thus no need to check its length!
                         */
                        cswab((short *)Bitbuff + UNDOBUF,
                                inshctr - UNDOBUF);
                        inshctr = UNDOBUF;
                }
                result |= Bitbuff[inshctr] >> BIPSHO - cnt + inbitrem;
                inbitrem -= cnt - BIPSHO;
        }
        return result & (1L << cnt) - 1;
}

int main(int argc, char **argv)
{
        int     i;

        setswab();
        for (i = 0; i < 256; i++)
                Mtftab[i] = i;
        /* Loop through blocks */
        do {
                Leng = getbits(1) ? BSIZE : getbits(LOGSIZ);
                if (Leng == 0)
                        break;
                Sindex = getbits(LOGSIZ);
                for (i = 0; i < Leng; i++)
                        Encbuff[i] = mtf_decode(zr_decode());
                zr_decode();
                bwt_decode();
                p_write(1, Decbuff, Leng);
        } while (Leng == BSIZE);
        exit(0);
}
#endif


Miscellaneous remarks.

The ZR encoding only outputs for non-zero numbers, but the last number encoded might be zero (i.e., part of an ``unfinished run''). So we encode an arbitrary non-zero number, as a sort of ``placeholder,'' at the end (and remember to throw it away during decode). We do this not just at EOF, but whenever we need to insert non-ZR data into the code stream. Fortunately that is infrequent in this software so compression isn't degraded by these extra non-zeros. (Other solutions are possible if such degradation became an issue, but are more complicated.)

A slight variation of the Elias gamma (EG) code is used for the positive MTF codes; this yielded a 3% compression improvement. ( x/2 is submitted to the gamma code instead of x and the LSB of x is sent ``as is.'') We also use a variation of the gamma code for the zero-run codes, though this yielded only a further 1% improvement. One bit is spent indicating whether the run-length is zero; if not, the positive run-length is gamma coded normally.

ptcomp() could simply invoke memcmp(), but I found it significantly faster in my Linux environment to do the simple memcmp-like loop in-line. Here memcmp() will usually miscompare on an early character so in-line speed isn't surprising, but in my experience it's usually possible to outperform even large-sized memcpy()'s, just with the well-known device of partially ``unrolling'' the loop. It seems so worthwhile and important to optimize the library memcpy() and memcmp() routines that you may find it odd that vendors don't, but in my experience these routines have been consistently suboptimal. As just one example, I found I could get dramatic improvement on an imaging program for the Sun-3 workstation (68020 chip) by writing my own memcpy(): on investigation I found that the Sun-3's memcpy() routine had been optimized for the obsolete Sun-2 (68010 chip). (A related, and even more astounding, example is integer-multiply on Sun-4 workstations, where the library routine can be sped up by a large factor -- though surely Sun's fixed this by now! ?.)

In my experiments I found some other ways to improve this code, but I did not incorporate them because the improvements were modest and I wanted to keep this example simple. Nevertheless I will mention two of them:


 

Please send me some e-mail.
Go back to my home page.