“Perfect” Compression

How to remove significant amounts of redundancy.

This post took longer to write than I expected. How do you explain a topic so far outside the normal experience of people? I didn’t end up with a polished post, but I hope it is followable.

There are roughly two kinds of compression. In one kind you make the expected data small; for example, in Morse code “‍e‍” is less than 25% the size of “‍q‍”. In the other kind you try to remove redundancy so that there is no wasted space, every bit matters completely.

For years I’ve been writing this second kind of compression following a simple rule that I developed bit-by-bit. I have no idea if this rule has been published somewhere; very likely it has a name and someone will be upset I didn’t cite their work. But since I don’t know who they are and don’t care to plow through the literature of a field foreign to my usual focus to find them, I’ll just post it as is.

There are four steps in my compression and decompression-building rule.

First, take whatever it is you are compressing and turn it into a sequence of tokens. Some sequences make the later steps much easier than others, but in this example I’m going to use text which has a natural representation as a sequence of letters or words.

Second, come up with a way to determine, for every sequence, the set of tokens that could be appended to it. For example, in one of my dictionaries if I have “‍alp‍” the next letter can be one of “‍a‍”, “‍e‍”, “‍h‍”, “‍i‍”, or “‍s‍”; or the word could end right there with “‍alp‍”. If I have “‍prandi‍” the only next token allowed by that dictionary is “‍a‍” and the word cannot stop at “‍prandi.‍”

Third, number the options at each step sequentially starting with 0. It is important that, if the sequence cannot stop where it is, then appending token 0 repeatedly will eventually result in a sequence that can stop. Thus, for the set of odd numbers token 0 should not be the digit “‍0‍” because appending “‍0‍”s to an even number never creates an odd number.

Fourth, I use the following two routines to actually do the compression and decompression. The result is an integer between 0 and the number of things that can be represented. In general, shorter sequences have smaller numbers.

How to compute n, the compression of s

Start n as 0; while s is not empty { let t be the last token in s; remove the last token from s; let a be the number of token t relative to s; let k be the number of things that could come after s; replace n with n × k + a; if could stop at s { increase n by + 1 } }

How to compute s, the decompression of n

Start s as an empty sequence of additions; Repeat until both n = 0 and we can stop at s { if we could stop at s { decrease n by 1 }; let k be the number of tokens that could follow s; let a be the remainder of n ÷ k; let t be the ath token that could follow s; add t to the end of s; divide n by k, dropping the remainder }

As an example, consider Dana’s nth-great grandmothers. Valid descriptions begin with “‍Dana‍”, end with “‍’s mother‍”, and have an arbitrary number of “‍’s father‍” and “‍’s mother‍” in between. Since token 0 needs to move towards a terminating point, we’ll call “‍’s father‍” token 1 and and “‍’s mother‍” token 0.

To compress “‍Dana’s father’s mother’s mother‍”,

n = 0, s = “‍Dana’s father’s mother’s mother‍”
The first time though the loop:
1. t is “‍’s mother‍”
2. s becomes “‍Dana’s father’s mother‍”
3. a is 0
4. k is 2
5. n becomes 0 × 2 + 0 = 0
6. We could stop after “‍Dana’s father’s mother‍” so n increases to 1
The second time though the loop:
1. t is “‍’s mother‍”
2. s becomes “‍Dana’s father‍”
3. a is 0
4. k is 2
5. n becomes 1 × 2 + 0 = 2
6. We can’t stop after “‍Dana’s father‍” so n doesn’t increase
The third time though the loop:
1. t is “‍’s father‍”
2. s becomes “‍Dana‍”
3. a is 1
4. k is 2
5. n becomes 2 × 2 + 1 = 5
6. We can’t stop after “‍Dana‍” so n doesn’t increase
The fourth time though the loop:
1. t becomes “‍Dana‍”
2. s becomes “‍‍”
3. a is 0
4. k is 1 because we always start with “‍Dana‍”
5. n becomes 5 × 1 + 0 = 5
6. We can’t stop after “‍‍” so n doesn’t increase
The answer is 5

To decompress 5 I’d do

n = 5, s = “‍‍”
The first time though the loop:
1. We can’t stop after “‍‍” so n doesn’t decrease
2. k is 1 because we always start with “‍Dana‍”
3. a is 0, the remainder of 5 ÷ 1
4. t is “‍Dana‍”, the 0th token that can follow “‍‍”
5. s becomes “‍Dana‍”
6. n becomes 5 ÷ 1 = 5
The second time though the loop:
1. We can’t stop after “‍Dana‍” so n doesn’t decrease
2. k is 2
3. a is 1, the remainder of 5 ÷ 2
4. t is “‍’s father‍”, the 1th token that can follow “‍Dana‍”
5. s becomes “‍Dana’s father‍”
6. n becomes 5 ÷ 2 = 2 (drop the remainder)
The third time though the loop:
1. We can’t stop after “‍Dana’s father‍” so n doesn’t decrease
2. k is 2
3. a is 0, the remainder of 2 ÷ 2
4. t is “‍’s mother‍”, the 0th token that can follow “‍Dana’s father‍”
5. s becomes “‍Dana’s father’s mother‍”
6. n becomes 2 ÷ 2 = 1
The fourth time though the loop:
1. We can stop after “‍Dana’s father’s mother‍” so n decreases to 0
2. k is 2
3. a is 0, the remainder of 0 ÷ 2
4. t is “‍’s mother‍”, the 0th token that can follow “‍Dana’s father’s mother‍”
5. s becomes “‍Dana’s father’s mother’s mother‍”
6. n becomes 0 ÷ 2 = 0
The answer is “‍Dana’s father’s mother’s mother‍”

It should be noted that this method is not perfect. In particular, some numbers are not generated by any encoding if there are sequences that have no possible successors. Thus, my dictionary of 173,540 words compresses the word “‍compression‍” as 58,889,155. I expect I’ll continue to refine and shrink the process over the coming years. But for right now it’s the best I’ve got.