Text compression routines...

TMD2003 · Post by **TMD2003** » Mon May 15, 2023 10:37 pm

If at first you don't succeed, cheat.

I opened a .Z80 snapshot of Fantasy World Dizzy: Classic Edition in an attempt to read the text displayed when Dizzy communicates with other characters, to see if that was a clue as to the very well hidden Easter Egg. I was surprised to find that there wasn't any... and then not surprised, because there's so much text that, logically, it would have to be held in compressed form. And I'd assume this explains why the game is all in capital letters. It made me start thinking about how sophisticated text compression routines need to be.

So, without digging into ZX0 which didn't exist in 1989, how does this sound for something that would have worked "back in the day"...

If we define a character set of 32 characters, then three of those characters can be compressed into two bytes, as each character needs five bits, and the bit manipulation required to unscramble the 15 bits into the correct characters is fairly straightforward. We'd need A-Z, then a space, full stop, apostrophe (which doubles as a quotation mark), comma, question mark, and the last character can be either a hyphen or an exclamation mark. This way, text takes up around two thirds of the space it would usually do. I could write such a routine myself, though the compressor would most likely have to be in BASIC as it'd be complicated and speed is irrelevant, the decompressor in machine code is much easier.

But if that isn't compressed enough, how does this sound?

To compress two characters into one byte, only 16 values are possible - which is easier written in hex. Hence: 0 is a space, 1 to D are the 13 most commonly used letters in the language required*, then E and F are like shift keys - followed by a second value which determines which of 16 characters should be selected from an extra lookup table (and all the punctuation can be kept in one or the other). Overall, this allows 46 characters in total, which is good news for the Russians with their 33 Cyrillic characters.

Bear in mind I've never studied computer science and have no formal training in algorithms or any of that jazz. How does it stack up?

Meanwhile, I will find out myself. I'll take all the sample text from the arrays in Advanced Ben Shapiro Simulator and see how well they compress with the above method.

* English: ETAONRISHDLCM. For the second set of characters (the "E" set), FUGYPWBVKXJQZ with room for three more, and put the punctuation in the third ("F") set. Letters can be moved around as required: for Dizzy games it'd be more useful to have Y and Z in the "unshifted" set.

grelf · Post by **grelf** » Tue May 16, 2023 7:42 am

One of the first systems I worked on professionally around 1980 was PDP-11 from DEC. Our machines had less than 128kbytes of memory but were doing real-time engineering tasks. The operating systems, RT-11 and RSX-11M, used a system of 3 characters per 2 bytes as you described. Every program name was just 3 letters. I rmember TKB for task builder and PIP for Peripheral Interchange Program (file copying, naming, etc).

berarma · Post by **berarma** » Tue May 16, 2023 9:18 am

For text I think you'd achieve much better compression coding words instead of letters. Although you could mix both schemes and have indexed words and compressed chars.

I would calculate how much space would be saved by indexing a word and decide which ones to index based on that. Indexed words would go into a table with an index between 0-255. Compressed text would consist of indexed words alternating with non-indexed words which would include spaces and punctuation.

AndyC · Post by **AndyC** » Tue May 16, 2023 10:20 am

It's been a while, but from what I remember a lot of text compression revolves around common letter pairings. Things like Th and ea occur a lot in English for example so treating them as a single character can help.

uglifruit · Post by **uglifruit** » Tue May 16, 2023 10:32 am

AndyC wrote: ↑Tue May 16, 2023 10:20 am It's been a while, but from what I remember a lot of text compression revolves around common letter pairings. Things like Th and ea occur a lot in English for example so treating them as a single character can help.

Yes, isn't it something like: search the text for common byte pairings, and create a lookup table using unused bytes to represent common pairs. This can be done more than once. Probably "full stop, space" would be very common in plain text. In this paragraph "th" and "e space" occur more than once. And if both of those substitutions are used then the new two byte pair "th" "e space" could be further replaced on a subsequent pass. Thus common words get further compressed.

8BitAG · Post by **8BitAG** » Tue May 16, 2023 10:33 am

Tokenizing common words & letter pairing is certainly a fairly simple method of compression that has the advantage of being fairly quick to decompress. It's used in some text adventure systems, such as the PAW & DAAD. You can use tokens based on the best fit to the individual text or use ones that work well across a wide variety of common texts.

For example, these are the standard English tokens for PAW (CP/M)...

Spoiler

_the_
_you_
_are_
ing_
_to_
_and
_is_
You_
and_
The_
n't_
_of_
_you
ing
ed_
_a_
_op
ith
out
ent
_to
_in
all
_th
_it
ter
ave
_be
ver
her
and
ear
You
_on
en_
ose
no
ic
ap
_b
gh
__
ad
is
_c
ir
ay
ur
un
oo
_d
lo
ro
ac
se
ri
li
ti
om
bl
ck
I_
ed
ee
_f
ha
pe
e_
t_
in
s_
th
,_
er
d_
on
to
an
ar
en
ou
or
st
._
ow
le
at
al
re
y_
ch
am
el
_w
as
es
it
_s
ll
do
op
sh
me
he
bo
hi
ca
pl
il
cl
_a
of
_h
tt
mo
ke
ve
so
e.
d.
t.
vi
ly
id
sc
_p
em
r_

If you're using another language then you would adjust the tokens to more common pairings in those languages.

WhatHoSnorkers · Post by **WhatHoSnorkers** » Tue May 16, 2023 10:38 am

S Robert Speel does some kind of tokenising/detokenising in his New Adventure System book, where words become individual characters.

+3code · Post by **+3code** » Tue May 16, 2023 7:01 pm

8BitAG wrote: ↑Tue May 16, 2023 10:33 am If you're using another language then you would adjust the tokens to more common pairings in those languages.

The head of the DAAD source codes has something like this, iirc.

Morkin · Post by **Morkin** » Tue May 16, 2023 9:23 pm

8BitAG wrote: ↑Tue May 16, 2023 10:33 am For example, these are the standard English tokens for PAW (CP/M)...
.
.
.

That's quite interesting - @Rorthron did a load of manual tokenizing for the text in Balachor's Revenge (saved 200+ bytes I think). Over 50% of the 'tokens' we used are in this list. Most of the most effective tokens were these sorts of 2- and 3- character strings (often including spaces) - I would've imagined there'd be more savings with longer words but that wasn't the case, as they were used less frequently.

I guess it'd depend on the amount and nature of the text? Ideally you could feed your finished text through something that'd identify the best tokens to use for saving memory? (I'm sure things like this exist).

I do like the idea of a 'blanket' cramming-characters-uniformly-in-fewer-bytes approach though. The next challenge would be to write the smallest possible ASM routine to extract them...

8BitAG · Post by **8BitAG** » Tue May 16, 2023 9:28 pm

Morkin wrote: ↑Tue May 16, 2023 9:23 pm I guess it'd depend on the amount and nature of the text? Ideally you could feed your finished text through something that'd identify the best tokens to use for saving memory? (I'm sure things like this exist).

Yes, you can. PAW had fixed tokens but in DAAD (as +3code pointed out) you can actually define the tokens yourself; I copied and pasted the standard DAAD tokens, which are set to the PAW CP/M ones, but if you were writing a game in Spanish you'd have a different set and there's no reason why you couldn't customise them to a specific game.

For DAAD we have a tool that can help generate some suitable tokens...
https://github.com/daad-adventure-writer/DRT
...However the default set works so well in most cases that I've never used this, but apparently it can save you a few bytes.

Morkin · Post by **Morkin** » Wed May 17, 2023 1:49 pm

8BitAG wrote: ↑Tue May 16, 2023 9:28 pm For DAAD we have a tool that can help generate some suitable tokens...
https://github.com/daad-adventure-writer/DRT
...However the default set works so well in most cases that I've never used this, but apparently it can save you a few bytes.

Looks quite interesting...

Joefish · Post by **Joefish** » Wed May 17, 2023 2:08 pm

The hi-tech way of doing it is called 'Huffman Coding' https://en.wikipedia.org/wiki/Huffman_coding
It uses statistical analysis to split the list of possible tokens down into two lists, over and over until there is only one token at the end of each branch of a great big tree (the 'graph') of options. Then to decode your data you do it one bit at a time to work out which branch to take until you get to a token. The divisions are deliberately lop-sided, so that the number of steps to get to more common tokens is much shorter than the number of steps (bits) needed for a less common token. But it's a lot of faff!

For reference, here are some sources of letter, pair and triplet frequencies in English, although these don't include spaces like those adventure game optimisations do. Bear in mind that the single letter frequencies are a percentage of 26 possibilities (so the average is 3.85%, round about where 'L' lies), whereas the digraphs/bigrams are from 26x26=676 possibilities (so the average is 0.15%) and the trigrams from 17576 possibilities (where the average would be 0.057%).

https://pi.math.cornell.edu/~mec/2003-2 ... ncies.html
https://pi.math.cornell.edu/~mec/2003-2 ... raphs.html

https://en.wikipedia.org/wiki/Bigram
https://en.wikipedia.org/wiki/Trigram

These sorts of tables were used in breaking simple letter-substitution codes, where you'd look for the most common replacement character/mark and assume it's an 'e', then any doubles are probably 'll', and repeated pairs most likely 'th', and then do a combination of guessing whole words and counting the rest of the letters.

Sol_HSA · Post by **Sol_HSA** » Fri May 19, 2023 1:47 pm

For mucho, I did the following:
First, zx7 allows for a "dictionary", or a block of data that is in memory right before the decompression buffer, from which it can lz data from to get a better compression ratio right off the bat.

I analyze the input data for most common word pairs and then generate beatnik poetry by chaining these to form the dictionary.

Trainer data: 1024 bytes
the room in a wad of cash which you are back into your eyes and when it is no more before you, only by something to be - there seems like she says Vine. "I think I'm trying not getting out from his feet You open them with long as he says. "We should just about make it. A medal made me what you're going through The man turns towards you. Continue Vine on this hall at him go for our He then fades . I know it's all down but one thing has been that
would have some creature her head up against two light finally can take too door behind her. opponent gets woman shakes do so get my see they close begins or card "You must darkness. Your red light. As an effort voice that's find yourself its point Take voice. trees wife who we looks room. audience lights. few things will trophy while Bianca yellow hair hear off soldiers Surprise! card?" here Ask heavy replies if doors still continues Answer door. away opens face now over keep beast say girl their pink It answer mist sound soft rises seem pocket. pick jacket gun horn h

Finally, since my decompression buffer is bigger than any single bit of text, I try all permutations of combining input text chunks and see which compress together the best.

p25 p80 p73 zx7: 3789 -> 1665 (43.943%), 0x5bce
p21 p77 zx7: 1781 -> 738 (41.437%), 0x624f
p38 p92 p58 zx7: 2708 -> 1005 (37.112%), 0x6531
p36 p74 p85 p71 zx7: 2816 -> 1117 (39.666%), 0x691e
p19 p94 p1 zx7: 3749 -> 1712 (45.666%), 0x6d7b
p69 p10 p4 p70 p50 zx7: 2994 -> 1068 (35.671%), 0x742b
p48 p62 zx7: 1419 -> 584 (41.156%), 0x7857
p3 p95 p12 zx7: 4022 -> 1972 (49.030%), 0x7a9f
p8 p83 p76 zx7: 2743 -> 1300 (47.393%), 0x8253
p47 p96 p13 p23 zx7: 3808 -> 1828 (48.004%), 0x8767
p14 p99 p33 p42 p72 p60 zx7: 3720 -> 1774 (47.688%), 0x8e8b
p35 p53 p45 p39 p81 zx7: 3961 -> 1955 (49.356%), 0x9579
p18 p52 p24 p78 p68 p87 zx7: 3329 -> 1470 (44.157%), 0x9d1c
p30 p61 p49 p90 p82 zx7: 3193 -> 1649 (51.644%), 0xa2da
p26 p79 p43 p97 p51 p16 p15 zx7: 3744 -> 1949 (52.057%), 0xa94b
p2 p41 p63 p88 p7 p66 p65 p46 zx7: 3827 -> 1998 (52.208%), 0xb0e8
p22 p89 p20 p9 p98 p29 p67 p55 p34 p11 p91 p75 p27 p44 zx7: 3912 -> 1870 (47.802%), 0xb8b6
p6 p86 p84 p93 p37 p31 p100 p28 p54 p57 p17 p5 zx7: 3902 -> 2068 (52.998%), 0xc004
p56 p64 p32 p40 p59 scorecheck start zx7: 1275 -> 730 (57.255%), 0xc818

(zx0 has come out since then and probably compresses better, but all of the above still applies)

catmeows · Post by **catmeows** » Fri May 19, 2023 4:45 pm

Speaking of compressing short text e.g. location description, items description Byte pair encoding Is very useful (https://en.wikipedia.org/wiki/Byte_pair_encoding).
I did compressor in Java and decompressor in machine code and it works very well. Compression ratio is in range 50%-60% of original text.

Rorthron · Post by **Rorthron** » Sat May 20, 2023 7:18 am

Morkin wrote: ↑Tue May 16, 2023 9:23 pm @Rorthron did a load of manual tokenizing for the text in Balachor's Revenge (saved 200+ bytes I think).

Because we had a relatively small amount of text and really needed to squeeze as many bytes out as we could, a bespoke routine was, I think, the best way to go. If my memory is correct, the way I approached it was as follows. I put all the text into a Word file, replacing all the spaces and returns with symbols to make them more obvious. I also set up an Excel file into which I could type a token and its frequency in the text and the file would calculate the net memory saving on each token to see if it was worth while. I then went through the text token-by-token, using Word's find command, putting the tokens and their frequency into the spreadsheet. I think I started with two-letter tokens (including spaces and returns). Once I had identified the two-letter tokens I did a find-and-replace in the Word file and went on to do the three-letter tokens. This worked because I "compounded" tokens, incorporating two-letter tokens into my three-letter ones, then three-letter ones into my four-letter ones, etc. (I might identify first a token TH, then TH+E, then THE+R, etc.) At least, this is how I think I did it; it was very nearly ten years ago!

This wasn't based on any computer science. I just approached it as a word game. So there might well be a better way to do it.

Looking back at my emails, we saved 286 bytes. The longest token looks like it was "- UNLOCKED" (10 characters). Oddly enough, BALACHOR didn't get a token of its own, just the CH and OR.

Sol_HSA · Post by **Sol_HSA** » Mon May 22, 2023 5:52 pm

Rorthron wrote: ↑Sat May 20, 2023 7:18 am This wasn't based on any computer science. I just approached it as a word game. So there might well be a better way to do it.

"Alter Ego", at least on PC, had a table of 128 common words and then text was compressed by using values 127-255 for lookup to those words and the lower values were literal ascii. I'm pretty sure it was a "word game" list as I would have written a routine that looked for common patterns instead of common words..

uglifruit · Post by **uglifruit** » Tue May 23, 2023 12:10 pm

So, I've been thinking about this ... so I've had a go at it!

(my path to reinventing the wheel)
I decided that Python might be a good way to go, so learnt enough Python write a program to do multiple iterative Byte Pair Encoding (as described here).

I wanted to creating a tokenization key for bytes 128-254 (I reserved 255 for 'end of text' marker), and reasoned the 0-32 might be used in ZX Speccy control codes.

So I wrote some Python, which created me .asm defb statements that I could pop into an assembler, then wrote a decoder in z80.

As a test piece of text, I used Chapter 1 of Harry Potter.

Stats

The 25806 bytes of input text, gave 13828 bytes of output, plus a 254 byte token table.

The decompression/printing routine which I wrote took 114 bytes (just calling RST 10h, rather than anything fancy) - including a 12 byte 'buffer' [in case of tokens, encoding tokens, encoding tokens]. Really unlikely to every be that far down before it gets to a real 'printable' character, but I wanted to be sure.

So 25,806 bytes of plain text compressing down to 13828+254+114 = 14,196 bytes.
Down to ~ 55% of it's original size, which seems pretty good to me.

Tap file of the Harry Potter text being decoded to the screen: HERE

Code: Select all

        ld     hl,bytepairmessage       ; 3  bytes
        call   printbytepairmessage     ; 3  bytes
        ret


;----------------------------------------------------------------------------------------------------
; print a byte-pair-encoded message, indexed by HL.
;----------------------------------------------------------------------------------------------------

printbytepairmessage:
        ld     (pointer),hl
nextcharacterinmessage:
        call   clearexplandablebuffer
        call   getcharactertobuffer     ; it is left in A, and the pointer is updated
        cp     255
        ret    z                        ; end if it is the end of the message
        call   lookatfirstcharacterinbuffer
        jr     nextcharacterinmessage
        ret

lookatfirstcharacterinbuffer:
        ld     a,(expandablebuffer)
        bit    7,a                      ; is it a token
        jr     z,emitfirstchar          ; if it isn't, then print it
        cp     255                      ; is the buffer empty
        ret    z                        ; return if so
        ;
        ld     hl,expandablebuffer+10   ; otch everything down the buffer to make room
        ld     de,expandablebuffer+11
        ld     bc,11
        lddr
                                        ; expand the token into two bytes
        and    %1111111                 ; ignore the top bit
        rla                             ; double it
        ld     e,a
        ld     d,0
        ld     hl,lookupbytepairs       ; find the place in the lookup table
        add    hl,de                    ; hl not points at the token pair
        ;
        ld     a,(hl)
        ld     (expandablebuffer),a
        inc    hl
        ld     a,(hl)
        ld     (expandablebuffer+1),a
        jr     lookatfirstcharacterinbuffer
emitfirstchar:
        rst    10h			; print the ascii character
        ld     hl,expandablebuffer+1	; shuffle down the buffer
        ld     de,expandablebuffer
        ld     bc,11
        ldir
        jr     lookatfirstcharacterinbuffer

getcharactertobuffer:
        ld     hl,(pointer)
        ld     a,(hl)
        ld     (expandablebuffer),a
        inc    hl
        ld     (pointer),hl
        ret

clearexplandablebuffer:
        ld     hl,expandablebuffer
        ld     de,expandablebuffer+1
        ld     (hl),255
        ld     bc,11
        ldir
        ret

pointer:
        defw   0

expandablebuffer:
        defb   0,0,0,0,0,0,0,0,0,0,0,0 ; 12 bytes


lookupbytepairs:					; byte pair substitutions, starting with byte 128 ... as needed
	defb 101,32 ;128
	defb 116,32 ;129
	defb 100,32 ;130
	defb 116,104 ;131
	...
bytepairmessage:					; encoded  text, terminated with a byte 255	
	defb 245,162,77,114,132,184...
	defb	255

My experiment with Python:

Code: Select all

# Byte Pair Compression, using the bytes 128 to 254 as an index lookup for substitution

file = open(r"C:\Users\andyu\Desktop\ZX Spectrum\Programming Tests\BytePair\Chapter 1.txt", "rt")    #Obviously change as needed
phrase=file.read()

#Or just use
#phrase="the text I want to encode"

startlen=len(phrase)

print(phrase)
subs=""                                                                                                                             # string of the substitutions made
subcount=0                                                                                                                          # count the substitutions made
j=0                                                                                                                                 # counter to the end of the table (max=127 gives 254)    
while j<127:
    mostcommon = ""                                                                                                                 # for each pass, need the most common pair of bytes
    commonmax = 0                                                                                                                   # tally of how often
    i=0
    while i < len(phrase)-1:
        substring=(phrase[i])+(phrase[i+1])                                                                                         # get each pair of bytes a two character text string
        thisnum=phrase.count(substring)                                                                                             # find number of occurances in the text        
        if (thisnum>commonmax):
            if (thisnum>2):                                                                                                         # if less than 2 occurances, don't make the substitution
                commonmax=thisnum
                mostcommon=(phrase[i])+(phrase[i+1])
        i = i + 1
    if (commonmax>2):
        print("%d-%d occurs %d times, replacing with character '%d'." % (ord(mostcommon[0]),ord(mostcommon[1]),commonmax,j+128))
        subs=subs +"defb "+str(ord(mostcommon[0]))+","+str(ord(mostcommon[1]))+" ;"+str(j+128)+"\n"                                 # note that the substitution has happened
        phrase=phrase.replace(mostcommon,chr(128+j))                                                                                # make the subsitition
        subcount+=1
    j+=1                                                                                                                            
print ("lookupbytepairs:")
print(subs)
print ("bytepairmessage:")
defb="defb "
j=0
while j<len(phrase):
    defb=defb+str(ord(phrase[j]))+','
    j+=1
print(defb,'255')

print("\n; Length %d bytes became " %startlen, len(phrase), "bytes. Using",subcount,"substitutions (%d bytes)."%(subcount*2) )
print("; Compressed to %2d percent of original size, including lookup table." %( (subcount*2+len(phrase)+1)/startlen *100) )

The encoding doesn't bother encode any pairs with less than 3 instances (as there is no data saving, when you take the tokentable into consideration).
It just generates output that I can copy into a text document in Context for compiling. It'd be neater to write to a new file, and add that as in include in the compilation, but I'm unlikely to use this at all frequently, so copy and paste was fine.

Obviously apologies to anyone who actually knows Python - I've never used it before this week, so it's probably crap.

Bedazzle · Post by **Bedazzle** » Tue May 23, 2023 4:47 pm

uglifruit wrote: ↑Tue May 23, 2023 12:10 pm As a test piece of text, I used Chapter 1 of Harry Potter.

As starting point, can you compress same text with 7zip, and see how small it -teoretically- can be?

uglifruit · Post by **uglifruit** » Tue May 23, 2023 10:32 pm

Bedazzle wrote: ↑Tue May 23, 2023 4:47 pm As starting point, can you compress same text with 7zip, and see how small it -theoretically- can be?

Drum roll...

My plain text file containing 25731 bytes (Chapter 1 of Harry Potter, no formatting)

The Byte-Pair-Encoding technique above takes that down to 14108 bytes. (54.83% of the original size)

7-Zip - on standard compression setting takes it down to : 10,054 bytes (39.07% of original size)
Fiddling with settings I can get this down to 9907 bytes (38.50%)

So it looks like 7-Zip wins by a substantial lead ... but the .zip files don't already include the decoder. (And I bet it's not as quick to decode.)

Joefish · Post by **Joefish** » Tue May 23, 2023 11:10 pm

Problem with ZIP files and that sort of data compression is that you need to decode everything at once, or at least, decode continuously with a fairly big back buffer left open. If you want to be able to decode separate chunks, you typically have to encode each chunk on its own, and then you lose a lot of the efficiency. The encoding relies on being able to refer back to where a sequence of bytes has been used before, so for it to work, all data up to the bit you decode has to be unpacked first.

Whereas with these text-encoding mechanisms, you can pick out short phrases from anywhere in the data and expand them on their own.

Bedazzle · Post by **Bedazzle** » Wed May 24, 2023 9:45 am

Joefish wrote: ↑Tue May 23, 2023 11:10 pm Problem with ZIP files and that sort of data compression

It is not a problem, but some kind of estimation tool. If some algo can succesfully beat zip size, you rock. If compressed size is much bigger than zip's - you need to think about better algo.

catmeows · Post by **catmeows** » Wed May 24, 2023 12:06 pm

If your estimation deliberately leaves out size of depacker and memory resources needed to actually depack data you need to think about better estimation.

Spectrum Computing

Text compression routines...

Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...

Re: Text compression routines...