A new data compressor called ZX0

The place for codemasters or beginners to talk about programming any language for the Spectrum.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

Awesome! You are welcome :)
Alone Coder
Manic Miner
Posts: 401
Joined: Fri Jan 03, 2020 10:00 am

Re: A new data compressor called ZX0

Post by Alone Coder »

Alone Coder wrote: Wed Jan 20, 2021 5:31 pm
Alone Coder wrote: Wed Jan 20, 2021 4:02 pm calgary - 98571 (lost to Exomizer3)
canterbury - 27939
graphics - 163173 (better than others)
music - 63721
misc - 254202

RAR headers included.
Decompressor size is 342 bytes (uses 1566 bytes for temporary buffer).

TR-DOS version of ZXRar optionally supports compression with 2-byte blocks (helpful for code), with 386-byte decompressor. It has to be tested too.
Temporarily added 2-byte blocks in NedoOS version:
calgary - 98525
canterbury - 27911
graphics - 163943
music - 62740
misc - 250341
ZXRar with 2-byte blocks TOTAL = 603460, including file headers (lost to exomizer and ZX0)

mRIP archiver for PC by Eugene77 (2003 format, depacker size is 218 bytes, fits in 1-sector TR-DOS BASIC loader, working area of #5C2 bytes):
calgary - 91734 (better than exomizer etc)
canterbury - 25800 (better than exomizer etc)
graphics - 155697 (better than exomizer etc)
music - 60160 (lost to exomizer and ZX0)
misc - 245702 (lost to exomizer and ZX0)
TOTAL - 579093 (better than exomizer etc)

RIP archiver for PC by Eugene77 (2000 format, depacker size is 228 bytes, working area of #5C2 bytes):
calgary - 90794 (the best)
canterbury - 24658 (the best)
graphics - 150981 (the best)
music - 57276 (the best)
misc - 240765 (the best)
TOTAL - 565474 (the best)

So, the old native formats were not bad!
User avatar
TMD2003
Rick Dangerous
Posts: 2032
Joined: Fri Apr 10, 2020 9:23 am
Location: Airstrip One
Contact:

Re: A new data compressor called ZX0

Post by TMD2003 »

I admit I have never had a reason to try ZX0 before. But now, I do: a relatively simple screen, with lots of empty space, which has absolutely no need to take up 6912 bytes. It requires compression.

I can't run it on my regular PC - it's incompatible with 64-bit Windows.
I tried it with DOSBox and it crashed - the window sits there doing nothing and can't be interrupted (and the .zx0 file never appears).
My 15-year-old laptop running Windows 7 returned the message "Program is too big to fit in memory", even though it's only 130-odd KB.
My 23-year-old laptop running Windows 2000 is flatly refusing to start up... even in Safe Mode.

Any suggestions?

I've already tried the screen compressor in BMP2SCR, and I've disassembled the initial machine code routine. For some reason that is several light years beyond my pay grade, the screen only displays if I load it from tape and RANDOMIZE USR 50000. If I copy the code into an .ASM file, along with lines and lines and lines of DEFBs to hold the screen data and assemble it with Spin, all I get is a screen full of flashing shash. The first lines are:
CALL 6034 (which is just a RET in the ROM and nothing more)
DEC SP
DEC SP
There is no reference whatsoever to the first line of the screen data - the compressor loads the code at 50000, and the screen data starts at 50114. Is it performing some mad trick with whatever the stack pointer points to at the end of a load?
I tried replacing these first lines with LD SP,50114 but it was to no available. As soon as any SP instructions are thrown my way I'm completely in the dark.

I've also taken to analysing the 6912 bytes, piece by piece, writing a load of LD HL,***** then LD (HL),*** and the odd LDIR routine to draw it effectively byte by byte, leaving out all the huge chunks of zeroes. Effectively I'm compressing the screen by hand, which is lengthy and tedious.
Spectribution: Dr. Jim's Sinclair computing pages.
Features my own programs, modified type-ins, RZXs, character sets & UDGs, and QL type-ins... so far!
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

TMD2003 wrote: Wed Sep 21, 2022 11:49 pm I can't run it on my regular PC - it's incompatible with 64-bit Windows.
Check if you downloaded it correctly. You are supposed to click on the "Download" button.

Direct link:

https://github.com/einar-saukas/ZX0/raw ... in/zx0.exe
User avatar
TMD2003
Rick Dangerous
Posts: 2032
Joined: Fri Apr 10, 2020 9:23 am
Location: Airstrip One
Contact:

Re: A new data compressor called ZX0

Post by TMD2003 »

...and that's working, on my main PC that outright rejected what I had yesterday. I was using the version found here:
https://github.com/einar-saukas/ZX0/tree/main/win

"Updated executables" seemed like the right place to go...

1,923 bytes! About 600 less than BMP2SCR required. I knew this would be the right option. Cheers!
Spectribution: Dr. Jim's Sinclair computing pages.
Features my own programs, modified type-ins, RZXs, character sets & UDGs, and QL type-ins... so far!
User avatar
Joefish
Rick Dangerous
Posts: 2042
Joined: Tue Nov 14, 2017 10:26 am

Re: A new data compressor called ZX0

Post by Joefish »

TMD2003 wrote: Wed Sep 21, 2022 11:49 pm I've also taken to analysing the 6912 bytes, piece by piece, writing a load of LD HL,***** then LD (HL),*** and the odd LDIR routine to draw it effectively byte by byte, leaving out all the huge chunks of zeroes. Effectively I'm compressing the screen by hand, which is lengthy and tedious.
An easily understood (though not hugely efficient) way to compress a screen that's got lots of empty character cells is to start with the attributes, and exploit the fact that no-one in their right mind would ever use the FLASH bit. So you scan your screen for empty character cells, and if you find one, set the FLASH bit of that attribute. Then you store your data as one byte of attribute, followed by eight bytes of pixel data, but you only store the pixel data if the FLASH bit is not set. Otherwise, you move straight on to the attribute for the next character cell.

When you come to unpack that stream of data, you first take a byte, check the top bit, and if it's set, you erase that bit, write the byte to the attribute, then write 0s into the pixel bytes for that character cell. If the top bit isn't set, then copy it up as an attribute, then copy the next eight bytes into the pixel bytes of that character cell.

Of course your screen data can't get any lower than 768 bytes (768 empty characters), but it's a start.

And then there are far more complex approaches you can take once you get the hang of it! But however elaborate the mechanism, decoding is the same. Read some signal bits or bytes from the compressed data, and depending on what you just read, either copy some more raw data from the stream or do something special like pad with blanks or copy something from earlier, or from a prepared 'dictionary' of common byte sequences.
User avatar
Lethargeek
Manic Miner
Posts: 735
Joined: Wed Dec 11, 2019 6:47 am

Re: A new data compressor called ZX0

Post by Lethargeek »

Joefish wrote: Thu Sep 22, 2022 11:28 am An easily understood (though not hugely efficient) way to compress a screen that's got lots of empty character cells is to start with the attributes, and exploit the fact that no-one in their right mind would ever use the FLASH bit. So you scan your screen for empty character cells, and if you find one, set the FLASH bit of that attribute. Then you store your data as one byte of attribute, followed by eight bytes of pixel data, but you only store the pixel data if the FLASH bit is not set. Otherwise, you move straight on to the attribute for the next character cell.

When you come to unpack that stream of data, you first take a byte, check the top bit, and if it's set, you erase that bit, write the byte to the attribute, then write 0s into the pixel bytes for that character cell. If the top bit isn't set, then copy it up as an attribute, then copy the next eight bytes into the pixel bytes of that character cell.

Of course your screen data can't get any lower than 768 bytes (768 empty characters), but it's a start.

And then there are far more complex approaches you can take once you get the hang of it! But however elaborate the mechanism, decoding is the same. Read some signal bits or bytes from the compressed data, and depending on what you just read, either copy some more raw data from the stream or do something special like pad with blanks or copy something from earlier, or from a prepared 'dictionary' of common byte sequences.
...aaand again the example is that screen compressor of mine, LgK aka Lethargeek Kompakt
first it filters the original screen to maximize the number of empty square chunks - 8x8 then 4x4 then 2x2
then the more clean filtered picture is compressed using the variable length prefix code
the idea is that squares generally have better correlation between all their pixels than byte strips
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

TMD2003 wrote: Thu Sep 22, 2022 9:25 amI was using the version found here:
https://github.com/einar-saukas/ZX0/tree/main/win

"Updated executables" seemed like the right place to go...
In this page, clicking on a filename will take you to another page, not directly to the actual file. So if you right clicked and saved it, you actually saved a webpage, not an executable file.

To get the actual file, first you need to left click on the file you want. Then you will see a button called "View raw", that you can use to download it.

The Github interface can be quite confusing, that's not my fault! :)
User avatar
TMD2003
Rick Dangerous
Posts: 2032
Joined: Fri Apr 10, 2020 9:23 am
Location: Airstrip One
Contact:

Re: A new data compressor called ZX0

Post by TMD2003 »

Einar Saukas wrote: Fri Sep 23, 2022 2:33 pm The Github interface can be quite confusing, that's not my fault! :)
I've always found it unnecessarily annoying, especially since a fair few projects of this type are hosted there. Sourceforge is a lot easier to navigate.

Anyway, the actual ZX0 has done some sterling work - four blocks of code (mostly screens and attributes) compressed to about a quarter of their size, maybe less... leaving me more room for a 12K wall of text. Of course, I have to fit the actual game in there somewhere...

Also, just a thought: how much stack space does ZX0 need when decompressing a full screen? Do tell me it's not 7K, otherwise I'll be in trouble. Does it decode a few bytes at a time and write them to their destination?
Spectribution: Dr. Jim's Sinclair computing pages.
Features my own programs, modified type-ins, RZXs, character sets & UDGs, and QL type-ins... so far!
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

TMD2003 wrote: Sun Sep 25, 2022 9:34 am Also, just a thought: how much stack space does ZX0 need when decompressing a full screen?
The standard version of the decompressor uses 4 bytes of stack.

The other 3 versions use 2 bytes of stack.

TMD2003 wrote: Sun Sep 25, 2022 9:34 amDoes it decode a few bytes at a time and write them to their destination?
Each decompressed byte is directly written to the destination.
User avatar
TMD2003
Rick Dangerous
Posts: 2032
Joined: Fri Apr 10, 2020 9:23 am
Location: Airstrip One
Contact:

Re: A new data compressor called ZX0

Post by TMD2003 »

No problems there, then. I can afford four bytes, come what may.
Spectribution: Dr. Jim's Sinclair computing pages.
Features my own programs, modified type-ins, RZXs, character sets & UDGs, and QL type-ins... so far!
User avatar
Prodatron
Drutt
Posts: 9
Joined: Tue Jun 06, 2023 3:21 pm

Re: A new data compressor called ZX0

Post by Prodatron »

Hi all, I am new in this forum, not (yet) a ZX Spectrum, but at least a Z80 guy, active on some other platforms like MSX, Enterprise and Amstrad CPC (my childhood system). Some weeks ago I discovered this fantastic ZX0 compressor and was so amazed, that I decided to fully integrate it into my project called "SymbOS", an operating system for Z80 systems (unfortunately currently not for the Spectrum).

Thanks so much to Einar Saukas for this fantastic piece of software!!

The next SymbOS version supports loading of ZX0 compressed executables (EXE, COM etc.) but also provides functions for applications for decompressing data. You can load compressed data from an opened file in a transparent way like if it would be uncompressed, which makes it very easy to update even existing applications for handling compressed data.

I added a little header to all kind of ZX0 compressed data...
- 1W: length of the compressed data
- 4B: last 4 bytes of uncompressed data
- 1W: number of bytes at the beginning, which should be skipped/shouldn't be compressed

The first word is necessary for the loading routine, which only knows the uncompressed size from the application.

The next 4 bytes are the last 4 bytes of the uncompressed data, which won't be included in the compressed ones. This makes it possible to load the compressed data exactly at the end of the area, where it will be decompressed. I never saw a delta>3, so to be safe I choosed to save 4 bytes, so there will never be a problem with the delta.

The last word is necessary for files, which have some meta-data at the beginning. These shouldn't be compressed, as you may want to load it separately without the need to decompress the whole thing. Therefore the prefix feature in ZX0 is a great thing for still using this part as dictionary.

Now all EXE (GUI executables) and COM (command line executables) files can be compressed, as well as some special executables "*.WDG" (desktop widgets) and "*.SAV" (screen saver).
Picture files (SGX) can be compressed, too, including desktop background pictures, as well as help files (HLP).
Last but not least I decided to add a header for compressed music files, which are supported in the SymAmp music player. These are PT3, ST2 (Amstrad Soundtrakker 128), SKM (Amstrad Starkos Tracker) and SA2 (Adlib Tracker 2), which can all be ZX0-compressed now.

The compression ratio is just fantastic, in most cases it's better than ZIP! Now it's possible again to place the whole operating system with all system apps, background picture and additional stuff which is loaded during booting on one Amstrad standard disc side (178K).
I am using the Turbo decompressor, which is still small and has a great speed. It's directly integrated into the SymbOS kernel, which makes it possible to decompress data with the full linear size of up to 63K!

I hope to be able to release the next SymbOS version later this year. I know that currently all this is not interesting for you Spectrum guys, but I just wanted to share this info, as I am using this awesome software now. Maybe/hopefully there will be a Spectrum Next version in the future.

Again a lot of thank you and respect to Einar Saukas (for the whole thing) and Introspec (for the Turbo decompressor)!
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

That's awesome! Thank you!!!

Now here's an idea. Your current format is as follows:

- 1W: total length of data after decompression
- last 4 bytes of uncompressed data
- 1W: number of uncompressed bytes at the beginning
- N uncompressed bytes at the beginning
- remaining compressed bytes

My suggestion is to use a simpler header like this:

- 1W: total length of data after decompression
- 1W: number of uncompressed bytes at the beginning (at least 4)
- N uncompressed bytes at the beginning (N>=4)
- remaining bytes, compressed backwards

This way, you will only need N uncompressed bytes total, instead of N+4.

To decompress, you copy into memory exactly N-4 uncompressed bytes, immediatelly followed by the compressed block. Afterwards decompress it backwards from the top of memory area. Finally copy the remaining 4 uncompressed bytes.

Compressing backwards may sound awkward, but you should get a slightly better compression in almost every file...
User avatar
Prodatron
Drutt
Posts: 9
Joined: Tue Jun 06, 2023 3:21 pm

Re: A new data compressor called ZX0

Post by Prodatron »

Hi Einar, thanks a lot for your answer and your ideas!

Using backwards compression to have the "4 bytes" (for avoiding the delta problem) at the beginning in the uncompressed area is a good idea!
Einar Saukas wrote: Wed Jun 07, 2023 2:42 am Compressing backwards may sound awkward, but you should get a slightly better compression in almost every file...
Is it really like this?
I made some quick and dirty tests with everal SymbOS EXE files, and the ratio with backward compression was always a little bit worse compared to normal compression. Then I remembered, that EXE files have the relocator table at the end, which may be a bad start for the dictionary. So I did the same with bitmap files, but here there is the same result, all backward compressed files are a little bit larger than normal compressed ones.
I don't really get this, as backward or forward is just swapping LDIR and LDDR, and why should the dictionary be somehow more optimal, when you start from the beginning or from the end.
Anyway I will make some more tests, especially I like to find out, if your suggestion will save some code on OS side, which is always a very good thing :)
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

Prodatron wrote: Wed Jun 07, 2023 5:26 pm Is it really like this?
Ops, I should have explained what I meant!

The compression itself shouldn't make much difference. Some files work slightly better with forward compression, others with backward compression.

However the new format I suggested (using backward compression) will allow compressing 4 more bytes per file.
User avatar
zara6502
Drutt
Posts: 7
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Hello friends. I do not know English, so I use a translator, please forgive me if he does not translate correctly.

Probably my question is addressed to Einar, but I will be glad if someone else answers.

I like just learning different algorithms and coding something in C# and ASM for ATARI. But my programming skills are very low.

I have read the description of the algorithm ZX0 on Github, but I cannot understand for you that "Copy from the last offset (repeat N bytes from the last offset)" - what is offset and last offset in this context, where do they come from? What is new offset and how is it formed? Why is the second block "0" immediately following Literal?

PS: I made my compression algorithm based on the LZSS and Gamma-Code Elias paradigm even before I met ZX0 and I liked its results, but ZX0 literally ruined my plans XD I would be interested to know the answers to my questions, maybe it will help me improve my algorithm. Thx
User avatar
PeterJ
Site Admin
Posts: 6858
Joined: Thu Nov 09, 2017 7:19 pm
Location: Surrey, UK

Re: A new data compressor called ZX0

Post by PeterJ »

Welcome @zara6502,

I have used the mention system to alert @Einar Saukas to your message.

Translation services are very good these days. No need to apologise for using one.
XoRRoX
Manic Miner
Posts: 231
Joined: Wed Jul 11, 2018 6:34 am

Re: A new data compressor called ZX0

Post by XoRRoX »

@Einar Saukas

In a new 128k game project, I am already using your zx0 de-compression for graphics in 128k banks to much satisfaction. But now, as it will be a physical tape game, to shorten the loading time I'd also like to compress the main-part. It occupies memory from 24832-65535=40703 which compresses down to 21.755 bytes (!!! :D )
I have read the documentation and think I should in this case use backward (de)compression but am until now unsuccessful in doing so.

Could you please explain how I should set this up? If needed, to create space for the loader & decompressor, I could shave necessary memory off the beginning of the file and load that last, for example.

Thank you in advance.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Wed Nov 08, 2023 2:50 pm I have read the description of the algorithm ZX0 on Github, but I cannot understand for you that "Copy from the last offset (repeat N bytes from the last offset)" - what is offset and last offset in this context, where do they come from? What is new offset and how is it formed? Why is the second block "0" immediately following Literal?

PS: I made my compression algorithm based on the LZSS and Gamma-Code Elias paradigm even before I met ZX0 and I liked its results, but ZX0 literally ruined my plans XD I would be interested to know the answers to my questions, maybe it will help me improve my algorithm. Thx
"Offset" and "literal" are concepts from standard LZSS. If you have already implemented a compression algorithm based on LZSS, you certainly used them yourself!

When decompressing, you are basically reading bytes from a compressed file ("source") while writing bytes to the decompression memory area ("destination"):
  • If source indicates that a compressed block is a literal, it simply means to copy next byte from source to destination.
  • Otherwise compressed block is a repetition. It simply means to repeat a few (previously already decompressed) bytes again in destination. Basically copy a few bytes ("length") from a few positions back in destination ("offset") to destination again.
For instance, imagine you have a sequence "ABCDBC". In ZXSS, it will be compressed as follows:
  • Literal "A"
  • Literal "B"
  • Literal "C"
  • Literal "D"
  • Repeat 2 bytes from 3 positions behind (length=2 and offset=3), thus producing "BC".
There's a more detailed explanation in Wikipedia. You can also try Youtube, there are probably online LZSS tutorials in your native language.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

XoRRoX wrote: Wed Nov 08, 2023 8:28 pm It occupies memory from 24832-65535=40703 which compresses down to 21.755 bytes (!!! :D )
I have read the documentation and think I should in this case use backward (de)compression but am until now unsuccessful in doing so.
When you compress it backwards, take note of the "delta" that it will provide.

Suppose you got delta=4. It means you must load the compressed file to address 24832-4=24828 (or lower). You also need a ZX0 backwards decoder, for instance you can load dzx0_standard_back (69 bytes) at address 24828-69=24759.

If you load a compressed file with size 21755 bytes to address 24828, then the last byte of this compressed file will be at address 24828+21755-1=46582. That's your initial decompression address to decompress backwards. Now you only need to execute something like this:

Code: Select all

LD    HL, 46582  ; source address
LD    DE, 65535  ; target address
CALL  24759      ; decompress backwards
User avatar
zara6502
Drutt
Posts: 7
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar, thank you so much for answering in this topic. I have great hope to figure out this algorithm.
I can't know in advance what intonation the translator will show in the narration, so I will say right away that I have no negativity, narcissism or claims. I treat all members of retro communities very carefully and with respect. I hope for a productive dialogue.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm "Offset" and "literal" are concepts from standard LZSS. If you have already implemented a compression algorithm based on LZSS, you certainly used them yourself!
I just can't match LZSS with your description. For me, they differ quite a lot.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm When decompressing, you are basically reading bytes from a compressed file ("source") while writing bytes to the decompression memory area ("destination"):
That's right, that's exactly the approach I use.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm For instance, imagine you have a sequence "ABCDBC".
The example shown is clear to me and I am guided by this principle in my reasoning.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm There's a more detailed explanation in Wikipedia.
I'm just relying on this article and my experiments are going on with the verse from this article "I am Sam", I literally have already become that Sam myself :lol:
My algorithm now compresses this verse to 92 bytes, although in the article it is 107 bytes and I use a break-even point of 1 byte, that is, I compress character by character.
Basically, I set out to make an algorithm for short phrases that does not use a dictionary, when I found out about ZX0 I was pleasantly surprised that such an idea had already been implemented.
Using ZX0.EXE for Windows, I compressed the wikipedia verse to 91 bytes (and using ZX5 as much as 89 bytes).
When viewing the result in a text editor, I see chunks of phrases, I didn't expect them to be there, since the offset is bitwise and a match with ASCII is unlikely, but what I see shows where you used literals.
I also read your source code in C and found no mention of any special "tricky" blocks, but when I try to repeat your algorithm with my hands, it is absolutely inefficient and compresses very poorly (at the level of the classic LZSS from the article, I think I got 102 bytes).
In my reasoning, I see that you do not encode individual characters in any way and do not compensate for the impossibility of compressing literals. I'll show you everything in more detail below.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm You can also try Youtube, there are probably online LZSS tutorials in your native language.
Oh, believe me, I did it a long time ago, I read articles both in Russian and translated into English, but they all somehow correspond to the classical algorithm and I can't compare the description of your algorithm on Github with the classical algorithm.

Let's raise your example "ABCDBC" and suggest using your blocks:

Code: Select all

Literal (copy the following N bytes from the compressed file)
0 bytes Elias(length)[1] byte[2] ...  byte[N]
at the output we will get:

Code: Select all

A (the first character without bit 0)
0 3 BCD -> 0 011 BCD
and here I don't understand why the

Code: Select all

Copy from last offset (repeat N bytes from last offset)
0 Alias(length)
block is needed, what is the offset value now?
In my implementation there is a block

Code: Select all

1 2 2 -> 1 010 010
and we get the sequence:

Code: Select all

A 0 011 BCD 1 010 010
that is, 11 bits of codes and 32 bits of literals, in the end it is 43 bits.
But here I compressed half with your implementation option, and half with my own, since I can't deal with two blocks after the literal one.

Code: Select all

Copy from new offset (repeat N bytes from new offset)
1  Elias(MSK(offset)+1)  LSB(offset)  Elias(length-1)
If new offset is mentioned here, then last offset was set somehow in advance, but by what? If the same block, then it must always precede the second block "0 Elias(length)" otherwise how will the last offset be obtained?

In my implementation for short data, I use only two blocks:

Code: Select all

0 + sum
1 + Elias(lan) + Elias(offset)
I conducted experiments and the blocks "0 + sym" showed either the same or better encoding than "0 Alias(length) byte[1] ... byte[N]".
Here is a simple table with 10 different blocks of literals up to 8 bytes long and two columns of the table where I encode in the first case with your block, and in the second with mine:

Code: Select all

Block		Einar					zara6502	difference (Einar-zara6502)
A		0 1 A -> 0 1 +8 (10 bits)		1*9 (9 bits)	+1
AA		0 2 A -> 0 010 +16 (20 bits)		2*9 (18 bits)	+2
AAA		0 3 A -> 0 011 +24 (28 bits)		3*9 (27 bits)	+1
AAAA		0 4 A -> 0 00100 +32 (38 bits)		4*9 (36 bits)	+2
AAAAA		0 5 A -> 0 00101 +40 (46 bits)		5*9 (45 bits)	+1
AAAAAA		0 6 A -> 0 00110 +48 (54 bits)		6*9 (54 bits)	0
AAAAAAA		0 7 A -> 0 00111 +56 (62 bits)		7*9 (63 bits)	-1
AAAAAAAA	0 8 A -> 0 0001000 +64 (72 bits)	8*9 (72 bits)	0
As can be seen from the table, this approach to encoding short phrases is ineffective. Indirectly, I think this also reduces decoding performance.

Let's take an example for the verse "I am Sam" from wikipedia, let's take only the first line

Code: Select all

I am Sam\n
You get 80 bits per line

Code: Select all

$06 + I_am_Sam\n
you don't encode it at all (I'm looking at the bytes in the output file after ZX0 compression) and it turns out to be a block of literals, although it's unclear why you just don't have "I" as the first character, this line starts with $06, which does not correspond to any of the blocks.
I have this string encoded like this:

Code: Select all

len	command        Text bytes	Length (bits)   Block code      bits (len bytes)        bits (offset bytes)
4	>		I am		36
1	%		_		5		1		1 (1)			3 (3)
1	>		S		9
2	:		am		7		1		3 (2)			3 (3)
1	>		\n		9
-------------------------------------------------------------------------------------------------------------------
9					66
The blocks marked with ">" are literal blocks, they are encoded as "0 + sim", that is, in our case:

Code: Select all

0 I 0 <space> 0 a 0 m
Formally, and I learned this from you, it's logical, you can omit 0 from the first character:

Code: Select all

I 0 <space> 0 a 0 m
All characters are packed in 35 bits (in the diagram above 36).
The second line is marked with the symbol "%" - this is the encoding of a single character, it is no different from the code ":" in the line and uses the encoding mask "1 + Elias(len) + Elias(offset)", but for me to test and it was convenient for statistics to see where the encoding of single characters takes place.
Since there is a space in the previous text, we encode it in 5 bits - 1 1 3, that is, 1 bit of the block, length 1 in the Elias code and offset 3 in the Elias code.
The third string is literal - 9 bits (0 S).
The fourth line "am" has already been previously and is encoded in 7 bits - 1 3 3 - > 1 010 011
And the line feed is also a 9-bit literal - 0\n.
In total, on the same line, I get 35+5+9+7+9=65 bit (I exited the first 0 block).

Considering that the general trend towards encoding persists for both you and me, I am trying to understand at what stage you were not only able to return 15 bits of the first line, but also eventually bypass 8 bits (approximately, since here we are talking about a boundary transition to 1 full byte). And for the ZX5, you compress another 2 bytes harder.
In total, I have 43 characters encoded with literals in this verse, which gives an overabundance when encoding in 43 bits (1 bit per character), but the gain from using the block

Code: Select all

"0 Elias(length) byte[1]  byte[2] ... byte[N]"
I won't have, since there are no long literal blocks sufficient to compensate for 1 bit per character.
There are 24 single encoded characters, with a total encoded length of 144 bits, here we win by encoding 24*8-144= 48 bits.
Which together with the previous literals gives a gain of 5 bits on two types of blocks.
In your case, literal blocks are not encoded at all in a smaller way (at least on short phrases) and as I understand it in ZX0, single characters are not encoded at all. I found confirmation of this in some of your statements referring to the block "Elias(length-1)", that is, you consider only blocks of 2 bytes in size, that is, the Elias code 1 encoded in 1 bit is considered a code for the value 2.

In my case, there are several literal phrases with a size of 2 bytes and higher

Code: Select all

4  > I am		36	+2
2  > Th			18	+2
2  > -I			18	+2
2  > !\n		18	+2
2  > do			18	+2
5  > like\n		45	+1
2  > gr			18	+2
4  > m?\n\n		36	+2
2  > th			18	+2
2  > m,			18	+2
If we turn to the comparison table of your literal blocks and mine, then no such block will have an advantage in coding.
In fact, using your approach to encoding literal blocks, we would get a code increase of 19 bits (~2.5 bytes).
But if you look at your output file received after ZX0 compression, you can notice literal blocks in such a number:

Code: Select all

I am Sam
That_
-Ά-€!
dono
like\nt
D
y΅u
 green eggs λnd hΏ?
em,
Moreover, they are strangely sometimes torn by 1 byte, and for "dono" there is no gap at all, and then how to restore from this "do not" I do not understand :shock: :lol:

When I thought over the algorithm myself, I started everything with encoding the symbol table.
The fact is that graphic modes with reprogramming of the symbol table are very semi-binary for ATARI, this on the one hand gives tangible savings for the size of the screen area (20*24) characters = 480 bytes for the five-color mode, on the other - for static screens, such as Popeye, Mario Bros., Pacman, etc. this frees up a lot RAM.
And in such studies, I noticed that the frequency of bytes in the graph is in small blocks of 8-16-24-32 bytes for (offset), and (length) usually does not exceed 1-4 bytes.
So the idea came up to encode values with Elias codes for small numbers. In such a situation, we can even compress an 8*8 sprite to a few bits.
For example, as in the Einar game "Dreamwalker: Alter Ego 2", where the playing field is divided into familiar ones. (On occasion I would like to shake your hand for this game, it is an impressively pleasant and balanced project)

It also took me a while to come to the conclusion that I made (offset) indicating not the beginning of the block, but the end (in my example above, you might have noticed that I use an offset of 3 bytes for "am", not 5). Since this reduces the value for encoding with the Elias gamma code by (length), which on short values gave an increase of 10% in compression and I was able to cross the psychological threshold of 100 bytes of encoded text just for the verse "I am Sam".
Also, at the initial stage, I used fixed byte constructions, changed them by 3-4-5 bits, added additional block codes and blocks 0, 10, 11 worked very well, I think I stopped at 94 bytes for "I am Sam", but it worked poorly on short data (overhead with a static bit size is obvious offset and len blocks).

Now I want to figure out how everything works with you. Either this will give me new information for applying some tricks (I read your correspondence in this topic with Introspec and I didn't understand about the Elias codes, but I got the impression that the Alice codes can somehow be simplified) or I will understand that my coder using your techniques will turn into ZX0, then it's easier for me will stop.
I also considered the issue of creating only the optimal encoder, but due to the small size of the data, the search for the optimal code will not take much time in any case, that is, it would be possible to compress my code on the Z80 itself.

I hope my writing has not completely confused you.

PS: I will attach this text under the spoiler, it may be easier for Introspec to read it in Russian, rather than an automatic translation in English (If, of course, he deems it necessary to read my opus).
Spoiler

Code: Select all

Возьмём ваш пример "ABCDBC" и попробуем использовать ваши блоки:
Literal (copy next N bytes from compressed file), 0  Elias(length)  byte[1]  byte[2]  ...  byte[N]
на выходе мы получим:
A (первый символ без бита 0)
0 3 BCD -> 0 011 BCD

и вот сдесь я не понимаю зачем нужен блок Copy from last offset (repeat N bytes from last offset), 0  Elias(length)
какое сейчас значение offset?
В моем варианте реализации идёт блок
1 2 2 -> 1 010 010

и мы получаем последовательность: "A 0 011 BCD 1 010 010", то есть 11 бит кодов и 32 бита литералов, в итоге это 43 бита.
Но тут я сжимал половину вашим вариантом реализации, а половину своим, так как я не могу разобраться с двумя блоками после литерального.
Copy from new offset (repeat N bytes from new offset), 1  Elias(MSB(offset)+1)  LSB(offset)  Elias(length-1)
Если тут упоминается new offset, то значит last offset был установлен как-то заранее, но чем? Если этим же блоком, то он должен всегда предшествовать второму блоку "0 Elias(length)" иначе как будет получен last offset?

В своей реализации для коротких данных я использую только два блока:
0 + sym
1 + Elias(len) + Elias(offset)
Я проводил эксперименты и блоки "0 + sym" показывали или такое же или лучшее кодирование чем "0 Elias(length) byte[1] ... byte[N]".
Вот простая таблица на 10 разных блоков литералов длиной до 8 байт и два столбца таблицы где я кодирую в первом случае вашим блоком, а во стором моим:
Block		Einar					zara6502	разница Einar-zara6502
A		0 1 A -> 0 1 +8 (10 bits)		1*9 (9 bits)	+1
AA		0 2 A -> 0 010 +16 (20 bits)		2*9 (18 bits)	+2
AAA		0 3 A -> 0 011 +24 (28 bits)		3*9 (27 bits)	+1
AAAA		0 4 A -> 0 00100 +32 (38 bits)		4*9 (36 bits)	+2
AAAAA		0 5 A -> 0 00101 +40 (46 bits)		5*9 (45 bits)	+1
AAAAAA		0 6 A -> 0 00110 +48 (54 bits)		6*9 (54 bits)	0
AAAAAAA		0 7 A -> 0 00111 +56 (62 bits)		7*9 (63 bits)	-1
AAAAAAAA	0 8 A -> 0 0001000 +64 (72 bits)	8*9 (72 bits)	0

Как видно из таблицы, такой подход к кодированию коротких фраз неэффективен. Косвенно это, как я думаю, снижает и производительность при декодировании.

Рассмотрим пример для стиха "I am Sam" из википедии возьмём только первую строку
I am Sam\n
У вас получается 80 битов на эту строку
$06 + I_am_Sam\n
вы её совсем не кодируете (я смотрю на байты в выходном файле после сжатия ZX0) и она получается блоком литералов, хотя непонятно почему у вас первым символом просто не идёт "I", у вас эта строка начинается с $06, что не соотвествует ни одному из блоков.
У меня эта строка кодируется вот так:
len	действие        основноё текст	длина в битах   код блока       bits (len bytes)        bits (offset bytes)
4	>		I am		36
1	%		_		5		1		1 (1)			3 (3)
1	>		S		9
2	:		am		7		1		3 (2)			3 (3)
1	>		\n		9
-------------------------------------------------------------------------------------------------------------------
9					66

Блоки отмеченные ">" это литеральные блоки, они кодируются как "0 + sym", то есть в нашём случае:
0 I 0 <space> 0 a 0 m
Формально, и я это узнал от вас, это логично, у первого символа можно 0 опустить:
I 0 <space> 0 a 0 m
Все символы упакуются в 35 бит (на схеме выше 36).
Вторая строка отмечена символом "%" - это кодирование одиночного символа, оно ничем не отличается от кода ":" в строке и использует маску кодирования "1 + Elias(len) + Elias(offset)", но мне для тестирования и статистики было удобно видеть где происходит кодирование одиночных символов.
Так как пробел есть в предыдущем тексте, то мы кодируем его в 5 бит - 1 1 3, то есть 1 бит блока, длина 1 в коде Элиаса и смещение 3 в коде Элиаса.
Третья строка литерал - 9 бит (0 S).
Четвёртая строка "am" уже была ранее и кодируется в 7 бит - 1 3 3 -> 1 010 011
И перевод строки тоже литерал 9 бит - 0 \n.
Итого, на ту же строку, у меня получается 35+5+9+7+9=65 бит (я вычел первый 0 блока).
Учитывая что общая тенденция к кодированию и у меня и у вас сохраняется, то я пытаюсь понять на каком этапе вы не только смогли вернуть 15 бит первой строки, но и обойти в конечном итоге на 8 бит (примерно, так как тут речь о граничном переходе на 1 полный байт). А для ZX5 вы сжимаете еще на 2 байта сильнее.
Всего литералами в этом стихе у меня закодировано 43 символа, что даёт переизбыток при кодировании в 43 бита (1 бит на символ), но выигрыша от использования блока "0  Elias(length)  byte[1]  byte[2]  ...  byte[N]" у меня не будет, так как нет длинных литеральных блоков, достаточных, чтобы компенсировать 1 бит на символ.
Одиночных закодированных символов - 24, при общей закодированной длине 144 бита, здесь мы выигрываем за счет кодирования 24*8-144=48 бит.
Что вместе с предыдущими литералами даёт выигрыш в 5 бит на двух типах блоков.
У вас же литеральные блоки совсем не кодируются в меньшую сторону (во всяком случае на коротких фразах) и как я понял в ZX0 одиночные символы совсем не кодируются. Этому я нашёл подтверждение в некоторых ваших высказываниях с отсылкой на блок "Elias(length-1)", то есть вы рассматриваете только блоки размером от 2 байт, то есть код Элиаса 1, кодируемый в 1 бит у вас считается кодом для значения 2.

В моем случае литеральных фраз размером от 2 байт и выше несколько
4  > I am					36	+2
2  > Th						18	+2
2  > -I						18	+2
2  > !\n					18	+2
2  > do						18	+2
5  > like\n					45	+1
2  > gr						18	+2
4  > m?\n\n					36	+2
2  > th						18	+2
2  > m,						18	+2

Если обратиться к таблице сравнения ваших литеральных блоков и моих, то ни на одном таком блоке не будет получено преимущество при кодировании.
Фактически используя ваш подход к кодированию литеральных блоков мы получили бы увеличение кода на 19 бит (~2.5 байта).
Но если посмотреть на ваш выходной файл полученный после сжатия ZX0, то можно заметить литеральные блоки в таком количестве:
I am Sam
That_
-Ά-€!
dono
like\nt
D
y΅u
 green eggs λnd hΏ?
em,
Причем они странным образом иногда разрываются на 1 байт, а для "dono" вообще разрыва нет и как потом восстановить из этого "do not" я не понимаю XD

Я, когда продумывал сам алгоритм, начинал всё с кодирования таблицы символов.
Дело в том что для ATARI очень полудярны графические режимы с перепрограммированием таблицы символов, это с одной стороны даёт ощутимую экономию для размера экранной области (20*24) символа = 480 байт для пятицветного режима, с другой - для статичных экранов, как например Popeye, Mario Bros., Pacman и т.п. это освобождает массу оперативной памяти.
И в таких исследованиях я заметил что часто повторяемость байтов в графике идёт небольшими блоками 8-16-24-32 байта для (offset), а (length) обычно не превышает 1-4 байт.
Так и появилась идея кодировать значения кодами Элиаса для малых чисел. В такой ситуации мы даже можем сжимать спрайт 8*8 до нескольких бит.
Например как в игре Einar "Dreamwalker: Alter Ego 2", где игровое поле разбито на знакоместа.

Так же не сразу я пришёл к тому, что сделал (offset) указывающим не начало блока, а на конец (в моём примере выше вы могли заметить, что я использую смещение 3 байта для "am", а не 5). Так как это уменьшает значение для кодирование гамма-кодом Элиаса на (length), что на коротких значениях дало припрост в 10% по сжатию и я как раз для стиха "I am Sam" смог перейти психологический рубеж в 100 байт закодированного текста.
Так же на начальном этапе я использовал фиксированные байтовые конструкции, менял их по 3-4-5 бит, добавлял дополнительные коды блоков и весьма неплохо работали блоки 0, 10, 11, я кажется остановился на 94 байтах для "I am Sam", но на коротких данных это работало плохо (очевидны накладные расходы при статичном размере битовых блоков offset и len).

Сейчас же мне хочется разобраться с тем как всё устроено у вас. Либо это даст мне новую информацию для применения каких-то хитростей (я читал вашу преписку в этой теме с Introspec и мне было непонятно про коды Элиаса, но создалось впечатление что коды Элиса можно как-то упростить) либо я пойму, что мой кодер с применением ваших техник превратится в ZX0, тогда мне проще будет остановиться.
Я так же рассматривал вопрос по созданию только оптимального кодера, но из-за малых размеров данных поиск оптимального кода не будет занимать много времени в любом случае, то есть сжимать моим кодом было бы возможно и на самом Z80.

Надеюсь моя писанина вас не запутала окончательно. Я прикреплю этот текст на расском под спойлером, возможно Introspec будет проще читать это на русском, а не на автоматическом переводе на английском.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Thu Nov 09, 2023 8:04 am Einar, thank you so much for answering in this topic. I have great hope to figure out this algorithm.
You are welcome!

zara6502 wrote: Thu Nov 09, 2023 8:04 am I just can't match LZSS with your description. For me, they differ quite a lot.
The concepts are exactly the same. But the implementation details are very different.

zara6502 wrote: Thu Nov 09, 2023 8:04 am Let's raise your example "ABCDBC" and suggest using your blocks:

Code: Select all

Literal (copy the following N bytes from the compressed file)
0 bytes Elias(length)[1] byte[2] ...  byte[N]
at the output we will get:

Code: Select all

A (the first character without bit 0)
0 3 BCD -> 0 011 BCD
No! This output will be:

Code: Select all

0 Elias(4)=00001 ABCD -> 0 00001 "ABCD"
Since it's the first literal, the initial 0 is omitted:

Code: Select all

Elias(4)=00001 ABCD -> 00001 "ABCD"
zara6502 wrote: Thu Nov 09, 2023 8:04 am When viewing the result in a text editor, I see chunks of phrases, I didn't expect them to be there, since the offset is bitwise and a match with ASCII is unlikely, but what I see shows where you used literals.
ASCII codes are stored as full bytes, not broken into bits. The above block will be saved as follows:

Code: Select all

Elias(4)=00001 ABCD -> 00001??? "ABCD"
The 3 bits temporarily marked as ? will be later filled with bits from the following block, until it completes a full byte.

zara6502 wrote: Thu Nov 09, 2023 8:04 am how will the last offset be obtained?
Imagine you have a sequence "ABCDBCEBCEB". In ZXSS, it will be compressed as follows:
  • Literals "ABCD" -> full sequence "ABCD"
  • Repeat 2 bytes from 3 positions behind (length=2 and offset=3), thus producing "BC" -> full sequence "ABCDBC"
  • Literal "E" -> full sequence "ABCDBCE"
  • Repeat 4 bytes from 3 positions behind (length=4 and offset=3), thus producing "BCEB" -> full sequence "ABCDBCEBCEB"
The last block above reuses the same offset=3 that was previously used last time. Therefore there's no need to store this offset again, you can use "copy from last offset" instead of "copy from new offset".

zara6502 wrote: Thu Nov 09, 2023 8:04 am It also took me a while to come to the conclusion that I made (offset) indicating not the beginning of the block
In my example above, notice that last block uses "length > offset". In your case, you won't be able to do it.

zara6502 wrote: Thu Nov 09, 2023 8:04 am I can't deal with two blocks after the literal one.
This is how ZX0 uses a single bit to distinguish between 3 different block types:
  • Immediately after a literal block, you cannot have another literal block (otherwise you would have merged both into a single block). Therefore after a literal block, bit 0 means "copy from last offset" and bit 1 means "copy from new offset".
  • Immediately after any copy block (either from last or new offset), you cannot have a "copy from last offset" block (otherwise you would have merged both into a single block). Therefore after after any copy block, bit 0 means "literal" and bit 1 means "copy from new offset".
zara6502 wrote: Thu Nov 09, 2023 8:04 am I conducted experiments and the blocks "0 + sym" showed either the same or better encoding than "0 Alias(length) byte[1] ... byte[N]".
Sure, your idea would reduce size of very small literal sequences. However it means the possibility of consecutive literal blocks, therefore a single bit wouldn't be enough to distinguish between 3 different block types. You would need to spend extra bits somewhere else. It's not worth it.
User avatar
zara6502
Drutt
Posts: 7
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 1:09 pm You are welcome!
Thank you for your time and clarification, I really appreciate it. I read it briefly after work and it seems to me that I will be able to understand the topic.
Einar Saukas wrote: Thu Nov 09, 2023 1:09 pm Elias(4)=00001
and why do you have the code 00001 and not 00100?
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Thu Nov 09, 2023 2:18 pm and why do you have the code 00001 and not 00100?
Because ZX0 uses interlaced Elias Gamma Code instead of classic Elias Gamma Code, thus bit order is different. See here:

https://github.com/einar-saukas/Zeta-Xi-Code
User avatar
zara6502
Drutt
Posts: 7
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 2:31 pm Because ZX0 uses interlaced Elias Gamma Code instead of classic Elias Gamma Code, thus bit order is different. See here:
https://github.com/einar-saukas/Zeta-Xi-Code
Thanks.
Uh, and now it's difficult. I didn't understand anything after reading it, but I will accept it as a constant.
Does it affect simplicity/speed? or is this representation capable of saving bits on small values?
Judging by the examples, this only affects large numbers?
Post Reply