A new data compressor called ZX0

The place for codemasters or beginners to talk about programming any language for the Spectrum.
User avatar
TMD2003
Rick Dangerous
Posts: 2045
Joined: Fri Apr 10, 2020 9:23 am
Location: Airstrip One
Contact:

Re: A new data compressor called ZX0

Post by TMD2003 »

No problems there, then. I can afford four bytes, come what may.
Spectribution: Dr. Jim's Sinclair computing pages.
Features my own programs, modified type-ins, RZXs, character sets & UDGs, and QL type-ins... so far!
User avatar
Prodatron
Drutt
Posts: 9
Joined: Tue Jun 06, 2023 3:21 pm

Re: A new data compressor called ZX0

Post by Prodatron »

Hi all, I am new in this forum, not (yet) a ZX Spectrum, but at least a Z80 guy, active on some other platforms like MSX, Enterprise and Amstrad CPC (my childhood system). Some weeks ago I discovered this fantastic ZX0 compressor and was so amazed, that I decided to fully integrate it into my project called "SymbOS", an operating system for Z80 systems (unfortunately currently not for the Spectrum).

Thanks so much to Einar Saukas for this fantastic piece of software!!

The next SymbOS version supports loading of ZX0 compressed executables (EXE, COM etc.) but also provides functions for applications for decompressing data. You can load compressed data from an opened file in a transparent way like if it would be uncompressed, which makes it very easy to update even existing applications for handling compressed data.

I added a little header to all kind of ZX0 compressed data...
- 1W: length of the compressed data
- 4B: last 4 bytes of uncompressed data
- 1W: number of bytes at the beginning, which should be skipped/shouldn't be compressed

The first word is necessary for the loading routine, which only knows the uncompressed size from the application.

The next 4 bytes are the last 4 bytes of the uncompressed data, which won't be included in the compressed ones. This makes it possible to load the compressed data exactly at the end of the area, where it will be decompressed. I never saw a delta>3, so to be safe I choosed to save 4 bytes, so there will never be a problem with the delta.

The last word is necessary for files, which have some meta-data at the beginning. These shouldn't be compressed, as you may want to load it separately without the need to decompress the whole thing. Therefore the prefix feature in ZX0 is a great thing for still using this part as dictionary.

Now all EXE (GUI executables) and COM (command line executables) files can be compressed, as well as some special executables "*.WDG" (desktop widgets) and "*.SAV" (screen saver).
Picture files (SGX) can be compressed, too, including desktop background pictures, as well as help files (HLP).
Last but not least I decided to add a header for compressed music files, which are supported in the SymAmp music player. These are PT3, ST2 (Amstrad Soundtrakker 128), SKM (Amstrad Starkos Tracker) and SA2 (Adlib Tracker 2), which can all be ZX0-compressed now.

The compression ratio is just fantastic, in most cases it's better than ZIP! Now it's possible again to place the whole operating system with all system apps, background picture and additional stuff which is loaded during booting on one Amstrad standard disc side (178K).
I am using the Turbo decompressor, which is still small and has a great speed. It's directly integrated into the SymbOS kernel, which makes it possible to decompress data with the full linear size of up to 63K!

I hope to be able to release the next SymbOS version later this year. I know that currently all this is not interesting for you Spectrum guys, but I just wanted to share this info, as I am using this awesome software now. Maybe/hopefully there will be a Spectrum Next version in the future.

Again a lot of thank you and respect to Einar Saukas (for the whole thing) and Introspec (for the Turbo decompressor)!
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

That's awesome! Thank you!!!

Now here's an idea. Your current format is as follows:

- 1W: total length of data after decompression
- last 4 bytes of uncompressed data
- 1W: number of uncompressed bytes at the beginning
- N uncompressed bytes at the beginning
- remaining compressed bytes

My suggestion is to use a simpler header like this:

- 1W: total length of data after decompression
- 1W: number of uncompressed bytes at the beginning (at least 4)
- N uncompressed bytes at the beginning (N>=4)
- remaining bytes, compressed backwards

This way, you will only need N uncompressed bytes total, instead of N+4.

To decompress, you copy into memory exactly N-4 uncompressed bytes, immediatelly followed by the compressed block. Afterwards decompress it backwards from the top of memory area. Finally copy the remaining 4 uncompressed bytes.

Compressing backwards may sound awkward, but you should get a slightly better compression in almost every file...
User avatar
Prodatron
Drutt
Posts: 9
Joined: Tue Jun 06, 2023 3:21 pm

Re: A new data compressor called ZX0

Post by Prodatron »

Hi Einar, thanks a lot for your answer and your ideas!

Using backwards compression to have the "4 bytes" (for avoiding the delta problem) at the beginning in the uncompressed area is a good idea!
Einar Saukas wrote: Wed Jun 07, 2023 2:42 am Compressing backwards may sound awkward, but you should get a slightly better compression in almost every file...
Is it really like this?
I made some quick and dirty tests with everal SymbOS EXE files, and the ratio with backward compression was always a little bit worse compared to normal compression. Then I remembered, that EXE files have the relocator table at the end, which may be a bad start for the dictionary. So I did the same with bitmap files, but here there is the same result, all backward compressed files are a little bit larger than normal compressed ones.
I don't really get this, as backward or forward is just swapping LDIR and LDDR, and why should the dictionary be somehow more optimal, when you start from the beginning or from the end.
Anyway I will make some more tests, especially I like to find out, if your suggestion will save some code on OS side, which is always a very good thing :)
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

Prodatron wrote: Wed Jun 07, 2023 5:26 pm Is it really like this?
Ops, I should have explained what I meant!

The compression itself shouldn't make much difference. Some files work slightly better with forward compression, others with backward compression.

However the new format I suggested (using backward compression) will allow compressing 4 more bytes per file.
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Hello friends. I do not know English, so I use a translator, please forgive me if he does not translate correctly.

Probably my question is addressed to Einar, but I will be glad if someone else answers.

I like just learning different algorithms and coding something in C# and ASM for ATARI. But my programming skills are very low.

I have read the description of the algorithm ZX0 on Github, but I cannot understand for you that "Copy from the last offset (repeat N bytes from the last offset)" - what is offset and last offset in this context, where do they come from? What is new offset and how is it formed? Why is the second block "0" immediately following Literal?

PS: I made my compression algorithm based on the LZSS and Gamma-Code Elias paradigm even before I met ZX0 and I liked its results, but ZX0 literally ruined my plans XD I would be interested to know the answers to my questions, maybe it will help me improve my algorithm. Thx
User avatar
PeterJ
Site Admin
Posts: 6879
Joined: Thu Nov 09, 2017 7:19 pm
Location: Surrey, UK

Re: A new data compressor called ZX0

Post by PeterJ »

Welcome @zara6502,

I have used the mention system to alert @Einar Saukas to your message.

Translation services are very good these days. No need to apologise for using one.
XoRRoX
Manic Miner
Posts: 233
Joined: Wed Jul 11, 2018 6:34 am

Re: A new data compressor called ZX0

Post by XoRRoX »

@Einar Saukas

In a new 128k game project, I am already using your zx0 de-compression for graphics in 128k banks to much satisfaction. But now, as it will be a physical tape game, to shorten the loading time I'd also like to compress the main-part. It occupies memory from 24832-65535=40703 which compresses down to 21.755 bytes (!!! :D )
I have read the documentation and think I should in this case use backward (de)compression but am until now unsuccessful in doing so.

Could you please explain how I should set this up? If needed, to create space for the loader & decompressor, I could shave necessary memory off the beginning of the file and load that last, for example.

Thank you in advance.
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Wed Nov 08, 2023 2:50 pm I have read the description of the algorithm ZX0 on Github, but I cannot understand for you that "Copy from the last offset (repeat N bytes from the last offset)" - what is offset and last offset in this context, where do they come from? What is new offset and how is it formed? Why is the second block "0" immediately following Literal?

PS: I made my compression algorithm based on the LZSS and Gamma-Code Elias paradigm even before I met ZX0 and I liked its results, but ZX0 literally ruined my plans XD I would be interested to know the answers to my questions, maybe it will help me improve my algorithm. Thx
"Offset" and "literal" are concepts from standard LZSS. If you have already implemented a compression algorithm based on LZSS, you certainly used them yourself!

When decompressing, you are basically reading bytes from a compressed file ("source") while writing bytes to the decompression memory area ("destination"):
  • If source indicates that a compressed block is a literal, it simply means to copy next byte from source to destination.
  • Otherwise compressed block is a repetition. It simply means to repeat a few (previously already decompressed) bytes again in destination. Basically copy a few bytes ("length") from a few positions back in destination ("offset") to destination again.
For instance, imagine you have a sequence "ABCDBC". In ZXSS, it will be compressed as follows:
  • Literal "A"
  • Literal "B"
  • Literal "C"
  • Literal "D"
  • Repeat 2 bytes from 3 positions behind (length=2 and offset=3), thus producing "BC".
There's a more detailed explanation in Wikipedia. You can also try Youtube, there are probably online LZSS tutorials in your native language.
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

XoRRoX wrote: Wed Nov 08, 2023 8:28 pm It occupies memory from 24832-65535=40703 which compresses down to 21.755 bytes (!!! :D )
I have read the documentation and think I should in this case use backward (de)compression but am until now unsuccessful in doing so.
When you compress it backwards, take note of the "delta" that it will provide.

Suppose you got delta=4. It means you must load the compressed file to address 24832-4=24828 (or lower). You also need a ZX0 backwards decoder, for instance you can load dzx0_standard_back (69 bytes) at address 24828-69=24759.

If you load a compressed file with size 21755 bytes to address 24828, then the last byte of this compressed file will be at address 24828+21755-1=46582. That's your initial decompression address to decompress backwards. Now you only need to execute something like this:

Code: Select all

LD    HL, 46582  ; source address
LD    DE, 65535  ; target address
CALL  24759      ; decompress backwards
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar, thank you so much for answering in this topic. I have great hope to figure out this algorithm.
I can't know in advance what intonation the translator will show in the narration, so I will say right away that I have no negativity, narcissism or claims. I treat all members of retro communities very carefully and with respect. I hope for a productive dialogue.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm "Offset" and "literal" are concepts from standard LZSS. If you have already implemented a compression algorithm based on LZSS, you certainly used them yourself!
I just can't match LZSS with your description. For me, they differ quite a lot.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm When decompressing, you are basically reading bytes from a compressed file ("source") while writing bytes to the decompression memory area ("destination"):
That's right, that's exactly the approach I use.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm For instance, imagine you have a sequence "ABCDBC".
The example shown is clear to me and I am guided by this principle in my reasoning.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm There's a more detailed explanation in Wikipedia.
I'm just relying on this article and my experiments are going on with the verse from this article "I am Sam", I literally have already become that Sam myself :lol:
My algorithm now compresses this verse to 92 bytes, although in the article it is 107 bytes and I use a break-even point of 1 byte, that is, I compress character by character.
Basically, I set out to make an algorithm for short phrases that does not use a dictionary, when I found out about ZX0 I was pleasantly surprised that such an idea had already been implemented.
Using ZX0.EXE for Windows, I compressed the wikipedia verse to 91 bytes (and using ZX5 as much as 89 bytes).
When viewing the result in a text editor, I see chunks of phrases, I didn't expect them to be there, since the offset is bitwise and a match with ASCII is unlikely, but what I see shows where you used literals.
I also read your source code in C and found no mention of any special "tricky" blocks, but when I try to repeat your algorithm with my hands, it is absolutely inefficient and compresses very poorly (at the level of the classic LZSS from the article, I think I got 102 bytes).
In my reasoning, I see that you do not encode individual characters in any way and do not compensate for the impossibility of compressing literals. I'll show you everything in more detail below.
Einar Saukas wrote: Wed Nov 08, 2023 11:00 pm You can also try Youtube, there are probably online LZSS tutorials in your native language.
Oh, believe me, I did it a long time ago, I read articles both in Russian and translated into English, but they all somehow correspond to the classical algorithm and I can't compare the description of your algorithm on Github with the classical algorithm.

Let's raise your example "ABCDBC" and suggest using your blocks:

Code: Select all

Literal (copy the following N bytes from the compressed file)
0 bytes Elias(length)[1] byte[2] ...  byte[N]
at the output we will get:

Code: Select all

A (the first character without bit 0)
0 3 BCD -> 0 011 BCD
and here I don't understand why the

Code: Select all

Copy from last offset (repeat N bytes from last offset)
0 Alias(length)
block is needed, what is the offset value now?
In my implementation there is a block

Code: Select all

1 2 2 -> 1 010 010
and we get the sequence:

Code: Select all

A 0 011 BCD 1 010 010
that is, 11 bits of codes and 32 bits of literals, in the end it is 43 bits.
But here I compressed half with your implementation option, and half with my own, since I can't deal with two blocks after the literal one.

Code: Select all

Copy from new offset (repeat N bytes from new offset)
1  Elias(MSK(offset)+1)  LSB(offset)  Elias(length-1)
If new offset is mentioned here, then last offset was set somehow in advance, but by what? If the same block, then it must always precede the second block "0 Elias(length)" otherwise how will the last offset be obtained?

In my implementation for short data, I use only two blocks:

Code: Select all

0 + sum
1 + Elias(lan) + Elias(offset)
I conducted experiments and the blocks "0 + sym" showed either the same or better encoding than "0 Alias(length) byte[1] ... byte[N]".
Here is a simple table with 10 different blocks of literals up to 8 bytes long and two columns of the table where I encode in the first case with your block, and in the second with mine:

Code: Select all

Block		Einar					zara6502	difference (Einar-zara6502)
A		0 1 A -> 0 1 +8 (10 bits)		1*9 (9 bits)	+1
AA		0 2 A -> 0 010 +16 (20 bits)		2*9 (18 bits)	+2
AAA		0 3 A -> 0 011 +24 (28 bits)		3*9 (27 bits)	+1
AAAA		0 4 A -> 0 00100 +32 (38 bits)		4*9 (36 bits)	+2
AAAAA		0 5 A -> 0 00101 +40 (46 bits)		5*9 (45 bits)	+1
AAAAAA		0 6 A -> 0 00110 +48 (54 bits)		6*9 (54 bits)	0
AAAAAAA		0 7 A -> 0 00111 +56 (62 bits)		7*9 (63 bits)	-1
AAAAAAAA	0 8 A -> 0 0001000 +64 (72 bits)	8*9 (72 bits)	0
As can be seen from the table, this approach to encoding short phrases is ineffective. Indirectly, I think this also reduces decoding performance.

Let's take an example for the verse "I am Sam" from wikipedia, let's take only the first line

Code: Select all

I am Sam\n
You get 80 bits per line

Code: Select all

$06 + I_am_Sam\n
you don't encode it at all (I'm looking at the bytes in the output file after ZX0 compression) and it turns out to be a block of literals, although it's unclear why you just don't have "I" as the first character, this line starts with $06, which does not correspond to any of the blocks.
I have this string encoded like this:

Code: Select all

len	command        Text bytes	Length (bits)   Block code      bits (len bytes)        bits (offset bytes)
4	>		I am		36
1	%		_		5		1		1 (1)			3 (3)
1	>		S		9
2	:		am		7		1		3 (2)			3 (3)
1	>		\n		9
-------------------------------------------------------------------------------------------------------------------
9					66
The blocks marked with ">" are literal blocks, they are encoded as "0 + sim", that is, in our case:

Code: Select all

0 I 0 <space> 0 a 0 m
Formally, and I learned this from you, it's logical, you can omit 0 from the first character:

Code: Select all

I 0 <space> 0 a 0 m
All characters are packed in 35 bits (in the diagram above 36).
The second line is marked with the symbol "%" - this is the encoding of a single character, it is no different from the code ":" in the line and uses the encoding mask "1 + Elias(len) + Elias(offset)", but for me to test and it was convenient for statistics to see where the encoding of single characters takes place.
Since there is a space in the previous text, we encode it in 5 bits - 1 1 3, that is, 1 bit of the block, length 1 in the Elias code and offset 3 in the Elias code.
The third string is literal - 9 bits (0 S).
The fourth line "am" has already been previously and is encoded in 7 bits - 1 3 3 - > 1 010 011
And the line feed is also a 9-bit literal - 0\n.
In total, on the same line, I get 35+5+9+7+9=65 bit (I exited the first 0 block).

Considering that the general trend towards encoding persists for both you and me, I am trying to understand at what stage you were not only able to return 15 bits of the first line, but also eventually bypass 8 bits (approximately, since here we are talking about a boundary transition to 1 full byte). And for the ZX5, you compress another 2 bytes harder.
In total, I have 43 characters encoded with literals in this verse, which gives an overabundance when encoding in 43 bits (1 bit per character), but the gain from using the block

Code: Select all

"0 Elias(length) byte[1]  byte[2] ... byte[N]"
I won't have, since there are no long literal blocks sufficient to compensate for 1 bit per character.
There are 24 single encoded characters, with a total encoded length of 144 bits, here we win by encoding 24*8-144= 48 bits.
Which together with the previous literals gives a gain of 5 bits on two types of blocks.
In your case, literal blocks are not encoded at all in a smaller way (at least on short phrases) and as I understand it in ZX0, single characters are not encoded at all. I found confirmation of this in some of your statements referring to the block "Elias(length-1)", that is, you consider only blocks of 2 bytes in size, that is, the Elias code 1 encoded in 1 bit is considered a code for the value 2.

In my case, there are several literal phrases with a size of 2 bytes and higher

Code: Select all

4  > I am		36	+2
2  > Th			18	+2
2  > -I			18	+2
2  > !\n		18	+2
2  > do			18	+2
5  > like\n		45	+1
2  > gr			18	+2
4  > m?\n\n		36	+2
2  > th			18	+2
2  > m,			18	+2
If we turn to the comparison table of your literal blocks and mine, then no such block will have an advantage in coding.
In fact, using your approach to encoding literal blocks, we would get a code increase of 19 bits (~2.5 bytes).
But if you look at your output file received after ZX0 compression, you can notice literal blocks in such a number:

Code: Select all

I am Sam
That_
-Ά-€!
dono
like\nt
D
y΅u
 green eggs λnd hΏ?
em,
Moreover, they are strangely sometimes torn by 1 byte, and for "dono" there is no gap at all, and then how to restore from this "do not" I do not understand :shock: :lol:

When I thought over the algorithm myself, I started everything with encoding the symbol table.
The fact is that graphic modes with reprogramming of the symbol table are very semi-binary for ATARI, this on the one hand gives tangible savings for the size of the screen area (20*24) characters = 480 bytes for the five-color mode, on the other - for static screens, such as Popeye, Mario Bros., Pacman, etc. this frees up a lot RAM.
And in such studies, I noticed that the frequency of bytes in the graph is in small blocks of 8-16-24-32 bytes for (offset), and (length) usually does not exceed 1-4 bytes.
So the idea came up to encode values with Elias codes for small numbers. In such a situation, we can even compress an 8*8 sprite to a few bits.
For example, as in the Einar game "Dreamwalker: Alter Ego 2", where the playing field is divided into familiar ones. (On occasion I would like to shake your hand for this game, it is an impressively pleasant and balanced project)

It also took me a while to come to the conclusion that I made (offset) indicating not the beginning of the block, but the end (in my example above, you might have noticed that I use an offset of 3 bytes for "am", not 5). Since this reduces the value for encoding with the Elias gamma code by (length), which on short values gave an increase of 10% in compression and I was able to cross the psychological threshold of 100 bytes of encoded text just for the verse "I am Sam".
Also, at the initial stage, I used fixed byte constructions, changed them by 3-4-5 bits, added additional block codes and blocks 0, 10, 11 worked very well, I think I stopped at 94 bytes for "I am Sam", but it worked poorly on short data (overhead with a static bit size is obvious offset and len blocks).

Now I want to figure out how everything works with you. Either this will give me new information for applying some tricks (I read your correspondence in this topic with Introspec and I didn't understand about the Elias codes, but I got the impression that the Alice codes can somehow be simplified) or I will understand that my coder using your techniques will turn into ZX0, then it's easier for me will stop.
I also considered the issue of creating only the optimal encoder, but due to the small size of the data, the search for the optimal code will not take much time in any case, that is, it would be possible to compress my code on the Z80 itself.

I hope my writing has not completely confused you.

PS: I will attach this text under the spoiler, it may be easier for Introspec to read it in Russian, rather than an automatic translation in English (If, of course, he deems it necessary to read my opus).
Spoiler

Code: Select all

Возьмём ваш пример "ABCDBC" и попробуем использовать ваши блоки:
Literal (copy next N bytes from compressed file), 0  Elias(length)  byte[1]  byte[2]  ...  byte[N]
на выходе мы получим:
A (первый символ без бита 0)
0 3 BCD -> 0 011 BCD

и вот сдесь я не понимаю зачем нужен блок Copy from last offset (repeat N bytes from last offset), 0  Elias(length)
какое сейчас значение offset?
В моем варианте реализации идёт блок
1 2 2 -> 1 010 010

и мы получаем последовательность: "A 0 011 BCD 1 010 010", то есть 11 бит кодов и 32 бита литералов, в итоге это 43 бита.
Но тут я сжимал половину вашим вариантом реализации, а половину своим, так как я не могу разобраться с двумя блоками после литерального.
Copy from new offset (repeat N bytes from new offset), 1  Elias(MSB(offset)+1)  LSB(offset)  Elias(length-1)
Если тут упоминается new offset, то значит last offset был установлен как-то заранее, но чем? Если этим же блоком, то он должен всегда предшествовать второму блоку "0 Elias(length)" иначе как будет получен last offset?

В своей реализации для коротких данных я использую только два блока:
0 + sym
1 + Elias(len) + Elias(offset)
Я проводил эксперименты и блоки "0 + sym" показывали или такое же или лучшее кодирование чем "0 Elias(length) byte[1] ... byte[N]".
Вот простая таблица на 10 разных блоков литералов длиной до 8 байт и два столбца таблицы где я кодирую в первом случае вашим блоком, а во стором моим:
Block		Einar					zara6502	разница Einar-zara6502
A		0 1 A -> 0 1 +8 (10 bits)		1*9 (9 bits)	+1
AA		0 2 A -> 0 010 +16 (20 bits)		2*9 (18 bits)	+2
AAA		0 3 A -> 0 011 +24 (28 bits)		3*9 (27 bits)	+1
AAAA		0 4 A -> 0 00100 +32 (38 bits)		4*9 (36 bits)	+2
AAAAA		0 5 A -> 0 00101 +40 (46 bits)		5*9 (45 bits)	+1
AAAAAA		0 6 A -> 0 00110 +48 (54 bits)		6*9 (54 bits)	0
AAAAAAA		0 7 A -> 0 00111 +56 (62 bits)		7*9 (63 bits)	-1
AAAAAAAA	0 8 A -> 0 0001000 +64 (72 bits)	8*9 (72 bits)	0

Как видно из таблицы, такой подход к кодированию коротких фраз неэффективен. Косвенно это, как я думаю, снижает и производительность при декодировании.

Рассмотрим пример для стиха "I am Sam" из википедии возьмём только первую строку
I am Sam\n
У вас получается 80 битов на эту строку
$06 + I_am_Sam\n
вы её совсем не кодируете (я смотрю на байты в выходном файле после сжатия ZX0) и она получается блоком литералов, хотя непонятно почему у вас первым символом просто не идёт "I", у вас эта строка начинается с $06, что не соотвествует ни одному из блоков.
У меня эта строка кодируется вот так:
len	действие        основноё текст	длина в битах   код блока       bits (len bytes)        bits (offset bytes)
4	>		I am		36
1	%		_		5		1		1 (1)			3 (3)
1	>		S		9
2	:		am		7		1		3 (2)			3 (3)
1	>		\n		9
-------------------------------------------------------------------------------------------------------------------
9					66

Блоки отмеченные ">" это литеральные блоки, они кодируются как "0 + sym", то есть в нашём случае:
0 I 0 <space> 0 a 0 m
Формально, и я это узнал от вас, это логично, у первого символа можно 0 опустить:
I 0 <space> 0 a 0 m
Все символы упакуются в 35 бит (на схеме выше 36).
Вторая строка отмечена символом "%" - это кодирование одиночного символа, оно ничем не отличается от кода ":" в строке и использует маску кодирования "1 + Elias(len) + Elias(offset)", но мне для тестирования и статистики было удобно видеть где происходит кодирование одиночных символов.
Так как пробел есть в предыдущем тексте, то мы кодируем его в 5 бит - 1 1 3, то есть 1 бит блока, длина 1 в коде Элиаса и смещение 3 в коде Элиаса.
Третья строка литерал - 9 бит (0 S).
Четвёртая строка "am" уже была ранее и кодируется в 7 бит - 1 3 3 -> 1 010 011
И перевод строки тоже литерал 9 бит - 0 \n.
Итого, на ту же строку, у меня получается 35+5+9+7+9=65 бит (я вычел первый 0 блока).
Учитывая что общая тенденция к кодированию и у меня и у вас сохраняется, то я пытаюсь понять на каком этапе вы не только смогли вернуть 15 бит первой строки, но и обойти в конечном итоге на 8 бит (примерно, так как тут речь о граничном переходе на 1 полный байт). А для ZX5 вы сжимаете еще на 2 байта сильнее.
Всего литералами в этом стихе у меня закодировано 43 символа, что даёт переизбыток при кодировании в 43 бита (1 бит на символ), но выигрыша от использования блока "0  Elias(length)  byte[1]  byte[2]  ...  byte[N]" у меня не будет, так как нет длинных литеральных блоков, достаточных, чтобы компенсировать 1 бит на символ.
Одиночных закодированных символов - 24, при общей закодированной длине 144 бита, здесь мы выигрываем за счет кодирования 24*8-144=48 бит.
Что вместе с предыдущими литералами даёт выигрыш в 5 бит на двух типах блоков.
У вас же литеральные блоки совсем не кодируются в меньшую сторону (во всяком случае на коротких фразах) и как я понял в ZX0 одиночные символы совсем не кодируются. Этому я нашёл подтверждение в некоторых ваших высказываниях с отсылкой на блок "Elias(length-1)", то есть вы рассматриваете только блоки размером от 2 байт, то есть код Элиаса 1, кодируемый в 1 бит у вас считается кодом для значения 2.

В моем случае литеральных фраз размером от 2 байт и выше несколько
4  > I am					36	+2
2  > Th						18	+2
2  > -I						18	+2
2  > !\n					18	+2
2  > do						18	+2
5  > like\n					45	+1
2  > gr						18	+2
4  > m?\n\n					36	+2
2  > th						18	+2
2  > m,						18	+2

Если обратиться к таблице сравнения ваших литеральных блоков и моих, то ни на одном таком блоке не будет получено преимущество при кодировании.
Фактически используя ваш подход к кодированию литеральных блоков мы получили бы увеличение кода на 19 бит (~2.5 байта).
Но если посмотреть на ваш выходной файл полученный после сжатия ZX0, то можно заметить литеральные блоки в таком количестве:
I am Sam
That_
-Ά-€!
dono
like\nt
D
y΅u
 green eggs λnd hΏ?
em,
Причем они странным образом иногда разрываются на 1 байт, а для "dono" вообще разрыва нет и как потом восстановить из этого "do not" я не понимаю XD

Я, когда продумывал сам алгоритм, начинал всё с кодирования таблицы символов.
Дело в том что для ATARI очень полудярны графические режимы с перепрограммированием таблицы символов, это с одной стороны даёт ощутимую экономию для размера экранной области (20*24) символа = 480 байт для пятицветного режима, с другой - для статичных экранов, как например Popeye, Mario Bros., Pacman и т.п. это освобождает массу оперативной памяти.
И в таких исследованиях я заметил что часто повторяемость байтов в графике идёт небольшими блоками 8-16-24-32 байта для (offset), а (length) обычно не превышает 1-4 байт.
Так и появилась идея кодировать значения кодами Элиаса для малых чисел. В такой ситуации мы даже можем сжимать спрайт 8*8 до нескольких бит.
Например как в игре Einar "Dreamwalker: Alter Ego 2", где игровое поле разбито на знакоместа.

Так же не сразу я пришёл к тому, что сделал (offset) указывающим не начало блока, а на конец (в моём примере выше вы могли заметить, что я использую смещение 3 байта для "am", а не 5). Так как это уменьшает значение для кодирование гамма-кодом Элиаса на (length), что на коротких значениях дало припрост в 10% по сжатию и я как раз для стиха "I am Sam" смог перейти психологический рубеж в 100 байт закодированного текста.
Так же на начальном этапе я использовал фиксированные байтовые конструкции, менял их по 3-4-5 бит, добавлял дополнительные коды блоков и весьма неплохо работали блоки 0, 10, 11, я кажется остановился на 94 байтах для "I am Sam", но на коротких данных это работало плохо (очевидны накладные расходы при статичном размере битовых блоков offset и len).

Сейчас же мне хочется разобраться с тем как всё устроено у вас. Либо это даст мне новую информацию для применения каких-то хитростей (я читал вашу преписку в этой теме с Introspec и мне было непонятно про коды Элиаса, но создалось впечатление что коды Элиса можно как-то упростить) либо я пойму, что мой кодер с применением ваших техник превратится в ZX0, тогда мне проще будет остановиться.
Я так же рассматривал вопрос по созданию только оптимального кодера, но из-за малых размеров данных поиск оптимального кода не будет занимать много времени в любом случае, то есть сжимать моим кодом было бы возможно и на самом Z80.

Надеюсь моя писанина вас не запутала окончательно. Я прикреплю этот текст на расском под спойлером, возможно Introspec будет проще читать это на русском, а не на автоматическом переводе на английском.
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Thu Nov 09, 2023 8:04 am Einar, thank you so much for answering in this topic. I have great hope to figure out this algorithm.
You are welcome!

zara6502 wrote: Thu Nov 09, 2023 8:04 am I just can't match LZSS with your description. For me, they differ quite a lot.
The concepts are exactly the same. But the implementation details are very different.

zara6502 wrote: Thu Nov 09, 2023 8:04 am Let's raise your example "ABCDBC" and suggest using your blocks:

Code: Select all

Literal (copy the following N bytes from the compressed file)
0 bytes Elias(length)[1] byte[2] ...  byte[N]
at the output we will get:

Code: Select all

A (the first character without bit 0)
0 3 BCD -> 0 011 BCD
No! This output will be:

Code: Select all

0 Elias(4)=00001 ABCD -> 0 00001 "ABCD"
Since it's the first literal, the initial 0 is omitted:

Code: Select all

Elias(4)=00001 ABCD -> 00001 "ABCD"
zara6502 wrote: Thu Nov 09, 2023 8:04 am When viewing the result in a text editor, I see chunks of phrases, I didn't expect them to be there, since the offset is bitwise and a match with ASCII is unlikely, but what I see shows where you used literals.
ASCII codes are stored as full bytes, not broken into bits. The above block will be saved as follows:

Code: Select all

Elias(4)=00001 ABCD -> 00001??? "ABCD"
The 3 bits temporarily marked as ? will be later filled with bits from the following block, until it completes a full byte.

zara6502 wrote: Thu Nov 09, 2023 8:04 am how will the last offset be obtained?
Imagine you have a sequence "ABCDBCEBCEB". In ZXSS, it will be compressed as follows:
  • Literals "ABCD" -> full sequence "ABCD"
  • Repeat 2 bytes from 3 positions behind (length=2 and offset=3), thus producing "BC" -> full sequence "ABCDBC"
  • Literal "E" -> full sequence "ABCDBCE"
  • Repeat 4 bytes from 3 positions behind (length=4 and offset=3), thus producing "BCEB" -> full sequence "ABCDBCEBCEB"
The last block above reuses the same offset=3 that was previously used last time. Therefore there's no need to store this offset again, you can use "copy from last offset" instead of "copy from new offset".

zara6502 wrote: Thu Nov 09, 2023 8:04 am It also took me a while to come to the conclusion that I made (offset) indicating not the beginning of the block
In my example above, notice that last block uses "length > offset". In your case, you won't be able to do it.

zara6502 wrote: Thu Nov 09, 2023 8:04 am I can't deal with two blocks after the literal one.
This is how ZX0 uses a single bit to distinguish between 3 different block types:
  • Immediately after a literal block, you cannot have another literal block (otherwise you would have merged both into a single block). Therefore after a literal block, bit 0 means "copy from last offset" and bit 1 means "copy from new offset".
  • Immediately after any copy block (either from last or new offset), you cannot have a "copy from last offset" block (otherwise you would have merged both into a single block). Therefore after after any copy block, bit 0 means "literal" and bit 1 means "copy from new offset".
zara6502 wrote: Thu Nov 09, 2023 8:04 am I conducted experiments and the blocks "0 + sym" showed either the same or better encoding than "0 Alias(length) byte[1] ... byte[N]".
Sure, your idea would reduce size of very small literal sequences. However it means the possibility of consecutive literal blocks, therefore a single bit wouldn't be enough to distinguish between 3 different block types. You would need to spend extra bits somewhere else. It's not worth it.
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 1:09 pm You are welcome!
Thank you for your time and clarification, I really appreciate it. I read it briefly after work and it seems to me that I will be able to understand the topic.
Einar Saukas wrote: Thu Nov 09, 2023 1:09 pm Elias(4)=00001
and why do you have the code 00001 and not 00100?
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Thu Nov 09, 2023 2:18 pm and why do you have the code 00001 and not 00100?
Because ZX0 uses interlaced Elias Gamma Code instead of classic Elias Gamma Code, thus bit order is different. See here:

https://github.com/einar-saukas/Zeta-Xi-Code
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 2:31 pm Because ZX0 uses interlaced Elias Gamma Code instead of classic Elias Gamma Code, thus bit order is different. See here:
https://github.com/einar-saukas/Zeta-Xi-Code
Thanks.
Uh, and now it's difficult. I didn't understand anything after reading it, but I will accept it as a constant.
Does it affect simplicity/speed? or is this representation capable of saving bits on small values?
Judging by the examples, this only affects large numbers?
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

Only simplicity/speed (exactly the same bits in a different order).
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 1:09 pm In my example above, notice that last block uses "length > offset". In your case, you won't be able to do it.
But you don't have a "B" in the window yet, so you're just doing a ring reading repeatedly from the first offset character? It's more like a "nice convention" that might come in handy - why not. But this principle can be easily used in my version, because it's just a convention. Well, or I didn't understand at all how you know that you will have a "B" there.
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

zara6502 wrote: Thu Nov 09, 2023 3:24 pm But you don't have a "B" in the window yet, so you're just doing a ring reading repeatedly from the first offset character? It's more like a "nice convention" that might come in handy - why not. But this principle can be easily used in my version, because it's just a convention. Well, or I didn't understand at all how you know that you will have a "B" there.
It doesn't matter. You walk back 3 positions, then start copying 4 bytes from there. By the time you reach the 4th byte, you will already have a "B" there.
User avatar
zara6502
Drutt
Posts: 8
Joined: Wed Nov 08, 2023 2:32 pm

Re: A new data compressor called ZX0

Post by zara6502 »

Einar Saukas wrote: Thu Nov 09, 2023 3:52 pm It doesn't matter. You walk back 3 positions, then start copying 4 bytes from there. By the time you reach the 4th byte, you will already have a "B" there.
Bravo. Such simple things and how elegant everything is.

In general, I tried to compress the verse "I am Sam" and I get a code much longer than my own, which can't be true, I think in manual mode I probably won't be able to find the optimal solution, so I won't be able to correctly assess the coding quality.

Thanks again for the clarifications, in any case they are very useful for me and there is something to think about.
XoRRoX
Manic Miner
Posts: 233
Joined: Wed Jul 11, 2018 6:34 am

Re: A new data compressor called ZX0

Post by XoRRoX »

@Einar Saukas

By continuing with my issue, I hope I'm not uncomfortably interrupting the flow of interactions between you two.

Thank you for the clarifications, Einar.

I have followed them as follows:

Given:
The binary block is exported after compilation of code with SJAMSPlus with the instruction:

Code: Select all

savebin "_Main.bin",game_start, game_end - game_start
resulting in a binary file where game_start=24832 and game_end=65513 of (65513-24832) 40681 bytes.

I then compressed that file with:

Code: Select all

zx0.exe -b _Main.bin 
which compressed it down from 40681 to 21760 with a delta of 3. I renamed the file to _MainB_Delta3.bin.zx0

Then I use the following to load and decompress:

Code: Select all

	device zxspectrum48

	define filename "decompressTests.tap"


	org 16384

begin:
	di


	ld hl,46587		; 24828 + (filesize) 21760 -1
	ld de,65513		; highest destination address
	call dzx0_decompress	; backwards decompressor

; make border black to show decompression is finished
	xor a
	out (254),a


	jr $			; "stay forever..."

dzx0_decompress:
	include "dzx0_standard_back.asm"


	org 24829		; (lowest destination address) 24832 - (delta) 3
compressedData:
	incbin "_MainB_Delta3.bin.zx0"



end:
	savetap filename,begin
and run it in SpecEmu (48k machine).

When using the debugger, the entry point of the code doesn't look how it should.
So, after the above code ran, I exported a bin file from 24832 with a length of (65513-24832) 40681 with filename _MainB_Delta3.bin.zx0-uncompressed.bin
A filecompare (UltraCompare & Windows fc.exe /b) between _Main.bin and _MainB_Delta3.bin.zx0-uncompressed.bin show differences.

Could you please help me to show me the error of my ways? :)
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

XoRRoX wrote: Fri Nov 10, 2023 10:38 am

Code: Select all

	ld hl,46587		; 24828 + (filesize) 21760 -1
...
	org 24829		; (lowest destination address) 24832 - (delta) 3
It's either 24828 or 24829. One of them is wrong.

XoRRoX wrote: Fri Nov 10, 2023 10:38 am

Code: Select all

	ld de,65513		; highest destination address
If uncompressed data starts at 24832 and size is 40681, then last byte is located at address is 24832-40681-1=65512, not 65513.

XoRRoX wrote: Fri Nov 10, 2023 10:38 am When using the debugger, the entry point of the code doesn't look how it should.
You placed your code at address 16384 inside the screen. Are you sure that's what you want?

Try starting your code at addres 24700 instead. Don't forget to CLEAR 24699 first.


EDIT: Since your data doesn't occupy top of memory until address 65535 anymore, now you have enough space to decompress forward instead, if you want. The calculations would be simpler this way.
XoRRoX
Manic Miner
Posts: 233
Joined: Wed Jul 11, 2018 6:34 am

Re: A new data compressor called ZX0

Post by XoRRoX »

I see... With both hl & de changed it now works - thank you :)

As I had crashes before, I placed the code to the screen just to make sure decompressed data was not somehow overwriting the code.
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

XoRRoX wrote: Fri Nov 10, 2023 1:34 pm I see... With both hl & de changed it now works - thank you :)
Great!!!
C.Born
Manic Miner
Posts: 231
Joined: Sat Dec 09, 2017 4:09 pm

Re: A new data compressor called ZX0

Post by C.Born »

Hi
you use the watcom compiler if i see it correct,
https://github.com/einar-saukas/ZX0/blo ... c/Makefile

can i change it to gcc ? i found the new watcom1.9 but it has few linux remarks and probably i have to set/write severall things

Code: Select all

Win32 BAT file:
---------------------------------------------------------------------------
@ECHO OFF
SET WATCOM=C:\WATCOM
SET PATH=%WATCOM%\BINNT;%WATCOM%\BINW;%PATH%
SET EDPATH=%WATCOM%\EDDAT
SET INCLUDE=%WATCOM%\H;%WATCOM%\H\NT
REM SET LIB=
REM SET WWINHELP=D:\BINW
---------------------------------------------------------------------------
if watcom is needed do you have a link to a modern linux version?
User avatar
Einar Saukas
Bugaboo
Posts: 3145
Joined: Wed Nov 15, 2017 2:48 pm

Re: A new data compressor called ZX0

Post by Einar Saukas »

The provided Makefile generates the Windows executable that's available for download.

However the source code is just standard C. Thus you can ignore this Makefile and simply recompile it with any compiler on any platform you want (including gcc on Linux).
Post Reply