Issues using zx7

zup · Post by **zup** » Wed Jul 08, 2020 9:12 am

I've been programming some tools to convert from sna snapshots to "standard" tap files (where standard means that they will work in any spectrum emulator / device that support tap files) trying to mimic the behaviour of some transfers.

After programming sna2transtape and (not yet released) sna2phoenix, I was thinking about merging sna2phoenix with zx7 to make compressed games, but I've run into some issues.

1.- My "tape builder" is a C program compiled with mingw-64 (a variant of GCC) in windows, but it runs sloooow. I've found that it takes more than 50 seconds in compressing 6 128k snapshots (that is 48 chunks of 16K). I tested it on Debian 10 (using GCC to make native binaries) and it takes about the same time. Looking into the profiler (the only thing I can do, how it works zx7 seems magic to me), it shows that it spends most time on this functions:
- elias_gamma_bits (25 secs, about 1700 million calls).
- optimize (another 25 secs, 48 calls).
- count_bits (about 3 secs itself, but called about 1700 million times).

Also, I've found that programs with many repeated data (i.e.: entire blocks filled with 0xE5) takes more time to compress than programs with more random data.

Is this a normal behaviour? How can I make it faster? (I looked into the original zx7.exe and it seems that it was compiled using Watcom C++... I've tried to compile my code using OpenWatcom and it stills is sloooow)

2.- Somewhere I've read that it needs a buffer of (at least) 1.5 times the size of the block to compress. Is this correct or does it need to be bigger/smaller?

3.- How much stack space will take when uncompressing? Will it fit on a stack size of about 16 bytes?

BTW, I made my own zx7.h header file, and these are the changes I've made to suit my program:
- Both compress.c and optimize.c has been relocated to zx7.h (to provide everything in a single file).
- Every instance of unsigned char has been changed to uint8_t.
- compress() return void and instead take another parameter (a pointer to the output data). Now the caller has to provide a valid pointer to a buffer. Parameter keep and related routines have been deleted.
- Because my program may have to process more than one file at a time, optimize() now frees memory allocated to min, max, matches and match_slots.
- Also, "optimal" is freed before finishing compress().

Some things I changed/tested trying to speed up the routine:
- It seems that there is no difference between using unsigned char or uint8_t.
- I tried to avoid using malloc/calloc creating buffers with enough spare space (and using memset to zero them between uses), but it takes about the same time.
- I've tried using different compilers (GCC on Linux, TDM-GCC for 32 bit builds, mingw-64 for 64 bit builds, OpenWatcom C++ 1.9, Bloodshed Dev-C++) but it's still slow (execution times varies about 2 or 3 seconds). Borland C++ 5 (the free one) failed to compile (still working on it), and I've had to try with free Visual C++.

I'm wondering why it is so slow... zx7.exe runs much faster. Also, keep in mind that (besides that slugginesh) everything works as intended. I mean, my tap files are created correctly, and on the emulator the data is expanded fine and runs without problems.

Any ideas?

utz · Post by **utz** » Wed Jul 08, 2020 3:50 pm

Difficult to say without seeing the code. Perhaps you could post it somewhere so people on here can take a look?

Einar Saukas · Post by **Einar Saukas** » Wed Jul 08, 2020 7:04 pm

Your project sounds interesting

A few ideas:

Profile how much time your program is spending allocating and freeing memory.

There's no need to reallocate buffers "min", "max", and "matches". These buffers have constant size, so you can simply reuse them.

Buffers "match_slots" and "optimal" depend on input file size, but you can reuse them too. If the next file size fits into the previously allocated buffers, you can simply reuse them without reallocating anything.

Required memory is directly proportional to file input size. According to ZX7 documentation: "compressing n bytes of data requires approximately 17n bytes of free memory (plus a small constant overhead)". I could reduce it further, but for a cross-platform utility focused on small files, it makes no difference.

The ZX7 source is just standard C, there are no special requirements. Any compiler should work.

Open Watcom is probably faster than MinGW, give it a try. I have compiled "zx7.exe" using it.

It takes very little stack space when decompressing. I'm sure 16 bytes is enough.

zup · Post by **zup** » Wed Jul 08, 2020 10:03 pm

Ok, let's go to simpler things...

My program is big, so I made a fast (and simpler) scr to zx7 tap converter. The program reads a screen.scr file, compress it 1000 times (so compressing can be timed) and then writes it to a tap file.

I've uploaded this test to mediafire, including two scr test files and the executables (compiled with mingw-64).

zx7scr.c contains the main program, zx7mod.h contains zx7 functions (modified to avoid mallocs/callocs) and zx7loadscr.h contains the first two blocks on the tap file (BASIC and zx7 decoder).

My test runs went faster, when compressing 128k snapshots it took longer (I guess because data is smaller). As I said before, it seems that data with many repetitions (espada.scr) takes more time to compress than data with fewer repetitions (stunt.scr). The functions that appears to take most time are optimize, elias_gamma_bits and count_bits.

With espada.scr, it was compressed to 1807 bytes and took about 77 seconds to execute.

Code: Select all

			%time	sec.	calls
optimize		75.53	31.11	1000
elias_gamma_bits	 1.97	 0.81	530416000
count_bits		 1.21	 0.50	530416000

With stunt.scr, it was compressed to 4428 bytes compressed and took about 8 seconds.

Code: Select all

			%time	sec.	calls
optimize		89.75	 7.09	1000
count_bits		 0.76	 0.06	30588000
elias_gamma_bits	 0.13	 0.01	30588000

Other functions times are negligible.

Einar Saukas · Post by **Einar Saukas** » Wed Jul 08, 2020 10:44 pm

Did it take 8 seconds each execution on average, or 8 seconds total time to run 1000 times?

ketmar · Post by **ketmar** » Wed Jul 08, 2020 10:54 pm

it is quite fast on example screens, but... create file with 32KB of zeroes, for example, and it will slow down to a crawl, taking 5-8 seconds to process it.

Einar Saukas · Post by **Einar Saukas** » Wed Jul 08, 2020 11:09 pm

ketmar wrote: ↑Wed Jul 08, 2020 10:54 pmit is quite fast on example screens, but... create file with 32KB of zeroes, for example, and it will slow down to a crawl, taking 5-8 seconds to process it.

Unfortunately it makes sense. Lempel-Ziv is based on compressing repetitions. When there are repetitions everywhere, there are many more combinations to analyze. This is the reason that ZX7 documentation mentions expected time O(n) and worst time O(n*w).

I will take another look to see if I can do better, but it will take a while. I spent a very long time developing ZX7, any further improvements will certainly take more than a few days.

ketmar · Post by **ketmar** » Wed Jul 08, 2020 11:16 pm

i didn't looked at the algo, but maybe you can simply put in some special-case hacks for repetitions of the same byte somehow?

Einar Saukas · Post by **Einar Saukas** » Wed Jul 08, 2020 11:32 pm

ketmar wrote: ↑Wed Jul 08, 2020 11:16 pm i didn't looked at the algo, but maybe you can simply put in some special-case hacks for repetitions of the same byte somehow?

That's what I thought. Problem is, handling single byte repetitions at the beginning of a file should be trivial. But in the middle of a file, it will require adjusting all optimization data accordingly otherwise it will screw up the rest of the compression. I will need time to design this.

Another alternative is to provide a "semi-optimal" command-line option, to allow compressing single byte repetitions very quickly but risk increasing a few bytes in compression size. But I don't like this alternative because it places the burden of choice on the user.

zup · Post by **zup** » Thu Jul 09, 2020 7:45 am

Time for more results...

Because screens compressed way faster than intended, I've changed the test program to convert 48k entire snapshots (note that the program is stripped down of important functions and will only work with the example snapshots). This time I've compressed the screen once and the main data block (41k) 100 times to get accurate results.

I've uploaded the file to mediafire, and I've added two snapshot examples: Teenage Mutant Hero Turtles (mixed data) and 3D Deathchase (16k, so there is a big empty block).

TMHT produced a 27772 bytes tape, and took 4 seconds to execute.

Code: Select all

			%time	sec.	calls
optimize		52.76	 2.10	101
elias_gamma_bits	 4.52	 0.18	90051628
count_bits		 1.76	 0.07	90051628

3D Deathchase produced a 13154 bytes tape, and took about 570 seconds.

Code: Select all

			%time	sec.	calls
optimize		18.08  103.12	101
elias_gamma_bits	 9.81	55.95	162844801
count_bits		 4.68	26.68	162844801

In 3D Deathchase, the profiler itself took most execution time (probably fuelled by those 162 million calls) and that's why the total time was about 570 seconds but those three functions took only about 186. Without profilint (using -O2 -s), TMHT took less than 3 seconds and 3D Deathchase took 180 seconds.

So, after reading your messages I guess I've hit two worst-case examples (and my program works as intended). I guess I'll work a little more before releasing...

Thank you for your help.

Einar Saukas · Post by **Einar Saukas** » Thu Jul 09, 2020 4:36 pm

zup wrote: ↑Thu Jul 09, 2020 7:45 am3D Deathchase took 180 seconds

Again, did it take 180 seconds for each compression on average, or 180 seconds to compress 100 times?

zup · Post by **zup** » Thu Jul 09, 2020 6:24 pm

180 seconds the whole 100 compressions (plus the screen compression). The times were taken on a Ryzen 3 1200 with 8Gb RAM.

So, with TMHT (mixed data) it took about 0.03 seconds per 41k block and with 3D Deathchase (big empty chunk at the end) took about 1.8 seconds per block.

Einar Saukas · Post by **Einar Saukas** » Thu Jul 09, 2020 6:54 pm

zup wrote: ↑Thu Jul 09, 2020 6:24 pm 180 seconds the whole 100 compressions (plus the screen compression). The times were taken on a Ryzen 3 1200 with 8Gb RAM.

So, with TMHT (mixed data) it took about 0.03 seconds per 41k block and with 3D Deathchase (big empty chunk at the end) took about 1.8 seconds per block.

OK, these results seem correct.

Give me a few days, I will try to improve it for Deathchase.

zup · Post by **zup** » Fri Jul 10, 2020 2:32 pm

Thanks, but I guess that my test files are a set of worst case examples. I'm not sure if it deserves to make more changes to zx7.

To put this in context: I was making a tool to convert sna snapshots into tap files (and mimic transtape behaviour), and saving the entire 128k memory caused load times of 12 minutes (I didn't tried it, I converted my taps to wav and looked the lenght on audacious). So the next step was trying to reduce the bytes stored, and I included a routine that looked at the beginning and end of each page so it could save only the data used (not really, only the data between the first non-zero byte and the last non-zero byte on each page).

That reduced load times, but I was tinkering with snapshots taken on 48k, +2 and +3 computer. While load times on +2 snapshots were reduced, +3 snapshots remained at 12 minutes (because +2A/+3 fills RAM pages 1, 3, 4 and 6 to 0xE5). I made a simple RLE compression routine that solved that problem (or at least made +3 snapshots loading times being closer to +2 snapshots).

Later, someone suggested using better compression and I used zx7 because I converted some games to +3 and used it to save space. In my tests (using a set of 19 128k snapshots) cutting unused areas saved about 40% space, using RLE saved about 50% and using zx7 saved about 60% (against saving the entire RAM).

So I'm using zx7 because is small, fast and easier than other alternatives... and (because I found that problem with +3 snapshots) it seems that I'm using a set of worst cases for LZ compression. Even I have a snapshot of Harrier Attack (16k) running on a +3 computer (WORST TEST FILE EVER)!

I was wondering if it was a fault on the changes I made to zx7, but now I'm wondering if it is worth the effort to change zx7. 16k games and early 128k games usually didn't use the entire RAM (and, some 128k games like Cybernoid II only used 48k), later games tends to fill every byte available. So I don't know if the games affected by this LZ77 "feature" are commonly found or only a small percentage of them.

Looking at my tool, I don't know if it even would be used (the original ones were made to make a homage to those "transfer" interfaces) for a variety of reasons (there are another tools that convert snapshots to tape, most emulators/modern devices are happy with sna files); even I don't know if it would be used on single files (so users would expend more time on command line than the tame taken to convert files) or thrown against sets of files (and slow execution would be more evident).

SUGGESTION: In the first lines of function compress(), output_size is calculated based on array optimal (and then data is compressed). I guess that if that size is bigger than actual data (I've seen it a few times) it should not compress anything and return an "error". I'm going to change it to return 65535 (bigger than any data I'm going to compress), because my routines check if size is bigger than uncompressed data and store the smallest one.

zup · Post by **zup** » Mon Jul 13, 2020 6:08 pm

Sorry, I spent too much tinkering with other things and I left that program unfinished...

Well, it's still unfinished but usable. This utility is member from a series and I'm still have to backport some changes, reshuffle libraries to delete duplicates definitions and those things.

Instructions: This is a command line utility, so you'll have to run it from cmd console. Just use sna2zx7 filename.sna to convert snapshots to tape files using zx7 compression. It accepts wildcards (so you can use *.sna), and also can change the name on "Program:" (use -n, just remember that it will ask a name for each file to convert).

Some Details:The 7z file on mediafire contains the source code and two executables (for Windows x32 and x64). Also contains GCC "friendly" source code (that means that it has been compiled on a variety of GCC clones).

The utility uses a Phoenix like "boot" routine, and try to compress every bit of ram (except 256 bytes needed for loader). In 48k, that means there are three blocks: the compressed screen, a compressed "main" block (41k starting from 23296) and the last 256 bytes of RAM (uncompressed).

In a 128k... it's a little more complex. In the worst case, it stores a compressed screen, then compressed pages 0, 1, 3, 4, 6 and 7, then a compressed "main" block (25k starting from 23296, pages 5 and 2) and then a 256 bytes uncompressed block. But that's not entirely true...
- There is a routine that detects if RAM pages end in zeroes (and avoids to store that zeroes). So maybe instead of compressing 16k it only compressed 15000 bytes...
- ...but if a RAM page is entirely filled with zeroes it won't be saved on tape (the loader will fill the RAM with zeroes to ensure everything is OK). So maybe a 128k tap file won't have all blocks.

Also, the program checks if a compressed block is actually bigger than uncompressed data (some games like Cannon Bubble have data that can not be compressed) and chooses the smaller one.

Another interesting feature (not a bug!) is a self-destructing loader

In 128k snapshots, the loader uses the stack to store start/lenght of blocks so new operations on the stack will destroy data already used. And (in both 48k and 128k snapshots), part of the loader is overwritten by the last compressed block to provide enough clearance for data (that delta thing on zx7).

Some errors you may encounter: Besides the normal ones (not a sna file, can't read/write a file), you can get one of those things:
- If you have compiled it by yourself and it can not convert any sna file... check compile.bat. Some compilers don't honor the __packed__ attribute nor #pragma pack directives, so the may need the -mno-ms-bitfields thing.
- If it complains about PC or SP, that sna can not be converted. Because Phoenix routines and stack are stored between 16384 and 17000, any snapshot that have PC or SP on that area will be rejected (if RAM5 is paged, they may be pointing there). Also, a corrupt sna file with PC=0 will be rejected.
- If it complains about TR-DOS... the snapshot won't be converted (but I guess the routine on 16896 could be used to page TR-DOS and make some TR-DOS only compatible tap file).

-------------------

So, to make that zx7 routine faster I've tried to avoid compressing repeated zeroes. It's not perfect, because:
- The +3 don't fill the upper RAM with zeroes. A snapshot taken on a +2A/+3 will take longer to compress than +2 ones.
- The "main" blocks may have trailing zeroes, but I haven't checked it because I needed them to overwrite the compressed data (and clean the RAM). On the 48k part, I may try to alter the loader to clean memory if the "main" block don't fill the entire 41k... but I'll make it later. On 128k probably that's not an issue (because most 128k games filled the entire 48k RAM) and it would be difficult to make space for a new clean routine on it (although it can be shortened by a few bytes).

Einar Saukas · Post by **Einar Saukas** » Mon Jul 13, 2020 7:25 pm

Good work!

Spectrum Computing

Issues using zx7

Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7

Re: Issues using zx7