Beating the C compiler...
Beating the C compiler...
Hi everyone, you've probably seen my mario/sonic game: https://toastyfox.com/zx/sonic.html
I'm trying to get the attribute copy code down to as fast as possible, in that demo I just used:
address=0x5840-31;
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}
And yesterday returned to write it in ASM, but I can't seem to beat the compiler, I was sure:
ld (0x8002),sp ; store stack pointer.
ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp,(0x8006) ; end of screen line.
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ld sp,(0x8002)
ld hl,(0x8006) ; inc memory locations
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl
repeat again 39 times...
Would beat it, but I don't know if I'm just doing it wrong, but it isn't faster... even when I unrolled it.
Any help how you guys would write the attribute space as fast as possible?
I'm trying to get the attribute copy code down to as fast as possible, in that demo I just used:
address=0x5840-31;
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}
And yesterday returned to write it in ASM, but I can't seem to beat the compiler, I was sure:
ld (0x8002),sp ; store stack pointer.
ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp,(0x8006) ; end of screen line.
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ld sp,(0x8002)
ld hl,(0x8006) ; inc memory locations
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl
repeat again 39 times...
Would beat it, but I don't know if I'm just doing it wrong, but it isn't faster... even when I unrolled it.
Any help how you guys would write the attribute space as fast as possible?
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
LD SP, (NN) is pretty slow (20T), I always use a macro to do a memcpy and only use a fixed address.
You can probably get the C compiler to produce the ASM source as well so you can compare. (Looks like it is option -l if you use this compiler https://github.com/z88dk/z88dk/wiki/Too ... mmand-line)
My memcpy routines, for 30 bytes you can just do both of these after each other. Interrupts must be disabled of course.
If you need a non-constant address you have to self-modify the code though (but you can do that at the end of the previous frame and not at the beginning of the frame i.e. start drawing immediately after the HALT assuming you have one).
EDIT: So my game loop goes
mainloop:
HALT
; disable interrupts - although if you can guarantee you will finish drawing within a frame you don't need to do that
; draw everything
; enable interrupts
; read keyboard/joystick
; update game logic
; prepare graphics for next frame
; goto mainloop
For the longest time I used to read the keyboard and do the game logic immediately after the HALT, and then draw everything, that is not a good idea if you draw directly to the screen. For the very first frame you can jump into the mainloop after the draw if you don't know what to draw frame 1.
In my draw everything bit (also does erase everything first if you are not drawing the entire screen) I only stash the SP once at the beginning and restore after drawing. So my draw routines don't use the stack for subroutine calls etc. since SP is not available for normal use.
You can probably get the C compiler to produce the ASM source as well so you can compare. (Looks like it is option -l if you use this compiler https://github.com/z88dk/z88dk/wiki/Too ... mmand-line)
My memcpy routines, for 30 bytes you can just do both of these after each other. Interrupts must be disabled of course.
Code: Select all
MACRO MEMCPY16 dest, src
ld sp, src
pop af
pop bc
pop de
pop hl
exx
ex af, af`
pop af
pop bc
pop de
pop hl
ld sp, dest+16
push hl
push de
push bc
push af
exx
ex af, af`
push hl
push de
push bc
push af
ENDM
MACRO MEMCPY14 dest, src
ld sp, src
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp, dest+14
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ENDM
EDIT: So my game loop goes
mainloop:
HALT
; disable interrupts - although if you can guarantee you will finish drawing within a frame you don't need to do that
; draw everything
; enable interrupts
; read keyboard/joystick
; update game logic
; prepare graphics for next frame
; goto mainloop
For the longest time I used to read the keyboard and do the game logic immediately after the HALT, and then draw everything, that is not a good idea if you draw directly to the screen. For the very first frame you can jump into the mainloop after the draw if you don't know what to draw frame 1.
In my draw everything bit (also does erase everything first if you are not drawing the entire screen) I only stash the SP once at the beginning and restore after drawing. So my draw routines don't use the stack for subroutine calls etc. since SP is not available for normal use.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Also make sure what you are copying is in non-contended RAM (it's fine to store level data in contended RAM but then copy the data for the current level into a current level buffer in non-contended RAM at level init time - you can compress the level data as well if you want and decompress into non-contended RAM).
Also set SP to non-contended RAM (I do LD SP, 0 so stack starts at FFFF).
Also set SP to non-contended RAM (I do LD SP, 0 so stack starts at FFFF).
Re: Beating the C compiler...
Ah I see, I didn't know a 16byte copy was possible! Full screen
I'm doing a 'wait vbl' by writing bright black to the bottom of the screen and watching for it, then making that my 'halt' It gives me more time to do things (I'm told this is a better technique than halt, but maybe I'm wrong?).
I do need to change the shadow screen address every frame, and maybe that's what's been causing issues, because I have no register left to use.
Are you doing this all unrolled? If not could I see how you loop it?
I'm doing a 'wait vbl' by writing bright black to the bottom of the screen and watching for it, then making that my 'halt' It gives me more time to do things (I'm told this is a better technique than halt, but maybe I'm wrong?).
I do need to change the shadow screen address every frame, and maybe that's what's been causing issues, because I have no register left to use.
Are you doing this all unrolled? If not could I see how you loop it?
Re: Beating the C compiler...
Ah yeah don't worry I'm not using that for code, I inject the compressed levels in to it after the compile and just use it for that.ParadigmShifter wrote: ↑Mon Feb 12, 2024 10:34 am Also make sure what you are copying is in non-contended RAM (it's fine to store level data in contended RAM but then copy the data for the current level into a current level buffer in non-contended RAM at level init time - you can compress the level data as well if you want and decompress into non-contended RAM).
Also set SP to non-contended RAM (I do LD SP, 0 so stack starts at FFFF).
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Yeah I unroll everything. I'm not using the floating bus since it's more complicated but if you do that you have even more time available.rothers wrote: ↑Mon Feb 12, 2024 10:37 am Ah I see, I didn't know a 16byte copy was possible! Full screen
I'm doing a 'wait vbl' by writing bright black to the bottom of the screen and watching for it, then making that my 'halt' It gives me more time to do things (I'm told this is a better technique than halt, but maybe I'm wrong?).
I do need to change the shadow screen address every frame, and maybe that's what's been causing issues, because I have no register left to use.
Are you doing this all unrolled? If not could I see how you loop it?
I'm not copying a full screen of attribs or anything I am only copying some graphic data from the screen to a buffer (so I can do a fast 8 pixel scroll downwards).
I'd unroll everything if I were you and self-modify the src, dst+N after drawing and game-logic update for the next frame. You need to modify 4 bytes per MEMCPY16/MEMCPY14.
Re: Beating the C compiler...
Ah right, I suspected people were doing this unrolled. It just eats in to the RAM, but I think I've got enough space left, plus I can move some of the non-in game graphics in to low RAM. I can probably free up 2kb or so, and I'll give this a go... here is where I find out this is exactly how the compiler is doing it
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Have you tried the -l option to produce the assembler output from the C compiler (you are using z88dk right?).
I don't know anything about z88dk compiler but I have used the "output asm source" option many times on other platforms and all C compilers usually produce the assembly source if you ask for it.
I don't know anything about z88dk compiler but I have used the "output asm source" option many times on other platforms and all C compilers usually produce the assembly source if you ask for it.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Since you are scrolling 4 pixels at a time do you have 2 copies of the map attributes per level (one set for offset by 0 pixels, another for 4 pixels)?
That would definitely help if you aren't already doing that and can spare the memory.
That would definitely help if you aren't already doing that and can spare the memory.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
There is a way to loop the memcpy though, you just need a draw list which has src, dest+16 pointers per 16 bytes, and end it with a single byte of zero (I've just realised you are always copying to the same place as well, so you only need to store the src pointers in fact). So you would have a list like this
drawlist dw srcline0left, line0right, line1left, line1right, ... , line20left, line20right,
db 0
drawlistptr dw drawlist
(I think you need to store the words in opposite endian order to make the 0 terminator be the high byte rather than the low byte since it's a lot faster to check for a single byte of 0 rather than 2 bytes of 0)
You would modify the drawlist per frame depending on the scroll position obvs
Then for the drawloop (pseudocode)
EDIT: Although that's very similar to what you were doing originally of course (may even be slower?). You can of course unroll more times if you like (since it's a line of contiguous attribs per 32 bytes you can do 2 easily and halve the size of the draw list). If you only do 30 bytes instead of 32 it's easy to change to do that as well.
EDIT2: And you don't need to terminate the list either since you are always drawing the same amount of attribs, which is even more similar to what you were originally doing.
I'd be interested to see what the C compiler is doing though if it was faster than your original approach! (I'm guessing it unrolled the loop and may use an inline stack based memcpy as well?). EDIT: If it did not unroll the loop, you want rewrite your loops to count down from loop_max to 0 as well if possible since it's much faster to check != 0 than < loopend
EDIT3: Full unroll and self-modify the addresses outside of draw time is going to be the fastest thing you can do of course, worth doing that if you have the memory to spare.
EDIT4: If you don't want to fully unroll I'd at least unroll 8 full rows and draw 1/3 of the screen at a time - so 3 loops total (then you only need to modify the high byte of dest instead of modifying both bytes outside the loop, low bytes are constant just increasing by 32 each line), you can also do an inc (highbyte) instead of a read/modify/store
EDIT5: If your assembler supports it (sjasmplus does) you can also add labels to the macro to make it easier to self modify the addresses
and do
MEMCPY16 dest, src, destlabelline0lhs, srclabelline0lhs
MEMCPY16 dest, src, destlabelline0rhs, srclabelline0rhs
MEMCPY16 dest, src, destlabelline1lhs, srclabelline1lhs
MEMCPY16 dest, src, destlabelline1rhs, srclabelline1rhs
;etc.
which will give you the addresses you need to modify (label+1) as convenient labels (obvs need to be unique)
drawlist dw srcline0left, line0right, line1left, line1right, ... , line20left, line20right,
db 0
drawlistptr dw drawlist
(I think you need to store the words in opposite endian order to make the 0 terminator be the high byte rather than the low byte since it's a lot faster to check for a single byte of 0 rather than 2 bytes of 0)
You would modify the drawlist per frame depending on the scroll position obvs
Then for the drawloop (pseudocode)
Code: Select all
short* hl = drawlistptr
while ((char*)*drawlistptr) != 0 { ; only need to check 1 byte for 0 not both.
drawlistptr++
; modify the src address for the following expanded macro using value in drawlist, update dest address for the current line
memcpy16 ; self modify the addresses as mentioned above
}
; done, reset drawlistptr
drawlistptr = drawlist
EDIT2: And you don't need to terminate the list either since you are always drawing the same amount of attribs, which is even more similar to what you were originally doing.
I'd be interested to see what the C compiler is doing though if it was faster than your original approach! (I'm guessing it unrolled the loop and may use an inline stack based memcpy as well?). EDIT: If it did not unroll the loop, you want rewrite your loops to count down from loop_max to 0 as well if possible since it's much faster to check != 0 than < loopend
EDIT3: Full unroll and self-modify the addresses outside of draw time is going to be the fastest thing you can do of course, worth doing that if you have the memory to spare.
EDIT4: If you don't want to fully unroll I'd at least unroll 8 full rows and draw 1/3 of the screen at a time - so 3 loops total (then you only need to modify the high byte of dest instead of modifying both bytes outside the loop, low bytes are constant just increasing by 32 each line), you can also do an inc (highbyte) instead of a read/modify/store
EDIT5: If your assembler supports it (sjasmplus does) you can also add labels to the macro to make it easier to self modify the addresses
Code: Select all
MACRO MEMCPY16 dest, src, destmodaddr, srcmodaddr
srcmodaddr:
ld sp, src
pop af
pop bc
pop de
pop hl
exx
ex af, af`
pop af
pop bc
pop de
pop hl
destmodaddr:
ld sp, dest+16
push hl
push de
push bc
push af
exx
ex af, af`
push hl
push de
push bc
push af
ENDM
MEMCPY16 dest, src, destlabelline0lhs, srclabelline0lhs
MEMCPY16 dest, src, destlabelline0rhs, srclabelline0rhs
MEMCPY16 dest, src, destlabelline1lhs, srclabelline1lhs
MEMCPY16 dest, src, destlabelline1rhs, srclabelline1rhs
;etc.
which will give you the addresses you need to modify (label+1) as convenient labels (obvs need to be unique)
Re: Beating the C compiler...
Yes there are 2 copies which are generated on level boot, it's a little more complex than that as I have to account for brightness changes and mask them with blacks depending on the way they are 'facing'. But it works well.ParadigmShifter wrote: ↑Mon Feb 12, 2024 11:13 am Since you are scrolling 4 pixels at a time do you have 2 copies of the map attributes per level (one set for offset by 0 pixels, another for 4 pixels)?
That would definitely help if you aren't already doing that and can spare the memory.
It can also scroll upwards in 4px if I have room in memory to do that.
Re: Beating the C compiler...
By default the z88dk memcpy function is the compiler's builtin one. i.e. it uses a simple ldir.rothers wrote: ↑Mon Feb 12, 2024 10:51 am Ah right, I suspected people were doing this unrolled. It just eats in to the RAM, but I think I've got enough space left, plus I can move some of the non-in game graphics in to low RAM. I can probably free up 2kb or so, and I'll give this a go... here is where I find out this is exactly how the compiler is doing it
You can rebuild the library with flags which switch in loop unrolling for memcpy, memset and others, but obviously you didn't do that.
Derek Fountain, author of the ZX Spectrum C Programmer's Getting Started Guide and various open source games, hardware and other projects, including an IF1 and ZX Microdrive emulator.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Do you have 2 copies of the tiles you need to draw as well then (I'm guessing so).rothers wrote: ↑Mon Feb 12, 2024 1:23 pm Yes there are 2 copies which are generated on level boot, it's a little more complex than that as I have to account for brightness changes and mask them with blacks depending on the way they are 'facing'. But it works well.
It can also scroll upwards in 4px if I have room in memory to do that.
How many different combinations are there for 2 tiles next to each other per frame? If there are only 8 possibilities you could do a Joffa Cobra style thing of emitting a series of push and exx instructions per row using all 8 available register pairs (BC, DE, HL, B'C', D'E', H'L', IX, IY) - assuming you are drawing all cells on the screen per frame. That might be slower than a series of LDI though if you have to use IX, IY or keep swapping between registers & shadow regs too often (but you may be able to preprocess the map somehow to give an optimal register allocation for different parts of the level minimising use of EXX and IX or IY? Complicated stuff that though lol).
Or since it seems you are only drawing vertical oblongs (4x8 pixel width, height) you could scroll 1 pixel row of each line into a buffer (so a buffer of 32 bytes is needed for a full screen scroll per line), and blit the same data 8 times?
You may even be able to use the mythical RLD or RRD instructions if you do that (although 4 rlca is faster than a single RRD or RLD, those seem to have the advantage of writing the result back to memory)? I tried once to use RRD or RLD but my code was bugged and I was drunk when I was debugging it so I went back to using 4xRLCA
Your approach seems fast enough already though so don;'t over-think things like I have just done lol if speed is good enough already.
; W,X,Y,Z are the nybbles of register contents
ld A,$WX
ld (HL),$YZ
RLD
; A = $WY
; (HL) = $ZX
I think RLD and RRD are the only opcodes I haven't yet used (apart from the apparently bugged OUTI, INI instructions). Oh and IM 0 (also unusable on the speccy IIRC).
Re: Beating the C compiler...
I did try unrolling a very long list of LDI but it takes up a lot of memory to do that, I think it was the best performing one, I'm going to write some basic benchmark routines tonight to test all this.ParadigmShifter wrote: ↑Mon Feb 12, 2024 2:59 pm Do you have 2 copies of the tiles you need to draw as well then (I'm guessing so).
How many different combinations are there for 2 tiles next to each other per frame? If there are only 8 possibilities you could do a Joffa Cobra style thing of emitting a series of push and exx instructions per row using all 8 available register pairs (BC, DE, HL, B'C', D'E', H'L', IX, IY) - assuming you are drawing all cells on the screen per frame. That might be slower than a series of LDI though if you have to use IX, IY or keep swapping between registers & shadow regs too often (but you may be able to preprocess the map somehow to give an optimal register allocation for different parts of the level minimising use of EXX and IX or IY? Complicated stuff that though lol).
Or since it seems you are only drawing vertical oblongs (4x8 pixel width, height) you could scroll 1 pixel row of each line into a buffer (so a buffer of 32 bytes is needed for a full screen scroll per line), and blit the same data 8 times?
You may even be able to use the mythical RLD or RRD instructions if you do that (although 4 rlca is faster than a single RRD or RLD, those seem to have the advantage of writing the result back to memory)? I tried once to use RRD or RLD but my code was bugged and I was drunk when I was debugging it so I went back to using 4xRLCA
Your approach seems fast enough already though so don;'t over-think things like I have just done lol if speed is good enough already.
; W,X,Y,Z are the nybbles of register contents
ld A,$WX
ld (HL),$YZ
RLD
; A = $WY
; (HL) = $ZX
I think RLD and RRD are the only opcodes I haven't yet used (apart from the apparently bugged OUTI, INI instructions). Oh and IM 0 (also unusable on the speccy IIRC).
The reason I'm trying to push it to the speed limit is so I can have as many enemies in the game as I can, ideally at least 3 16x16 (24x16 with shifting)
pixel based enemies plus the player character, plus the 4*8 based baddies while maintaining 50fps.
It is ALMOST there. It's just one good optimisation jump away from it.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Scrolling what is on the screen already might be best if that would work? (You'd have to erase sprites then of course).
By which I mean (if your tiles are always 4x8 pixels) you can copy the first line of each character row into a 32 byte buffer, scroll that right or left 4 pixels (bring in new pixels at the edge that becomes visible) then blit that 8 times using those memcpy routines from the buffer.
Maybe that doesn't work with how you are doing the different bright levels though.
EDIT: Even if you don't scroll what is currently on screen if all your tiles are 4x8 pixels solid or blank that is probably the way to go - maybe that is what you are doing already though? That way you only need to have a screen buffer of 32 bytes per 8 pixels, so you could store the entire screen as 32x24 bytes of pixel data (you could reduce mem usage to just 32 bytes if you build the data before you blit it though -- that might be too slow to beat the raster of course). I think that would be fine on 128K where you can swap between 2 screen buffers very quickly (I've never done any 128K programming myself but that is the best way to do it IIRC).
That's similar to how I am doing my scroll down 8 pixels directly to the screen in my latest project though... I copy character row 7 to a buffer then copy row 6 to row 7, row 5 to row 6, ..., row 0 to row 1 then bring in what is needed at the top. The next screen third I use the buffer I copied into as the pixels to copy to row 8. That's what I am using my memcpy macros for anyway (copying "critical strips" i.e. row 7 and row 15 of the screen to a buffer).
I'm also only scrolling 16 pixel wide columns down as well (and there's an optimisation to only scroll what I need).
It's not quite finished yet what I am doing though (attribute scrolling is needed as well) so I haven't managed to see how well it performs (and I can think of some optimisations scrolling multiple columns at a time if they are next to each other - but that is more complicated still). Aim is to be less flickery than SJOE anyway, we will see how it works out. I got sidetracked by another good optimisation which is not erasing any pixel data just set the ink and paper to the background colour like what TLL and Cyclone does.
Scrolling up 8 pixels would be a great deal easier than scrolling down 8 pixels anyway, gravity is in the wrong direction lol.
By which I mean (if your tiles are always 4x8 pixels) you can copy the first line of each character row into a 32 byte buffer, scroll that right or left 4 pixels (bring in new pixels at the edge that becomes visible) then blit that 8 times using those memcpy routines from the buffer.
Maybe that doesn't work with how you are doing the different bright levels though.
EDIT: Even if you don't scroll what is currently on screen if all your tiles are 4x8 pixels solid or blank that is probably the way to go - maybe that is what you are doing already though? That way you only need to have a screen buffer of 32 bytes per 8 pixels, so you could store the entire screen as 32x24 bytes of pixel data (you could reduce mem usage to just 32 bytes if you build the data before you blit it though -- that might be too slow to beat the raster of course). I think that would be fine on 128K where you can swap between 2 screen buffers very quickly (I've never done any 128K programming myself but that is the best way to do it IIRC).
That's similar to how I am doing my scroll down 8 pixels directly to the screen in my latest project though... I copy character row 7 to a buffer then copy row 6 to row 7, row 5 to row 6, ..., row 0 to row 1 then bring in what is needed at the top. The next screen third I use the buffer I copied into as the pixels to copy to row 8. That's what I am using my memcpy macros for anyway (copying "critical strips" i.e. row 7 and row 15 of the screen to a buffer).
I'm also only scrolling 16 pixel wide columns down as well (and there's an optimisation to only scroll what I need).
It's not quite finished yet what I am doing though (attribute scrolling is needed as well) so I haven't managed to see how well it performs (and I can think of some optimisations scrolling multiple columns at a time if they are next to each other - but that is more complicated still). Aim is to be less flickery than SJOE anyway, we will see how it works out. I got sidetracked by another good optimisation which is not erasing any pixel data just set the ink and paper to the background colour like what TLL and Cyclone does.
Scrolling up 8 pixels would be a great deal easier than scrolling down 8 pixels anyway, gravity is in the wrong direction lol.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Hmm that's given me an idea now I wonder how fast I could draw an entire screen at half resolution. Would need 32x96 bytes = 3K of buffers for that.
Can compress the image data to 4 bits per 2x2 pixel as well then.
Might have a try of that later on (post beer o'clock that will be though lol).
Can compress the image data to 4 bits per 2x2 pixel as well then.
Might have a try of that later on (post beer o'clock that will be though lol).
Re: Beating the C compiler...
It only has to copy the attributes, there is no pixel data, that's all saved for the sprites. The screen is set up with a chessboard like image which is manipulated to get the scrolling.
So a copy of 570 bytes.
I've written my own sprite routines, and I'll come on to those later, as I know there is some black magic for speeding them up out there. The whole engine uses the attribute layer like a tile set for all functions, so the code for the game is pretty tiny.
I absolutely want this to run on a 48k machine as that was my machine (handed down to me as a kid) and I want to show what it can do! The engine actually fits in to the 16k spectrum, but there is no room for levels. The 16k could run levels at 8x8.
I know it would be far easier on the 128k, and I'll probably add 128k music if I can find a fast player (I did try one from GitHub, but it caused random performance drops all the over place), but 48k is my target and no multi-load. The whole game has to fit in to 48k. Each level compresses down to about 1-2k, so I can get 10 big levels and far more once I code in some decent tile sets.
This game has been in my head since I was a little kid, and I'm just getting it out of my system I'm also doing some work on the GBC and SMS.
Here is the screen copy routine in C:
address=0x5840-31;
if (i==0){
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}
}
else{
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi5+q+timestable[p], 30);
address+=32;
}
}
That's it, it alternates depending on which 4x screen is showing.
I've tried all the ASM routines and none of them seem to beat it, bizarrely. It could be because I keep having to adjust the memory address and I don't know many of the Z80 tricks others do.
I call them with:
if (i==0) qpop(bgi4);
else qpop(bgi5);
As zcc allows you to insert data in to HL.
I have the memory address in RAM which I read out each write as there are no registers left, I've not coded in Z80 since about 2000 when I was a kid.
LD DE, (0x8000)
ADD HL, DE //inc HL to correct point in the shadow screen
ld (0x8004),hl //put bg location in ram
LD DE, 0x5830 //screen location
ld (0x8006),DE //screen location to ram
ld (0x8002),sp ; store stack pointer.
ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp,(0x8006) ; load screen location from 8006
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ld sp,(0x8002) //put stack back
ld hl,(0x8006) //load back in the settings and increase them
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl
( repeat dozens of times etc etc)
So a copy of 570 bytes.
I've written my own sprite routines, and I'll come on to those later, as I know there is some black magic for speeding them up out there. The whole engine uses the attribute layer like a tile set for all functions, so the code for the game is pretty tiny.
I absolutely want this to run on a 48k machine as that was my machine (handed down to me as a kid) and I want to show what it can do! The engine actually fits in to the 16k spectrum, but there is no room for levels. The 16k could run levels at 8x8.
I know it would be far easier on the 128k, and I'll probably add 128k music if I can find a fast player (I did try one from GitHub, but it caused random performance drops all the over place), but 48k is my target and no multi-load. The whole game has to fit in to 48k. Each level compresses down to about 1-2k, so I can get 10 big levels and far more once I code in some decent tile sets.
This game has been in my head since I was a little kid, and I'm just getting it out of my system I'm also doing some work on the GBC and SMS.
Here is the screen copy routine in C:
address=0x5840-31;
if (i==0){
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi4+q+timestable[p], 30);
address+=32;
}
}
else{
for( p=scx; p<21+scx; p++ )
{
memcpy( address, bgi5+q+timestable[p], 30);
address+=32;
}
}
That's it, it alternates depending on which 4x screen is showing.
I've tried all the ASM routines and none of them seem to beat it, bizarrely. It could be because I keep having to adjust the memory address and I don't know many of the Z80 tricks others do.
I call them with:
if (i==0) qpop(bgi4);
else qpop(bgi5);
As zcc allows you to insert data in to HL.
I have the memory address in RAM which I read out each write as there are no registers left, I've not coded in Z80 since about 2000 when I was a kid.
LD DE, (0x8000)
ADD HL, DE //inc HL to correct point in the shadow screen
ld (0x8004),hl //put bg location in ram
LD DE, 0x5830 //screen location
ld (0x8006),DE //screen location to ram
ld (0x8002),sp ; store stack pointer.
ld sp, (0x8004) ; start of buffer line.
pop af
pop bc
pop de
pop hl
exx
pop bc
pop de
pop hl
ld sp,(0x8006) ; load screen location from 8006
push hl
push de
push bc
exx
push hl
push de
push bc
push af
ld sp,(0x8002) //put stack back
ld hl,(0x8006) //load back in the settings and increase them
add hl, 14
ld (0x8006) ,hl
ld hl,(0x8004)
add hl, 14
ld (0x8004) ,hl
( repeat dozens of times etc etc)
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Oh right I hadn't thought about doing it like that (surely stripes would be better than a checkerboard pattern though?), that's quite clever.
You should definitely be able to write 768 bytes to attribs very fast using the stack... probably before the scanline hits the top of the screen visible area if you start drawing immediately I would have thought.
I might have a go and see how fast I can do an unrolled attrib copy routine in ASM later on where each line of attribs can point at an arbitrary address.
You should definitely be able to write 768 bytes to attribs very fast using the stack... probably before the scanline hits the top of the screen visible area if you start drawing immediately I would have thought.
I might have a go and see how fast I can do an unrolled attrib copy routine in ASM later on where each line of attribs can point at an arbitrary address.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
This code (uses funky sjasmplus REPT with index arguments though, I was just writing it as fast as possible) blits 768 bytes to atribs when the border is red (border is white while waiting for the vblank), this is the fastest you can possibly do a blit immediately following a vblank.
It unrolls MEMCPY16 48 times, each unroll draws 16 bytes to the attribs area
I'll try it in a loop next with 48 pointers to the source addresses... I'm using the ROM for the attrib data
EDIT: Using an indirect src address pointer, amazing that this works with sjasmplus, the REPT with an index and macro expansion is quite powerful (not as good as C's #define though unfortunately), was surprised
so that is expanding this line of the macro
ld sp, dst + 16
with
ld sp, (attribptrs + 16)
next expansion it becomes
ld sp, (attribs + 2 + 16) ; and it is evaluating attribptrs+2+16 at compile time
etc.
So expanding the code like that means you can blit all the attribs from a table of 48 pointers to half-rows of attrib data before the raster reaches the top of the drawable area (update the pointers after drawing to change the addresses it draws from each frame)
It unrolls MEMCPY16 48 times, each unroll draws 16 bytes to the attribs area
Code: Select all
; TEST
.testagain
ld a, 2
out (#fe), a
REPT 48, idx
MEMCPY16 ATTRIBS_ADDR + idx*16, idx*16
ENDR
ld a, 7
out (#fe), a
halt
jp .testagain
EDIT: Using an indirect src address pointer, amazing that this works with sjasmplus, the REPT with an index and macro expansion is quite powerful (not as good as C's #define though unfortunately), was surprised
Code: Select all
; TEST
.testagain
ld a, 2
out (#fe), a
REPT 48, idx
MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
ENDR
ld a, 7
out (#fe), a
halt
jp .testagain
attribptrs
REPT 48, idx
dw idx*16
ENDR
ld sp, dst + 16
with
ld sp, (attribptrs + 16)
next expansion it becomes
ld sp, (attribs + 2 + 16) ; and it is evaluating attribptrs+2+16 at compile time
etc.
So expanding the code like that means you can blit all the attribs from a table of 48 pointers to half-rows of attrib data before the raster reaches the top of the drawable area (update the pointers after drawing to change the addresses it draws from each frame)
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Super quick and dirty animation of the ROM with scrolling
Code, I expect only sjasmplus will be able to compile this code
Each frame it just increments the low byte of each of the 48 pointers in the table (which is why there's a glitch when they wrap around back to 0)
EDIT: Updating the pointers is quite slow lol (cyan border)
Code, I expect only sjasmplus will be able to compile this code
Code: Select all
ATTRIBS_ADDR EQU #5800
ORGADDR EQU #8000
; sjasmplus.exe --sym=out.sym --syntax=f --raw=out.bin attribtest.asm
ORG ORGADDR
MACRO MEMCPY16 dest, src
ld sp, src
pop af
pop bc
pop de
pop hl
exx
ex af, af'
pop af
pop bc
pop de
pop hl
ld sp, dest+16
push hl
push de
push bc
push af
exx
ex af, af'
push hl
push de
push bc
push af
ENDM
main:
; code which draws the ruler on the right hand side removed ;)
; TEST
.testagain
ld a, 2
out (#fe), a
REPT 48, idx
MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
ENDR
ld a, 7
out (#fe), a
; update lower bytes of all 48 src addresses for next frame
ld b, 48
ld hl, attribptrs
.updateptrs
inc (hl)
inc hl ; can save a jiffy by aligning attribptrs table so it does not cross a 256 byte boundary, then you can use inc l here
inc hl
djnz .updateptrs ; you can unroll this loop 48 times as well to save another microjiffy
halt
jp .testagain
attribptrs
REPT 24, idx
dw idx*256, idx*256 + 16
ENDR
EDIT: Updating the pointers is quite slow lol (cyan border)
Code: Select all
; TEST
.testagain
ld a, 2
out (#fe), a
REPT 48, idx
MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
ENDR
ld a, 5
out (#fe), a
; update lower bytes of all 48 src addresses for next frame
ld b, 48
ld hl, attribptrs
.updateptrs
inc (hl)
inc hl
inc hl
djnz .updateptrs
ld a, 7
out (#fe), a
halt
jp .testagain
Last edited by ParadigmShifter on Mon Feb 12, 2024 10:51 pm, edited 1 time in total.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
Anyway I hope you get the idea.
For sprites you may have to draw them at just the right time (i.e. draw them ordered according to where they are on the y axis, intermingled with drawing the attribute rows) to beat the raster which will complicate the code quite a bit I expect.
If you want to reduce the code size you can do it in a loop and self modify the code based on the src and dest pointers as I hinted at earlier... all of those methods will be slower than the code I posted though which is max unrolled.
EDIT: There should really be a DI and an EI around my code which abuses the stack (and a save/restore SP as well).
EDIT2: So this is safer. Stack would be pointing at attribute memory when the interrupt goes off (since I have not changed from the ROM interrupt). I guess that worked out ok since it was pointing at RAM not ROM and even if the attribs got corrupted I redraw them all before the raster reached any corruption
In my actual code I do more stuff with the SP (I erase and draw all my sprites without using SP for anything other than data transfer, i.e. I can't call any functions).
For sprites you may have to draw them at just the right time (i.e. draw them ordered according to where they are on the y axis, intermingled with drawing the attribute rows) to beat the raster which will complicate the code quite a bit I expect.
If you want to reduce the code size you can do it in a loop and self modify the code based on the src and dest pointers as I hinted at earlier... all of those methods will be slower than the code I posted though which is max unrolled.
EDIT: There should really be a DI and an EI around my code which abuses the stack (and a save/restore SP as well).
EDIT2: So this is safer. Stack would be pointing at attribute memory when the interrupt goes off (since I have not changed from the ROM interrupt). I guess that worked out ok since it was pointing at RAM not ROM and even if the attribs got corrupted I redraw them all before the raster reached any corruption
In my actual code I do more stuff with the SP (I erase and draw all my sprites without using SP for anything other than data transfer, i.e. I can't call any functions).
Code: Select all
; TEST
.testagain
ld a, 2
out (#fe), a
di ; about to mess with SP, best disable interrupts
ld (.restoresp+1), sp
REPT 48, idx
MEMCPY16 ATTRIBS_ADDR + idx*16, (attribptrs + idx*2)
ENDR
.restoresp
ld sp, 0 ; restore SP before turning on interrupts
ei ; we're done abusing the stack now. Safe to call subroutines and for interrupts to go off
ld a, 5
out (#fe), a
; update lower bytes of all 48 src addresses for next frame
ld b, 48
ld hl, attribptrs
.updateptrs
inc (hl)
inc hl
inc hl
djnz .updateptrs
ld a, 7
out (#fe), a
halt
jp .testagain
Re: Beating the C compiler...
Thank you! I've finished hard coding the sprites now with ASM lookup tables and I think they are as fast as they can be, so now it's just this attribute copy to get working as fast as possible.
The rest of the code is really fast, using the attributes as a tile map is speedy!
I'm also coding a super compressed way to store levels, and that should be it, ready to release. You can actually load super mario bros levels in to it, but I'm only using that to test speed vs the NES.
I'll probably then look at that C64 Sonic port and see what I can do on the 48k. Really enjoying this, it's like my morning crossword puzzle every day.
The rest of the code is really fast, using the attributes as a tile map is speedy!
I'm also coding a super compressed way to store levels, and that should be it, ready to release. You can actually load super mario bros levels in to it, but I'm only using that to test speed vs the NES.
I'll probably then look at that C64 Sonic port and see what I can do on the 48k. Really enjoying this, it's like my morning crossword puzzle every day.
- ParadigmShifter
- Manic Miner
- Posts: 671
- Joined: Sat Sep 09, 2023 4:55 am
Re: Beating the C compiler...
You probably want to use Einar's ZX0 for compression unless your level format is really simple to run-length encode.
I run length encoded the levels for my Manic Miner remake, here is an example so you get the idea of what I did. This was my first attempt at ASM programming so probably could do a lot better. I could compress a lot more by packing the row/column into 9 bits and the repeat count into the high bits of a 16 bit number rather than using a byte for each.
So I had horizontal and vertical repeating cells in the level. Could also add rectangles instead of just horizontal/vertical repeat.
Guardian sprite data starts on a 256 byte aligned boundary so I can just use 8 bits for those as well.
I run length encoded the levels for my Manic Miner remake, here is an example so you get the idea of what I did. This was my first attempt at ASM programming so probably could do a lot better. I could compress a lot more by packing the row/column into 9 bits and the repeat count into the high bits of a 16 bit number rather than using a byte for each.
Code: Select all
; some macros to pack pointers to graphics into 8 bits rather than 16
MACRO celltype gfx
db (gfx - gfx_platform0) / 8
ENDM
MACRO keytype gfx
db (gfx - gfx_key0) / 8
ENDM
; macro for ink, paper, bright
MACRO IPB ink, paper, bright
db ink|(paper<<3)|(bright<<6)
ENDM
; Central Cavern (Spectrum Version)
Central_Cavern:
dc "Central Cavern"
; border/paper
db 2
; cell graphics
celltype gfx_platform0
celltype gfx_wall0
celltype gfx_spiky0
celltype gfx_crumbly0
celltype gfx_platform0
celltype gfx_platform0
celltype gfx_spiky1
celltype gfx_conveyor0
celltype gfx_conveyor0
; cell attribs
IPB 2, 0, 1
IPB 6, 2, 0
IPB 4, 0, 1
IPB 2, 0, 0
IPB 0, 0, 0
IPB 0, 0, 0
IPB 5, 0, 0
IPB 4, 0, 0
IPB 4, 0, 0
; willy start position x, y. bit 4 of y is set if facing left
db 2, 13
; exit position
db 29, 13
; exit colour
IPB 6, 1, 0
; keytype
keytype gfx_key0
db 5 ; number of keys
; position of keys
db 9, 0
db 29, 0
db 16, 1
db 24, 4
db 30, 6
; guardians
db 1 ; number of guardians
; something like start position, end position of patrol path and some other stuff I can't remember ;) Obvs the attribs (64+6) here too.
; seems to be terminated with a 0 since some enemies need extra data
db gfx_robot0/256, 0, 8, 7, 8, 15, 64+6, 0
; single blocks
db SPIKY_A, 23, 4
db SPIKY_A, 27, 4
db SPIKY_A, 21, 8
db SPIKY_A, 12, 12
db SPIKY_B, 11, 0
db SPIKY_B, 16, 0
; repeat blocks: platforms
db HORZ_REPEAT|PLATFORM_A, 1, 5, 30
db HORZ_REPEAT|PLATFORM_A, 1, 7, 3
db HORZ_REPEAT|PLATFORM_A, 1, 9, 4
db HORZ_REPEAT|PLATFORM_A, 29, 10, 2
db HORZ_REPEAT|PLATFORM_A, 28, 12, 3
db HORZ_REPEAT|PLATFORM_A, 5, 13, 15
db HORZ_REPEAT|PLATFORM_A, 1, 15, 30
; crumbly platforms
db HORZ_REPEAT|CRUMBLY, 14, 5, 4
db HORZ_REPEAT|CRUMBLY, 19, 5, 4
db HORZ_REPEAT|CRUMBLY, 23, 12, 5
; walls
db HORZ_REPEAT|WALL_A, 17, 8, 3
db HORZ_REPEAT|WALL_A, 20, 12, 3
; conveyor
db HORZ_REPEAT|CONVEYOR_L, 8, 9, 20
db #ff ; terminator
ENDIF
; The Cold Room
The_Cold_Room:
dc "The Cold Room"
; border/paper
IPB 2, 1, 0
; cell graphics
celltype gfx_platform0
celltype gfx_wall0
celltype gfx_spiky0
celltype gfx_crumbly0
celltype gfx_platform0
celltype gfx_platform0
celltype gfx_spiky4
celltype gfx_conveyor0
celltype gfx_conveyor0
; cell attribs
IPB 3, 1, 1
IPB 6, 2, 0
IPB 0, 0, 0
IPB 3, 1, 0
IPB 0, 0, 0
IPB 0, 0, 0
IPB 5, 1, 0
IPB 6, 1, 0
IPB 6, 1, 0
; willy start position x, y. bit 4 of y is set if facing left
db 2, 13
; exit position
db 29, 13
; exit colour
IPB 3, 2, 1
; keytype
keytype gfx_key2
db 5|(1<<4) ; number of keys/paper colour
db 7, 1
db 25, 1
db 26, 7
db 3, 9
db 19, 12
; guardians
db 2 ; number of guardians
db gfx_penguin0/256, 7, 18, 3, 1, 18
IPB 6, 1, 0
db 0
db gfx_penguin0/256, 7, 29, 13, 12, 29
IPB 5, 1, 0
db 0
; single blocks
db SPIKY_B, 30, 1
db PLATFORM_A, 25, 3
db PLATFORM_A, 1, 7
; repeat blocks
db HORZ_REPEAT|WALL_A, 19, 0, 12
db HORZ_REPEAT|PLATFORM_A, 1, 5, 19
db HORZ_REPEAT|CRUMBLY, 21, 3, 4
db HORZ_REPEAT|PLATFORM_A, 21, 6, 4
db HORZ_REPEAT|CRUMBLY, 26, 6, 2
db HORZ_REPEAT|CRUMBLY, 2, 7, 5
db HORZ_REPEAT|PLATFORM_A, 9, 9, 7
db HORZ_REPEAT|CRUMBLY, 19, 10, 4
db HORZ_REPEAT|CONVEYOR_R, 3, 11, 4
db HORZ_REPEAT|PLATFORM_A, 14, 12, 4
db HORZ_REPEAT|CRUMBLY, 8, 13, 4
db HORZ_REPEAT|PLATFORM_A, 1, 15, 30
db VERT_REPEAT|WALL_A, 25, 6, 7
db VERT_REPEAT|WALL_A, 28, 5, 8
db VERT_REPEAT|CRUMBLY, 26, 8, 5
db VERT_REPEAT|CRUMBLY, 27, 8, 5
db #ff
Guardian sprite data starts on a 256 byte aligned boundary so I can just use 8 bits for those as well.