improving the Speed of my push/pop screen routine
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
improving the Speed of my push/pop screen routine
Hello everyone:
I'm trying to use the PUSH/POP method to fill a screen from a buffer and I need some help with optimizing for speed.
Without trying to take too much white space, my code an be found at:
https://github.com/andydansby/zx_push_pop_screen
I'm pushing 26 character lines which requires me making 2 passes of push/pops, which seems to actually take longer than using LDI which to me seems wrong.
As per my current calculations, the routine takes 584 to 589 T states per line whereas the LDI routine takes 506 T states per line.
To compile the code, just run the LUT batch file.
Any help would be appreciated.
Thanks
Andy
I'm trying to use the PUSH/POP method to fill a screen from a buffer and I need some help with optimizing for speed.
Without trying to take too much white space, my code an be found at:
https://github.com/andydansby/zx_push_pop_screen
I'm pushing 26 character lines which requires me making 2 passes of push/pops, which seems to actually take longer than using LDI which to me seems wrong.
As per my current calculations, the routine takes 584 to 589 T states per line whereas the LDI routine takes 506 T states per line.
To compile the code, just run the LUT batch file.
Any help would be appreciated.
Thanks
Andy
- Turtle_Quality
- Manic Miner
- Posts: 506
- Joined: Fri Dec 07, 2018 10:19 pm
Re: improving the Speed of my push/pop screen routine
Hi Andy,
PUSH/POP is quicker, for standard registers it's 22 T states for 2 bytes, compared to 32 T states with LDI . You need to move the stack pointer and perform EXX commands also. But it should still be faster than LDI
But did you take into account memory contention when you were trying this ? Commands accessing memory from 16-32K will be delayed if the ULA is updating the screen. It's described here https://scratchpad.fandom.com/wiki/Cont ... y%20areas.
I posted an Excel before here to help with calculating contention delays
Is this to achieve a multicolour effect or is this pixel data ? If multicolour, then it needs to be done repeatedly during the refresh, but you can minimise contention by doing the POPs in the border zone as much as possible.
If it's pixel data, hopefully you can try to complete it outside the screen refresh time
PUSH/POP is quicker, for standard registers it's 22 T states for 2 bytes, compared to 32 T states with LDI . You need to move the stack pointer and perform EXX commands also. But it should still be faster than LDI
But did you take into account memory contention when you were trying this ? Commands accessing memory from 16-32K will be delayed if the ULA is updating the screen. It's described here https://scratchpad.fandom.com/wiki/Cont ... y%20areas.
I posted an Excel before here to help with calculating contention delays
Is this to achieve a multicolour effect or is this pixel data ? If multicolour, then it needs to be done repeatedly during the refresh, but you can minimise contention by doing the POPs in the border zone as much as possible.
If it's pixel data, hopefully you can try to complete it outside the screen refresh time
Definition of loop : see loop
Re: improving the Speed of my push/pop screen routine
But memory contention is the same regardless of which Z80 instructions are used to move data around, isn't it? So as long as Andy has tested using the same memory areas as source and destination for his stack and LDI test cases, then memory contention won't make any relative difference?Turtle_Quality wrote: ↑Fri Jul 15, 2022 9:35 am But did you take into account memory contention when you were trying this ?
Genuine question, BTW. I never properly understood the concept.
Derek Fountain, author of the ZX Spectrum C Programmer's Getting Started Guide and various open source games, hardware and other projects, including an IF1 and ZX Microdrive emulator.
- Turtle_Quality
- Manic Miner
- Posts: 506
- Joined: Fri Dec 07, 2018 10:19 pm
Re: improving the Speed of my push/pop screen routine
Hi @dfzx
As I understand it there are two factors affecting contention -
Does the construction involve contended memory ? either reading from, writing to, or even the instruction being located in contended memory.
What is the beam doing when the instruction is being processed ? Once the beam has finished row 192, there is no contention until the beam reaches column zero of row zero. And there is no contention while the beam is in the border area.
If you try an LDI instruction sitting at 24,000 copying from 30000 to 16384, you could get 3 contention delays - it's in Examples in the link I sent
The length of the delay depends again on the number of T states relative to column 0, there's an 8 T state cycle, and (assuming your command only includes 1 access to contended memory) the delay to your instruction will be somewhere from 0 to 6 T states.
So... you either avoid writing to screen while the beam is there, or estimate that each POP might take on average an extra 3 T states, or if you're trying to get a multicolour effect, find a way to reliably start the screen update on the same T state and ensure (with an accurate monitor debugger) that each row update takes the correct number of cycles.
As I understand it there are two factors affecting contention -
Does the construction involve contended memory ? either reading from, writing to, or even the instruction being located in contended memory.
What is the beam doing when the instruction is being processed ? Once the beam has finished row 192, there is no contention until the beam reaches column zero of row zero. And there is no contention while the beam is in the border area.
If you try an LDI instruction sitting at 24,000 copying from 30000 to 16384, you could get 3 contention delays - it's in Examples in the link I sent
The length of the delay depends again on the number of T states relative to column 0, there's an 8 T state cycle, and (assuming your command only includes 1 access to contended memory) the delay to your instruction will be somewhere from 0 to 6 T states.
So... you either avoid writing to screen while the beam is there, or estimate that each POP might take on average an extra 3 T states, or if you're trying to get a multicolour effect, find a way to reliably start the screen update on the same T state and ensure (with an accurate monitor debugger) that each row update takes the correct number of cycles.
Definition of loop : see loop
Re: improving the Speed of my push/pop screen routine
It would help if you posted the proper code snippet instead of giving a link to repository with several files that need compiling.
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
The routine is for Pixel data only, I'm not trying to do anything more advanced than that.
I'm not sure if anyone has had much of a change to look at the code, but the data is 26 bits wide. It seems to me that once you go beyond 16 bits, you start running into trouble. For the LDI, I adjust HL, DE and BC once and let it fly.
When I use the push/pop, I set IX for the screen pointer and IY as the buffer pointer. I then use af, bc, de and hl to perform the push/pop. I repeat the same for the shadow registers. I am using the i register for the loop.
The problem seems to stem from when you go beyond 16 bits, I have to adjust IX and IY again adding an additional 70 t-states to push the additional 10 bits of data. This seems to gobble up any savings that you have gained using the push/pop as opposed to LDI.
At the end of pushing and popping the second group of pixels, I have to adjust down 1 line and change the buffer address again taking an additional 133/138 t-states.
After an entire line is written, according to my calculations, I have spent 584/589 t-states using push/pop. Using LDI, I seem to take 506 t-states.
That's why I seem a bit puzzled. Beyond 16 bits, the LDI seems to catch up pretty handily and over the couse of 192 lines, it's 14,976 t-states slower.
I'm not sure if anyone has had much of a change to look at the code, but the data is 26 bits wide. It seems to me that once you go beyond 16 bits, you start running into trouble. For the LDI, I adjust HL, DE and BC once and let it fly.
When I use the push/pop, I set IX for the screen pointer and IY as the buffer pointer. I then use af, bc, de and hl to perform the push/pop. I repeat the same for the shadow registers. I am using the i register for the loop.
The problem seems to stem from when you go beyond 16 bits, I have to adjust IX and IY again adding an additional 70 t-states to push the additional 10 bits of data. This seems to gobble up any savings that you have gained using the push/pop as opposed to LDI.
At the end of pushing and popping the second group of pixels, I have to adjust down 1 line and change the buffer address again taking an additional 133/138 t-states.
After an entire line is written, according to my calculations, I have spent 584/589 t-states using push/pop. Using LDI, I seem to take 506 t-states.
That's why I seem a bit puzzled. Beyond 16 bits, the LDI seems to catch up pretty handily and over the couse of 192 lines, it's 14,976 t-states slower.
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
The reason I did that is so someone may download the entire folder and compile it completely using the PASMO compiler whereas a snippet take a bit of work. But, here's the code if you are interested.
Code: Select all
org $8000
start:
di
call push_pop
return:
ei
endless:
jp endless
originalStack:
defw $0000
;;;;;;;;;;;;;;;;;;;;;;;;;;;
org $C000
data:
incbin "girl.bin"
push_pop:
;on entry stack is #5FE6
ld b, 0 ;the first entry in the LUT
push bc ;save to stack for when we update the loop
; sp = #5FE4
;call coords_to_address
; uses the B register to calculate pixel row
;results in ix
; sp = #5FE4
ld ix, $4010
ld iy, data
;iy holds out buffer data
;i of ir is going to be our loop data
;ld a, (loopsteps) ; temp to be replaced with below
ld a,192
ld i, a
;so at this point
;IX is holding the screen address
;IY is holding out image buffer
;I (of IR) is holding our loop info
;lets save our original stack
ld (originalStack), sp
loop:
; sp = #5FE4
ld sp, iy; 10t
;our stack is set to image buffer
;== 10 t-states
pop af ;2 10t
pop bc ;4 10t
pop de ;6 10t
pop hl ;8 10t
exx ; 4t
ex af, af'; 4t
pop af ;10 10t
pop bc ;12 10t
pop de ;14 10t
pop hl ;16 10t
;==88 t-states
ld sp, ix; 10t
;our stack is set to the screen
;== 10 t-states
push hl ;16 11t
push de ;14 11t
push bc ;12 11t
push af ;10 11t
exx ; 4t
ex af, af'; 4t
push hl ;8 11t
push de ;6 11t
push bc ;4 11t
push af ;2 11t
;==96 t-states
;adjust our screen
;ld sp, (originalStack) ;20t
ld bc, $0a ;10t
add ix, bc ;15t
;adjust our buffer
ld bc, $10 ;10t
add iy, bc ;15t
;==70 t-states
ld sp, iy; buffer
pop bc ;18 10t
pop de ;20 10t
pop hl ;22 10t
exx ; 4t
pop bc ;24 10t
pop de ;26 10t
;==54 t-states
ld sp, ix; 10t
;== 10 t-states
push de ;26 11t
push bc ;24 11t
exx ; 4t
push hl ;22 11t
push de ;20 11t
push bc ;18 11t
;==59 t-states
ld sp, (originalStack) ;20t
;https://worldofspectrum.org/forums/discussion/comment/315782/#Comment_315782
ld d, ixh; 8t
ld e, ixl; 8t
uphl:
inc d; 4t
ld a,d; 4t
and 7; 7t
jp nz, end_of_next_line; 10t
ld a,e; 4t
add a,32; 7t
ld e,a; 4t
jr c, end_of_next_line; 12/7t
ld a,d; 4t
sub 8; 7t
ld d,a; 4t
end_of_next_line:
ld a,e; 4t
ld l, $A; 7t
sub l; 4t
ld e,a; 4t
ld ixh, d; 8t
ld ixl, e; 8t
;==133 to 138
;iy holds out buffer data
ld bc, 10 ;10t
add iy, bc ;15t
ld a, i ;9t
dec a ;4t
ld i, a ;9t
jr nz, loop ;12/7 t
;==54/59 t-states
;;end loop
;; entire loop for 1 line is
; ==584 / 589 t-states, why oh why?
;LDI method 2 = 506T
ld sp, (originalStack)
;sp = #5FE4
pop hl; clear a little junk out of the stack
ret
Re: improving the Speed of my push/pop screen routine
That was the best English language explanation of contention I've yet seen! Thanks!Turtle_Quality wrote: ↑Fri Jul 15, 2022 10:41 am As I understand it there are two factors affecting contention -
...
Derek Fountain, author of the ZX Spectrum C Programmer's Getting Started Guide and various open source games, hardware and other projects, including an IF1 and ZX Microdrive emulator.
Re: improving the Speed of my push/pop screen routine
The general approach is good but you spend too much time on computing pointers for next loop.andydansby wrote: ↑Fri Jul 15, 2022 12:06 pm
Code: Select all
;;end loop ;; entire loop for 1 line is ; ==584 / 589 t-states, why oh why? ;LDI method 2 = 506T
For example your code
Code: Select all
ld bc, $0a ;10t
add ix, bc ;15t
;adjust our buffer
ld bc, $10 ;10t
add iy, bc ;15t
could be faster
Code: Select all
ld bc, $0a ;10t
add ix, bc ;15t
;adjust our buffer
ld c, $10 ;7t because B is already set to 0
add iy, bc ;15
Code: Select all
ld sp, (originalStack) ;20t
;https://worldofspectrum.org/forums/discussion/comment/315782/#Comment_315782
ld d, ixh; 8t
ld e, ixl; 8t
uphl:
inc d; 4t
ld a,d; 4t
and 7; 7t
jp nz, end_of_next_line; 10t
ld a,e; 4t
add a,32; 7t
ld e,a; 4t
jr c, end_of_next_line; 12/7t
ld a,d; 4t
sub 8; 7t
ld d,a; 4t
end_of_next_line:
ld a,e; 4t
ld l, $A; 7t
sub l; 4t
ld e,a; 4t
ld ixh, d; 8t
ld ixl, e; 8t
;==133 to 138
;iy holds out buffer data
ld bc, 10 ;10t
add iy, bc ;15t
ld a, i ;9t
dec a ;4t
ld i, a ;9t
jr nz, loop ;12/7 t
;==54/59 t-states
Code: Select all
dont read the stack with ld sp, (xxxx), it is very expensive
;also don't copy IX to DE unless you need it
inc ixh
ld a, ixh
and 7
jr z, updateIx ;this happen only once in 8 runs
;so it is better to jump relative when it happens for 12 ticks
;otherwise do nothing for 7 ticks
ld a, ixl
sub $0A
ld ixl, a
continue:
;now update iy
ld bc, 10
add iy, bc
ld a, i
dec a
ld i, a
jp nz, loop
ld sp, (originalStack) ;only now is time to restore stack
ret
updateIx:
;update ix and go to continue
Proud owner of Didaktik M
Re: improving the Speed of my push/pop screen routine
Have you considered using the push code itself as the buffer? Something like this
The main downside of this is loading the data into the buffer in the first place plus the huge size of the buffer memory wise but with this the average is just over 10.5t per pixel (not including contention).
TomD
Code: Select all
ld sp,0000 ; pixel row end
ld hl,0000 ; last 2 pixel bytes
push hl
ld hl,0000 ; second to last 2 pixel bytes
push hl
...
TomD
Retro enthusiast and author of Flynn's Adventure in Bombland, The Order of Mazes & Maze Death Rally-X. Check them out at http://tomdalby.com
Re: improving the Speed of my push/pop screen routine
IMHO the real reason why is POP/PUSH considered faster is somewhat better timing when accessing contended memory. Fact is, I don't how much it is faster. @Einar Saukas or @Joefish should now, they spend half of their life on making multicolors
Anyway something like this would be little bit faster than LDI:
Except there is bug - PUSH instruction does DEC SP, PUSH BYTE, DEC SP, PUSH BYTE and to copy very last byte of 1/3 screen, the stack pointer needs to point to the next third of screen. An since I manipulate only less significant byte, I would need special case code for the very last 16B block in third.
Which brings me to another topic. LDI sequence is very hard to beat when you copy from linear buffer to screen using full width of 32 bytes. Why ? Because LDI will always give next source address for free. But many games actually don't use full width of screen. And in that case, source pointer has to be adjusted every line. Exactly like when using POP/PUSH method. So, if you don't use full-width copy, not only LDI loses advantage of being always correct regarding source pointer but POP/PUSH method also have one or more registers available for loop management. And that is exactly the edge that POP/PUSH needs to be faster than LDI.
Second thing is that games using POP/PUSH often don't use linear buffer at all. With linear buffer, you copy left side, update pointers, copy right side, update pointers again and so on.
But what if you dont have a linear buffer but a buffer that mimics screen organization ? And what if you don't copy your byte blocks in usual way left, right, left, right ?
So imagine you have a 4KB buffer that has same layout as upper two thirds of screen. What if you copy 16 bytes on the left side and then just another 16 bytes right under ? It makes pointer arithmetic incredibly simple: for the next 16B you increase high byte of source by one and you increase high byte of destination by one.
When you are done with chunk of 8 lines on the left side, you update your pointers to copy right side. And because you don not the ugly pointer arithmetic on every line (actually twice on every line) but only twice for group of 8 lines, it will save you time.
I hope this helps.
Anyway, for copying full width of screen from linear buffer I would use good old
Anyway something like this would be little bit faster than LDI:
Code: Select all
loop:
ex af, af' ;counter in A'
ld sp, iy ;10
pop bc ;10
pop de ;10
pop af ;10
pop hl ;10
exx ;4
pop bc ;10
pop de ;10
pop ix ;14
pop hl ;10 -> 98
stack0:
ld sp, 0000
push hl
push ix
push de
push bc
exx
push hl
push af
push de
push bc ;->106
ld de, 16
add iy, de ;->25
;^^^^repeat 16 times -> 16*(98+106)=3664
;now it is time to update LSB of destinations
ld c, 32
ld a, (LSBdestination1)
add a, c
ld (LSBdestination1), a
ld (stack0), a ;13
ld (stack2), a ;13
ld (stack4), a ;13
ld (stack6), a ;13
ld (stack8), a ;13
ld (stack10), a ;13
ld (stack12), a ;13
ld (stack14), a ;13 ->10*13+7+4=141
ld a, (LSBdestination2)
add a, c
ld (LSBdestination2), a
ld (stack1), a ;13
ld (stack3), a ;13
ld (stack5), a ;13
ld (stack7), a ;13
ld (stack9), a ;13
ld (stack11), a ;13
ld (stack13), a ;13
ld (stack15), a ;13 ->10*13+4=134
;and then loop, af' is unused
ex af, af'
dec a
jp nz, loop
Which brings me to another topic. LDI sequence is very hard to beat when you copy from linear buffer to screen using full width of 32 bytes. Why ? Because LDI will always give next source address for free. But many games actually don't use full width of screen. And in that case, source pointer has to be adjusted every line. Exactly like when using POP/PUSH method. So, if you don't use full-width copy, not only LDI loses advantage of being always correct regarding source pointer but POP/PUSH method also have one or more registers available for loop management. And that is exactly the edge that POP/PUSH needs to be faster than LDI.
Second thing is that games using POP/PUSH often don't use linear buffer at all. With linear buffer, you copy left side, update pointers, copy right side, update pointers again and so on.
But what if you dont have a linear buffer but a buffer that mimics screen organization ? And what if you don't copy your byte blocks in usual way left, right, left, right ?
So imagine you have a 4KB buffer that has same layout as upper two thirds of screen. What if you copy 16 bytes on the left side and then just another 16 bytes right under ? It makes pointer arithmetic incredibly simple: for the next 16B you increase high byte of source by one and you increase high byte of destination by one.
Code: Select all
ld sp, ix
do some POPs
ld sp, iy
do some PUSHes
inc ixh
inc iyh
ld sp, ix
do some POPs
ld sp, iy
do some PUSHES
.... do it for 8 pixel lines
I hope this helps.
Anyway, for copying full width of screen from linear buffer I would use good old
Code: Select all
pop de ;pop destination from table
ld c, d
ldi
ldi
ldi
....repeat LDI 32 times
dec a
jp nz, loop
Proud owner of Didaktik M
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
Actually that gives me a bit of an idea that I might try as a separate test. I might have 2 separate loops 1st loop with left side of the screen push pop the first 16 bits. Then do a second loop pushing and popping the right side with 10 bits. That should avoid my pointer math.catmeows wrote: ↑Fri Jul 15, 2022 4:07 pm When you are done with chunk of 8 lines on the left side, you update your pointers to copy right side. And because you do not the ugly pointer arithmetic on every line (actually twice on every line) but only twice for group of 8 lines, it will save you time.
Might be worth a try.
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
I made the optimization changes, and the code for it is now at
Code: Select all
https://github.com/andydansby/ZX_back_buffer/tree/main/pushpop3
According to my measurement, takes about 476 T-States for each line to 8 lines, after the 8th line, the pointer math takes about 40t states, every 1/3 of the screen takes an additional 98 T-states, but it only really happens twice in the routine. So the entire image takes 95076 T-states in total for a 208x192 image. Using TICKS for the routine measurement.
This is an additional 6% faster than the LDI method, which is pretty good.
I'm going to try to set up for the other suggestions to see the pathway those take me down. Interesting in trying new techniques.
- Einar Saukas
- Bugaboo
- Posts: 3145
- Joined: Wed Nov 15, 2017 2:48 pm
Re: improving the Speed of my push/pop screen routine
A few ideas:
1. Instead of POP/PUSH AF twice in first block, use POP/PUSH AF once in both blocks. This way, you won't need EX AF,AF' anymore.
2. Now that you are not using EX AF,AF' anymore, you can use EX AF,AF' instead of register I to preserve the loop counter.
3. Instead of POP/PUSH BC twice in second block, use POP/PUSH HL twice in second block. This way, you can reuse the same value of BC that you have set between both POP/PUSH blocks.
4. Instead of calculating IX, store this value as an "extra column" in your data buffer. The first instruction in your first POP/PUSH block would be POP IX.
1. Instead of POP/PUSH AF twice in first block, use POP/PUSH AF once in both blocks. This way, you won't need EX AF,AF' anymore.
2. Now that you are not using EX AF,AF' anymore, you can use EX AF,AF' instead of register I to preserve the loop counter.
3. Instead of POP/PUSH BC twice in second block, use POP/PUSH HL twice in second block. This way, you can reuse the same value of BC that you have set between both POP/PUSH blocks.
4. Instead of calculating IX, store this value as an "extra column" in your data buffer. The first instruction in your first POP/PUSH block would be POP IX.
-
- Manic Miner
- Posts: 401
- Joined: Fri Jan 03, 2020 10:00 am
Re: improving the Speed of my push/pop screen routine
Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
- Einar Saukas
- Bugaboo
- Posts: 3145
- Joined: Wed Nov 15, 2017 2:48 pm
Re: improving the Speed of my push/pop screen routine
I assume unrolled loop is out of question, since OP specified "without taking too much white space". Otherwise I agree this would be a great option.Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
You are correct Einar, I didn't want to have an obscenely large routine to push graphics. I should clarify when I ask a question rather than drop slang . My experience at writing assembly is extremely limited as I've only written very few routines as compared to most people on these boards, so my attempts are rather clunky at times.Einar Saukas wrote: ↑Sun Jul 17, 2022 11:06 pmI assume unrolled loop is out of question, since OP specified "without taking too much white space". Otherwise I agree this would be a great option.Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
That being said, seeing other routines like this even the large ones would be good to see especially for learners like myself.
Re: improving the Speed of my push/pop screen routine
What's that? Google is telling me nothing!Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
Derek Fountain, author of the ZX Spectrum C Programmer's Getting Started Guide and various open source games, hardware and other projects, including an IF1 and ZX Microdrive emulator.
Re: improving the Speed of my push/pop screen routine
Fetch the data using LD into registers, then PUSH then to the display. The SP then keeps track of where you are because you aren't needing to repoint it.
-
- Microbot
- Posts: 148
- Joined: Fri Nov 24, 2017 5:09 pm
- Location: Syracuse, NY, USA
- Contact:
Re: improving the Speed of my push/pop screen routine
I'm placing the various solutions as I finish them in the GIT
https://github.com/andydansby/ZX_back_buffer
long_LDI_26wide is a simple, but long LDI solution.
pushpop1 was my original push/pop solution that was slower then the Long_LDI
pushpop2 was my attempt at an optimization, while working made things slower. Ugg.
pushpop3 is similar to pushpop1, but the the optimizations suggested by catmeows
pushpop4 is with the first 3 optimizations suggested by Einar, but I'm still looking at the last optimization.
So far as best as I can calculate, there's a savings of 59 t-states during the loop and 1 t-state before the loop and 1 t-state during screen chunks 2 and 3.
I'm commenting the code as best as I can, perhaps some of them actually make sense.
I'll add a few more as time and brain power allows.
I'll keep looking for suggestions and various tricks.
https://github.com/andydansby/ZX_back_buffer
long_LDI_26wide is a simple, but long LDI solution.
pushpop1 was my original push/pop solution that was slower then the Long_LDI
pushpop2 was my attempt at an optimization, while working made things slower. Ugg.
pushpop3 is similar to pushpop1, but the the optimizations suggested by catmeows
pushpop4 is with the first 3 optimizations suggested by Einar, but I'm still looking at the last optimization.
So far as best as I can calculate, there's a savings of 59 t-states during the loop and 1 t-state before the loop and 1 t-state during screen chunks 2 and 3.
I'm commenting the code as best as I can, perhaps some of them actually make sense.
I'll add a few more as time and brain power allows.
I'll keep looking for suggestions and various tricks.
- Einar Saukas
- Bugaboo
- Posts: 3145
- Joined: Wed Nov 15, 2017 2:48 pm
Re: improving the Speed of my push/pop screen routine
It was explained before in this same thread:dfzx wrote: ↑Mon Jul 18, 2022 8:24 amWhat's that? Google is telling me nothing!Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
TomD wrote: ↑Fri Jul 15, 2022 2:21 pm Have you considered using the push code itself as the buffer? Something like this
The main downside of this is loading the data into the buffer in the first place plus the huge size of the buffer memory wise but with this the average is just over 10.5t per pixel (not including contention).Code: Select all
ld sp,0000 ; pixel row end ld hl,0000 ; last 2 pixel bytes push hl ld hl,0000 ; second to last 2 pixel bytes push hl ...
TomD
- Einar Saukas
- Bugaboo
- Posts: 3145
- Joined: Wed Nov 15, 2017 2:48 pm
Re: improving the Speed of my push/pop screen routine
Instead of:
Use:
Code: Select all
ex af, af'
ld a,64
ex af, af'
loop1:
...
ex af, af'
dec a
jr nz, setup_next_pass
...
setup_next_pass:
ex af, af'
jp loop1
Code: Select all
ld a,64
loop1:
ex af, af'
...
ex af, af'
dec a
jp nz, loop1
- Einar Saukas
- Bugaboo
- Posts: 3145
- Joined: Wed Nov 15, 2017 2:48 pm
Re: improving the Speed of my push/pop screen routine
Instead of:
Use:
EDIT: Fixed bug (thanks @Joefish!)
Code: Select all
originalStack:
defw $0000
...
ld (originalStack), sp
...
finished_copy:
ld sp, (originalStack)
ret
Code: Select all
...
ld (finished_copy+1), sp
...
finished_copy:
ld sp, $0000
ret
EDIT: Fixed bug (thanks @Joefish!)
Last edited by Einar Saukas on Mon Jul 18, 2022 2:08 pm, edited 1 time in total.
Re: improving the Speed of my push/pop screen routine
(finished_copy+1), not +2
@andydansby the idea is to re-write the data part of the next LD SP,value instruction. The instruction itself is one byte, followed by two bytes of data.
Confusingly, LD SP,(address) really is two bytes of instruction followed by the address data in another two bytes. But LD SP,value is shorter, and quicker.
@andydansby the idea is to re-write the data part of the next LD SP,value instruction. The instruction itself is one byte, followed by two bytes of data.
Confusingly, LD SP,(address) really is two bytes of instruction followed by the address data in another two bytes. But LD SP,value is shorter, and quicker.