improving the Speed of my push/pop screen routine

andydansby · Post by **andydansby** » Fri Jul 15, 2022 1:15 am

Hello everyone:

I'm trying to use the PUSH/POP method to fill a screen from a buffer and I need some help with optimizing for speed.
Without trying to take too much white space, my code an be found at:

https://github.com/andydansby/zx_push_pop_screen

I'm pushing 26 character lines which requires me making 2 passes of push/pops, which seems to actually take longer than using LDI which to me seems wrong.

As per my current calculations, the routine takes 584 to 589 T states per line whereas the LDI routine takes 506 T states per line.

To compile the code, just run the LUT batch file.

Any help would be appreciated.

Thanks
Andy

Turtle_Quality · Post by **Turtle_Quality** » Fri Jul 15, 2022 9:35 am

Hi Andy,
PUSH/POP is quicker, for standard registers it's 22 T states for 2 bytes, compared to 32 T states with LDI . You need to move the stack pointer and perform EXX commands also. But it should still be faster than LDI

But did you take into account memory contention when you were trying this ? Commands accessing memory from 16-32K will be delayed if the ULA is updating the screen. It's described here https://scratchpad.fandom.com/wiki/Cont ... y%20areas.

I posted an Excel before here to help with calculating contention delays

Is this to achieve a multicolour effect or is this pixel data ? If multicolour, then it needs to be done repeatedly during the refresh, but you can minimise contention by doing the POPs in the border zone as much as possible.

If it's pixel data, hopefully you can try to complete it outside the screen refresh time

dfzx · Post by **dfzx** » Fri Jul 15, 2022 10:00 am

Turtle_Quality wrote: ↑Fri Jul 15, 2022 9:35 am But did you take into account memory contention when you were trying this ?

But memory contention is the same regardless of which Z80 instructions are used to move data around, isn't it? So as long as Andy has tested using the same memory areas as source and destination for his stack and LDI test cases, then memory contention won't make any relative difference?

Genuine question, BTW. I never properly understood the concept.

Turtle_Quality · Post by **Turtle_Quality** » Fri Jul 15, 2022 10:41 am

Hi @dfzx

As I understand it there are two factors affecting contention -

Does the construction involve contended memory ? either reading from, writing to, or even the instruction being located in contended memory.

What is the beam doing when the instruction is being processed ? Once the beam has finished row 192, there is no contention until the beam reaches column zero of row zero. And there is no contention while the beam is in the border area.

If you try an LDI instruction sitting at 24,000 copying from 30000 to 16384, you could get 3 contention delays - it's in Examples in the link I sent

The length of the delay depends again on the number of T states relative to column 0, there's an 8 T state cycle, and (assuming your command only includes 1 access to contended memory) the delay to your instruction will be somewhere from 0 to 6 T states.

So... you either avoid writing to screen while the beam is there, or estimate that each POP might take on average an extra 3 T states, or if you're trying to get a multicolour effect, find a way to reliably start the screen update on the same T state and ensure (with an accurate monitor debugger) that each row update takes the correct number of cycles.

Ralf · Post by **Ralf** » Fri Jul 15, 2022 11:34 am

It would help if you posted the proper code snippet instead of giving a link to repository with several files that need compiling.

andydansby · Post by **andydansby** » Fri Jul 15, 2022 11:59 am

The routine is for Pixel data only, I'm not trying to do anything more advanced than that.

I'm not sure if anyone has had much of a change to look at the code, but the data is 26 bits wide. It seems to me that once you go beyond 16 bits, you start running into trouble. For the LDI, I adjust HL, DE and BC once and let it fly.

When I use the push/pop, I set IX for the screen pointer and IY as the buffer pointer. I then use af, bc, de and hl to perform the push/pop. I repeat the same for the shadow registers. I am using the i register for the loop.

The problem seems to stem from when you go beyond 16 bits, I have to adjust IX and IY again adding an additional 70 t-states to push the additional 10 bits of data. This seems to gobble up any savings that you have gained using the push/pop as opposed to LDI.

At the end of pushing and popping the second group of pixels, I have to adjust down 1 line and change the buffer address again taking an additional 133/138 t-states.

After an entire line is written, according to my calculations, I have spent 584/589 t-states using push/pop. Using LDI, I seem to take 506 t-states.

That's why I seem a bit puzzled. Beyond 16 bits, the LDI seems to catch up pretty handily and over the couse of 192 lines, it's 14,976 t-states slower.

andydansby · Post by **andydansby** » Fri Jul 15, 2022 12:06 pm

Ralf wrote: ↑Fri Jul 15, 2022 11:34 am It would help if you posted the proper code snippet instead of giving a link to repository with several files that need compiling.

The reason I did that is so someone may download the entire folder and compile it completely using the PASMO compiler whereas a snippet take a bit of work. But, here's the code if you are interested.

Code: Select all

	org $8000	
start:
	
	di
	call push_pop

return:
	ei
	
endless:
	jp endless

originalStack:
defw $0000
;;;;;;;;;;;;;;;;;;;;;;;;;;;
	org $C000
data:
incbin "girl.bin"

push_pop:

;on entry stack is #5FE6

	ld b, 0		;the first entry in the LUT
	push bc		;save to stack for when we update the loop
		
	; sp = #5FE4
	;call coords_to_address
	; uses the B register to calculate pixel row
	;results in ix
	; sp = #5FE4
	
	ld ix, $4010
		
	ld iy, data
	;iy holds out buffer data
	
	;i of ir is going to be our loop data
	;ld a, (loopsteps)	; temp to be replaced with below	
	ld a,192
	ld i, a
	
	;so at this point
	;IX is holding the screen address 
	;IY is holding out image buffer
	;I (of IR) is holding our loop info
	
	;lets save our original stack
	ld (originalStack), sp
	
loop:
; sp = #5FE4
	ld sp, iy; 		10t
	;our stack is set to image buffer
	;== 10 t-states
	
	pop af	;2		10t
	pop bc	;4		10t
	pop de	;6		10t
	pop hl	;8		10t
	exx		;	 	4t
	ex af, af';	 	4t
	pop af	;10		10t
	pop bc	;12		10t
	pop de	;14		10t
	pop hl	;16		10t
	;==88 t-states
		
	ld sp, ix; 		10t
	;our stack is set to the screen
	;== 10 t-states
	
	push hl	;16		11t
	push de	;14		11t
	push bc	;12		11t
	push af	;10		11t
	exx		;	 	4t
	ex af, af';	 	4t
	push hl	;8		11t
	push de	;6		11t
	push bc	;4		11t
	push af	;2		11t
	;==96 t-states
	
	;adjust our screen
	;ld sp, (originalStack)	;20t
	ld bc, $0a				;10t
	add ix, bc				;15t
	;adjust our buffer
	ld bc, $10				;10t
	add iy, bc				;15t
	;==70 t-states
	
	ld sp, iy; buffer
	pop bc	;18		10t
	pop de	;20		10t
	pop hl	;22		10t
	exx		;		4t
	pop bc	;24		10t
	pop de	;26		10t
	;==54 t-states
	
	ld sp, ix; 		10t
	;== 10 t-states

	push de	;26		11t
	push bc	;24		11t
	exx		;		4t
	push hl	;22		11t
	push de	;20		11t
	push bc	;18		11t
	;==59 t-states
	
	ld sp, (originalStack)	;20t
	
	;https://worldofspectrum.org/forums/discussion/comment/315782/#Comment_315782	
	ld d, ixh;		8t
	ld e, ixl;		8t
	uphl:
	inc d;			4t
	ld a,d;			4t
	and 7;			7t
	jp nz, end_of_next_line;	10t
	
	ld a,e;			4t
	add a,32;		7t
	ld e,a;			4t

	jr c, end_of_next_line;		12/7t

	ld a,d;			4t
	sub 8;			7t
	ld d,a;			4t

end_of_next_line:
	ld a,e;			4t
	ld l, $A;		7t
	sub l;			4t
	ld e,a;			4t
	
	ld ixh, d;		8t
	ld ixl, e;		8t
	;==133 to 138

	;iy holds out buffer data
	ld bc, 10				;10t
	
	add iy, bc				;15t
	
	ld a, i					;9t
	dec a					;4t
	ld i, a					;9t	
	
	jr nz, loop				;12/7 t
	;==54/59  t-states

;;end loop	
;; entire loop for 1 line is
;	==584 / 589 t-states, why oh why?
;LDI method 2 = 506T
	

ld sp, (originalStack)
;sp = #5FE4

	pop hl; clear a little junk out of the stack
ret

dfzx · Post by **dfzx** » Fri Jul 15, 2022 12:12 pm

Turtle_Quality wrote: ↑Fri Jul 15, 2022 10:41 am As I understand it there are two factors affecting contention -
...

That was the best English language explanation of contention I've yet seen! Thanks!

catmeows · Post by **catmeows** » Fri Jul 15, 2022 2:11 pm

andydansby wrote: ↑Fri Jul 15, 2022 12:06 pm
Code: Select all
;;end loop	
;; entire loop for 1 line is
;	==584 / 589 t-states, why oh why?
;LDI method 2 = 506T

The general approach is good but you spend too much time on computing pointers for next loop.

For example your code

Code: Select all

ld bc, $0a				;10t
add ix, bc				;15t
;adjust our buffer
ld bc, $10				;10t
add iy, bc				;15t

could be faster

Code: Select all

ld bc, $0a				;10t
add ix, bc				;15t
;adjust our buffer
ld c, $10				;7t because B is already set to 0
add iy, bc				;15

also the next part of loop management is too complicated

Code: Select all

	ld sp, (originalStack)	;20t
	
	;https://worldofspectrum.org/forums/discussion/comment/315782/#Comment_315782	
	ld d, ixh;		8t
	ld e, ixl;		8t
	uphl:
	inc d;			4t
	ld a,d;			4t
	and 7;			7t
	jp nz, end_of_next_line;	10t
	
	ld a,e;			4t
	add a,32;		7t
	ld e,a;			4t

	jr c, end_of_next_line;		12/7t

	ld a,d;			4t
	sub 8;			7t
	ld d,a;			4t

end_of_next_line:
	ld a,e;			4t
	ld l, $A;		7t
	sub l;			4t
	ld e,a;			4t
	
	ld ixh, d;		8t
	ld ixl, e;		8t
	;==133 to 138

	;iy holds out buffer data
	ld bc, 10				;10t
	
	add iy, bc				;15t
	
	ld a, i					;9t
	dec a					;4t
	ld i, a					;9t	
	
	jr nz, loop				;12/7 t
	;==54/59  t-states

could look like this

Code: Select all

dont read the stack with ld sp, (xxxx), it is very expensive
;also don't copy IX to DE unless you need it
inc ixh
ld a, ixh
and 7
jr z, updateIx    ;this happen only once in 8 runs
                  ;so it is better to jump relative when it happens for 12 ticks
                  ;otherwise do nothing for 7 ticks
ld a, ixl
sub $0A
ld ixl, a
continue:
;now update iy
ld bc, 10
add iy, bc

ld a, i
dec a
ld i, a
jp nz, loop

ld sp, (originalStack)   ;only now is time to restore stack
ret

updateIx:
 ;update ix and go to continue

BUT it can be done much faster.

TomD · Post by **TomD** » Fri Jul 15, 2022 2:21 pm

Have you considered using the push code itself as the buffer? Something like this

Code: Select all

ld sp,0000 ; pixel row end
ld hl,0000 ; last 2 pixel bytes
push hl
ld hl,0000 ; second to last 2 pixel bytes
push hl
...

The main downside of this is loading the data into the buffer in the first place plus the huge size of the buffer memory wise but with this the average is just over 10.5t per pixel (not including contention).

TomD

catmeows · Post by **catmeows** » Fri Jul 15, 2022 4:07 pm

IMHO the real reason why is POP/PUSH considered faster is somewhat better timing when accessing contended memory. Fact is, I don't how much it is faster. @Einar Saukas or @Joefish should now, they spend half of their life on making multicolors

Anyway something like this would be little bit faster than LDI:

Code: Select all

loop:
 ex af, af'  ;counter in A'
 ld sp, iy   ;10
 pop bc      ;10 
 pop de      ;10 
 pop af      ;10
 pop hl      ;10
 exx         ;4
 pop bc      ;10
 pop de      ;10 
 pop ix      ;14
 pop hl      ;10 -> 98
stack0:
 ld sp, 0000
 push hl
 push ix
 push de
 push bc
 exx
 push hl
 push af
 push de
 push bc     ;->106
 ld de, 16
 add iy, de  ;->25
 ;^^^^repeat 16 times -> 16*(98+106)=3664

 ;now it is time to update LSB of destinations
 ld c, 32
 ld a, (LSBdestination1)
 add a, c
 ld (LSBdestination1), a
 ld (stack0), a  ;13
 ld (stack2), a  ;13
 ld (stack4), a  ;13
 ld (stack6), a ;13
 ld (stack8), a ;13
 ld (stack10), a ;13
 ld (stack12), a ;13
 ld (stack14), a ;13 ->10*13+7+4=141
 
 ld a, (LSBdestination2)
 add a, c
 ld (LSBdestination2), a
 ld (stack1), a  ;13
 ld (stack3), a  ;13
 ld (stack5), a  ;13
 ld (stack7), a ;13
 ld (stack9), a ;13
 ld (stack11), a ;13
 ld (stack13), a ;13
 ld (stack15), a ;13 ->10*13+4=134

 ;and then loop, af' is unused
 ex af, af'
 dec a
 jp nz, loop

Except there is bug - PUSH instruction does DEC SP, PUSH BYTE, DEC SP, PUSH BYTE and to copy very last byte of 1/3 screen, the stack pointer needs to point to the next third of screen. An since I manipulate only less significant byte, I would need special case code for the very last 16B block in third.

Which brings me to another topic. LDI sequence is very hard to beat when you copy from linear buffer to screen using full width of 32 bytes. Why ? Because LDI will always give next source address for free. But many games actually don't use full width of screen. And in that case, source pointer has to be adjusted every line. Exactly like when using POP/PUSH method. So, if you don't use full-width copy, not only LDI loses advantage of being always correct regarding source pointer but POP/PUSH method also have one or more registers available for loop management. And that is exactly the edge that POP/PUSH needs to be faster than LDI.

Second thing is that games using POP/PUSH often don't use linear buffer at all. With linear buffer, you copy left side, update pointers, copy right side, update pointers again and so on.
But what if you dont have a linear buffer but a buffer that mimics screen organization ? And what if you don't copy your byte blocks in usual way left, right, left, right ?
So imagine you have a 4KB buffer that has same layout as upper two thirds of screen. What if you copy 16 bytes on the left side and then just another 16 bytes right under ? It makes pointer arithmetic incredibly simple: for the next 16B you increase high byte of source by one and you increase high byte of destination by one.

Code: Select all

ld sp, ix
do some POPs
ld sp, iy
do some PUSHes
inc ixh
inc iyh
ld sp, ix
do some POPs
ld sp, iy
do some PUSHES 
.... do it for 8 pixel lines

When you are done with chunk of 8 lines on the left side, you update your pointers to copy right side. And because you don not the ugly pointer arithmetic on every line (actually twice on every line) but only twice for group of 8 lines, it will save you time.
I hope this helps.

Anyway, for copying full width of screen from linear buffer I would use good old

Code: Select all

pop de     ;pop destination from table
ld c, d
ldi
ldi
ldi
....repeat LDI 32 times
dec a
jp nz, loop

andydansby · Post by **andydansby** » Fri Jul 15, 2022 10:57 pm

catmeows wrote: ↑Fri Jul 15, 2022 4:07 pm When you are done with chunk of 8 lines on the left side, you update your pointers to copy right side. And because you do not the ugly pointer arithmetic on every line (actually twice on every line) but only twice for group of 8 lines, it will save you time.

Actually that gives me a bit of an idea that I might try as a separate test. I might have 2 separate loops 1st loop with left side of the screen push pop the first 16 bits. Then do a second loop pushing and popping the right side with 10 bits. That should avoid my pointer math.

Might be worth a try.

catmeows · Post by **catmeows** » Sat Jul 16, 2022 1:46 am

Good luck

andydansby · Post by **andydansby** » Sun Jul 17, 2022 3:14 am

catmeows wrote: ↑Sat Jul 16, 2022 1:46 amGood luck

I made the optimization changes, and the code for it is now at

Code: Select all

https://github.com/andydansby/ZX_back_buffer/tree/main/pushpop3

Thanks for the suggestions @catmeows

According to my measurement, takes about 476 T-States for each line to 8 lines, after the 8th line, the pointer math takes about 40t states, every 1/3 of the screen takes an additional 98 T-states, but it only really happens twice in the routine. So the entire image takes 95076 T-states in total for a 208x192 image. Using TICKS for the routine measurement.

This is an additional 6% faster than the LDI method, which is pretty good.

I'm going to try to set up for the other suggestions to see the pathway those take me down. Interesting in trying new techniques.

Einar Saukas · Post by **Einar Saukas** » Sun Jul 17, 2022 3:11 pm

A few ideas:

1. Instead of POP/PUSH AF twice in first block, use POP/PUSH AF once in both blocks. This way, you won't need EX AF,AF' anymore.

2. Now that you are not using EX AF,AF' anymore, you can use EX AF,AF' instead of register I to preserve the loop counter.

3. Instead of POP/PUSH BC twice in second block, use POP/PUSH HL twice in second block. This way, you can reuse the same value of BC that you have set between both POP/PUSH blocks.

4. Instead of calculating IX, store this value as an "extra column" in your data buffer. The first instruction in your first POP/PUSH block would be POP IX.

Alone Coder · Post by **Alone Coder** » Sun Jul 17, 2022 9:55 pm

Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.

Einar Saukas · Post by **Einar Saukas** » Sun Jul 17, 2022 11:06 pm

Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.

I assume unrolled loop is out of question, since OP specified "without taking too much white space". Otherwise I agree this would be a great option.

andydansby · Post by **andydansby** » Mon Jul 18, 2022 12:13 am

Einar Saukas wrote: ↑Sun Jul 17, 2022 11:06 pm
Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
I assume unrolled loop is out of question, since OP specified "without taking too much white space". Otherwise I agree this would be a great option.

You are correct Einar, I didn't want to have an obscenely large routine to push graphics. I should clarify when I ask a question rather than drop slang

. My experience at writing assembly is extremely limited as I've only written very few routines as compared to most people on these boards, so my attempts are rather clunky at times.

That being said, seeing other routines like this even the large ones would be good to see especially for learners like myself.

dfzx · Post by **dfzx** » Mon Jul 18, 2022 8:24 am

Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.

What's that? Google is telling me nothing!

AndyC · Post by **AndyC** » Mon Jul 18, 2022 9:45 am

Fetch the data using LD into registers, then PUSH then to the display. The SP then keeps track of where you are because you aren't needing to repoint it.

andydansby · Post by **andydansby** » Mon Jul 18, 2022 11:00 am

I'm placing the various solutions as I finish them in the GIT
https://github.com/andydansby/ZX_back_buffer

long_LDI_26wide is a simple, but long LDI solution.
pushpop1 was my original push/pop solution that was slower then the Long_LDI
pushpop2 was my attempt at an optimization, while working made things slower. Ugg.
pushpop3 is similar to pushpop1, but the the optimizations suggested by catmeows
pushpop4 is with the first 3 optimizations suggested by Einar, but I'm still looking at the last optimization.
So far as best as I can calculate, there's a savings of 59 t-states during the loop and 1 t-state before the loop and 1 t-state during screen chunks 2 and 3.

I'm commenting the code as best as I can, perhaps some of them actually make sense.

I'll add a few more as time and brain power allows.

I'll keep looking for suggestions and various tricks.

Einar Saukas · Post by **Einar Saukas** » Mon Jul 18, 2022 12:49 pm

dfzx wrote: ↑Mon Jul 18, 2022 8:24 am
Alone Coder wrote: ↑Sun Jul 17, 2022 9:55 pm Have you tried LD:PUSH? Russian e-zines use this for 50 fps text scrolling.
What's that? Google is telling me nothing!

It was explained before in this same thread:

TomD wrote: ↑Fri Jul 15, 2022 2:21 pm Have you considered using the push code itself as the buffer? Something like this
Code: Select all
ld sp,0000 ; pixel row end
ld hl,0000 ; last 2 pixel bytes
push hl
ld hl,0000 ; second to last 2 pixel bytes
push hl
...
The main downside of this is loading the data into the buffer in the first place plus the huge size of the buffer memory wise but with this the average is just over 10.5t per pixel (not including contention).

TomD

Einar Saukas · Post by **Einar Saukas** » Mon Jul 18, 2022 12:58 pm

Instead of:

Code: Select all

	ex af, af'
        ld a,64
	ex af, af'
loop1:
        ...
	ex af, af'
	dec a
	jr nz, setup_next_pass
        ...
setup_next_pass:
	ex af, af'
	jp loop1

Use:

Code: Select all

        ld a,64
loop1:
	ex af, af'
        ...
	ex af, af'
	dec a
	jp nz, loop1

Einar Saukas · Post by **Einar Saukas** » Mon Jul 18, 2022 1:06 pm

Instead of:

Code: Select all

originalStack:
        defw $0000
        ...
        ld (originalStack), sp
        ...
finished_copy:
        ld sp, (originalStack)
        ret

Use:

Code: Select all

        ...
        ld (finished_copy+1), sp
        ...
finished_copy:
        ld sp, $0000
        ret

EDIT: Fixed bug (thanks @Joefish!)

Joefish · Post by **Joefish** » Mon Jul 18, 2022 1:57 pm

(finished_copy+1), not +2
@andydansby the idea is to re-write the data part of the next LD SP,value instruction. The instruction itself is one byte, followed by two bytes of data.

Confusingly, LD SP,(address) really is two bytes of instruction followed by the address data in another two bytes. But LD SP,value is shorter, and quicker.

Spectrum Computing

improving the Speed of my push/pop screen routine

improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine

Re: improving the Speed of my push/pop screen routine