improving the Speed of my push/pop screen routine

The place for codemasters or beginners to talk about programming any language for the Spectrum.
andydansby
Microbot
Posts: 148
Joined: Fri Nov 24, 2017 5:09 pm
Location: Syracuse, NY, USA
Contact:

Re: improving the Speed of my push/pop screen routine

Post by andydansby »

The very last bit might not work for me in this particular case. This is not only going to be with the 128k in mind, but I'm also going to have this as a routine with Z88dk, so saving the stack to a namespace / address is probably more applicable in this case, outside this routine, Z88dk will again control the stack.

I got the pushpop ver 3 (with catmeows optimizations) working well within the framework. I haven't had a chance with version 4 as of yet. But releasing the I of IR really optimized another part within Z88dk, that I used to save the I before launching into this routine. The other optimizations are in, next in line is to integrate this within Z88dk.

I'll work on placing ver4 into Z88dk soon.

The LD:PUSH routine suggested by Alone Coder scrambled my brain when looking into it. I'm pretty sure I tried to follow the logic of it within the "Dozen" demo. The problem I saw with it is that the routine was almost as large as the image itself which seemed too obtuse to work with.
Einar Saukas wrote: Mon Jul 18, 2022 1:06 pm Instead of:

Code: Select all

originalStack:
        defw $0000
        ...
        ld (originalStack), sp
        ...
finished_copy:
        ld sp, (originalStack)
        ret
Use:

Code: Select all

        ...
        ld (finished_copy+1), sp
        ...
finished_copy:
        ld sp, $0000
        ret

EDIT: Fixed bug (thanks @Joefish!)
User avatar
Einar Saukas
Bugaboo
Posts: 3147
Joined: Wed Nov 15, 2017 2:48 pm

Re: improving the Speed of my push/pop screen routine

Post by Einar Saukas »

andydansby wrote: Mon Jul 18, 2022 9:52 pm The very last bit might not work for me in this particular case. This is not only going to be with the 128k in mind, but I'm also going to have this as a routine with Z88dk, so saving the stack to a namespace / address is probably more applicable in this case, outside this routine, Z88dk will again control the stack.
Saving the stack to "finished_copy+1" instead of "originalStack" won't break Spectrum 128K or z88dk.
presh
Manic Miner
Posts: 237
Joined: Tue Feb 25, 2020 8:52 pm
Location: York, UK

Re: improving the Speed of my push/pop screen routine

Post by presh »

This is a really interesting thread. I'd come to the conclusion that the ONLY way that PUSH/POP could beat LDI was with it completely unrolled, with just a LD SP, nn between every set of PUSHes/POPs. That put me off due to the enormous size of the code required to transfer the full play area. All of my attempts to roll it into a loop turned out slower than a POP / LDI x width.
catmeows wrote: Fri Jul 15, 2022 4:07 pm Second thing is that games using POP/PUSH often don't use linear buffer at all. With linear buffer, you copy left side, update pointers, copy right side, update pointers again and so on.
But what if you dont have a linear buffer but a buffer that mimics screen organization ? And what if you don't copy your byte blocks in usual way left, right, left, right ?
So imagine you have a 4KB buffer that has same layout as upper two thirds of screen. What if you copy 16 bytes on the left side and then just another 16 bytes right under ? It makes pointer arithmetic incredibly simple: for the next 16B you increase high byte of source by one and you increase high byte of destination by one.

Code: Select all

ld sp, ix
do some POPs
ld sp, iy
do some PUSHes
inc ixh
inc iyh
ld sp, ix
do some POPs
ld sp, iy
do some PUSHES 
.... do it for 8 pixel lines 
When you are done with chunk of 8 lines on the left side, you update your pointers to copy right side. And because you don not the ugly pointer arithmetic on every line (actually twice on every line) but only twice for group of 8 lines, it will save you time.
I hope this helps.
That's really neat & something I hadn't thought to consider! Makes the weird speccy screen layout work in your favour. 8-)

Something to add to the "must try" list!
presh
Manic Miner
Posts: 237
Joined: Tue Feb 25, 2020 8:52 pm
Location: York, UK

Re: improving the Speed of my push/pop screen routine

Post by presh »

catmeows wrote: Fri Jul 15, 2022 4:07 pm So imagine you have a 4KB buffer that has same layout as upper two thirds of screen. What if you copy 16 bytes on the left side and then just another 16 bytes right under ? It makes pointer arithmetic incredibly simple: for the next 16B you increase high byte of source by one and you increase high byte of destination by one.

Code: Select all

ld sp, ix
do some POPs
ld sp, iy
do some PUSHes
inc ixh
inc iyh
ld sp, ix
do some POPs
ld sp, iy
do some PUSHES 
.... do it for 8 pixel lines 
OK, so I took @catmeows idea and ran with it - this copies the middle block of the screen to the top block (28 x 8 cells, centered horizontally).

Following the Spectrum's screen layout - and also working DOWN the left-hand side, then UP the right-hand side of each half-row cell - leads itself to some *surprisingly* simple pointer calculations! :shock:

Hopefully everything is explained clearly in the comments, I think it's pretty nifty how it all comes together! :)

Code: Select all

  ORG 32768

  ; Push alternate registers + IY
  EXX
  PUSH BC
  PUSH DE
  PUSH HL
  PUSH IY
  
  
do_forever:   

  ; --- Interrupts! --- ;
  
  ; Interrupts on!
  POP IY
  PUSH IY
  EI 
  
  ; Wait for start of frame (to show border stripes)
  HALT
  
  ; Disable interrupts to prevent corruption
  DI
  
  ; --- Save SP --- ;
  LD (restore_sp), SP   ; [20]

  ; --- Set up src & dest --- ;
  
  LD IX, $4802     ; from block 1, row 0, col 2
  LD IY, $4002 +14 ; to   block 0, row 0, col 2
  

  ; ===== BEGIN! ===== ;

block_start:

  ; START OF BLOCK

  LD A, 8     ; [7] 8 cell rows in block

thr_loop:

  OUT (254), A    ; change border colour at start of each cell row, so we can see where beam is!
    
  EX AF, AF'  ; [4]  store loop counter
  
  
  ; --- Transfer left-hand side --- ;
  
  ; Transfer 8 pixel rows, INCrementally
  
; Do 8 pixel rows
REPT 8
  ; Get row from source
  LD SP, IX   ; [10]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  EXX         ; [4]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  ; Put row at destination
  LD SP, IY   ; [10]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  EXX         ; [4]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  ; Move to next pixel row
  INC IXH     ; [8]
  INC IYH     ; [8]
ENDM
  ORG $-4     ; skip back 4 bytes (the two INC instructions) to stay on bottom pixel row 
  ; [1512] TOTAL for this unrolled loop
  
  
  ; --- Move to RHS --- ;
  
  ; To save time, DON'T return to the top pixel of the cell row! There is no need to do this.
  ; Instead, for the right-hand side, continue from the bottom pixel row (where IXH & IYH already are) and work upwards (DEC IXH & IYH)
  ; (This also means that when we finish, IXH & IYH will be be back where we started on the top pixel row,
  ; so to continue to the next cell row within the same block, we only have to adjust IXL & IYL) :)
  
  ; So we only need to move IXL & IYL across, and as a bonus,
  ; IYL is already where IXL wants to be! :)
  LD A, IYL       ; [8]
  LD IXL, A       ; [8]
  ; Then to calculate IYL, just +14 
  ADD A, 14       ; [7]
  LD IYL, A       ; [8]
  
  
  ; --- Transfer right-hand side --- ;
  
  ; Transfer 8 pixel rows DECrementally 
  
; Do 8 pixel rows
REPT 8
  ; Get row from source
  LD SP, IX   ; [10]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  EXX         ; [4]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  ; Put row at destination
  LD SP, IY   ; [10]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  EXX         ; [4]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  ; Move to previous row
  DEC IXH     ; [8]
  DEC IYH     ; [8]
ENDM
  ORG $-4     ; skip back 4 bytes (the two DEC instructions) to stay on top pixel row
  ; [1512] TOTAL for this unrolled loop
  
  
  ; --- Move down to next cell row --- ;
  
  ; IXH & IYH are back at top pixel of cell and therefore correct already :)
 
  ; Move IXL down/left
  LD A, IXL       ; [8]
  ADD A, 32-14    ; [7]
  LD IXL, A       ; [8]
  ; Move IYL relative to it - saves a LD A, IYL
  ADD A, 14       ; [7]
  LD IYL, A       ; [8]
  
  ; Loop
  EX AF, AF'      ; [4]  retrieve loop counter
  DEC A           ; [4]
  JP NZ, thr_loop ; [10] 
  
  
  ; END OF BLOCK
  
  
  ; Restore SP
restore_sp EQU $+1
  LD SP, 0    ; [10]

  ; === DONE === ;

  OUT (254), A    ; = 0
  
  JP do_forever   ; uncomment to watch the border stripes!
  
  
  ; Get stored alt registers back
  POP IY
  POP HL
  POP DE
  POP BC
  EXX
  ; Enable interrupts
  EI
  ; DONE!
  RET        
Dr beep
Manic Miner
Posts: 381
Joined: Mon Oct 01, 2018 8:53 pm

Re: improving the Speed of my push/pop screen routine

Post by Dr beep »

in stead of

EX AF,AF''
DEC A
JP NZ,

what is the value of IYlat the end?

a simple
CP n
JP NZ,

Edit it is ixl not ixh so not working
edit 2 ixl is still increasing, in fact isn't the last add a carry so JP NC is enough?

The OUT is for testing only isn't it?
presh
Manic Miner
Posts: 237
Joined: Tue Feb 25, 2020 8:52 pm
Location: York, UK

Re: improving the Speed of my push/pop screen routine

Post by presh »

Dr beep wrote: Fri Aug 19, 2022 11:09 am edit 2 ixl is still increasing, in fact isn't the last add a carry so JP NC is enough?
Ooh nice, that sounds like it could work too - hadn't considered that! Will test when I get time. Though I suspect the carry occurs while calculating the IXL value, i.e.

Code: Select all

  LD A, IXL       ; [8]
  ADD A, 32-14    ; [7]  overflows after last line?
  JR C, done
  ... do IYL here, then loop back ...
The OUT is for testing only isn't it?
Yeah, just wanted to see how long each row takes
User avatar
Einar Saukas
Bugaboo
Posts: 3147
Joined: Wed Nov 15, 2017 2:48 pm

Re: improving the Speed of my push/pop screen routine

Post by Einar Saukas »

It's even simpler for 32 columns:

Code: Select all

  ORG 32768

  ; Push alternate registers + IY
  EXX
  PUSH BC
  PUSH DE
  PUSH HL
  PUSH IY


do_forever:

  ; --- Interrupts! --- ;

  ; Interrupts on!
  POP IY
  PUSH IY
  EI

  ; Wait for start of frame 
  HALT

  ; Disable interrupts to prevent corruption
  DI

  ; --- Save SP --- ;
  LD (restore_sp), SP   ; [20]

  ; --- Set up src & dest --- ;

  LD IX, $4800     ; from block 1, row 0, col 0
  LD IY, $4010     ; to   block 0, row 0, col 0


  ; ===== BEGIN! ===== ;

  ; START OF BLOCK


thr_loop:


  ; --- Transfer left-hand side --- ;

  ; Transfer 8 pixel rows, INCrementally

; Do 8 pixel rows
REPT 8
  ; Get row from source
  LD SP, IX   ; [10]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  EXX         ; [4]
  EX AF,AF'   ; [4]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  ; Put row at destination
  LD SP, IY   ; [10]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  EXX         ; [4]
  EX AF,AF'   ; [4]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  ; Move to next pixel row
  INC IXH     ; [8]
  INC IYH     ; [8]
ENDM
  ORG $-4     ; skip back 4 bytes (the two INC instructions) to stay on bottom pixel row


  ; --- Move to RHS --- ;

  ; To save time, DON'T return to the top pixel of the cell row! There is no need to do this.
  ; Instead, for the right-hand side, continue from the bottom pixel row (where IXH & IYH already are) and work upwards (DEC IXH & IYH)
  ; (This also means that when we finish, IXH & IYH will be be back where we started on the top pixel row,
  ; so to continue to the next cell row within the same block, we only have to adjust IXL & IYL) :)

  ; So we only need to move IXL & IYL across, and as a bonus,
  ; IYL is already where IXL wants to be! :)
  LD A, IYL       ; [8]
  LD IXL, A       ; [8]
  ; Then to calculate IYL, just +16
  ADD A, 16       ; [7]
  LD IYL, A       ; [8]


  ; --- Transfer right-hand side --- ;

  ; Transfer 8 pixel rows DECrementally
  
; Do 8 pixel rows
REPT 8
  ; Get row from source
  LD SP, IX   ; [10]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  EXX         ; [4]
  EX AF,AF'   ; [4]
  POP AF      ; [10]
  POP BC      ; [10]
  POP DE      ; [10]
  POP HL      ; [10]
  ; Put row at destination
  LD SP, IY   ; [10]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  EXX         ; [4]
  EX AF,AF'   ; [4]
  PUSH HL     ; [11]
  PUSH DE     ; [11]
  PUSH BC     ; [11]
  PUSH AF     ; [11]
  ; Move to previous row
  DEC IXH     ; [8]
  DEC IYH     ; [8]
ENDM
  ORG $-4     ; skip back 4 bytes (the two DEC instructions) to stay on top pixel row


  ; --- Move down to next cell row --- ;

  ; IXH & IYH are back at top pixel of cell and therefore correct already :)

  ; Move IXL down/left
  ; IYL is already where IXL wants to be! :)
  LD A, IYL       ; [8]
  LD IXL, A       ; [8]
  ; Then to calculate IYL, just +16 (except decrementing first so we can check Carry Flag)
  DEC A           ; [4]
  ADD A, 17       ; [7]
  LD IYL, A       ; [8]

  ; Loop
  JP NC, thr_loop ; [10]


  ; END OF BLOCK


  ; Restore SP
restore_sp EQU $+1
  LD SP, 0    ; [10]

  ; === DONE === ;

  JP do_forever
  

  ; Get stored alt registers back
  POP IY
  POP HL
  POP DE
  POP BC
  EXX
  ; Enable interrupts
  EI
  ; DONE!
  RET        
It should work but I didn't test it :)
Post Reply