Optimisation Challenge, for the benefit of new programmers

The place for codemasters or beginners to talk about programming any language for the Spectrum.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

Joefish wrote: Thu May 19, 2022 11:09 am Not a huge margin, and as you say, juggling the registers to make it work could be costly. I'm trying to work out how to do all the loops and counting in registers working back from that core routine. I still need to be able to use an LDI to do the attribute. After I do that and LD SP,HL (6t) I can do EXX (4t) to get the screen address from the alternate registers. You could say that's my 12T margin almost gone, but I was doing the EXX already, so maybe only half-gone! ;)

There's also an issue that the contents of either BC or DE needs to be empty so that the POP to fetch the pixel data doesn't trash anything...

Was it yours where the loop counters are in IXL etc.? Good choice, particularly for the row counter, as they're rarely accessed compared to everything else. I only recently noticed that DJNZ is hardly worth making the effort to fit in, saving only a single cycle over DEC B and JP NZ on all but the last loop, and that method works on any register.

Interleaving sprites does really force you to use all 256 though. If you leave out a few it still puts data onto 9 separate 256-byte page boundaries, leaving at least 2K fairly useless!

I also think a better test is to render a screen that's 50% actual UDGs, so will set something up. I didn't intend for everyone to copy my first run! :lol:
I don't have any loop counters - I optimised them away!
I just increase the screen address LSB and loop if it doesn't overflow.
When it overflows, I add 8 to MSB of screen ADDR and test for it being >80 - as that bit of code only runs 3 times it doesn't have to be tight.

I've managed to remove one PUSH/POP pair as C,E,L are usually equal so I don't need to save L and can store H in C. It's not made a huge difference as my code rarely actually prints the UDGs!

In terms of wasting some sprite space, it's only worse than the .5K buffer if you use less than about half the sprites. Also, you can neatly fit a 768 byte page-aligned buffer straight after the UDGs.

I've just managed a nice speedup of the routine that updates the tilemap by using JR for the jumps that usually won't be taken, and JP for those that usually will - that had a bigger impact on T-states per frame.

More UDGs onscreen won't impact my routine speed as all the static ones will be skipped, only changes are drawn. More sprites would have an impact, though.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

Pobulous wrote: Thu May 19, 2022 11:54 am I don't have any loop counters - I optimised them away!
I just increase the screen address LSB and loop if it doesn't overflow.
When it overflows, I add 8 to MSB of screen ADDR and test for it being >80 - as that bit of code only runs 3 times it doesn't have to be tight.
I can see you're relying on perimeter UDGs being 0 to stop your routine printing out the screen margins though. Maybe I should have made that rule a bit stricter! It's still a valuable function. But you might as well just present it doing the full 32x24 character screen, since the margins aren't built-in to the routine any more.

For a generic game that's over-printing the margins because it uses chunky background tiles or sprites and wants the margins clipped, you'd need an extra pass to force the border characters all to 0 before you do the render.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

Joefish wrote: Thu May 19, 2022 12:20 pm
Pobulous wrote: Thu May 19, 2022 11:54 am I don't have any loop counters - I optimised them away!
I just increase the screen address LSB and loop if it doesn't overflow.
When it overflows, I add 8 to MSB of screen ADDR and test for it being >80 - as that bit of code only runs 3 times it doesn't have to be tight.
I can see you're relying on perimeter UDGs being 0 to stop your routine printing out the screen margins though. Maybe I should have made that rule a bit stricter! It's still a valuable function. But you might as well just present it doing the full 32x24 character screen, since the margins aren't built-in to the routine any more.

For a generic game that's over-printing the margins because it uses chunky background tiles or sprites and wants the margins clipped, you'd need an extra pass to force the border characters all to 0 before you do the render.
I implemented the non-updating border checks ages ago.

My current gameloop copies the level data to the CharMap sets all the borders to zero, prints the CharMap, sets the borders back to 2 all round, pauses for 1 second, updates the CharMap from the level data, adds the sprites to the CharMap, executes one HALT, displays the CharArray on screen, runs gamelogic, loops to updating the CharMap.

The only difference I have from the original description is to have updates to CharArray to zero bytes that haven't been updated, and the display routine then skips all the zeros.

Current code of that routine:

Code: Select all

DisplayScreen:				;Copy CharMap to screen
	LD D, ScreenAddr/256	;Initialise the main registers - Worthy of note is that the LSB of ScreenAddr, ATTRAddr and CharMap will be equal throughout the runtime.
	LD B, ATTRAddr/256

CMlab:
	LD H, CharMap/256		;Must be divisible by 256

TElab:
	LD E,TopEdge			;Skip non-printing margin		
	LD L,E					;ATTRAddr and CharMap LSB can be updated likewise
	LD C,E
	JP TestZero				;Can't be at bottom edge yet, so skip over test

PrintLoop:
	
			LD A,D
			CP 80				;Check if in bottom third
			JP NZ, TestZero		;Jump if not in bottom third of screen
			
TestBottomEdge:			
			LD A,E				;Get LSB in A for further tests
BElab:		CP BottomEdge		;Test if we are in bottom non-printing area
			JP C, TestZero		;Not at bottom edge yet
			RET NZ				;This test needed to prevent possibly missing printing very last tile on the screen

TestZero:
			LD A,(HL)				;Test if next char is unchanged from last time
			AND A
			JR NZ, TestLeftSide		;Character has changed from last print, so test if in border
			
MoreZeros:
				INC L				
				JR Z, SkipZeros		;End of screenthird?
				OR (HL)
				JP Z,MoreZeros
				LD C,L
				LD E,L
				
				LD A,D
				CP 80				
				JR Z,TestBottomEdge		;If in bottom third, re-enter printloop via testing bottomborder

TestLeftSide:
			LD A,E
			AND 31
LElab:		CP LeftEdge				;Are we in left non-printing border?
			JP NC, TestRightSide	;Jump to next test if not
			LD A,E					;Add border offset to LSB ScreenAddr
			AND 224
LElab2:		ADD A,LeftEdge
			LD L,A					;ATTRAddr and CharMap LSB can be updated likewise
			LD A,(HL)				
			AND A
			JP Z, MoreZeros
			LD C,L
			LD E,L
			JP PrintChar
			
TestRightSide:
RElab:		CP RightEdge			;compare screen X position with right-hand non-printing border
			JP C, PrintChar			;Jump into printing routine if not in border
			LD A, 31				;Set ScreenAddr position to x=31 on the current line
			OR E
			LD E,A
			JP SkipPrinting			;Skip Printing, via checking for end of third of screen, which will also update L and C.

PrintChar:
			
			LD A,H
			EX AF,AF'			;Store CharMap location in A'
			
			LD A,(HL)
			CP EmptyUDGs		;Test if will only be updating ATTR	
			
			LD H,UDGs/256		;Get base address of UDGs
			LD L,A				;Select UDG 1-255
			LD A,(HL)			;copy ATTR data
			LD (BC),A
			
			JP C,PrintATTROnly	;Jump if updating ATTR only
			
			INC H				;Get first line of chardata
			LD C,D				;Store current screen address MSB

			LD A,(HL)			;copy 8 lines of UDG data to screen
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A
			INC H
			INC D
			LD A,(HL)
			LD (DE),A

			LD D,C			;Restore screen address MSB
			
PrintATTROnly:
			EX AF,AF'			;Restore CharMap location from A'
			LD H,A
			
SkipPrinting:
			INC E			;Advance screen address
			LD L,E			;Advance to next CharMap character
			LD C,E			;Advance to next ATTR position

		JP NZ, PrintLoop	;Loop until E wraps to zero - signals a third of the screen is completed
		
NextThird:
		INC B
		INC H
		LD A,D				;Adjust DE to point to next screen third
		ADD A,8
		LD D,A
		
		CP 88				;If MSB of screen addr is now 88, then we are done.
	JP C, PrintLoop			;Do next screen third
	
	RET						;Return as finished printing
	
SkipZeros:
		LD C,L
		LD E,L
		JP NextThird
Edited with a few more optimisations. Using A' instead of PUSH/POP to store HL, plus some other bits and bobs. No PUSHes at all and no-use of IX or IH. I also tweaked the border and zero checks to prevent some wasted t-states.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

OK, I've upped the test screen to what I think (if I've counted correctly) will draw 50% of the UDGs in the 28x20 rendering window; the remaining 50% are attribute-only (codes 0 or 1). This code does 256 passes and seems to do it in 8 seconds, though maybe FUSE can capture the exact number of frames. I make that 31ms per refresh, or 1.56 frames. If I swap the test function around and synchronise it with a HALT, that's roughly verified by where the border colour swap happens.

My original routine took over 10s to do 256 repeats, or 39 ms, and that was with that original (reduced) screen layout. With the new test map it takes 12.5s to do 256 passes, or 49ms or 2.4 frames per update. Again, born out by watching border timing. So I think I've made quite an improvement on this one!

As another approximate test, my new routine still takes a little more than one frame to draw the original test screen, though that is still looking up and painting the attributes of all the UDGs.

Code: Select all

;================================================================
; Constants
;================================================================

ENTRY_POINT		EQU 32768

CHAR_MAP		EQU 40000		; 40000..40767, 768 bytes or more
UDGs			EQU 41000		; 41000..43304, 2304 bytes
EMPTY_UDGs		EQU 2

UDG_TABLE		EQU 43520		; 43520..44031, 512 bytes, 43520=170x256

SCREEN			EQU 16384
ATTRS			EQU 22528
SCREEN_X_OFFSET	EQU 2
SCREEN_Y_OFFSET	EQU 2
SCREEN_OFFSET	EQU SCREEN_X_OFFSET + (32 * SCREEN_Y_OFFSET)

MAP_COLS		EQU 32
MAP_ROWS		EQU 24
WINDOW_COLS		EQU 28
WINDOW_ROWS		EQU 20
WINDOW_X_OFFSET	EQU 2
WINDOW_Y_OFFSET	EQU 2
WINDOW_OFFSET	EQU WINDOW_X_OFFSET + (MAP_COLS * WINDOW_Y_OFFSET)


;================================================================
; Entry / Exit
;================================================================

org ENTRY_POINT
	exx									; Alternate registers
	push hl								; Save HL' then restore before exit
	push iy								; Save IY then restore before exit
		
		
	ld a,2
	out (254),a
		
		
	ld a,0
test_loop
	ex af,af'
			
	call do_render
			
	ex af,af'
	dec a
	jp nz,test_loop
		
		
	ld a,7
	out (254),a		
		
		
	pop iy								; Restore IY for BASIC
	pop hl								; Restore HL' for BASIC
	exx
	ret									; And return.
	
	
	
;================================================================
; THIS IS THE MAIN FUNCTION
;================================================================

do_render	
	di									; No interrupts please, we're messing with the stack
	ld (restore_stack+1),sp			; POKE the stack address directly into the code for later
	
	exx
		ld de,ATTRS + SCREEN_OFFSET		; Attribute address in DE'
	exx
	
	ld hl,SCREEN + SCREEN_OFFSET		; Screen pointer in HL
	ld bc,CHAR_MAP + WINDOW_OFFSET		; Character Map pointer in BC
	
	ld ixh,WINDOW_ROWS				; Row loop counter in IXH
	
rows_loop
	ld ixl,WINDOW_COLS				; Column loop counter in IXL
		
	columns_loop
		ld a,(bc)					; Get byte from Character Map (address in BC)
		inc bc
		
		cp EMPTY_UDGs
		jr nc,draw_pixels			; Don't do pixel drawing for low UDG numbers, only high ones
		
			
			
;----------------------------------------------------------------
; Code branches at this point, either doing skip_pixels or draw_pixels.
; Note that both branches end in the same looping behaviour.
;----------------------------------------------------------------
		
	skip_pixels
		exx							; Alternate registers
			ld h,UDG_TABLE / 256	
			ld l,a					; Form table address in HL'
			ld a,(hl)
			inc h
			ld h,(hl)
			ld l,a					; Get specific address of UDG from table, again in HL'
			
			ldi						; Copy one attribute, Hl'->DE', affects BC' but not used
			
	skip_pixels_next_row
		exx					; Normal registers
		inc l				; Screen pointer to next character
		
		dec ixl
		jp nz,columns_loop	; Repeat this loop early, column loop counter in IXL
		ld a,l
		add a,32-WINDOW_COLS			; Add right + left border skip to Screen pointer in HL
		ld l,a
		exx									; Alternate registers
			ld e,a							; Lo-byte of Attribute pointer DE' is the same value
			jr c,wrap_block					; If that ADD up there rolled over then the hi-bytes need adding to
					
	block_wrapped
		exx								; Back to regular registers
		
		ld a,c
		add a,32-WINDOW_COLS			; Add right + left border skip to Character Map pointer in BC'
		ld c,a
		jr c,wrap_map					; This will re-join the loop counter below
	map_wrapped

		dec ixh
		jp nz,rows_loop					; Repeat, row loop counter in IXH
		
	restore_stack
		ld sp,0000						; Stack restore address will be POKEd in here
		ei								; Interrupts back on, now stack is restored
		ret
				
				
;----------------------------------------------------------------
	draw_pixels
		exx							; Alternate registers
			ld h,UDG_TABLE / 256	
			ld l,a					; Form table address in HL'
			ld a,(hl)
			inc h
			ld h,(hl)
			ld l,a					; Get specific address of UDG from table, again in HL'
			
			ldi						; Copy one attribute, Hl'->DE', affects BC' but not used
			
			ld sp,hl				; Point stack at rest of UDG data
		exx							; Normal registers
		
		ld a,h						; Save screen pointer, Hi-byte
		
		pop de						; Fetch 2 bytes of UDG data
		ld (hl),e					; Write one
		inc h						; then
		ld (hl),d					; the other...
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		
		ld h,a						; Screen pointer back to start of character
		inc l						; Then on to next character
		
		dec ixl
		jp nz,columns_loop				; Repeat, column loop counter in IXL
		
		
	draw_pixels_next_row
		ld a,l
		add a,32-WINDOW_COLS			; Add right + left border skip to Screen pointer in HL
		ld l,a
		exx								; Alternate registers
			ld e,a						; Lo-byte of Attribute pointer DE' is the same value
			jr c,wrap_block				; If that ADD up there rolled over then the hi-bytes need adding to
		exx								; Back to regular registers
		
		ld a,c
		add a,32-WINDOW_COLS			; Add right + left border skip to Character Map pointer in BC'
		ld c,a
		jr c,wrap_map
		
		dec ixh
		jp nz,rows_loop					; Repeat, row loop counter in IXH
		
		jp restore_stack				; And exit
	
	
; We could do this in-line and skip over it with JP NC, but that alwasys takes 10 cycles to execute
; JR C only takes 7 cycles if it doesn't branch (12 if it does), and it only branches here twice in
; 20 rows. So even though there's an extra 10 cycles penalty at the end here to jump back again, we
; save (18 * 3) - ((5 + 10) * 2) = 24 cycles
wrap_block
		inc d							; Adjust Hi-byte of Attribute pointer DE'
		exx
			ld a,h
			add a,8
			ld h,a						; Adjust Hi-byte of Screen pointer HL
		exx
		jp block_wrapped				; Return to routine, with alternate registers still selected
	
	
wrap_map
		inc b							; Adjust Hi-byte of Character Map pointer BC'
		jp map_wrapped					; Return to routine
	
	
;================================================================
; Character map, anywhere in memory:
;================================================================

org CHAR_MAP	
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,8,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,8,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	
	
;================================================================
; UDGs, anywhere in memory:
;================================================================

org UDGs
	; Attribute, then data.
	defb	00000000b,    00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b
	defb	01011011b,    00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b
	
	defb	01110000b,    11001100b,01100110b,00110011b,10011001b,11001100b,01100110b,00110011b,10011001b
	
	defb	01000011b,    00111100b,01111110b,11111111b,11111111b,11111111b,11111111b,01111110b,00111100b
	
	defb	01000011b,    00000000b,00000011b,00001111b,00011111b,00111111b,00111111b,01111111b,01111111b
	defb	01000011b,    00000000b,11000000b,11110000b,11111000b,11111100b,11111100b,11111110b,11111110b
	defb	01000011b,    01111111b,01111111b,00111111b,00111111b,00011111b,00001111b,00000011b,00000000b
	defb	01000011b,    11111110b,11111110b,11111100b,11111100b,11111000b,11110000b,11000000b,00000000b
	
	defb	00000001b,    00000000b,00000000b,00100100b,00011000b,00011000b,00100100b,00000000b,00000000b
	
	defs 9*(256-9),0
	
	
;================================================================
; UDG table, must be on a 256-byte boundary:
;================================================================

org UDG_TABLE
	; Addresses of the 256 UDGs, just the Lo-bytes first:
	REPT 256, udg_id
		defb	(UDGs + (udg_id * 9)) & 255
	ENDM
	
	; Followed by the Hi-bytes, 256b higher in memory:
	REPT 256, udg_id
		defb	(UDGs + (udg_id * 9)) / 256
	ENDM

	
;================================================================
end ENTRY_POINT
;================================================================
User avatar
ketmar
Manic Miner
Posts: 741
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Optimisation Challenge, for the benefit of new programmers

Post by ketmar »

sorry for hijacking the thread. ;-) offtopic tile engine.
Spoiler
download XTEngine0 (sources included).

just for fun, what could be done in a day… i started it yesterday, and finished it today.

totally unoptimised, but has some nice features.

* uses about 1kb of working area (including framebuffer)
* can use the whole 32x24 area
* doesn't use IY (so it is possible to run it in IM 1)
* sprites will not flicker even if interrupted (but may be teared)
* updates only changed tiles
* automatically erases sprites at old positions when the sprite is moved
* no need to do anything special for sprite moving, just change the coords
* supports masked sprites
* has IM 2-driven sound engine from Steve Turner as a bonus ;-)
* simple platforming demo included (only kempston joystick for now)
* 48K-compatible

it prolly could be made much faster by optimising the code, and using better
data align.

WARNING! you need my UrAsm to build it! https://repo.or.cz/urasm.git
32-bit windows binary is included. use it like this:

urasm.exe main.zas
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

Do you have a .tap file available, @Joefish?

Annoyingly I have to increase all CharMap values by one to test with my routine.


On another note, what build/test process is everyone using?
I'm coding in Programmer's Notepad and assembling with Basin, exporting to .tap and testing with Fuse.
Basin's emulation of timing isn't good enough for testing in Basin.
Basin's assembler is also a bit quirky, so I really ought to use a better process.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

OK, I've got that running on mine and it's basically still around 1/150th of a second with 5 sprites on the screen (1/250th of a second with a static screen with no sprites - but then it's not actually doing anything useful!).

I suppose that I need to count the time taken for the CharMap update routine (which handles erasing sprites as well as identifying which parts of the screen are static), but that still takes well under a 50th of a second in total for the two routines.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

Here's my version with the latest design, plus 5 sprites, pacman controlled with 6789, two moving ghosts, two stationary ghosts.

Previously I was drawing the whole playfield initially, then only updating the part inside the borders which may not have been clear that I was respecting the borders.

The white border is the draw to screen routine, the cyan border is the routine that updates the CharMap of what to draw to the screen, including erasing sprites. This may not need to be called each frame - a The green border is the sprites being added to the CharMap. The magenta border shows time leftover in the frame.

Animating the pacman sprite is up next.

User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

I've not done a .TAP as it's fiddly to upload files at times, whereas cutting and pasting source is easy enough. I assemble with PASMO and the --tapbas extension, which generates a .TAP file with a BASIC loader attached. You can launch the file straight away in ZXSpin from the command line.

I have a batch file called _PASMO_Build.bat that looks like this:

Code: Select all

C:\Program Files\PASMO\pasmo --tapbas --alocal %~f1 %~dpn1.tap
pause
start "" "C:\Program Files\ZXSpin.exe" "%~dpn1.tap"
exit
All I have to do is drag my assembly file over it and it builds, pauses to let me read the results, then (re-)launches ZXSpin with the .TAP file it just built.

I should have thought of your treatment of #0 specifically, sorry. Can you paste back your version of the map with everything +1 then? Or I can update mine and slip in another blank UDG. Make it easier to compare like-for-like.

I just realised the flaw in my own time counting. Drawing UDGs with a POP to fetch data may overall be quicker, but for the cells that are just attributes, it still has to go through the faff of looking up their address in the table, whereas interleaved data would jump straight to the attribute. Maybe I could simply copy the attributes for those first few UDGs as the first bytes in my address table and handle things differently at that point. That should speed things up.

I'm also trying to work out how well the drawing functions perform under contention. The LD / INC H method to copy a byte takes 22 T-cycles, so will be extended to 24t when running under contention. Or 48t to copy 2 bytes. The POP / LD method takes 21t for the first byte and 11t for the next one (32t total) but contention will extend these to 24t and 16t, so 40t to copy 2 bytes. Not sure whether than counts as a 3x8=24t saving or a 4x8=32t saving, but when you deduct the extra 22t it takes to do a table lookup for the UDG and the extra 6t for that LD SP,HL, it's a borderline case. But the uncontended drawing should still be faster.

I'm still seriously impressed with how fast yours skips all the empty spaces when running through the screen though. Do you think it's faster to separately force the margins to character #0, then skip them in the render, rather than having loop counters and ADDs to skip over the margins?

Also just realised the obvious and easy way to get 256 UDGs into an interleaved state - draw them in a paint package in the top 1/3 of the screen and then copy them straight from the screen memory!
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

Pobulous wrote: Fri May 20, 2022 9:39 am OK, I've got that running on mine and it's basically still around 1/150th of a second with 5 sprites on the screen
But that's not drawing anything but the sprites each time, is it? And just skipping over everything else?
How long does it take if it's actually rendering all the characters I've defined, including setting the attributes of all the empty cells? As that's what you have to do first time it runs to paint the basic screen. I'm running it 256 times to drag it out and be able to measure the time it takes in human terms. Saying you only need to draw it once in the first pass of the loop is rather dodging the whole point of the test. This is a Sinclair machine, not a Volkswagen! :lol:

I'm not knocking how that method works in a game - obviously it's fast. But that's working in a static screen game. If you have a scrolling game - particularly a parallax one, which is how this all started - then you're going to be redrawing a lot more than a few 2x2 char sprites each game frame. That's why I'm looking to the full redraw as the performance marker.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

OK, I've fixed it so that for Attribute-only cells, the first byte in the UDG lookup table is actually the attribute, so it can be fetched and copied as quickly as it could from an interleaved store. There's an improvement in speed, but it's still around 1.4 frames to render the new case with 50% UDGs and 50% attributes only.
User avatar
ketmar
Manic Miner
Posts: 741
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Optimisation Challenge, for the benefit of new programmers

Post by ketmar »

by the way, my engine is not *so* offtopic. it is using 32x24 tilemap, non-interleaved 9-byte UDGs, draws tile #0 with a special code as "empty tile", automatically tracks dirty regions, and most of the time you need only to set the tile to the new value for it to appear (that's how Rex collecting those gold bars — by simply changing the tilemap bytes).

also, each tile row can be anywhere at the screen ([0..23]), you can render less rows by loading B with less than 24, and you can render less columns by simply using something less than 32 for a row.

i marked it "offtopic", because while it respects the "spirit" of the rules, it doesn't follow the "letter". but the idea is there. and most data structures are neatly packed instead of being interleaved, so the engine is using less than a kilobyte of RAM for its internal needs. (UDG and sprite graphics is not interleaved too, so you don't have to waste memory for unused UDGs.)

also, it is almost a full "game engine", you only need to add sprite logic (which is done by attaching "ai subroutine" to each sprite). it is missing coldet, tho, but i'm planning to add player-vs-monsters coldet in the next version. the code is commented (sometimes ;-), and doesn't use many tricks besides unrolled loops, so it should be easy to start with it.

it is quite slow with the initial paint, tho (about 4 frames or such), so it isn't suitable for scrollers. but flip-screen platformers should be ok.

also, hey, screenshot usually works fine to attract people's attention! so here it is (with sprite tearing, sorry ;-):
Spoiler
Image
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

ketmar wrote: Fri May 20, 2022 12:56 pm by the way, my engine is not *so* offtopic. it is using 32x24 tilemap, non-interleaved 9-byte UDGs, draws tile #0 with a special code as "empty tile", automatically tracks dirty regions, and most of the time you need only to set the tile to the new value for it to appear (that's how Rex collecting those gold bars — by simply changing the tilemap bytes).

also, each tile row can be anywhere at the screen ([0..23]), you can render less rows by loading B with less than 24, and you can render less columns by simply using something less than 32 for a row.

i marked it "offtopic", because while it respects the "spirit" of the rules, it doesn't follow the "letter". but the idea is there. and most data structures are neatly packed instead of being interleaved, so the engine is using less than a kilobyte of RAM for its internal needs. (UDG and sprite graphics is not interleaved too, so you don't have to waste memory for unused UDGs.)

also, it is almost a full "game engine", you only need to add sprite logic (which is done by attaching "ai subroutine" to each sprite). it is missing coldet, tho, but i'm planning to add player-vs-monsters coldet in the next version. the code is commented (sometimes ;-), and doesn't use many tricks besides unrolled loops, so it should be easy to start with it.

it is quite slow with the initial paint, tho (about 4 frames or such), so it isn't suitable for scrollers. but flip-screen platformers should be ok.

also, hey, screenshot usually works fine to attract people's attention! so here it is (with sprite tearing, sorry ;-):
Spoiler
Image
There's a dodgy cert on your image host and I can't view it, even when pasting the address into my browser.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

Mine skips empty space by:

LD A,(HL)
AND A
JR NZ, notzero
skipzeros:
INC L
JR Z, endofscreenthird
OR (HL)
JP Z,skipmorezeros

It's well over a frame if it has to update all the graphics, so not suitable for scrolling.
On the plus side it's simple to use and fast enough to add game logic and still hit 50fps

The gameloop is:
Call routine to Copy LevelMap to CharMap
Call routine to DrawSprites
Halt
Call Display CharMap
Call GameLogic
Loop

User just needs to provide the GameLogic function

This is a version where the copy LevelMap to CharMap routine is replaced by an LDIR (so no zeroing is done)
It ends up running 1/3 the speed of the optimised version, so just over 2 frames for this routine.
Trouble is that the optimisations I made for testing for zeros will be slowing it down.

User avatar
ketmar
Manic Miner
Posts: 741
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Optimisation Challenge, for the benefit of new programmers

Post by ketmar »

mine has several optimisations: first, each row has "need work" flag, so if nothing was changed, the whole row is skipped. second, each tile has "copy" and "dirty" flags. "copy" means "just copy UDG directly to screen, bypassing backbuffer" — it is used to erase sprites when they moved out of tiles. and "dirty" means that there are some sprites at this row, so the engine copies UDG to backbuffer, then overlays all sprites, then copies backbuffer to the real screen.

internally, the engine creates a list of all visible sprites for each row (marking affected tiles as "dirty" by the way). then it renders the screen row-by-row. first it checks if the row has some sprites on it. in this case, it renders dirty tiles to backbuffer, then overlays the sprites to backbuffer too (so it can do properly masked sprites). then it scans the row, and blits tiles. it also changes "dirty" tiles to "copy", and "copy" to "skip". this way on the next frame the sprites will be automatically erased (if they weren't put into the row list again).

it could be done much faster if i'll remove masked sprites at all, and will simply use "sprite is just a tile" approach, but i wanted to have properly masked sprites that could be arbitrarily overlayed, and without flickering.

i deliberately avoided all stack copying tricks, so the engine could be used even in IM 1 mode with interrupts enabled. i can prolly make it 1.5-2 times faster by requiring a custom interrupt routine that knows how to restore things, and then heavily use stack transfers, and require the data to be created in the format better suitable for such blitting (possibly interspersed with machine code). but i wanted to avoid that, so the engine stays small and easily usable.

still, even with 10 24x16 sprites it is quite fast (around 20 FPS, i guess). could be made even faster with smaller sprites (it is easily doable, because each sprite in list has a pointer to "sprite printing routine", no need to change the core engine code for that).

it is quite easy to do coldet too (i just ran out of patience ;-). as each row has a list of sprites in it, with their horizontal tile positions recorded, you only need to check the corresponding lists, quickly rejecting all sprites that couldn't possibly touch, and do a precise check for "touchers". it is possible to do pixel-perfect coldet too, because each sprite in list has a pointer to the part of its graphics data, already calculated. so bounding box check is cheap, and pixel-perfect check is slightly more costly, but still not a big deal.


p.s.: sorry for the hosting. dunno what's wrong, it works for me without any problems…
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

My routine will "blit" the full CharMap (less the borders) to the screen in just under 2 frames, if I simply remove the checks for zeros. That sounds like it's faster than using the stack, @Joefish

There's a bit of optimisation to be made there, but doubtful it would free up enough time to prevent sprite handling and game logic dropping it to 17fps is doubtful.

That sounds pretty quick @ketmar - properly masked sprites, too!
User avatar
ketmar
Manic Miner
Posts: 741
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Optimisation Challenge, for the benefit of new programmers

Post by ketmar »

here's mediafire link to the archive with engine sources and .tap, and postimage link to the screenshot:
Spoiler
Image
p.s.: yes, you can actually walk there, climb ladders, drop down, and collect gold bars. even with a sound fx. and the enemies are moving too. use kempston joystick to control Rex. fire is "noclip" cheat.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

Pobulous wrote: Fri May 20, 2022 2:46 pm My routine will "blit" the full CharMap (less the borders) to the screen in just under 2 frames, if I simply remove the checks for zeros. That sounds like it's faster than using the stack, @Joefish
Mine does the same. If you set the border colour there's just a few lines at the bottom of the second frame left on a 48K. More so on a +2. Not much in it either way.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

OK, this is my fastest to render all 28x20 UDGs with no skips, but leaving the 2-character border. You can see from the border switch how many scan lines are left at the bottom of the second frame. The main row loop is fractured so that the last UDG on the row is done separately by a duplicate of the rendering code. And there are a few duplicates of the ends of the loops depending on branching, so the code doesn't have to jump back again to the main code. Can interleaved UDG data do it quicker?

Code: Select all

;================================================================
; Constants
;================================================================

ENTRY_POINT		EQU 32768

CHAR_MAP		EQU 40000		; 40000..40767, 768 bytes or more
UDGs			EQU 41000		; 41000..43304, 2304 bytes
;EMPTY_UDGs		EQU 0

UDG_TABLE		EQU 43520		; 43520..44031, 512 bytes, 43520=170x256

SCREEN			EQU 16384
ATTRS			EQU 22528
SCREEN_X_OFFSET	EQU 2
SCREEN_Y_OFFSET	EQU 2
SCREEN_OFFSET	EQU SCREEN_X_OFFSET + (32 * SCREEN_Y_OFFSET)

MAP_COLS		EQU 32
MAP_ROWS		EQU 24
WINDOW_COLS		EQU 28
WINDOW_ROWS		EQU 20
WINDOW_X_OFFSET	EQU 2
WINDOW_Y_OFFSET	EQU 2
WINDOW_OFFSET	EQU WINDOW_X_OFFSET + (MAP_COLS * WINDOW_Y_OFFSET)


;================================================================
; Entry / Exit
;================================================================

org ENTRY_POINT
	exx									; Alternate registers
	push hl								; Save HL' then restore before exit
	push iy								; Save IY then restore before exit
		
		
	ld a,0
test_loop
	ex af,af'
	
	halt
	
	ld a,2
	out (254),a
			
	call do_render
			
	ld a,7
	out (254),a		
	
	ex af,af'
	dec a
	jp nz,test_loop
		
		
	pop iy								; Restore IY for BASIC
	pop hl								; Restore HL' for BASIC
	exx
	ret									; And return.
	
	
	
;================================================================
; THIS IS THE MAIN FUNCTION
;================================================================

do_render	
	di									; No interrupts please, we're messing with the stack
	ld (restore_stack+1),sp			; POKE the stack address directly into the code for later
	
	exx
		ld de,ATTRS + SCREEN_OFFSET		; Attribute address in DE'
		ld b,UDG_TABLE / 256			; UDG Table base in B'
	exx
	
	ld hl,SCREEN + SCREEN_OFFSET		; Screen pointer in HL
	ld bc,CHAR_MAP + WINDOW_OFFSET		; Character Map pointer in BC
	
	ld ixh,WINDOW_ROWS					; Row loop counter in IXH
rows_loop

		ld a,(bc)					; Get first byte from Character Map (address in BC)
		
		exx		 					; Alternate registers	
			ld c,WINDOW_COLS*2-2	; Column loop counter in C'. Gets doubled-DECced by the LDI. Deliberately one short
	columns_loop
	
			ld h,b
			ld l,a					; Form table address in HL'
			ld a,(hl)
			inc h
			ld h,(hl)
			ld l,a					; Get specific address of UDG from table, again in HL'
			
			ldi						; Copy one attribute, Hl'->DE', affects BC'
			
			ld sp,hl				; Point stack at rest of UDG data
		exx							; Normal registers
		
		ld a,h						; Save screen pointer, Hi-byte
		
		pop de						; Fetch 2 bytes of UDG data
		ld (hl),e					; Write one
		inc h						; then
		ld (hl),d					; the other...
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		
		ld h,a						; Screen pointer back to start of character
		inc l						; Then on to next character
		
		inc bc						; Move to next byte in Character Map
		ld a,(bc)					; Get next byte from Character Map (redundant on last loop)
		
		exx							; Alternate registers
			dec c
			jp nz,columns_loop		; Repeat, column loop counter in C'
			inc c					; One DEC left in the tank for the LDI below...
			
			
		; After the loop, still one more UDG to render:
			
			ld h,b
			ld l,a					; Form table address in HL'
			ld a,(hl)
			inc h
			ld h,(hl)
			ld l,a					; Get specific address of UDG from table, again in HL'
			
			ldi						; Copy one attribute, Hl'->DE', affects BC' but not enough to affect B'
			
			ld sp,hl				; Point stack at rest of UDG data
		exx							; Normal registers
		
		ld a,h						; Save screen pointer, Hi-byte
		
		pop de						; Fetch 2 bytes of UDG data
		ld (hl),e					; Write one
		inc h						; then
		ld (hl),d					; the other...
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		inc h
		pop de
		ld (hl),e
		inc h
		ld (hl),d
		
		ld h,a						; Screen pointer back to start of character
		
	draw_pixels_next_row
		ld a,l
		add a,32-WINDOW_COLS+1			; Add right + left border skip to Screen pointer in HL
		ld l,a
		exx								; Alternate registers
			ld e,a						; Lo-byte of Attribute pointer DE' is the same value
			jr c,wrap_block				; If that ADD up there rolled over then the hi-bytes need adding to
		exx								; Back to regular registers
		
		ld a,c
		add a,32-WINDOW_COLS+1			; Add right + left border skip to Character Map pointer in BC'
		ld c,a
		jr c,wrap_map
		
		dec ixh
		jp nz,rows_loop					; Repeat, row loop counter in IXH
		
	restore_stack
		ld sp,0000						; Stack restore address will be POKEd in here
		ei								; Interrupts back on, now stack is restored
		ret
		
		
; We could do this in-line and skip over it with JP NC, but that alwasys takes 10 cycles to execute
; JR C only takes 7 cycles if it doesn't branch (12 if it does), and it only branches here twice in
; 20 rows. So even though there's an extra 10 cycles penalty at the end here to jump back again, we
; save (18 * 3) - ((5 + 10) * 2) = 24 cycles
wrap_block
			inc d						; Adjust Hi-byte of Attribute pointer DE'
		exx								; Back to regular registers
		ld a,h
		add a,8
		ld h,a							; Adjust Hi-byte of Screen pointer HL
		
		ld a,c
		add a,32-WINDOW_COLS+1			; Add right + left border skip to Character Map pointer in BC'
		ld c,a
		jr c,wrap_map
		
		dec ixh
		jp nz,rows_loop					; Repeat, row loop counter in IXH
		
		jp restore_stack
	
	
wrap_map
		inc b							; Adjust Hi-byte of Character Map pointer BC'
		dec ixh
		jp nz,rows_loop					; Repeat, row loop counter in IXH
		
		jp restore_stack
	
	
;================================================================
; Character map, anywhere in memory:
;================================================================

org CHAR_MAP	
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,0,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,8,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,0,0,8,0,0,0,  0,0,0,0,0,0,0,0,  8,0,8,0,8,2,2,2
	
	defb	2,2,2,0,8,0,8,0,  8,0,0,0,0,0,0,0,  4,5,0,0,0,1,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,3,4,5,4,5,0,  1,0,3,4,5,1,0,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,0,1,1,1,1,7,3,  1,3,1,6,5,1,5,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,8,0,8,0,8,  0,0,1,6,7,6,7,0,  1,0,1,6,7,1,1,0,  8,0,8,0,8,2,2,2
	defb	2,2,2,0,8,0,8,0,  8,6,7,0,0,0,0,6,  7,0,0,0,0,0,0,8,  0,8,0,8,0,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	defb	2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2,  2,2,2,2,2,2,2,2
	
	
;================================================================
; UDGs, anywhere in memory:
;================================================================

org UDGs
	; Attribute, then data.
	defb	00000000b,    00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b
	defb	01011011b,    00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b,00000000b
	
	defb	01110000b,    11001100b,01100110b,00110011b,10011001b,11001100b,01100110b,00110011b,10011001b
	
	defb	01000011b,    00111100b,01111110b,11111111b,11111111b,11111111b,11111111b,01111110b,00111100b
	
	defb	01000011b,    00000000b,00000011b,00001111b,00011111b,00111111b,00111111b,01111111b,01111111b
	defb	01000011b,    00000000b,11000000b,11110000b,11111000b,11111100b,11111100b,11111110b,11111110b
	defb	01000011b,    01111111b,01111111b,00111111b,00111111b,00011111b,00001111b,00000011b,00000000b
	defb	01000011b,    11111110b,11111110b,11111100b,11111100b,11111000b,11110000b,11000000b,00000000b
	
	defb	00000001b,    00000000b,00000000b,00100100b,00011000b,00011000b,00100100b,00000000b,00000000b
	
	defs 9*(256-9),0
	
	
;================================================================
; UDG table, must be on a 256-byte boundary:
;================================================================

org UDG_TABLE
	; Addresses of the remainder of the 256 UDGs, just the Lo-bytes first:
	REPT 256, udg_id
		defb	(UDGs + (udg_id * 9)) & 255
	ENDM
	
	; Followed by the Hi-bytes, 256b higher in memory:
	REPT 256, udg_id
		defb	(UDGs + (udg_id * 9)) / 256
	ENDM

	
;================================================================
end ENTRY_POINT
;================================================================
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

OK, I thought of a new optimisation....

Instead of copying the buffer, setting zeros, then "blitting" to screen...

How about:
Set CharMap to all 0
Set background UDGs as lower numbers
Put 0 in CharMap to delete sprites

If UDG in CharMap = 0 it's been wiped and needs pulling from background CharMap
If UDG is > background CharMap, then it's a sprite and needs printing
If UDG in CharMap = UDG in background CharMap, do nothing

I can quite cheaply compare CharMap UDG to backbuffer UDG (SET 2, H then CP (HL))

Sprites are added to CharMap
BackBuffer is updated by scrolling or whatever.

Should gain some speed on static screens, but can also gain speed on scrolling screens in some circumstances.
Does mean that sprites need to be erased by 0s when they move, though.

I just need to write the sprite erase code, but it looks promising.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

Again, it's optimising a game engine with a specific purpose in mind. I'm simply trying to get some speed for flat-out rendering to transform a character map into a screen display.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

I get what you're saying but my new tweak will be more flexible than my previous routine and still provide a healthy speed boost in most cases.

Tweaks like this are exploitable because of the way I've built the initial routine, so I wouldn't say they're against the spirit of the challenge.

A bunch of routines that work best in different scenarios would still be useful for anyone wanting to get into coding without having to deal with the screen quirks.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

It's all useful. Although the more layers there are to it, the harder it gets to explain how to exploit it as a library function!

Having said that, I think perhaps my stack manipulation is a step too far as it means dictating that the interrupts have to be disabled. I'll have a go at the interleaved UDG data one too.

I was wondering if it was at all possible to make the actual drawing better able to synchronise with the memory contention, but there aren't really many options on how to do it anyway.
User avatar
Pobulous
Dynamite Dan
Posts: 1423
Joined: Wed Nov 15, 2017 12:51 pm

Re: Optimisation Challenge, for the benefit of new programmers

Post by Pobulous »

OK, so with a pure fill and no zeroing, I think it's taking around 1.5 frames - hard to be sure with nothing moving onscreen.
I had to use IXH and IXL as loop variables.
User avatar
Joefish
Rick Dangerous
Posts: 2087
Joined: Tue Nov 14, 2017 10:26 am

Re: Optimisation Challenge, for the benefit of new programmers

Post by Joefish »

Blimey, that's fast. That's filling every character, ATTRs and pixels? Can you post the listing that does that?

If I try interleaved UDGs and draw everything (28 x 20) and with interrupts left on I've initially got it to just over 2 frames. There are some optimisations I can pick at, but I don't see how I can save 1/2 a frame without skipping some characters, or at least ATTR-only drawing.
Post Reply