3D Experimenting

Heimdall · Post by **Heimdall** » Sun Jun 21, 2020 7:29 am

While Spectrum Next will be the target platform for me, it would be crazy not to do some initial work targeting Z80, as majority of ASM ops are the same anyway. Plus, it might prove interesting to use the 28 MHz against standard 256x192 resolution.

For quick prototyping, I have a DirectX/C++ project, where I can specify target resolution/bitdepth and a FrameBuffer in RAM. This is great for high-level scene and engine experimenting. Only what passes through this stage gets the ASM implementation.
Also, this project is where I have a core 6502 dev emulator implemented. I will use that codebase to expand it to Z80. This will allow me to use real Z80 code alongside the C++, which is extremely useful in figuring out if certain implementations are worth the effort, as I have a built-in cycle summary - hence I get a very precise benchmark, automatically.

Examining game library on Spectrum, I think a good reference candidate would be Hard Driving. I will cheat a bit here by not having a full-blown LookFrom->LookAt camera (with Roll), but I don't believe the added fun factor remotely compensates for destroyed framerate.

I think it should be possible to get around 15 fps when there's only road in view.

I've created a basic code that creates road, at specified view distance and alternates color per each road segment.
Road can have generic XYZ coordinates, so hills/valleys/curves are possible. Camera can be set also at desired position, so you can have view from behind the car (if needed), or down at the bumper level, or anywhere you want.

Drawing is done at a byte level - e.g. left/right edge of each scanline is written as a byte and multiples of bytes are written for the rest of scanline.

Next step will be creating core ASM instructions and converting the code to Z80 Asm.

Post edited by PeterJ - 21/06/2020 - 09:30 BST

Heimdall · Post by **Heimdall** » Tue Jun 23, 2020 1:32 pm

I spent a day working on the core Z80 feature set:
- parsing all addressing modes
- two register banks
- simple timing syntax (both cycles and TStates) per each op

I now have:
- LD: all op modes
- PUSH/POP
- EXX, EX

Should be quite easy to add ADD/SUB/AND/OR/XOR tomorrow.

ketmar · Post by **ketmar** » Wed Jun 24, 2020 5:00 am

maybe you can simply take my Zymosis, for example? it is only 50 kb of self-contained source in C (should be standards-compliant), supports any number of CPUs, implements all Z80 instructions including undocumented ones, with precise-enough timings to emulate ZX Spectrum with contended memory. ah, and it is public domain, so no problems with licensing.

i fully understand that it is more fun to write your own, but if you will ever want to save some time... it also has Z80 disassembler included as a separate source file, and you can use liburasm as assembler (yet this one only works with GCC).

p.s.: there is also "ED trap callback", so you can implement additional Z80N instructions without modifying Zymosis sources at all.

Heimdall · Post by **Heimdall** » Wed Jun 24, 2020 6:41 am

ketmar wrote: ↑Wed Jun 24, 2020 5:00 am maybe you can simply take my Zymosis, for example? it is only 50 kb of self-contained source in C (should be standards-compliant), supports any number of CPUs, implements all Z80 instructions including undocumented ones, with precise-enough timings to emulate ZX Spectrum with contended memory. ah, and it is public domain, so no problems with licensing.

i fully understand that it is more fun to write your own, but if you will ever want to save some time... it also has Z80 disassembler included as a separate source file, and you can use liburasm as assembler (yet this one only works with GCC).

p.s.: there is also "ED trap callback", so you can implement additional Z80N instructions without modifying Zymosis sources at all.

Hey there. Cool project you got there, for sure!

But it's not just about saving time. I noticed in your thread you use Linux.

There's just no way to get to link it against Microsoft's DirectX/Win32 libraries and retain Debug & Continue feature in Visual studio.

Honestly, from vast experience in similar futile past such exercises, it's so much frustration that I would rather spend 2 weeks in iterative daily development than deal with those library version conflicts.
You pull in one library, which will require certain environment flags set that will break other things down the Win32 pipeline (or Visual studio functionality will be affected-like Debug and continue).

Luckily, I don't have to spend 2 weeks to implement that.

It only took 2 days to implement Z80's ops (ones I need) within my framework

Also, would your library even be able to allow me to step through each Z80 instruction within Visual studio?
Would I be able to change the instruction, hit Debug Continue which would recompile that one op and just continue debugging?

Heimdall · Post by **Heimdall** » Wed Jun 24, 2020 6:47 am

So, second day of coding the Z80 framework gave me all math and jump instructions. The ones I need anyway.

I did some additional refactoring of my framework, so if some interesting new retro platform pops up in future, it should be less work than 2 days to implement its Asm support.

I think I have everything to start converting 3d rasterizer to Z80 and do the detailed benchmarks in the target resolution.

ketmar · Post by **ketmar** » Wed Jun 24, 2020 6:52 am

ah... so you want a kind of recompiler? nope, Zymosis is a pure emulator. of course, it can excecute instructions one by one, so you can simply use "step over" `zym_exec_step()`, which does exactly that -- steps over one Z80 instruction. but you have to assemble your Z80 code externally.

i guess you want something like a set of C functions to emulate Z80 instructions, so you will be able to write something like this directly in your C source?
ld(regA, 42);
push(regAF);

then sorry, you're right, Zymosis is not the right thing here.

as for OS -- Zymosis itself doesn't use anything OS-specific, and is fully self-contained. the repository is combined repo for Zymosis and ZXEmuT emulator, but you don't need ZXEmuT part, only src/libzymosis is relevant.

anyway, sorry for derailing your thread. i just didn't realised what you really need.

Heimdall · Post by **Heimdall** » Wed Jun 24, 2020 11:33 am

ketmar wrote: ↑Wed Jun 24, 2020 6:52 am ah... so you want a kind of recompiler? nope, Zymosis is a pure emulator. of course, it can excecute instructions one by one, so you can simply use "step over" `zym_exec_step()`, which does exactly that -- steps over one Z80 instruction. but you have to assemble your Z80 code externally.

i guess you want something like a set of C functions to emulate Z80 instructions, so you will be able to write something like this directly in your C source?
ld(regA, 42);
push(regAF);

then sorry, you're right, Zymosis is not the right thing here.

as for OS -- Zymosis itself doesn't use anything OS-specific, and is fully self-contained. the repository is combined repo for Zymosis and ZXEmuT emulator, but you don't need ZXEmuT part, only src/libzymosis is relevant.

anyway, sorry for derailing your thread. i just didn't realised what you really need.

Correct. The Z80 backend implements Z80 instructions via C code, so I can interleave the 3D engine code with ASM implementation and debug it in place.

It's much faster this way than just trying to port the whole darn thing from C to ASM in one go.

Also, once you commit to the ASM effort, it's very hard to just discard some code. With this approach, when I see that certain algorithm isn't performing as hoped, I simply discard it right there, in Visual Studio as the effort to get to that point is minimal, compared to classic ASM/Link/Deploy approach.

Heimdall · Post by **Heimdall** » Fri Jun 26, 2020 6:50 am

ketmar wrote: ↑Wed Jun 24, 2020 6:52 am i guess you want something like a set of C functions to emulate Z80 instructions, so you will be able to write something like this directly in your C source?
ld(regA, 42);
push(regAF);

Here's a best example of mixing C / Z80 from what I'm doing right now:

Code: Select all

	#pragma region ASM:v1		Inner Loop
		if (true)
		{
			byte color1 = 128+32+8+2;	//	170
			byte color2 = 64+16+4+1;	//	85
			for (int cRow = 64; cRow < 128; cRow+=2)
			{
				HL = ypAddr [cRow];
			//	for (int cCol = 0; cCol < 32; cCol++)	RAM [HL++] = color1;
				LD (A,#170)
				LD (B,#32)
				v1rgInner1:
					LD ((HL),A)
					INC (HL)
				DJNZ (v1rgInner1)

				HL = ypAddr [cRow+1];
			//	for (int cCol = 0; cCol < 32; cCol++)	RAM [HL++] = color2;
				LD (A,#85)
				LD (B,#32)
				v1rgInner2:
					LD ((HL),A)
					INC (HL)
				DJNZ (v1rgInner2)
			}
		}
	#pragma endregion

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 10:23 am

I've worked on drawing the dithered 2-line pattern for the "Grass" section of the track. Currently assuming up to 64-scanlines height, so that we have a performance reference point when there's just a road in the viewport.

I wrote 9 versions today:

v1: ...C: vidPtr via full computation
v2: ...C: vidPtr via Array index (used for ASM)
v3: ASM: 14,794c: scanline drawn via Loop ( LD (HL),A INC (HL) )
v4: ASM: 14,766c: v3 + registers taken out of loop
v5: ASM: 13,514c: Avoid thrashing IX, thus no need tor reload it
v6: ASM: 11,914c: scanline drawn via LDIR
v7: ASM:. 9,354c: scanline drawn via unrolled LDI
v8: ASM:. 7,786c: scanline drawn via unrolled Stack operations (PUSH/POP/EXXX)
v9: ASM:. 7,306c: scanline drawn via unrolled 32x { LD (HL),A INC (HL) }

It was surprising to see the Stack approach (v8) loose to (v9), but Stack approach does have additional overhead (EXX, thrashing all working outer loop registers, etc.) which is not present in unrolled v9 approach.

Before I get to the v10: unrolled jump table approach I did on Atari 6502 (fastest possible, unrolled, yet generic scanline length), it would be interesting to see if rendering only visible "grass" pixels would be faster than current brute-force redraw of all background pixels at those 64 scanline (even if ~half is covered by road).

Theoretically, since the road occupies 924 full 8-pixel Bytes out of 2,048 (45%), the fastest loop approach (v6:LDIR) would then take up (1.0 - 0.55) * 11,914 = 6,552c. However, that's assuming single LDIR register set-up per scanline. And we need two (one for left side of road, another for right side).
And while the first one is guaranteed to start at left-most xpos=0, the right edge must be computed.
That's 64x. Even if it took just 16c (doubtful, that's barely ~4 ops), that's 1,024 additional cyclea. Which then makes 6,552 + 1,024 = 7,576c.

Which is slower than v9. So, it doesn't make sense to even attempt that one, as it simply can't be faster as v9.

Since v10 approach saved about 17% additional cycles on Atari, I don't think it's worth my time to spend whole day on saving ~1,200c.
Even if we rounded target framerate down to 25 fps, that's still 141,800c (2x 70,908).
1,200c is less than 1% of frame budget, so totally irrelevant. v9 is good enough for now and totally worth the one day effort to me.

Einar Saukas · Post by **Einar Saukas** » Sun Jun 28, 2020 1:53 pm

Heimdall wrote: ↑Sun Jun 28, 2020 10:23 am v8: ASM:. 7,786c: scanline drawn via unrolled Stack operations (PUSH/POP/EXXX)
v9: ASM:. 7,306c: scanline drawn via unrolled 32x { LD (HL),A INC (HL) }

You are doing completely different things in these 2 versions. What exactly are you trying to do?

If your goal is to fill one scanline with a single value (as shown in your C/Z80 code sample), then this should be v8:

Code: Select all

LD L,$85
LD H,L
LD SP,address
{ PUSH HL } x16

And this should be v9:

Code: Select all

LD A,$85
LD HL,address
{ LD (HL),A / INC L } x31
LD (HL),A

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 3:38 pm

Einar Saukas wrote: ↑Sun Jun 28, 2020 1:53 pm
Heimdall wrote: ↑Sun Jun 28, 2020 10:23 am v8: ASM:. 7,786c: scanline drawn via unrolled Stack operations (PUSH/POP/EXXX)
v9: ASM:. 7,306c: scanline drawn via unrolled 32x { LD (HL),A INC (HL) }
You are doing completely different things in these 2 versions. What exactly are you trying to do?

If your goal is to fill one scanline with a single value (as shown in your C/Z80 code sample), then this should be v8:
Code: Select all
LD L,$85
LD H,L
LD SP,address
{ PUSH HL } x16
And this should be v9:
Code: Select all
LD A,$85
LD HL,address
{ LD (HL),A / INC L } x31
LD (HL),A

The Stack approach - I took it originally straight from that eBook from the section on clearing screen.
He uses the pop.push.exx combo twice, but as it's 14, not 16, I added 2 more push.

An hour later, during the walk, I realized that in theory, it should be possible to just do 16 pushes, as long as the value stays in the register, which I couldn't recall outside.

You mentioning 16x push confirms my theory that it should work, but I will double check the Z80 manual on the exact push 16-bit behavior.

As for the other approach, I noticed that you only updated the L register via Inc L. On Atari, regardless of resolution, you always had to handle updating Hi byte and n the middle of scanline.

But you made me realize:, that on Spectrum, since scanline is exactly 32 Bytes long (not 40 like Atari), if I align the first byte, then it's impossible for Hi byte to be ever updated in the middle!
I will reflect that in the implementation.
Also, good catch on the last inc! Not needed, but sneaked in via copy paste ! Thanks!

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 5:12 pm

I checked the Z80 manual and didn't see any nefarious shenanigans going on when doing PUSH DE, so I just went ahead and did 16xPUSH DE. I don't know what that guy in eBook was doing with those PUSH/POP/EXX, but no point trying to understand his code right now. This is faster and that's all that matters

Also slight adjustment to v9 (7,306 vs 7,242) due to commenting out last unneeded INC HL.

v1: ...C: vidPtr via full computation
v2: ...C: vidPtr via Array index (used for ASM)
v3: ASM: 14,794c: scanline drawn via Loop { LD (HL),A INC HL }
v4: ASM: 14,766c: v3 + registers taken out of loop
v5: ASM: 13,514c: Avoid thrashing IX, thus no need tor reload it
v6: ASM: 11,914c: scanline drawn via LDIR
v7: ASM:. 9,354c: scanline drawn via unrolled LDI
v8: ASM:. 7,786c: scanline drawn via unrolled Stack operations (PUSH/POP/EXXX)
v9: ASM:. 7,242c: scanline drawn via unrolled 32x { LD (HL),A INC HL }
v10:ASM:. 4,746c: scanline drawn via unrolled 16x { PUSH DE }

Now, I have to wonder, if one went through the trouble of implementing inner scanline fill using this PUSH approach, what the savings would be.
Within a scanline, it's not going to be as simple as here - as there are 2 bytes and edges will add some complexity.
But, in theory, it should be possible to create multiple codepaths (2 for left side, 2 for road, 2 for right side), based on 2-byte alignment, that would fill scanline using unrolled PUSH combined with jump table approach.

Probably an 8 KB function (but real fast)

ketmar · Post by **ketmar** » Sun Jun 28, 2020 5:16 pm

Heimdall wrote: ↑Sun Jun 28, 2020 5:12 pm I don't know what that guy in eBook was doing with those PUSH/POP/EXX

this is often used for fast screen copying (vitrual screen -> screen$).

Einar Saukas · Post by **Einar Saukas** » Sun Jun 28, 2020 5:18 pm

Heimdall wrote: ↑Sun Jun 28, 2020 5:12 pmI don't know what that guy in eBook was doing with those PUSH/POP/EXX

It's probably copying an entire scanline from buffer to screen, instead of just filling a scanline with a single value.

andydansby · Post by **andydansby** » Sun Jun 28, 2020 5:28 pm

[mention]Heimdall[/mention] Which ebook is this?

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 5:51 pm

Yeah, it's probably best used for copying byte-aligned sprites.

The book is Jonathan Cauldwell: "How to write Spectrum games".

Thinking about this approach some more, this could be a real fast ClearScreen.
Clearing a third of screen would then take 64x16x3c = 3,072 c. We wouldn't even have to load pointer for current scanline, order of scanlines wouldn't matter.
Only 9,216c for full screen. At that point, smart clearing suddenly becomes quite expensive. Probably still worth-wile, just much less...

And it's just 1 Byte opcode ! Unrolled clearscreen on Atari took over half RAM (3 Bytes per byte of framebuffer). Not exactly useable, even on 128 KB 130 XE. Let alone double-buffered.

It's really nice - just 3 cycles and you store 16-bit value, and decrement register ! Compared to Atari, where you constantly had to burn cycles on Carry after each address adjustment (or unroll 256x if possible), this is quite incredible.

I always wondered, how the hell did Spectrum fill high resolution screen so fast. Double CPU clock and more than double writing speed !

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 6:09 pm

So, the next question is - Does it even make sense to only draw background where the road is not ? Or is the cost of overdraw irrelevant due to 3-cycle 16-bit write ? I admit it sounds crazy on 8-bit CPU...

The Grass section is 256x64 pixels, which is 32x64 = 2,048 Bytes = 1,024x PUSH
Road takes 924 Bytes = ~462 PUSH (certainly slightly more due to 16-bit alignment).

But, let's just go with an ideal case of exactly 462x PUSH. We could save 55% cycles of 4,746 = 2,610 cycles. That's 40c per scanline, and we need to compute:
1. The length of first run (Left Screen edge to the left edge of road) + Set SP there
2. Start of the second run (Right Road Edge to Right Screen Edge) + length + Set SP at end of scanline

But, we now need a loop - which is 3 cycles per iteration, which is 3x562 (PUSH) = 1,686c
So, instead of 2,610c we now have just 2,610 - 1,686 = 924 cycles for 64 scanlines -> 14 cycles per scanline and we still need to compute the two things above. There's no way we can do the work above in 14c. In theory we could have 2 lookup tables, but even then it's doubtful, as one lookup (IX+d) is 5c, we need two, so just the two look ups take up 10c and we are left with 4c for two runs (e.g. 2c per run). Yeah, not happening.

So, Spectrum is actually fast enough to discard ~50% overdraw and just brute-force through it like a GeForce. Damn, did not see that one coming

Heimdall · Post by **Heimdall** » Sun Jun 28, 2020 6:23 pm

I'm realizing that it might be possible to actually draw screen scanline in one go (from left to right), never adjusting SP (it's autodecremented after each push):

1. We point SP to the end of scanline
2. Draw the Grass on the right (via loop:push).
3. Handle the boundary between Grass and Road: single push
4. Draw the Road (via loop:push)
5. Handle the boundary between Grass and Road: single push
6. Draw the Grass on the left (via loop:push).

Should be easy enough to implement in C++ first and see exactly how many conditions would be there (especially the grass-road boundary).

But, it definitely has a potential to be faster than a sum of { DrawGrass + DrawRoad }.

Joefish · Post by **Joefish** » Sun Jun 28, 2020 10:25 pm

That's fine for a pseudo-3D road where the separations between shaded road segments are horizontal, but aren't you doing a real 3D model, so you could get a split between black and white segments on one scanline?

Einar Saukas · Post by **Einar Saukas** » Sun Jun 28, 2020 10:44 pm

Heimdall wrote: ↑Sun Jun 28, 2020 5:51 pmClearing a third of screen would then take 64x16x3c = 3,072 c.

Does "c" mean cycle? What exactly is a cycle for you?

It seems you are measuring speed in terms of "machine cycles", where each machine cycle doesn't have a fixed number of T-states. If so, that's a terrible way to compare execution time between different implementations.

Heimdall · Post by **Heimdall** » Mon Jun 29, 2020 7:10 am

Einar Saukas wrote: ↑Sun Jun 28, 2020 10:44 pm
Heimdall wrote: ↑Sun Jun 28, 2020 5:51 pmClearing a third of screen would then take 64x16x3c = 3,072 c.
Does "c" mean cycle? What exactly is a cycle for you?

I'm taking the machine cycles count from the Z80 manual. I inferred that 1 cycle = 4 T-States.

I store both values: cycles and T-States, for each addressing mode of each instruction.

Einar Saukas wrote: ↑Sun Jun 28, 2020 10:44 pm It seems you are measuring speed in terms of "machine cycles", where each machine cycle doesn't have a fixed number of T-states. If so, that's a terrible way to compare execution time between different implementations.

There are some scenarios when two ops list same number of cycles, but different T-states:

INC D : 1c, 4 T-States
INC DE:1c, 6 T-States

Now, I don't know the internal HW pipelining rules (decoding,etc.), but I would presume that 16-bit INC would hold execution prior to decoding next op for 2 more T-states, right ?

If two methods have similar number of cycles, I then look closely at T-States, from comparison standpoint.

It would have to be a deliberate benchmark that would focus only on such same-cycles-yet-different-T-states, otherwise it shouldn't happen a lot.

ketmar · Post by **ketmar** » Mon Jun 29, 2020 7:17 am

if you aren't creating a precise emulator (contended memory, multicolor, etc.) -- why do you need machine cycles at all? it is as easy to count t-states, and have much more precise result. and even if you want precise emulation, counting t-states is still better.

Heimdall · Post by **Heimdall** » Mon Jun 29, 2020 7:25 am

Joefish wrote: ↑Sun Jun 28, 2020 10:25 pm That's fine for a pseudo-3D road where the separations between shaded road segments are horizontal, but aren't you doing a real 3D model, so you could get a split between black and white segments on one scanline?

I'm not doing pseudo-3D road. All road vertices are full 3D (X,Y,Z) and camera position is also (X,Y,Z). Each road segment can have different width / height.

This is merely a separate codepath (simple condition : if (GenericMeshesInView == 0) ) that will be run if there are no generic 3D meshes in viewport (just road). So far I think I can fit all that into 128 KB...

I think it should be possible to run this at around ~20 fps, if there is just road in viewport.

Now, even on Atari, the actual pixel fill (even though it's much slower than on Z80) it still took only around ~13% cycles - e.g. the performance difference was barely noticeable, whether you had wireframe or filled polygons (but, I did have a fastest possible inlined, unrolled, condition-less jump-table approach).

70-90% of frame time will be spent on scanline traversal and handling edges.

I will probably write at least 2 different scanline traversals, but since we have 16-bit math on Z80, it's waaaay too tempting not to use for scanline traversal, so that will probably be the first implementation.

Heimdall · Post by **Heimdall** » Mon Jun 29, 2020 7:37 am

ketmar wrote: ↑Mon Jun 29, 2020 7:17 am if you aren't creating a precise emulator (contended memory, multicolor, etc.)

yeah, this is merely dev "emulator" coding environment to figure out, as quickly as possible, which algorithm will be copied to actual emulator/HW, without having to burn a lot of time.

ketmar wrote: ↑Mon Jun 29, 2020 7:17 am -- why do you need machine cycles at all?

Because Z80 is only one of about dozen retro HW Targets where I run my 3D engine.
My excel has all kinds of benchmarks from multiple platforms.

I need the data to be comparable.

I can imagine only one scenario when the difference (cycles vs T-States) would be noticeable on ZX - the 60 fps game. Once we get down to target framerates of around 10 fps (still 500% faster than Hard Drivin

), it's not going to matter too much...

Also, I can very easily make a mistake of jotting down the T-States and not cycles and accidentally disregarding a perfectly fast implementation. That risk is very real.

When I'm debugging, the T-States are right below cycles in my Watch window. So, I see them all the time (if needed). I just make sure I don't write them down.

ketmar · Post by **ketmar** » Mon Jun 29, 2020 8:25 am

i still don't understand why do you need machine cycles. t-state counting is more precise, it can be easily converted to real/frame timings, any target 8-bit CPU is using t-state counting, and comparing real/frame timings gives much better idea of what the real/relative speeds will be.

note that you CANNOT convert cycles to frame timings! (due to different cycle lenghtes) so i cannot see how it is easier to do comparisons with cycles.

Spectrum Computing

3D Experimenting

3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting