3D Experimenting

Heimdall · Post by **Heimdall** » Mon Jun 29, 2020 8:52 am

Of all the PDFs I have, Z80 is literally the only one that lists T-States alongside cycles, per instruction.

You could probably find a PDF that would list the T-states, if you exerted enough googling effort, I guess.

Still, unlike 2D bitmap game with fixed count of sprites and their sizes (hence a fixed, pre-determined scene cost no matter what), 3D rendering suffers from vastly disproportional frame-to-frame scene cycle budget, even when polycount is identical (for obvious reasons).

The discrepancy between T-States and cycles will be the last problem when your frame suddenly takes 50% more time (in next rendered frame)

ketmar · Post by **ketmar** » Mon Jun 29, 2020 9:02 am

Heimdall wrote: ↑Mon Jun 29, 2020 8:52 am Of all the PDFs I have, Z80 is literally the only one that lists T-States alongside cycles, per instruction.

http://www.z80.info/z80code.txt
you don't need cycles at all here, only t-states.

Heimdall wrote: ↑Mon Jun 29, 2020 8:52 am Still, unlike 2D bitmap game with fixed count of sprites and their sizes (hence a fixed, pre-determined scene cost no matter what), 3D rendering suffers from vastly disproportional frame-to-frame scene cycle budget, even when polycount is identical (for obvious reasons).

The discrepancy between T-States and cycles will be the last problem when your frame suddenly takes 50% more time (in next rendered frame) :)

still, you can't count your budget properly with cycles. 2 t-state difference in unrolled loop may ruin all your assumptions, for example.

p.s.: it is up to you, of course. i'm just trying to understand the reasons behind choosing cycle counting instead of t-state counting. it is not clear for me what advantages it have (so far i can see only disadvantages).

Einar Saukas · Post by **Einar Saukas** » Mon Jun 29, 2020 1:53 pm

Heimdall wrote: ↑Mon Jun 29, 2020 8:52 am You could probably find a PDF that would list the T-states, if you exerted enough googling effort, I guess.

https://www.ime.usp.br/~einar/z80table/

Heimdall · Post by **Heimdall** » Tue Jun 30, 2020 4:57 am

Einar Saukas wrote: ↑Mon Jun 29, 2020 1:53 pm
Heimdall wrote: ↑Mon Jun 29, 2020 8:52 am You could probably find a PDF that would list the T-states, if you exerted enough googling effort, I guess.
https://www.ime.usp.br/~einar/z80table/

Thanks, but I meant other (non-Z80) CPUs (like 6502, 65C02, 68000, RISC DSP, etec.) - as it looks like you missed this:

Heimdall wrote: ↑ Of all the PDFs I have, Z80 is literally the only one that lists T-States alongside cycles, per instruction.

And this:

Heimdall wrote: ↑ When I'm debugging, the T-States are right below cycles in my Watch window. So, I see them all the time (if needed).

And this:

Heimdall wrote: ↑ I store both values: cycles and T-States, for each addressing mode of each instruction.

I count T-states automatically with cycles. Just need to keep the numbers comparable to other benchmarks from other platforms, as all numbers there are cycles.
I'm also indexing all CPUs - i.e. how many cycles it takes to process the exact same 3D scene (broken down to individual pipeline stages).

ketmar · Post by **ketmar** » Tue Jun 30, 2020 6:09 am

why?! please?! *every* 8-bit CPU out there has info about command lengthes in ticks (see p.s.), and one tick has constant time. and machine cycle is not constant time (at least on Z80). there are no fixed number of machine cycles in frame, there are no fixed timing for machine cycle. what exactly are you comparing here? why? abstract machine cycles used? this info is totally useless, it says nothing about actual and relative code performance!

i definitely don't undertstand something here. what i missed in the picture? why cycles, and not t-states?

p.s.: yes, i've seen "Of all the PDFs I have, Z80 is literally the only one that lists T-States alongside cycles, per instruction". found info about instruction timings for 6502 in several clicks; if some CPU has all machine cycles of constant time, then it can be considered as t-state info too. this is not top-secret info, any good real-time emulator has it, after all. and their authors got that info from somewhere.

p.p.s.: sorry, i'm not trying to attack you. maybe i am pushing this topic too hard. i am just genuinely puzzled. but i think we're running in circles there, so just don't answer to this question if you don't want this to continue. i believe we are more interested to see the actual code implemented anyway, not nitpicking the way you prefer to do it. ;-)

Heimdall · Post by **Heimdall** » Thu Jul 02, 2020 7:10 pm

ketmar wrote: ↑Tue Jun 30, 2020 6:09 am why?! please?! *every* 8-bit CPU out there has info about command lengthes in ticks (see p.s.), and one tick has constant time. and machine cycle is not constant time (at least on Z80). there are no fixed number of machine cycles in frame, there are no fixed timing for machine cycle. what exactly are you comparing here? why? abstract machine cycles used? this info is totally useless, it says nothing about actual and relative code performance!

I often implement 3-10 versions of same algorithm and choose the final one based on number of cycles/ticks/whatever you want to call it.

So, it's definitely useful. How else would I know which algorithm is fastest ? It needs to be benchmarked.

Whether I count 1 cycle as 4 T-States, it's still the same result (just the number is 4x bigger).

Are you saying that the numbers in Z80 official PDF are imprecise ? For example it lists ADD A,r as 1 cycle and 4 T-States.

If what you say is true then ADD a,r would sometimes take less and sometimes more than 1 cycle = 4 T-States.

This would then be on top of RAM contention. That sounds like quite a clusterfuck, but I don't really have intimate knowledge of the platform.

Can you elaborate why Z80 has such a wide range of execution time for fixed simple ops ?

Heimdall · Post by **Heimdall** » Thu Jul 02, 2020 7:24 pm

I have implemented scanline traversal. I can handle all combinations of Leftwards/Rightwards and Steep/NonSteep edges in the inner loop.
I also correctly compute LeftMost pixel on current scanline for LeftWards Non-Steep Line and RightMost pixel for RightWards Non-Steep Lines.

My first ASM implementation takes 76 cycles (264 T-States) per scanline (inner loop).

So, 64 scanlines in the view (for the road) would take 64*76 = 4,864 cycles. Since 128K should have roughly 60,192c (@60 FPS), that's barely 8% of frame time, which is fine. But, it's great to see that scanline traversal won't be a major hog.

Of course, this total number of cycles will go up with the Edge Set-Up and rasterizing (next on the To-Do List).

Looking into my notes, on 6502, this stage took 246 cycles per scanline (after about 8 refactors ! first version was about 379c), so as expected, having so many registers helped tremendously.

ketmar · Post by **ketmar** » Thu Jul 02, 2020 7:32 pm

machine cycle is not just a shorthand for 4 t-states. INC r8 and INC r16 both takes the same number of machine cycles. yet INC r8 is 4 t-states, and INC r16 is 6 t-states. so they have different timings.

if you're writing your unrolled loop based on machine cycles, both instuctions will be equivalent. yet irl INC r8 is faster, so you should choose it over INC r16 where it is possible.

now, with Spectrum screen you can use both INC HL and INC L to move to the next horizontal position (for example). and you will never know which one to choose based purely on machine cycles, because they're the same. yet 24*INC L == 96 t-states == 1/50*(96/69888) seconds, and 24*INC HL == 144 t-states == 1/50*(144/69888) seconds. or, in another words, (96/69888) of the frame, and (144/69888) of the frame. now, if you want to draw 16 screen rows, it will be (96*16+n)/69888 and (144*16+n)/69888 respectively. 1536 vs 2304 t-states. 768 t-states were wasted due to measuring by cycles. this may not look like alot, but those "wasted t-states" are accumulating, and soon you may find that you just slightly missed the interrupt.

of course, the code can be optimised later, but now you'll have to go through all your code and think if you can change INC HL to INC L there, or not, effectively doing twice as much work instead of looking at t-states and decide it when you're writing it.

p.s.: if you wonder where 69888 came from -- it is the number of t-states between interrupts (full frame refresh time).

Heimdall · Post by **Heimdall** » Thu Jul 02, 2020 8:26 pm

ketmar wrote: machine cycle is not just a shorthand for 4 t-states. INC r8 and INC r16 both takes the same number of machine cycles. yet INC r8 is 4 t-states, and INC r16 is 6 t-states. so they have different timings.

if you're writing your unrolled loop based on machine cycles, both instuctions will be equivalent. yet irl INC r8 is faster, so you should choose it over INC r16 where it is possible.

You have definitely missed this post of mine, where I happened to mention the exact same instruction you are talking about

Heimdall wrote: ↑Mon Jun 29, 2020 7:10 am
I store both values: cycles and T-States, for each addressing mode of each instruction.

There are some scenarios when two ops list same number of cycles, but different T-states:

INC D : 1c, 4 T-States
INC DE:1c, 6 T-States

If two methods have similar number of cycles, I then look closely at T-States, from comparison standpoint.

It would have to be a deliberate benchmark that would focus only on such same-cycles-yet-different-T-states, otherwise it shouldn't happen a lot.

I presumed, since I wrote it on previous page, we were on the same page. Alas, we weren't

I would also recommend checking my second post on first page, where I mentioned I actually save both T-States and cycles while decoding instructions.

You made me think that Z80 has some fucked up decode/prefetch/load/scoreboarding issues in the HW pipeline, just like Atari Jaguar's RISC !

Fortunately, that doesn't seem to be the issue and Z80's execution time is more predictable than Jaguar.

ketmar · Post by **ketmar** » Thu Jul 02, 2020 8:33 pm

i've seen it. i just can't understand why you need machine cycles at all, because it adds no useful info. so i thought that maybe you have some troubles understanding the difference, and i tried to explain it (yet again). it is prolly misunderstanding between us both. i guess we can drop this issue now, i was just curious, and prolly little hard to talk with. sometimes it happens with me. i'm sorry.

AndyC · Post by **AndyC** » Thu Jul 02, 2020 10:06 pm

I think the problem here is the "machine cycles" in Z80 terminology has absolutely nothing at all to do with how long instructions take to execute and is merely an internal concept to the Z80, related to how instructions are internally constructed. It is absolutely not analogous to cycles on a 6502, and saying something take X machine cycles is an entirely meaningless metric.

The equivalent of 6502 cycles on the Z80 is T-States, that is the number of clock ticks that an instruction will take to execute under perfect conditions. T-States do not have anything to do with contention or other stall conditions that might also be imposed by any given hardware platform. If you want to compare the performance of Z80 code under entirely idealised conditions you want to measure T-states.

Real world performance adds further complications. On the Speccy you have to worry about the contention model, on the Amstrad CPC you typically use an adjusted set of timings known as NOP timings (because all instructions there get stretched to multiples of 4 cycles). How difficult it is to predict this level of detail tends to be very machine specific.

ketmar · Post by **ketmar** » Thu Jul 02, 2020 10:10 pm

AndyC wrote: ↑Thu Jul 02, 2020 10:06 pm I think the problem here is the "machine cycles" in Z80 terminology has absolutely nothing at all to do with how long instructions take to execute and is merely an internal concept to the Z80, related to how instructions are internally constructed. It is absolutely not analogous to cycles on a 6502, and saying something take X machine cycles is an entirely meaningless metric.

this. this is what i meant, but somehow completely forgot to write in clear text. stupid me.

Einar Saukas · Post by **Einar Saukas** » Thu Jul 02, 2020 11:48 pm

Heimdall wrote: ↑Thu Jul 02, 2020 7:10 pm For example it lists ADD A,r as 1 cycle and 4 T-States.

For example it also lists ADD A,n as 2 cycles and 7 T-states.

Therefore it's faster than doing ADD A,r twice.

The point is, you previously arrived to the wrong conclusion that your routine v9 was faster than v8. Because of this, developers here are trying to explain that your execution time calculations are flawed not precise enough.

But at the end of the day, it's your project so it's up to you to decide how you want to do it. I hope these posts won't discourage you, we are just trying to help!

Heimdall · Post by **Heimdall** » Sat Jul 04, 2020 9:04 pm

Einar Saukas wrote: ↑Thu Jul 02, 2020 11:48 pm
Heimdall wrote: ↑Thu Jul 02, 2020 7:10 pm For example it lists ADD A,r as 1 cycle and 4 T-States.
For example it also lists ADD A,n as 2 cycles and 7 T-states.

Therefore it's faster than doing ADD A,r twice.

And my benchmarking will obviously catch that, because as I've said about twenty times by now, I am counting BOTH T-States and cycles

Technically, I have been counting them before I even wrote the thread, as I implemented that functionality before this thread, as I realized that additional precision shouldn't hurt...

So, executing 2x {1,4} will result in {2,8}
vs executing 1x (2,7) will result in {2,7}

Hence I will, correctly, find that {2,7} is faster than {2,8}

Einar Saukas wrote: ↑Thu Jul 02, 2020 11:48 pm But at the end of the day, it's your project so it's up to you to decide how you want to do it. I hope these posts won't discourage you, we are just trying to help!

Well, I was hoping for some code review as I just finished scanline traversal and pixel fill stage of the 3D engine and this is the first larger batch of Z80 code I wrote. So, I obviously had about dozen questions.

But I'd have to be crazy to attempt that now

I could write some statement twenty times, but I can see it wouldn't matter, so in the end it would be a collossal waste of time...
I'm just gonna live with whatever code I write and be simply happy about it

More code I write, the better it will get, anyway. That's the way experience works. Z80 surely isn't any special in this regard...

Heimdall · Post by **Heimdall** » Sat Jul 04, 2020 9:14 pm

AndyC wrote: ↑Thu Jul 02, 2020 10:06 pm I think the problem here is the "machine cycles" in Z80 terminology has absolutely nothing at all to do with how long instructions take to execute and is merely an internal concept to the Z80, related to how instructions are internally constructed. It is absolutely not analogous to cycles on a 6502, and saying something take X machine cycles is an entirely meaningless metric.

The equivalent of 6502 cycles on the Z80 is T-States, that is the number of clock ticks that an instruction will take to execute under perfect conditions. T-States do not have anything to do with contention or other stall conditions that might also be imposed by any given hardware platform. If you want to compare the performance of Z80 code under entirely idealised conditions you want to measure T-states.

Real world performance adds further complications. On the Speccy you have to worry about the contention model, on the Amstrad CPC you typically use an adjusted set of timings known as NOP timings (because all instructions there get stretched to multiples of 4 cycles). How difficult it is to predict this level of detail tends to be very machine specific.

Except that the Z80 manual states a 4:1 ratio between cycles and TStates, with cycles being merely rounded (as TStates has 4 additional values per 1 cycle).

So, let's say we have a budget of 1,000 cycles which is 4,000 T-States.
Let's say we have a NOP (1 cycle, 4 TStates)

1,000 / 1 = 1,000 NOPs
4,000 / 4 = 1,000 NOPs

WTF ?!? Whether I counted cycles or TStates, it still came down to the same number of executions in the given performance budget !

The 50 Hz 128k is supposed to have 70908 cycles per frame, which is 70908*4 = 283632 TStates.

If my routine takes 50,000 cycles (and 200,000 TStates) it won't matter which metric I use, as it still will use the exact same percentage of frame time:
50,000 / 70908 = 70.51%
200,000 / 283632 = 70.51%

Since not all instructions have the exact 4:1 ratio due to rounding, the TStates will be very slightly off. But the final percentage will be very similar.
Unless, as I said dozen times, somebody made a deliberately retarded test with only the instructions that aren't exactly 4:1....

Heimdall · Post by **Heimdall** » Sat Jul 04, 2020 9:18 pm

Here's a real-world metric taken from running scanline traversal and pixel fill:
27,642 cycles
104,394 TStates

104,394 / 27,642 = 3.78

3.78 !!! That's BRUTAL !

ketmar · Post by **ketmar** » Sat Jul 04, 2020 9:18 pm

Heimdall wrote: ↑Sat Jul 04, 2020 9:14 pm The 50 Hz 128k is supposed to have 70908 cycles per frame, which is 70908*4 = 283632 TStates.

that's why you should forget about "cycles" altogether. no, 128K frame has 70908 t-states. nobody in Speccy (and Z80, i believe) programming world ever counts anything in "cycles".

p.s.: the only exception is mentions of "even M1", which simply means that opcode fetching always starts at even t-state (not cycle!) in 128K.

Heimdall · Post by **Heimdall** » Sat Jul 04, 2020 9:43 pm

ketmar wrote: ↑Sat Jul 04, 2020 9:18 pm
Heimdall wrote: ↑Sat Jul 04, 2020 9:14 pm The 50 Hz 128k is supposed to have 70908 cycles per frame, which is 70908*4 = 283632 TStates.
that's why you should forget about "cycles" altogether. no, 128K frame has 70908 t-states. nobody in Speccy (and Z80, i believe) programming world ever counts anything in "cycles".

p.s.: the only exception is mentions of "even M1", which simply means that opcode fetching always starts at even t-state (not cycle!) in 128K.

So, why THE f***, do I find this out, only now after 5 pages of bullsh*t about ~6% difference in precision between cycles/TStates ?

That would have been the piece of information to mention 4 pages ago, if you are aware of it.

That's a bloody huge difference. 4:1 for f***'s sake.

My frame budget just dropped from 283000 TStates to 70900 TStates. Fantastic...

ketmar · Post by **ketmar** » Sat Jul 04, 2020 9:56 pm

Heimdall wrote: ↑Sat Jul 04, 2020 9:43 pm So, why THE f***, do I find this out, only now after 5 pages of bullsh*t about ~6% difference in precision between cycles/TStates ?

we were telling you about that from the very beginning. we were telling you that nobody's using "cycles" in Z80, that "machine cycles" are used to mark logical states of Z80, not for timing. that's why we asked why you bother counting "cycles" at all. i even gave you the explanation with timings. this one. where i explicitly wrote:
"24*INC L == 96 t-states == 1/50*(96/69888) seconds
...
if you wonder where 69888 came from -- it is the number of t-states (bold is not in the original post) between interrupts (full frame refresh time)."

yes, 69888 is for 48K. but it is quite close to 70908 to not assume that 128K has 283632 t-states per frame, i believe.

Heimdall · Post by **Heimdall** » Sat Jul 04, 2020 10:57 pm

I am not upset because we argued for 5 pages about the ~6% difference, in precision, between TStates and cycles (or my misinterpretation of what a cycle actually is on this particular platform - that's academical waxing).

But can you see the ridiculousness that there is 300% difference which, if I didn't accidentally mention the 283,000 number, it would never been brought up.

That equation of yours, I looked at it 5 times, till I saw it. Only now. Well hidden, for sure.

Again, 300% vs 6%. Awesome...

Now my framerate has dropped 4:1. Yeah, I'm pissed...

ketmar · Post by **ketmar** » Sat Jul 04, 2020 11:09 pm

ah, sorry, attacking you wasn't my intent. English is not my native language, and i am very sarcastic person. sometimes my sarcastic, but not really rude comments looks like rude attacks with my Engrish.

anyway, i can fully understand your frustration -- this sudden shrinking of frame budget is almost devastating, i guess. but it woild be much worser to find this out when most of your code would be ready and tested, i think. ;-)

i hope we finally sorted this out. i myself, for example, is so used to "t-states", that i never even realised that for other CPUs it is called "cycles", and that you thought the same about Z80. i believe that others were confused too, so the whole thing was looking like we're bashing you for using the wrong term for the same thing. eh... sh*t happens. ;-)

Lethargeek · Post by **Lethargeek** » Sun Jul 05, 2020 10:32 am

ketmar wrote: ↑Sat Jul 04, 2020 9:18 pm p.s.: the only exception is mentions of "even M1", which simply means that opcode fetching always starts at even t-state (not cycle!) in 128K.

not in "128k", but in some clones (with any k)

ketmar · Post by **ketmar** » Sun Jul 05, 2020 10:35 am

Lethargeek wrote: ↑Sun Jul 05, 2020 10:32 am
ketmar wrote: ↑Sat Jul 04, 2020 9:18 pm p.s.: the only exception is mentions of "even M1", which simply means that opcode fetching always starts at even t-state (not cycle!) in 128K.
not in "128k", but in some clones (with any k)

(consulted ZXEmuT) yeah, my bad. somehow i was sure that +2/+3 have it too, but machine info in ZXEmuT says otherwise.

Alone Coder · Post by **Alone Coder** » Mon Sep 07, 2020 10:52 pm

I streamed a little about my chunky 3D engine for 48K that is unused for years:
https://www.youtube.com/watch?v=mDnyJ0o ... e=youtu.be
No design, no demo

Spectrum Computing

3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting

Re: 3D Experimenting