Post-Contention CPU Cycles Available

Heimdall · Post by **Heimdall** » Sun Jun 14, 2020 6:16 am

I'm going through some routines I wrote for 6502 and looking over the cycle budget for some of those. Now, while the CPU was just 1.79 MHz, I mostly used resolution 160x96, so there was more CPU power available per pixel there.

But, given that 1 byte addresses 8 pixels on Spectrum, it partially offsets the scenario above.

So, how many CPU cycles are actually available, after contention takes place ?

3,500,000 / 60 (NTSC) = 58,333 cycles - this is the theoretical maximum if there was no contention.

What is the number after contention ? Is it above 50,000 ? At 10 fps target framerate, that would give me 6 full frames of CPU time (~300,000 cycles for whole scene).

AndyC · Post by **AndyC** » Sun Jun 14, 2020 10:25 am

Contention on the Speccy is hard, exactly how many cycles are lost depends entirely on what instructions are running and when, so you're only ever going to get a ballpark figure. This is different to machines like the Amstrad CPC or C64 which follow a much more predictable pattern of delays.

Heimdall · Post by **Heimdall** » Sun Jun 14, 2020 10:39 pm

Well, that is unfortunate. But it explains why my forum search prior to the post came up empty.

So, what is the ballpark range here then? What's the worst and best case scenario?

Is there any cycle-exact emulator that would handle this as precisely as possible?

1024MAK · Post by **1024MAK** » Sun Jun 14, 2020 11:24 pm

So just to explain a little. If the Z80 CPU is accessing (reading an instruction, reading, or writing data) to the ROM area (0x0000 to 0x3FFF) or to the ‘upper’ RAM (0x8000 to 0xFFFF) it runs at full speed.

However if the Z80 is accessing the ‘lower’ RAM (0x4000 to 0x7FFF) which contains the bitmapped screen data, if the ULA needs to get data from RAM to send to the TV or monitor, it will freeze the clock to the Z80, and hence cause it to pause. If the ULA does not need to grab any video data (because it is currently drawing the border or the CRT beam is off screen (line or field flyback) the ULA will not mess with the Z80 CPU clock, so during this time it runs at full speed.

For this reason, some games programmers use various methods to try to synchronise updating the screen data to times when the ULA is not ‘drawing the screen’. In addition they try to have as much of their program code in ‘upper’ RAM. Using the rest of ‘lower’ RAM for data rather than code if they can.

So as you can see, there are a number of different variations on how much the Z80 gets delayed:
Instruction fetched from ROM or from upper RAM and reading or writing data to/from either of these two areas - full speed.
Instruction fetched from ROM or from upper RAM and reading or writing data to/from lower RAM - possibly affected by contention.
Instruction fetched from lower RAM and reading or writing data to/from anywhere - possibly affected by contention.

Lots more details here (although keep in mind that these pages are now rather dated and in some areas fresh discoveries have meant that some of this information has been superseded).

Mark

1024MAK · Post by **1024MAK** » Sun Jun 14, 2020 11:36 pm

See also this page.

Mark

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 5:34 am

Thanks a lot for the links. I quickly skimmed through it, but it will [for sure] require some proper focus to digest, probably best in the morning along with coffee

But, just quickly, use case here would be low-framerate, double-buffered 3D scene. Like, ~8-10 fps, so it takes 6-8 full frames to render single frame.

So, even if I only used the upper RAM for my code/data, during those 8 frames, there would still be a contention for ULA reading framebuffer.

From a quick look into your link, it appears that CPU runs at full speed whenever the electron beam isn't drawing the screen (e.g. borders).
Also, it seems that every 6 cycles (during the 192 scanlines) there are 2 CPU cycles available.

Let's take NTSC (60 fps) -> 3,500,000 / 60 = 58,333 cycles [theoretical max] per frame

192 [scanlines] * 224 [cycles] = 43,008 -> 43,008 / 58,333 = 73.73%
So, the CPU is frozen during 73% of frame time - just not completely, because there is 96 cycles available per each border and every 6 cycles per scanline give us 2 cycles.

58,333 - 43,008 = 15,325 -> this is the time when beam isn't drawing screen (above/below screen)
96*192 = 18,432 -> this is the scanline border time
16*2*192 = 6,144 -> this is the 2c available every 6 during scanline being drawn

So, when we add everything together:
15,325 + 18,432 + 6,144 = 39,901 cycles out of 58,333 -> 68.41%

So, the cycle budget for NTSC SW rasterizer is:

30 fps: 79,802
20 fps: 119,703
15 fps: 159,604
12 fps: 199,505
10 fps: 239,406
7.5 fps:319,208
6 fps: 399,010

I think I like the 10 fps best - ~250k budget...

Patrik Rak · Post by **Patrik Rak** » Mon Jun 15, 2020 9:19 am

Why do you even consider NTSC? Spectrum is a PAL machine, the NTSC variants do exist but they are definitely not the primary target...

Besides, your computation that the CPU is frozen 73% of the time is bogus. It may be the theoretical limit, but you would have to try hard to write a code which hits all the contention windows.

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 9:56 am

I live in NTSC territory now, so NTSC is very real for me, regardless of where Spectrum was dominant

The 73% was just the initial baseline estimate to get the cycles outside of visible screen.

The useable cycles turned out to be 68.41%, so the CPU is frozen at the very least 100-68.41 = 31.59% of the frame.

Since the clock isn't exactly 3,500,000 there will be a slight adjustment, but probably in the order of ~1% or so.

Important thing is I can now discard or entertain certain 3D scenes, based on this calculation, as I have a ballpark figure now.

Of course, I'm comparing this to the cycle count on 6502, which had only 3 registers and Z80 has 16 (plus 16-bit ops, plus DJNZ, LDIR, etc.), so it's obvious that the Z80 implementation of same algorithms will have to be [inevitably] significantly faster than on 6502.

Not to mention that at least 3 substages of the whole 3D pipeline will not have to store the temporary results to the RAM (loop for writing and another for reading) - like I had to on 6502, but can actually directly use the computed data, further removing thousands of cycles from the whole pipeline.

Metalbrain · Post by **Metalbrain** » Mon Jun 15, 2020 10:33 am

The contention only affects the moment when the CPU tries to read or write a contended memory cell. Just because slow memory accesses get delayed in 6 (or 7) out of every 8 states, don't totally freeze the CPU in those states if no slow memory access is needed.

As an example, if we're using the LDI instruction executed from fast memory to read a byte from fast memory and write it into the slow memory (such as writing from a buffer to the screen), the first 4 states read the prefix, the next 4 states read the instruction itself, the next 4 states read the byte to be written, and in the next cycle there's 1 state out of the last 4 where it will try to write to the screen memory. In the worst case scenario, we'll have to wait 6 states (or even 7 in the case of a +2A/+3 model), and if we're lucky we'll have to wait 0, so the instruction may take 16-22(23) states to be executed. But if a new LDI instruction follows that first one immediately, in this case the write state will be aligned with the previous write and there won't be any delay. So a bunch of LDIs will only get delayed a tiny bit, while using LDIR (which without contention takes 21 states per byte IIRC), the repeated write accesses will get delayed 3 states each, so it will be like executing it at 24 states per byte (that's 14.29% slower than normal).

PD: In reality it's a bit more complex than that, because LDI has some extra contended states for writing, so overall both LDI and LDIR will be slower.

Lethargeek · Post by **Lethargeek** » Mon Jun 15, 2020 4:47 pm

Heimdall wrote: ↑Mon Jun 15, 2020 9:56 am I live in NTSC territory now, so NTSC is very real for me, regardless of where Spectrum was dominant

hmm, aren't all modern TV sets capable of different frame rates, or do you just wanna use a vintage NTSC CRT for some reason?

also, even in the worst possible artificial scenario when all the code and data belong to contended page and every contended instruction is either "ld hl, (nn)" or "ld (nn),hl" (making it 40 cycles instead of uncontended 16) less than 14800 cycles per frame are wasted on the contention (so ~21% for PAL and ~25% for NTSC frame) - and as other people here already said, it's very easy to avoid for critical code, so in practice it will be less than 10% (unless you're writing for 16k spectrum)

Joefish · Post by **Joefish** » Mon Jun 15, 2020 5:06 pm

Heimdall wrote: ↑Mon Jun 15, 2020 5:34 amBut, just quickly, use case here would be low-framerate, double-buffered 3D scene. Like, ~8-10 fps, so it takes 6-8 full frames to render single frame.
So, even if I only used the upper RAM for my code/data, during those 8 frames, there would still be a contention for ULA reading framebuffer.

No, that's wrong.

First, with the Spectrum, there's no 'frame buffer'. There's just the screen memory at a fixed position in RAM, which you can't change. If you want to use a frame buffer, you have to make your own elsewhere in memory, then when ready, copy it to the screen memory address.

Second, while the ULA is fetching data from lower RAM, if the CPU is only ever accessing upper RAM, then there is NO CONTENTION. The system does not slow down the processor if it does not need to. Contention only occurs when the ULA and the CPU try to access the same lower 16K of RAM at the same time.

While your code is running and reading/writing to a frame buffer in upper RAM there is NO CONTENTION. The code runs at full speed.
When you come to copy your buffer from upper RAM to lower RAM (the screen), then you MIGHT experience contention. Fetching the instructions of your code (assuming it's still in upper RAM) is not contended. The READ of data from upper RAM is not contended either. Only the WRITE phase of each byte you're copying to lower RAM is contended.

And as pointed out above, if you use repeated LDI instructions, the first one will be delayed by contention but subsequent ones will be in synch with the ULA fetch cycles. The code will actually automatically align itself and for a brief period will have ZERO contention! And this is only while the ULA is drawing pixels on the screen. If you can do your copying during the top/bottom border rendering time, again, there's no contention.

It is complicated, but it also allows the CPU to run as fast as possible. Other machines run slower by applying a false contention pattern constantly, even when it's not needed, to keep everything in step. The Spectrum doesn't do this. It only holds up the processor when there is a real memory-access contention.

Alone Coder · Post by **Alone Coder** » Mon Jun 15, 2020 5:57 pm

A universal 3D engine for 48K Speccy can't be faster than this semi-universal one: http://alonecoder.nedopc.com/3dengine_wirecube.zip

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 9:01 pm

Alone Coder wrote: ↑Mon Jun 15, 2020 5:57 pm A universal 3D engine for 48K Speccy can't be faster than this semi-universal one: http://alonecoder.nedopc.com/3dengine_wirecube.zip

I'm on mobile so will check the ZIP file later when on PC.

I wouldn't use"universal", as that would imply generic 3D scene set up, processing of which is slowing down even order of magnitude faster CPU, due to the sheer amount of work required.

In terms of ZX spectrum, let's just say something like StarStrike - predefined path indoors and free roaming outdoors in space.

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 9:05 pm

Joefish wrote: ↑Mon Jun 15, 2020 5:06 pm
Heimdall wrote: ↑Mon Jun 15, 2020 5:34 amBut, just quickly, use case here would be low-framerate, double-buffered 3D scene. Like, ~8-10 fps, so it takes 6-8 full frames to render single frame.
So, even if I only used the upper RAM for my code/data, during those 8 frames, there would still be a contention for ULA reading framebuffer.
No, that's wrong.

First, with the Spectrum, there's no 'frame buffer'. There's just the screen memory at a fixed position in RAM, which you can't change. If you want to use a frame buffer, you have to make your own elsewhere in memory, then when ready, copy it to the screen memory address.

Second, while the ULA is fetching data from lower RAM, if the CPU is only ever accessing upper RAM, then there is NO CONTENTION. The system does not slow down the processor if it does not need to. Contention only occurs when the ULA and the CPU try to access the same lower 16K of RAM at the same time.

While your code is running and reading/writing to a frame buffer in upper RAM there is NO CONTENTION. The code runs at full speed.
When you come to copy your buffer from upper RAM to lower RAM (the screen), then you MIGHT experience contention. Fetching the instructions of your code (assuming it's still in upper RAM) is not contended. The READ of data from upper RAM is not contended either. Only the WRITE phase of each byte you're copying to lower RAM is contended.

And as pointed out above, if you use repeated LDI instructions, the first one will be delayed by contention but subsequent ones will be in synch with the ULA fetch cycles. The code will actually automatically align itself and for a brief period will have ZERO contention! And this is only while the ULA is drawing pixels on the screen. If you can do your copying during the top/bottom border rendering time, again, there's no contention.

It is complicated, but it also allows the CPU to run as fast as possible. Other machines run slower by applying a false contention pattern constantly, even when it's not needed, to keep everything in step. The Spectrum doesn't do this. It only holds up the processor when there is a real memory-access contention.

Huh, so that would mean, that I would actually have the full amount of CPU cycles for the duration of those 6-8 frames of CPU time! Like, instead of 39,901 cycles I would have 58,333 per each of those 6-8 frames.
That's up to 466,000 for whole scene.

Awesome

Joefish · Post by **Joefish** » Mon Jun 15, 2020 9:57 pm

Yes, all the complicated maths and pixel- filling can run at full speed if it all takes place above memory address 32768. Only when you copy from your image buffer to the real screen memory at 16384 will you have to worry about contention, and there are plenty of strategies for minimising it even then.

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 10:07 pm

Joefish wrote: ↑Mon Jun 15, 2020 9:57 pm Yes, all the complicated maths and pixel- filling can run at full speed if it all takes place above memory address 32768. Only when you copy from your image buffer to the real screen memory at 16384 will you have to worry about contention, and there are plenty of strategies for minimising it even then.

I suspect you mean DMA transfer?

I'm thinking of Beam Racing here:
One could set up an interrupt to trigger once beam crossed the first screen scanline and then start the copy process.

This would copy a dozen or so scanlines while ULA would contend CPU.
But once ULA reached the last scanline, then it would be copied at full speed.

How many KB of data can DMA transfer before beam reaches first scanline?

Tearing appears to be a normal feature on Spectrum so perhaps it's not even an issue...

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 10:17 pm

Alone Coder wrote: ↑Mon Jun 15, 2020 5:57 pm A universal 3D engine for 48K Speccy can't be faster than this semi-universal one...

To clarify further what I meant, the core rendering concepts are identical, regardless of whether the HW platform is 8-bit (Spectrum) or 64-bit (Jaguar):
- 3D transform
- 3D culling
- 2D clipping
- Triangle rasterization - scanline traversal

Especially in flatshading, the game can look almost identical across various platforms, as long as they retain same color schemes. Moreso in pure wireframe, which is most likely candidate on Spectrum, but I wouldn't rule out filled polygons, given that there is no contention and we have up to 0.5 Mil cycles budget.

From my benchmarks on 6502, the scanline fill consumed only moderate amount of cycles, as great majority was spent on the scanline traversal and handling the edges of the current scanline.

Seven.FFF · Post by **Seven.FFF** » Mon Jun 15, 2020 10:19 pm

Heimdall wrote: ↑Mon Jun 15, 2020 10:07 pm How many KB of data can DMA transfer before beam reaches first scanline?

Remember this is a ZX Spectrum forum. Although you started off by asking questions about the Spectrum Next in a different topic, the vast majority of the topics in this forum are about a ZX Spectrum unless you specifically post in the Next subforum or ask a question about the Next.

The standard Spectrum, released as various official Sinclair and Amstrad models between 1982 and 1987, has no DMA, and doesn't have the ability to turn contention off. Both those are features of the Next. Most people on this forum are going to be utterly confused if you start asking questions about DMA.

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 10:24 pm

Now, the challenge with Spectrum in flatshading is that we must use dithering and various other patterns effectively.

But that's nothing that a couple short tables couldn't handle.

StarStrike 1 seems to have been designed really smart as it separated various elements enough that they don't overlap on a byte boundary and can have a separate color attribute.

Is there some other 3D game that achieved the same or even a better effect?

Heimdall · Post by **Heimdall** » Mon Jun 15, 2020 10:33 pm

Seven.FFF wrote: ↑Mon Jun 15, 2020 10:19 pm
Heimdall wrote: ↑Mon Jun 15, 2020 10:07 pm How many KB of data can DMA transfer before beam reaches first scanline?
Remember this is a ZX Spectrum forum. Although you started off by asking questions about the Spectrum Next in a different topic, the vast majority of the topics in this forum are about a ZX Spectrum unless you specifically post in the Next subforum or ask a question about the Next.

The standard Spectrum, released as various official Sinclair and Amstrad models between 1982 and 1987, has no DMA, and doesn't have the ability to turn contention off. Both those are features of the Next. Most people on this forum are going to be utterly confused if you start asking questions about DMA.

Oh, didn't know that there was no DMA on standard Spectrum.

I'm currently considering both HW targets - 3.5 MHz and 28 MHz. The core code will be 95% identical, with couple separate codepaths via #ifdef blocks to have a special version using extended opcodes or to handle double buffering (be it via ULA or not).

To be more specific, my last 6502 code did the Star Wars catwalk scene and the 65c02 had a flatshaded stunrunner-style scene (though that was on a 4 MHz Lynx and resolution was just 160*102, not 256*192).
So, I got lots of reference benchmarks to compare...

Heimdall · Post by **Heimdall** » Tue Jun 16, 2020 8:54 am

Lethargeek wrote: ↑Mon Jun 15, 2020 4:47 pm
Heimdall wrote: ↑Mon Jun 15, 2020 9:56 am I live in NTSC territory now, so NTSC is very real for me, regardless of where Spectrum was dominant
hmm, aren't all modern TV sets capable of different frame rates, or do you just wanna use a vintage NTSC CRT for some reason?

also, even in the worst possible artificial scenario when all the code and data belong to contended page and every contended instruction is either "ld hl, (nn)" or "ld (nn),hl" (making it 40 cycles instead of uncontended 16) less than 14800 cycles per frame are wasted on the contention (so ~21% for PAL and ~25% for NTSC frame) - and as other people here already said, it's very easy to avoid for critical code, so in practice it will be less than 10% (unless you're writing for 16k spectrum)

I wish I actually had NTSC CRT, but I don't.

From reading forums for another platform, it became obvious [if surprising] that 60 Hz is apparently quite a problem in Europe. Regardless of what the HDMI spec says.

I personally haven't tried 50 Hz yet on my LCDs, but if they don't support 50 Hz, I will have to hardcode everything to 60 Hz. If the 50 Hz runs then I will target 50 Hz. As much as I am a huge fan of conditional compiling, I don't want to do that for NTSC/PAL. Just too many issues.

I wouldn't target 16K - the lowest non-Next Spectrum I would consider is 128K, as I need couple look up tables and 64KB version would be much slower on less than 128 KB.

Heimdall · Post by **Heimdall** » Tue Jun 16, 2020 8:59 am

Metalbrain wrote: ↑Mon Jun 15, 2020 10:33 am The contention only affects the moment when the CPU tries to read or write a contended memory cell. Just because slow memory accesses get delayed in 6 (or 7) out of every 8 states, don't totally freeze the CPU in those states if no slow memory access is needed.

As an example, if we're using the LDI instruction executed from fast memory to read a byte from fast memory and write it into the slow memory (such as writing from a buffer to the screen), the first 4 states read the prefix, the next 4 states read the instruction itself, the next 4 states read the byte to be written, and in the next cycle there's 1 state out of the last 4 where it will try to write to the screen memory. In the worst case scenario, we'll have to wait 6 states (or even 7 in the case of a +2A/+3 model), and if we're lucky we'll have to wait 0, so the instruction may take 16-22(23) states to be executed. But if a new LDI instruction follows that first one immediately, in this case the write state will be aligned with the previous write and there won't be any delay. So a bunch of LDIs will only get delayed a tiny bit, while using LDIR (which without contention takes 21 states per byte IIRC), the repeated write accesses will get delayed 3 states each, so it will be like executing it at 24 states per byte (that's 14.29% slower than normal).

PD: In reality it's a bit more complex than that, because LDI has some extra contended states for writing, so overall both LDI and LDIR will be slower.

Sorry that I didn't respond earlier. For some reason, I missed last two posts on this page and only noticed now, weird...

I will spend 6-8 full frames to draw the screen into off-screen framebuffer.

So, I will be copying my framebuffer to screen only about ~8-10 times per second, not every frame, so I don't expect much of a slowdown now that it was explained to me that there is no contention as long as I don't access contended RAM.

Reason I was thinking it would is because on Atari, the ANTIC chip, regardless of your framerate, will shut down 6502 during each of the 60 frames, even if your 3D app has a framerate of 1 fps.

So, I thought it logical, that same technology from that era would be in Spectrum too. Turns out Speccy is way more advanced than I thought...

Patrik Rak · Post by **Patrik Rak** » Tue Jun 16, 2020 11:27 am

Heimdall wrote: ↑Tue Jun 16, 2020 8:59 am So, I thought it logical, that same technology from that era would be in Spectrum too. Turns out Speccy is way more advanced than I thought...

You can also consider targeting Pentagon - it uses smarter memory access which needs no contention and gives you 71680 cycles per frame...

Patrik

Heimdall · Post by **Heimdall** » Tue Jun 16, 2020 1:25 pm

Patrik Rak wrote: ↑Tue Jun 16, 2020 11:27 am
Heimdall wrote: ↑Tue Jun 16, 2020 8:59 am So, I thought it logical, that same technology from that era would be in Spectrum too. Turns out Speccy is way more advanced than I thought...
You can also consider targeting Pentagon - it uses smarter memory access which needs no contention and gives you 71680 cycles per frame...

Patrik

Would I even have to do anything about Pentagon ? It's supposed to be fully compatible (plus it has RAM/audio/resolution extensions), so I presume any Spectrum code should just run there, or no ?

Some of them are even 7 MHz. Although, I would have to run physics/input based on timer, not framecounter, to have framerate-independent handling (which is not exactly cheap)...

Alone Coder · Post by **Alone Coder** » Tue Jun 16, 2020 6:46 pm

Heimdall wrote: ↑Mon Jun 15, 2020 10:17 pm
Alone Coder wrote: ↑Mon Jun 15, 2020 5:57 pm A universal 3D engine for 48K Speccy can't be faster than this semi-universal one...
To clarify further what I meant, the core rendering concepts are identical, regardless of whether the HW platform is 8-bit (Spectrum) or 64-bit (Jaguar):
- 3D transform
- 3D culling
- 2D clipping
- Triangle rasterization - scanline traversal

What precision for 3D transform and 3D culling will you use?
My engine transforms in 12 bit and draws object-wise, and there is no 3D culling, all the clipping is in 2D with 8 bit transform.
Because of that, you can't have bigger objects.

Spectrum Computing

Post-Contention CPU Cycles Available

Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available

Re: Post-Contention CPU Cycles Available