Re: CPI / CPIR instruction
Posted: Sat Sep 28, 2019 12:10 am
Uh . . . The short answer is: it’s complicated.
The long answer is itself complicated.
You see, when analyzing these combo instructions, it’s best not to rewrite them in pseudocode like you did. Your pseudocode is correct, but only in breaking down the logic of the instruction. That is how the CPU arrives at the result, but that’s actually not what it’s doing.
A better way of breaking down any instruction is to think of it in machine cycles, not T states. Each machine cycle can take several T states, and each instruction takes at least one M cycle—the opcode fetch. The absolute minimum number of T states in a fetch M cycle is four. Some instructions take just that many T states (say, INC A). That’s how long it takes to place the PC register on the address bus and read the opcode. Extended instructions (prefixed by ED, CB, DD, and FD), take an additional 4 T states, because their opcodes are two bytes long. IX and IY bit instructions (prefixed by DDCB and FDCB) take even longer. Compare, for example, the regular LD HL,(**) instruction (opcode 22; 16 T states) with its undocumented counterpart (opcode ED6B; 20 T states).
Now, each of the pseudocode instructions that you wrote out doesn’t need to be fetched and parsed individually; only one instruction fetch happens in either LDI or CPI. Since those are extended instructions (with the ED prefix), the fetch machine cycle for each takes 8 T states.
Next machine cycles (if they exist at all) are for moving data between the CPU and the RAM/ROM or other devices (I/O). They can take anywhere from three to five T states. Some instruction don’t move any data (INC A) and thus take much less time. Incrementing an index register, however, will take longer, because, say, INC IXh is an extended instruction; it takes another 4 T states to fetch the second byte of its opcode. Yet something like EX (SP),IX can take as many as six machine cycles and 23 T states (!) (two fetches, two memory reads and two writes—one for each byte).
The internal workings of the Z80 are not as easily broken down timing-wise and they do depend on numerous factors, including, as you put it—“the wiring.” Suffice it to say, that actually incrementing a register (or register pair) doesn’t take up 6 T states. Moreover, increments and decrements can be grouped together and impose little to no overhead when executed simultaneously—they’re not necessarily cumulative. The incrementer/decrementer circuity in the Z80 is quite clever and can do various things. It can, too, pass a value without incrementing or decrementing it; thus, similar to the WZ register pair, it can be used for storing data temporarily.
The HL/DE registers pairs can be very easily swapped in hardware. In fact, they are not strictly speaking physically separate registers at all. Instructions like EX DE,HL don’t actually exchange data between DE and HL, but it sure looks like it to the programmer.
Some internal operations in the Z80 can be pipelined and thus overlap, but not all. For example, you can’t directly copy a value from one register to another (yes, even the LD B,C mnemonic is a lie). The operation must be done through the ALU. But the ALU in the Z80 is 4-bit, and using it for transferring data between 16-bit registers would be too slow. It’s much faster to use the incrementer/decrementer circuity for that. Now, the ALU (and register) operations can finish while the CPU is fetching another instruction, but since that requires the incrementer/decrementer latch, if an instruction requires its use, it must be completed first before the next instruction can be fetched. This explains why INC A is faster than INC HL, for example. Block transfers (LDIR, CPIR, etc.) sure use the incrementer latch a lot.
Like I said, it’s complicated. Hopefully, I’ve now confused you beyond reason, and you have no desire to investigate the matter any further.