CPI / CPIR instruction

R-Tape · Post by **R-Tape** » Thu Sep 26, 2019 8:02 pm

Has anyone ever actually used these instructions?

I remember when I first read about them in 'Spectrum Machine Language for the Absolute Beginner', and it described them as being so powerful as to leap skyscrapers (or something). They really do sound like they have the potential to be as useful as LDI/LDIR, but every time I think "Ah! Finally - I can use CPI!*", there's a property of it that means it's not the best solution.

I think if it was a CP (HL) with (DE), increment, and DEC BC then I'd find a use for it.

*which was the case this week

Ast A. Moore · Post by **Ast A. Moore** » Thu Sep 26, 2019 9:41 pm

Yes, I use CPIR it in my redefine keys routine. It’s pretty slow, so I’d reserve it for when code execution speed isn’t a concern.

I think its primary purpose is for use with data arrays, i.e. text, etc. I use it to ignore the already defined keys (something many redefine keys routines neglect, much to the chagrin of players).

Ralf · Post by **Ralf** » Thu Sep 26, 2019 10:33 pm

I very rarely use it. But I used it once in one of my games. Imagine a table of N bytes. I needed to count how many bytes in this table are 0.
Eventually I did:

Code: Select all

 LD HL,TableStart
 LD DE,0                ;count of 0s
 LD BC,TableSize
 XOR A
Loop1:
 CPI                    ;does CP (HL) INC HL DEC BC
 JP NZ,Loop2
 INC DE                 ;increase count of 0s
Loop2
 JP PE,Loop1

catmeows · Post by **catmeows** » Thu Sep 26, 2019 10:59 pm

R-Tape wrote: ↑Thu Sep 26, 2019 8:02 pm Has anyone ever actually used these instructions?

No, I usually try to find things by faster way than by CP*(R).

Seriously, aren't the instructions rather slow ?

AFAIK native LZ packers use the CPIR/CPDR as fast way to find match candidate.

Seven.FFF · Post by **Seven.FFF** » Fri Sep 27, 2019 4:08 am

It’s very useful for writing expression parsers. Finding the CR end of line markers, finding the equals in Key=Value constructs, reading data out of modem AT command responses, that kind of thing.

R-Tape · Post by **R-Tape** » Fri Sep 27, 2019 2:21 pm

Thanks guys. You've made me realise there is somewhere I could (and should) have used this. I had a map of 18 bytes and I needed to find the first available empty place, so a new tile could be inserted.

For the first time ever I will use the CPIR instruction!

catmeows wrote: ↑Thu Sep 26, 2019 10:59 pm Seriously, aren't the instructions rather slow ?

The speed looks quite good to me, same as LDI/LDIR. CPI/R seem less likely to be used in fast game loops anyway.

Ast A. Moore · Post by **Ast A. Moore** » Fri Sep 27, 2019 2:23 pm

R-Tape wrote: ↑Fri Sep 27, 2019 2:21 pm I had a map of 18 bytes and I needed to find the first available empty place, so a new tile could be inserted.

Yup. That is precisely where you’d use CPI/CPIR.

catmeows · Post by **catmeows** » Fri Sep 27, 2019 3:35 pm

R-Tape wrote: ↑Fri Sep 27, 2019 2:21 pm The speed looks quite good to me, same as LDI/LDIR. CPI/R seem less likely to be used in fast game loops anyway.

Well, LDI decreases/increases 3 16bit registers, does one memory read, one memory write.
CPI decreases/increases 2 16bit registers, does one memory read and register comparison.
I was just thinking about that yesterday when I was trying to figure out why I don't use CPI etc.

Btw. Whenever I needed some buffer that would have empty slots, I kept an user stack with pointers to empty slots as a helper structure. That way an empty slot is always on top of stack so I just pick it up. When I invalidate data in slot, I return the slot pointer on top of stack.

Joefish · Post by **Joefish** » Fri Sep 27, 2019 4:37 pm

I could have used it in my compression routine, for looking to see if a data byte is already recorded in a dictionary of most common values. Except I want an 8-bit index into the dictionary, not an absolute address, so it's easier to just write my own compare routine than faff around with adjusting the answer from an absolute address to a relative value. And no, I'm not going to align everything to page boundaries!

R-Tape · Post by **R-Tape** » Fri Sep 27, 2019 9:47 pm

catmeows wrote: ↑Fri Sep 27, 2019 3:35 pm Well, LDI decreases/increases 3 16bit registers, does one memory read, one memory write.
CPI decreases/increases 2 16bit registers, does one memory read and register comparison.
I was just thinking about that yesterday when I was trying to figure out why I don't use CPI etc.

Ah sorry, I see what you mean. I rarely think in terms of tstates and had to resort to Rodney Zaks.

To summarise:

Code: Select all

LDI	16t

equivalent to:
	ld a,(hl)	;7t
	ld (de),a	;7t
	inc hl		;6t
	inc de		;6t
	dec bc		;6t
	=32t

CPI	16t

equivalent to:
	cp (hl)		;7t
	inc hl		;6t
	dec bc		;6t
	=19t

I'm exposing my ignorance here, but can anyone explain why they have the same number of tstates, but one is clearly doing a lot more than the other?

(It might be easiest just to say "wiring"

)

EDITED - removed stupid flowery prose

Ast A. Moore · Post by **Ast A. Moore** » Sat Sep 28, 2019 12:10 am

R-Tape wrote: ↑Fri Sep 27, 2019 9:47 pm can anyone explain why they have the same number of tstates, but one is clearly doing a lot more than the other?

Uh . . . The short answer is: it’s complicated.

The long answer is itself complicated.

You see, when analyzing these combo instructions, it’s best not to rewrite them in pseudocode like you did. Your pseudocode is correct, but only in breaking down the logic of the instruction. That is how the CPU arrives at the result, but that’s actually not what it’s doing.

A better way of breaking down any instruction is to think of it in machine cycles, not T states. Each machine cycle can take several T states, and each instruction takes at least one M cycle—the opcode fetch. The absolute minimum number of T states in a fetch M cycle is four. Some instructions take just that many T states (say, INC A). That’s how long it takes to place the PC register on the address bus and read the opcode. Extended instructions (prefixed by ED, CB, DD, and FD), take an additional 4 T states, because their opcodes are two bytes long. IX and IY bit instructions (prefixed by DDCB and FDCB) take even longer. Compare, for example, the regular LD HL,(**) instruction (opcode 22; 16 T states) with its undocumented counterpart (opcode ED6B; 20 T states).

Now, each of the pseudocode instructions that you wrote out doesn’t need to be fetched and parsed individually; only one instruction fetch happens in either LDI or CPI. Since those are extended instructions (with the ED prefix), the fetch machine cycle for each takes 8 T states.

Next machine cycles (if they exist at all) are for moving data between the CPU and the RAM/ROM or other devices (I/O). They can take anywhere from three to five T states. Some instruction don’t move any data (INC A) and thus take much less time. Incrementing an index register, however, will take longer, because, say, INC IXh is an extended instruction; it takes another 4 T states to fetch the second byte of its opcode. Yet something like EX (SP),IX can take as many as six machine cycles and 23 T states (!) (two fetches, two memory reads and two writes—one for each byte).

The internal workings of the Z80 are not as easily broken down timing-wise and they do depend on numerous factors, including, as you put it—“the wiring.” Suffice it to say, that actually incrementing a register (or register pair) doesn’t take up 6 T states. Moreover, increments and decrements can be grouped together and impose little to no overhead when executed simultaneously—they’re not necessarily cumulative. The incrementer/decrementer circuity in the Z80 is quite clever and can do various things. It can, too, pass a value without incrementing or decrementing it; thus, similar to the WZ register pair, it can be used for storing data temporarily.

The HL/DE registers pairs can be very easily swapped in hardware. In fact, they are not strictly speaking physically separate registers at all. Instructions like EX DE,HL don’t actually exchange data between DE and HL, but it sure looks like it to the programmer.

Some internal operations in the Z80 can be pipelined and thus overlap, but not all. For example, you can’t directly copy a value from one register to another (yes, even the LD B,C mnemonic is a lie). The operation must be done through the ALU. But the ALU in the Z80 is 4-bit, and using it for transferring data between 16-bit registers would be too slow. It’s much faster to use the incrementer/decrementer circuity for that. Now, the ALU (and register) operations can finish while the CPU is fetching another instruction, but since that requires the incrementer/decrementer latch, if an instruction requires its use, it must be completed first before the next instruction can be fetched. This explains why INC A is faster than INC HL, for example. Block transfers (LDIR, CPIR, etc.) sure use the incrementer latch a lot.

Like I said, it’s complicated. Hopefully, I’ve now confused you beyond reason, and you have no desire to investigate the matter any further.

djnzx48 · Post by **djnzx48** » Sat Sep 28, 2019 12:17 am

Those instructions sequences aren't exactly equivalent as CPI/LDI set flags if BC is equal to zero.

If LDI and CPI take the same number of T-states, my guess is that CPI uses the same circuitry but doesn't output the value to memory, or maybe just writes it back to HL.

1024MAK · Post by **1024MAK** » Sat Sep 28, 2019 9:58 am

The Z80 MPU has a number of features (including clever ideas) that make it a bit unconventional.

Remember, the mnemonics are only there to help humans remember the effect of the instruction. They do not necessarily accurately indicate how the Z80 carries out the operation. As all sorts of hardware tricks take place. The exchange of alternative registers sets is a good example. No copy/swap operation takes place, instead a single latch/flip-flop bit changes state to tell the Z80 which registers are the current in-use set.

The other thing to remember, is that MPU/CPU design is closely tied in with memory performance. At the time that the Z80 was designed, DRAM and ROM memory chips were painfully slow (in fact, DRAM memory is still painfully slow, we have just come up with many more tricks to make it look a bit faster). Hence where possible MPU/CPU designers avoided unnecessary memory accesses where they could. Memory was also very expensive. So again, instructions that did a lot of useful work for not many bytes of code were favoured, so that code could be compact.

One limiting factor with the Z80 design, is that because it was designed to run 8080 code, this rather limited the flexibility of the instruction set. Hence there are a lot of operations that take longer than is actually needed compared with if you started with a clean sheet approach.

Mark

Juan F. Ramirez · Post by **Juan F. Ramirez** » Sat Sep 28, 2019 1:23 pm

I read the whole thread because of [mention]R-Tape[/mention] 's meme, I don't usually wander around this kind of threads! No idea of coding!

Morkin · Post by **Morkin** » Sat Sep 28, 2019 2:26 pm

Juan F. Ramirez wrote: ↑Sat Sep 28, 2019 1:23 pm I read the whole thread because of @R-Tape 's meme, I don't usually wander around this kind of threads! No idea of coding!

Heh - I'm the same with the hardware threads, enjoy reading them but no idea what anyone's talking about...

Spectrum Computing

CPI / CPIR instruction

CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction

Re: CPI / CPIR instruction