Multiplying routine in assembler

sn3j · Post by **sn3j** » Sun Mar 10, 2024 3:44 pm

Here's a quick multiply routine. It utilizes a 256 byte lookup table.
I didn't find anything like this in the database, so I'm sharing it here.

The routine takes two numbers in the range 0..255 and returns the product in hl.
This takes 164 T states.
The algorithm decomposes the two numbers into 4 nibbles (half-bytes) and looks up 4 products according to the combinations (b.h,c.h), (b.h,c.l), (c.h,b.l) and (b.l,c.l).
The result is added up from all parts, meaning b*c = 256*b.h*c.h + 16*(b.h*c.l+c.h*b.l) + b.l*c.l where .h, .l address the respective nibble.

The attached tap file contains a Basic program with the code, the lookup table and a test.
https://drive.google.com/file/d/1xYcPam ... sp=sharing

Code: Select all

Mulu88                 ; multiply, hl := b*c  (changes: af, b, de, hl)
--
 4  ld a b
16  rlca x4
 4  xor c
 4  ld d a
 7  and 240
 4  xor c              ; (b.l,c.l)
 7  ld h MulTab.hi     ; 256-aligned table for xy -> x*y where x,y in 0..15
 4  ld l a
 7  ld e (hl)          ; e = b.l*c.l
 4  ld a d
 7  and 15
 4  xor c              ; (c.h,b.h)
 4  ld l a
 7  ld d (hl)          ; de = 256*b.h*c.h + b.l*c.l
 4  ld a b
 4  xor c
 7  and 15
 4  xor c              ; (c.h,b.l)
 4  ld l a
 4  ld a b
 4  xor c
 7  and 240
 4  xor c              ; (b.h,c.l)
 7  ld c (hl)          ; c = c.h*b.l
 4  ld l a
 7  ld a (hl)          ; a = b.h*c.l
 4  add c              ; (C,a) = b.h*c.l + c.h*b.l
16  rla x4
 4  ld l a
 4  rla
 7  and 31
 4  ld h a
 4  ld a l
 7  and 240
 4  ld l a             ; hl = 16 * (b.h*c.l + c.h*b.l)
11  add hl de          ; hl = 256*b.h*c.h + 16*(b.h*c.l+c.h*b.l) + b.l*c.l
    ret

ParadigmShifter · Post by **ParadigmShifter** » Sun Mar 10, 2024 4:37 pm

I saw this page today while I was looking for a fast multiply by various constants... wasn't what I was looking for but this uses a 512 byte lookup table and is 90T states by my reckoning, so very fast. But it only works for multiplying numbers in the range -64 to +63 looks like

(listing at the end, MULTROUT.ASC)

https://www.msxcomputermagazine.nl/mccw ... on/en.html

I didn't test it though.

I might try yours as well at some point... in SJOE I think I do some multiplies when doing score bonuses (so if you make 2 words at once you get x2 score, 3 words at once 3x score, etc., for each word in the chain) which I think I am doing by repeated adding (scoring doesn't need to be very fast for me though since it isn't doing much else at the time, it's animating piece removal and that's all). I use 24 bit maths at some point as well but I don't think I use that when doing that score multiplier, I think I only do that when making the total score, can't remember. Some other combo bonuses are x2, x4, x8, x16 though so I do those using shifts. I never multiply by more than 9 or so I think (most number of words you can make simultaneously) so the routine I linked to might be good, 512 bytes lookup table is a lot though (table is 16 bit lookup of (int)(x*x/4)).

Other useful maths stuff, they have an unrolled 8 bit multiply here

http://z80-heaven.wikidot.com/math#toc10

H_Times_E:
;Inputs:
; H,E
;Outputs:
; HL is the product
; D,B are 0
; A,E,C are preserved
;Size: 38 bytes
;Speed: 198+6b+9p-7s, b is the number of bits set in the input H, p is if it is odd, s is the upper bit of h
; average is 226.5 cycles (108.5 cycles saved)
; max required is 255 cycles (104 cycles saved)

which isn't bad for no lookup table.

You might be able to use the same technique as the first link is using if you store the squares of the nybbles/4 which uses the squares and differences of squares

(a+b)^2 - (a-b)^2

= a^2 + b^2 + 2ab - (a^2 + b^2 - 2ab)

= a^2 + b^2 + 2ab - a^2 - b^2 + 2ab

= 2ab + 2ab = 4ab

since 16^2/4 is 64 which fits into a byte.

PROSM · Post by **PROSM** » Mon Mar 11, 2024 12:42 am

ParadigmShifter wrote: ↑Sun Mar 10, 2024 4:37 pm I saw this page today while I was looking for a fast multiply by various constants... wasn't what I was looking for but this uses a 512 byte lookup table and is 90T states by my reckoning, so very fast. But it only works for multiplying numbers in the range -64 to +63 looks like

(listing at the end, MULTROUT.ASC)

https://www.msxcomputermagazine.nl/mccw ... on/en.html

If you don't need to support signed integers, that routine can be adapted to take any two unsigned numbers whose sum does not exceed 255, at the extra cost of 12 to 15 T-states and 4 bytes. It's something I figured out about a year after releasing a 3D game which used this routine quite heavily - prior to my realisation, I had been truncating my inputs to fit between 0 to 63 and then afterwards bodging on an extra binary digit of precision at an average cost of 55.5 T-states, but it was still faster than doing it without a table.

Code: Select all

    ld e,a
    sub d                       ; Work out (a-b)
    jr nc,$+4                   ; Jump forward if positive
    neg                         ; Negate to make this positive

    ld h,d_multab/256
    ld l,a                      ; Form address to quarter-square table
    ld c,(hl)
    inc h
    ld b,(hl)                   ; Load ((a-b)^2)/4 into BC

    ld a,e
    add a,d                     ; Work out (a+b)
    ld l,a
    ld e,(hl)
    dec h
    ld l,(hl)
    ld h,e                      ; Load ((a+b)^2)/4 into HL
    sbc hl,bc                   ; Result = ((a+b)^2)/4 - ((a-b)^2)/4
    
d_multab:
; Aligned to 256-byte page
    db $00, $00, $01, $02, $04, $06, $09, $0c, $10, $14, $19, $1e, $24, $2a, $31, $38
    db $40, $48, $51, $5a, $64, $6e, $79, $84, $90, $9c, $a9, $b6, $c4, $d2, $e1, $f0
    db $00, $10, $21, $32, $44, $56, $69, $7c, $90, $a4, $b9, $ce, $e4, $fa, $11, $28
    db $40, $58, $71, $8a, $a4, $be, $d9, $f4, $10, $2c, $49, $66, $84, $a2, $c1, $e0
    db $00, $20, $41, $62, $84, $a6, $c9, $ec, $10, $34, $59, $7e, $a4, $ca, $f1, $18
    db $40, $68, $91, $ba, $e4, $0e, $39, $64, $90, $bc, $e9, $16, $44, $72, $a1, $d0
    db $00, $30, $61, $92, $c4, $f6, $29, $5c, $90, $c4, $f9, $2e, $64, $9a, $d1, $08
    db $40, $78, $b1, $ea, $24, $5e, $99, $d4, $10, $4c, $89, $c6, $04, $42, $81, $c0
    db $00, $40, $81, $c2, $04, $46, $89, $cc, $10, $54, $99, $de, $24, $6a, $b1, $f8
    db $40, $88, $d1, $1a, $64, $ae, $f9, $44, $90, $dc, $29, $76, $c4, $12, $61, $b0
    db $00, $50, $a1, $f2, $44, $96, $e9, $3c, $90, $e4, $39, $8e, $e4, $3a, $91, $e8
    db $40, $98, $f1, $4a, $a4, $fe, $59, $b4, $10, $6c, $c9, $26, $84, $e2, $41, $a0
    db $00, $60, $c1, $22, $84, $e6, $49, $ac, $10, $74, $d9, $3e, $a4, $0a, $71, $d8
    db $40, $a8, $11, $7a, $e4, $4e, $b9, $24, $90, $fc, $69, $d6, $44, $b2, $21, $90
    db $00, $70, $e1, $52, $c4, $36, $a9, $1c, $90, $04, $79, $ee, $64, $da, $51, $c8
    db $40, $b8, $31, $aa, $24, $9e, $19, $94, $10, $8c, $09, $86, $04, $82, $01, $80
    
    db $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00
    db $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $00
    db $01, $01, $01, $01, $01, $01, $01, $01, $01, $01, $01, $01, $01, $01, $02, $02
    db $02, $02, $02, $02, $02, $02, $02, $02, $03, $03, $03, $03, $03, $03, $03, $03
    db $04, $04, $04, $04, $04, $04, $04, $04, $05, $05, $05, $05, $05, $05, $05, $06
    db $06, $06, $06, $06, $06, $07, $07, $07, $07, $07, $07, $08, $08, $08, $08, $08
    db $09, $09, $09, $09, $09, $09, $0a, $0a, $0a, $0a, $0a, $0b, $0b, $0b, $0b, $0c
    db $0c, $0c, $0c, $0c, $0d, $0d, $0d, $0d, $0e, $0e, $0e, $0e, $0f, $0f, $0f, $0f
    db $10, $10, $10, $10, $11, $11, $11, $11, $12, $12, $12, $12, $13, $13, $13, $13
    db $14, $14, $14, $15, $15, $15, $15, $16, $16, $16, $17, $17, $17, $18, $18, $18
    db $19, $19, $19, $19, $1a, $1a, $1a, $1b, $1b, $1b, $1c, $1c, $1c, $1d, $1d, $1d
    db $1e, $1e, $1e, $1f, $1f, $1f, $20, $20, $21, $21, $21, $22, $22, $22, $23, $23
    db $24, $24, $24, $25, $25, $25, $26, $26, $27, $27, $27, $28, $28, $29, $29, $29
    db $2a, $2a, $2b, $2b, $2b, $2c, $2c, $2d, $2d, $2d, $2e, $2e, $2f, $2f, $30, $30
    db $31, $31, $31, $32, $32, $33, $33, $34, $34, $35, $35, $35, $36, $36, $37, $37
    db $38, $38, $39, $39, $3a, $3a, $3b, $3b, $3c, $3c, $3d, $3d, $3e, $3e, $3f, $3f

ParadigmShifter · Post by **ParadigmShifter** » Mon Mar 11, 2024 1:33 am

Yeah supporting signed by default seems a bit weird. CPUs that support integer multiplication usually have 2 opcodes, one for signed and one for unsigned.

You should probably have 2 entry points one for where you don't need to do the check then. You don't need it if you always pass largest value in A either I suppose. EDIT: Hmm no it will still go wrong if the sum overflows as well as the difference so it doesn't make any odds.

I'm ok with doing multiplying by constants for now though and any small multiplies I do with a few repeated additions. The conversion of the score to ASCII is probably slower than the repeated additions. Was thinking of doing packed BCD instead I will have a look next time I work on SJOE.

I multiply by 11 quite a bit (width of my collision map) and I also might want to multiply by 24 if I reduce the size of my sprites (currently 16x16 pixels = 32 bytes which always have the same bottom and top 2 rows so I only really need to have 32 - 8 = 24 bytes of data for a 16x16 sprite then (which will save at least 26x4 bytes in the code for the letters A-Z and I have a few extra characters like a space, a cursor, icons for a few things, etc.) so it might be worth it for me to reduce the data size and take the speed hit of multiplying by 24 instead of 32.

I might use a lookup table instead for multiples of 11 since my collision map only has 16 rows (only 15 actually used, bottom row is all #FF for "this is a floor"), so 11x16 fits into a byte.

EDIT: There's also fast versions of a MOD k (for constant k, obviously not a power of 2, use AND for that obvs) which boil down to doing a multiply (and magic constants) so a fast multiply for that might be handy as well. I use a reverse lookup table I think for some things to go from collision map pointer to collision map row when I really just need offset MOD 11.

Not using that for my current engine though except for drawing the debug collision minimap.

Maybe I should have used a 16 wide collision map and packed the unused values with other data though

EDIT2: Hmm thinking about it I really should do that... always trying to save some microjiffies lol. Although a linear search is probably faster if the collision map is the correct width (not that it is, there's dummy bytes of #FF either side for "this is a wall" as well, but that's only 2 x INC L to skip walls instead of adding on a value to get to the next multiple of 16 so maybe it's not worth it... I do quite a lot of linear searches too). Having the walls in the collision map makes piece movement and collision a lot simpler (no need to check for going out of bounds).

Sometimes it would be nice to have access to the half carry flag I guess which detects a carry from bit 3 into bit 4, that's only used by DAA I think and you can't test it normally

EDIT3: Wow DAA is fast for the amount of stuff it does, only 4T. Maybe there is a way to use that to advance to the next multiple of 16 using DAA.

grelf · Post by **grelf** » Mon Mar 11, 2024 9:12 am

This is by the way but...

When I wrote The Forest in the early 80s (1983 for the second version, for ZX Spectrum) I used 3-byte fixed-point numbers (such as DE.B). This enabled players to move not just on grid squares but by arbitrary distances on any bearing, so they could be positioned to the nearest 1/256th metre. Movement by sines and cosines of the bearing involved look-up tables for those trig values. It also involved multiplication of course, so I wrote multiplication routines for the signed 3-byte numbers. The code can be found if you search for The Forest (publisher Phipps) on this site and then look in the details for TheForest_DesignMaterial.pdf . It's a big file but search for MUL212 and MULCO as the labels in the assembly source. A few years later when I got the ROM disassembly book from Melbourne House (which I still have) I could see that it was admitted that the floating point routines were very slow, so I was pleased to have done the best thing.

You can go straight to the PDF file at https://ia902601.us.archive.org/6/items ... terial.pdf

arjun · Post by **arjun** » Tue Mar 12, 2024 6:09 am

grelf wrote: ↑Mon Mar 11, 2024 9:12 am This is by the way but...

When I wrote The Forest in the early 80s (1983 for the second version, for ZX Spectrum)

Very cool! Always nice to run into an author of original software from the 80's. How was your development experience back then? Did you make enough money to buy a Ferrari?

I notice that there are only two games attributed to you in the archives. I gather making games for the speccy wasn't your full time occupation then?

grelf · Post by **grelf** » Tue Mar 12, 2024 8:05 am

arjun wrote: ↑Tue Mar 12, 2024 6:09 am I gather making games for the speccy wasn't your full time occupation then?

Your deduction is correct. It was just a hobby and made some pocket money. What I was doing was not mass-market, it was for my own interest and I learnt a lot by doing it - taught myself assembler for a start.

Spectrum Computing

Multiplying routine in assembler

Multiplying routine in assembler

Re: Multiplying routine in assembler

Re: Multiplying routine in assembler

Re: Multiplying routine in assembler

Re: Multiplying routine in assembler

Re: Multiplying routine in assembler

Re: Multiplying routine in assembler