Euclidean dist/Sqrt/Multiplication routines 16bit

sn3j · Post by **sn3j** » Sat Mar 30, 2024 7:56 pm

Hi again, I've spent some time to build a fast assembler routine which computes the Euclidean distance, that is Sqrt(x² + y²), where x and y are unsigned 8-bit integers.
The routine requires 3*256 byte lookup tables and runs at ~311 T on average.
As a byproduct, so to say, it features a fast square root routine which uses ~238 T on average (profiled with FUSE over all 64k inputs).
The result is returned in BC, fixed point, where B contains the integer part and C the fractional part (so e.g. in Basic the float result is obtained by USR EucDis/256).

I tried to keep the errors as small as possible without wasting too many cycles. For numbers <256 the fractional part contains four bits (.5, .25, .125, .0625) precision, for numbers >=256 there's three bits precision with the 4th being rounded into the 3rd.

The maximum absolute error is below 0.075 and the maximum relative error is ~1.6% at SQRT(2). The mean relative error is about 0.0235%.

I've also added a fast multiplication routine which takes two 8-bit unsigned values and gets the result in 136(139) T if operands x+y<256 and 149(152) T if x+y>=256 (+3T if x<y). It uses the trick where (x-y)² is subtracted from (x+y)² and works with all unsigned 8-bit inputs. So at least two of the bulky lookup tables can be reused here.

The code for EuclDist is around 170 bytes, but could be reduced by ~50 bytes for a lower-precision variant for values in the range 256..1023.

In the attached .tap file you'll find a Basic program which pokes the lookup tables and the code, and features a little test suite to check the results by calling the respective USR.
If you run the tests it is advised to speed up the emulator.

eucldist.tap
https://drive.google.com/file/d/1s-_zrK ... sp=sharing

ParadigmShifter · Post by **ParadigmShifter** » Sat Mar 30, 2024 8:35 pm

Depends on what you use it for I guess... 8 bits precision for x and y (and not having z for 3d) might limit its usefulness.

Lots of stuff in 2d doesn't require actual distance anyway or even square roots you can often just compare the values before you take the square root, that's because if a and b are positive numbers (as Euclidean distances are) then

a < b if and only if a^2 < b^2 (same for =, >, <=, >=, != as well).

You don't need any trig or floating/fixed point maths to draw circles either (you don't even need a multiplication function).

If you give people fast Euclidean distance they might start using it when they don't need to is basically what I am saying

Most geometry stuff can be done with dot products and no trig either.

What you do need distance for is if you display distance for ranges (but can probably precalculate a lot of stuff maybe) and the main use is for normalising vectors.

There was also discussion in the pathfinding thread about alternative metrics (distance functions) for stuff like tile based strategy games and pathfinding.

EDIT: How are you estimating your square root anyway? Back in the day it was a lot faster to estimate 1/sqrt(x) instead (which is also better for normalising) and if you need the square root you can multiply it by x. Newton's method converges very quickly if you estimate using first guess of x/2

sn3j · Post by **sn3j** » Sun Mar 31, 2024 11:26 am

ParadigmShifter wrote: ↑Sat Mar 30, 2024 8:35 pm Depends on what you use it for I guess... 8 bits precision for x and y (and not having z for 3d) might limit its usefulness.

Lots of stuff in 2d doesn't require actual distance anyway or even square roots you can often just compare the values before you take the square root, that's because if a and b are positive numbers (as Euclidean distances are) then

a < b if and only if a^2 < b^2 (same for =, >, <=, >=, != as well).

You don't need any trig or floating/fixed point maths to draw circles either (you don't even need a multiplication function).

If you give people fast Euclidean distance they might start using it when they don't need to is basically what I am saying

Most geometry stuff can be done with dot products and no trig either.

What you do need distance for is if you display distance for ranges (but can probably precalculate a lot of stuff maybe) and the main use is for normalising vectors.

There was also discussion in the pathfinding thread about alternative metrics (distance functions) for stuff like tile based strategy games and pathfinding.

Of course you can just use the squares and spare this operation, that is working with x²+y² and compare with d², but this won't work when building up paths in a cumulative way.
For example I'd need it for my A* algorithm as an admissive heuristic function. I did use the tile-distance (Chebyshev+Taxicab) as metric but the results are not good enough. Euclidean dist would be just the right metric.

ParadigmShifter wrote: ↑Sat Mar 30, 2024 8:35 pm EDIT: How are you estimating your square root anyway? Back in the day it was a lot faster to estimate 1/sqrt(x) instead (which is also better for normalising) and if you need the square root you can multiply it by x. Newton's method converges very quickly if you estimate using first guess of x/2

Given a 16-bit number num, I am looking up the nearest lower square with the help of a reverse lookup table, which is the illustration (num.hi) -> &Squares[r] where (r²).hi >= num.hi. (.hi denoting the hi-byte)
The table doesn't provide the exact entry, depending on the lo-byte, but in most cases the right entry is at most one step away.
Once I got the nearest lower square I use the distance to it to generate the fractional bits, as the distance is known to be in the range [0..2*r+1).
It's an interpolation with the tangent with slope 2(r+1) at this point, which works sufficiently well with roots >=32. For smaller numbers (hi-bytes 0..3) I can derive the root integer part and 3-4 fractional bits from the reverse lookup table without interpolation.
I didn't want to add any more approximation steps to keep the speed at a reasonable pace, and for roots <32 the solution cannot be improved anymore without using more frac bits.

I've read about the Inverse Root method https://en.wikipedia.org/wiki/Fast_inverse_square_root though and it might be an alternative to explore x*1/sqrt(x), like you sketched out. But I'm a bit in doubt this could be done sufficiently fast.

ParadigmShifter · Post by **ParadigmShifter** » Sun Mar 31, 2024 9:00 pm

I didn't realise it was you who was asking about the pathfinding metric. I gave you a (probably slow, but can precalc) way to find the exact tile distance using the octants thing in that thread which was of the form d = a + b sqrt(2) (but didn't use any multiplies or divides except multiplying by sqrt(2) I think).

Can't you just us a lookup table of the exact value (stored as an 8.8 bits precision fixed point number)? Your grid size is small 32x40. You'd only need a quarter of that (plus 1 extra row and column) since d(x ,y) = d(y, x).

You can also save nearly half the memory again by using a triangular table (makes the indexing harder) since the table is symmetric along the line x = y which I tried to do in BASIC but never got round to fixing the bug in that thread.

As in you don't need to store all the numbers for a symmetric lookup table e.g. you can store (this is just an example, not actual values)

Code: Select all

by only storing the upper diagonal part

Code: Select all

1 2 3 4
x 5 6 7
x x 8 9
x x x 10

where x are duplicate values by the symmetry so don't need to be stored

so you can store it as

1 2 3 4 5 6 7 8 9 10

If you do that the access function is more complicated but it's not hard to write (I did manage to fail to write a working one in BASIC though lol but that's because BASIC is annoying and harder to debug than ASM code lol (at least that was my excuse).

There's lots of C code about explaining how to do the accessor function (search for symmetric matrix indexing or something like that).

EDIT: Storing the lower diagonal part is probably easier for the accessor though on second thoughts. Or transposing the table, not had enough coffee to think about that at the moment

sn3j · Post by **sn3j** » Sun Mar 31, 2024 9:54 pm

Btw here's the listing for the multiplication routine with all the markings (there's some comma missing in some instructions, it's all done in a text editor...)

Code: Select all

MuluDE                 ; de = x*y, using square lookups
--                     ; changes: a,hl
 4  ld a d
 4  sub e
12  jr nc 2
 3   neg
 7  ld h Squares.hi    ; a 256-aligned table of squares from 0 to 255, big-endian, lo-bytes at Squares.hi+1
 4  ld l a
 4  ld a d
 4  add e              ; C,a = x+y
 7  ld d (hl)
 4  inc h              ; Squares lo-bytes
 7  ld e (hl)          ; de = (x-y)²
 4  ld l a
 7  ld a (hl)
12  jr c L1            ; if x+y > 255 go to L1
80|83
-1  sub e
 4  ld e a
 4  dec h              ; Squares hi-bytes
 7  ld a (hl)
 4  sbc d              ; ae = (x+y)²-(x-y)² = 4*x*y
 4  rra
 8  rr e
 4  rra
 8  rr e
 4  ld d a             ; de = ae / 4 = x*y
10  ret
56
(80|83) + 56 = 136|139
  L1:                  ; case x+y>255
(80|83)
                       ; (x+y)² = (256+k)² = 65536+512*k+k², k>=0
 4  sub e
 4  ld e a
 4  dec h              ; Squares hi-bytes
 7  ld a (hl)
 4  sbc d              ; ae = k² - (x-y)²
 4  ccf                ; C,ae = 65536 + k² - (x-y)²
 4  rra
 8  rr e               ; ae = (65536 + k² - (x-y)²) / 2
 4  add l              ; C,ae = (65536+k² - (x-y)²)/2 + 256*k
 4  rra
 8  rr e
 4  ld d a             ; de = (65536+512*k+k² - (x-y)²)/4 = x*y
10  ret
69
(80|83) + 69 = 149|152

sn3j · Post by **sn3j** » Sun Mar 31, 2024 9:58 pm

ParadigmShifter wrote: ↑Sun Mar 31, 2024 9:00 pm I didn't realise it was you who was asking about the pathfinding metric. I gave you a (probably slow, but can precalc) way to find the exact tile distance using the octants thing in that thread which was of the form d = a + b sqrt(2) (but didn't use any multiplies or divides except multiplying by sqrt(2) I think).

Can't you just us a lookup table of the exact value (stored as an 8.8 bits precision fixed point number)? Your grid size is small 32x40. You'd only need a quarter of that (plus 1 extra row and column) since d(x ,y) = d(y, x).

You can also save nearly half the memory again by using a triangular table (makes the indexing harder) since the table is symmetric along the line x = y which I tried to do in BASIC but never got round to fixing the bug in that thread.

As in you don't need to store all the numbers for a symmetric lookup table e.g. you can store (this is just an example, not actual values)
Code: Select all
1 2 3 4
2 5 6 7
3 6 8 9
4 7 9 10
by only storing the upper diagonal part
Code: Select all
1 2 3 4
x 5 6 7
x x 8 9
x x x 10

where x are duplicate values by the symmetry so don't need to be stored

so you can store it as

1 2 3 4 5 6 7 8 9 10
If you do that the access function is more complicated but it's not hard to write (I did manage to fail to write a working one in BASIC though lol but that's because BASIC is annoying and harder to debug than ASM code lol (at least that was my excuse).

There's lots of C code about explaining how to do the accessor function (search for symmetric matrix indexing or something like that).

EDIT: Storing the lower diagonal part is probably easier for the accessor though on second thoughts. Or transposing the table, not had enough coffee to think about that at the moment

Accessing a triangular table would require to store the row start index as well, I think. That's an additional LD A,(HL) and ADD X, no big deal.
Unfortunately my A* algorithm is planned for a big map that works with 8-bit unsigned coordinates. But I'll give this a thought.
EDIT:
A* requires you to build a set of distance estimates from goal node to all other nodes. Since the goal node may be arbitrary you won't be able to get the estimates from a precomputed lookup table, there's just too many combinations of (start,goal) tuples.

ParadigmShifter · Post by **ParadigmShifter** » Sun Mar 31, 2024 10:08 pm

It basically boils down to a triangular number which is easy to calculate, maybe have a look here (not had a good look at the code). EDIT: include the link lol https://stackoverflow.com/questions/191 ... c-matrices that was just first one I found

k = j + n*i - i*(i-1) /2

looks like it will work, although I think that can be factorised to only use 1 multiplication? You have to make sure i <= j (or may be other way around).

EDIT: Yeah you can take out a factor of i in the part after "j +" to reduce it to only 1 multiply (and a divide by 2 which is =just a shift obvs).

You can also work out the nth triangular number pretty fast by accumulating 1 + 2 + 3 + ... + n if multiplication is slower than a loop (there's probably a cut off point where multiplication outperforms i=1, loop:add i, inc i, loop though).

ParadigmShifter · Post by **ParadigmShifter** » Sun Mar 31, 2024 10:17 pm

sn3j wrote: ↑Sun Mar 31, 2024 9:58 pm
EDIT:
A* requires you to build a set of distance estimates from goal node to all other nodes. Since the goal node may be arbitrary you won't be able to get the estimates from a precomputed lookup table, there's just too many combinations of (start,goal) tuples.

There's not that many because of the symmetry though, you only need to store 1/8th of the values plus a few extra (along 1 axis and along 1 diagonal). EDIT: My 4x4 example reduces to 10 values but that can be used to lookup the answer for a 7x7 array because of symmetry so 49 entries the obvious way reduces to just 10 (and you can probably have 1 less lookup because you know dist(goal, goal) = 0). You also know the values along the axis without having to look them up (it is orthogonal distance * distance travelled = ortho dist * row), but that might not be a speedup (just a way to reduce the table size). You also know the values along the diagonal too since it's just diagonal dist * row).

EDIT2: Ok you need to store more than that since you need to take into account pathing from one corner to the opposite one, so it's about 1/4 of the size not 1/8th I suppose. It's worth considering for small map sizes though.

A* only needs an estimate it does not have to be exact (but I think you weren't checking nodes to see if they were already in the open list when you updated cost estimates though so it may miss a shorter route to a node you already visited if you do that I think).

It's always worth implementing stuff like that in C or C++ or C# or whatever since the algorithm isn't that hard and once you have a C version you are happy with it's pretty straightforward to port to Z80 ASM.

EDIT3: Another not bad optimisation may be to cache some of the recent results especially if you end up calculating the same distances over and over again.

I remember I used that for a PS1 game where we cached normal vectors we had recently calculated which gave a good speedup (although for some reason we didn't stash the normals of each poly in the map format, probably to save space...)

EDIT4: Although can't you just work out the distance once per goal and as you move orthogonally or diagonally each step just adjust the distance as you go? So if you move 1 tile orthogonally closer you reduce the distance by orthodist (moving orthogonally away adds orthodist) similarly for diagonal movement towards/away from the goal? EDIT5: Probably not thinking about it since if you move perpendicular to the goal there's a shorter distance by moving diagonally next step instead. Well it was just a thought

ParadigmShifter · Post by **ParadigmShifter** » Sun Mar 31, 2024 11:51 pm

sn3j wrote: ↑Sun Mar 31, 2024 9:54 pm Btw here's the listing for the multiplication routine with all the markings (there's some comma missing in some instructions, it's all done in a text editor...)

Code: Select all

MuluDE                 ; de = x*y, using square lookups
--                     ; changes: a,hl
 4  ld a d
 4  sub e
12  jr nc 2
 3   neg
 7  ld h Squares.hi    ; a 256-aligned table of squares from 0 to 255, big-endian, lo-bytes at Squares.hi+1
 4  ld l a
 4  ld a d
 4  add e              ; C,a = x+y
 7  ld d (hl)
 4  inc h              ; Squares lo-bytes
 7  ld e (hl)          ; de = (x-y)²
 4  ld l a
 7  ld a (hl)
12  jr c L1            ; if x+y > 255 go to L1
80|83
-1  sub e
 4  ld e a
 4  dec h              ; Squares hi-bytes
 7  ld a (hl)
 4  sbc d              ; ae = (x+y)²-(x-y)² = 4*x*y
 4  rra
 8  rr e
 4  rra
 8  rr e
 4  ld d a             ; de = ae / 4 = x*y
10  ret
56
(80|83) + 56 = 136|139
  L1:                  ; case x+y>255
(80|83)
                       ; (x+y)² = (256+k)² = 65536+512*k+k², k>=0
 4  sub e
 4  ld e a
 4  dec h              ; Squares hi-bytes
 7  ld a (hl)
 4  sbc d              ; ae = k² - (x-y)²
 4  ccf                ; C,ae = 65536 + k² - (x-y)²
 4  rra
 8  rr e               ; ae = (65536 + k² - (x-y)²) / 2
 4  add l              ; C,ae = (65536+k² - (x-y)²)/2 + 256*k
 4  rra
 8  rr e
 4  ld d a             ; de = (65536+512*k+k² - (x-y)²)/4 = x*y
10  ret
69
(80|83) + 69 = 149|152

That seems similar to the very fast multiplication I linked to in the multiply thread... you can store x^2/4 in the table instead to save some rolls I think?

see at the end here

https://www.msxcomputermagazine.nl/mccw ... on/en.html

although his remark "rest assured that this doesn't affect the final result" is a bit too hand-wavy/"we see immediately that this is the case" for my liking, would want to test it works for the entire domain of your function, or instead prove his claim using algebra.

sn3j · Post by **sn3j** » Mon Apr 01, 2024 12:38 pm

ParadigmShifter wrote: ↑Sun Mar 31, 2024 11:51 pm That seems similar to the very fast multiplication I linked to in the multiply thread... you can store x^2/4 in the table instead to save some rolls I think?

see at the end here

https://www.msxcomputermagazine.nl/mccw ... on/en.html

although his remark "rest assured that this doesn't affect the final result" is a bit too hand-wavy/"we see immediately that this is the case" for my liking, would want to test it works for the entire domain of your function, or instead prove his claim using algebra.

Yes, I was following this and another article which prominently pop up when searching for Z80 multiplication.
I was eager to take the lookup tables from the Euclidean dist algo to another good use, so I wound up the multiplication routine for unsigned 8-bit operands (instead of the signed ones with limited distance).
Storing n²/2 or n²/4 in the tables wouldn't work with the other algorithm so I kept it as is. (Can't build x²+y² for small odd numbers.)

Why it works with n²/2 is easily shown:
If n is even, then n² is even. Likewise if n is odd, then n² is odd.
Now when x+y is odd (even), then x-y is also odd (even).
So it's safe to subtract INT( (x-y)²/2 ) from INT( (x+y)²/2 ), we just don't need the lowest bit.
I can't show it for n²/4, though. :/
And I agree the explanation given in the article sounds a bit sloppy. I guess the author did run an exhaustive test instead of coming up with a solid proof, and then made some big words about it.

Timmy · Post by **Timmy** » Mon Apr 01, 2024 1:18 pm

Just wanted to say I like the idea. I'm glad you're writing it and publishing it.

Although, as PS already had said, after I used a sqrt function a few times, I realised I never actually needed the distance, as I always only needed a distance comparison. Perhaps that's because of the things I've been using.

After that, I only keep the original number instead of the root.

ParadigmShifter · Post by **ParadigmShifter** » Mon Apr 01, 2024 4:32 pm

sn3j wrote: ↑Mon Apr 01, 2024 12:38 pm Yes, I was following this and another article which prominently pop up when searching for Z80 multiplication.
I was eager to take the lookup tables from the Euclidean dist algo to another good use, so I wound up the multiplication routine for unsigned 8-bit operands (instead of the signed ones with limited distance).
Storing n²/2 or n²/4 in the tables wouldn't work with the other algorithm so I kept it as is. (Can't build x²+y² for small odd numbers.)

Why it works with n²/2 is easily shown:
If n is even, then n² is even. Likewise if n is odd, then n² is odd.
Now when x+y is odd (even), then x-y is also odd (even).
So it's safe to subtract INT( (x-y)²/2 ) from INT( (x+y)²/2 ), we just don't need the lowest bit.
I can't show it for n²/4, though. :/
And I agree the explanation given in the article sounds a bit sloppy. I guess the author did run an exhaustive test instead of coming up with a solid proof, and then made some big words about it.

If n is odd => n^2 ends with binary 01 might help I suppose? Regardless of whether n nod 4 = 1 or whether n mod 4 = 3

since
(2k+1)^2 = 4k^2 + 4k + 1 => 1 above a multiple of 4
(2k+3)^2 = (2k+3)(2k+3) = 4k^2 + 12k + 9 = 4(k^2 + 3k + 2) + 1 => 1 above a multiple of 4

EDIT: Also
https://math.stackexchange.com/question ... is-3-mod-4

EDIT2: Defo need more coffee lol

So all square numbers end with 00 in binary (if n is even) or 01 in binary (if n is odd). So x^2 + y^2 ends with either 00, 01 or 10 binary depending on parity of x and y

x even, y even => x^2 + y^2 == 0 mod 4
x even, y odd or vice versa => x^2 + y^2 == 1 mod 4
x, y both odd => x^2 + y^2 == 2 mod 4

sn3j · Post by **sn3j** » Mon Apr 01, 2024 8:26 pm

ParadigmShifter wrote: ↑Mon Apr 01, 2024 4:32 pm If n is odd => n^2 ends with binary 01 might help I suppose? Regardless of whether n nod 4 = 1 or whether n mod 4 = 3

since
(2k+1)^2 = 4k^2 + 4k + 1 => 1 above a multiple of 4
(2k+3)^2 = (2k+3)(2k+3) = 4k^2 + 12k + 9 = 4(k^2 + 3k + 2) + 1 => 1 above a multiple of 4

EDIT: Also
https://math.stackexchange.com/question ... is-3-mod-4

EDIT2: Defo need more coffee lol

So all square numbers end with 00 in binary (if n is even) or 01 in binary (if n is odd). So x^2 + y^2 ends with either 00, 01 or 10 binary depending on parity of x and y

x even, y even => x^2 + y^2 == 0 mod 4
x even, y odd or vice versa => x^2 + y^2 == 1 mod 4
x, y both odd => x^2 + y^2 == 2 mod 4

Great. If all squares end in 01 or 00 this would mean it's safe to store n²/4 in the lookup table, because bit 0 (odd/even) can be taken from n (when it comes to compute the Eucliden dist, where you need the exact square to make up the sum). This would leave 2 spare bits for each entry for something else.

ParadigmShifter · Post by **ParadigmShifter** » Mon Apr 01, 2024 8:31 pm

Well it doesn't prove his claim that the spare bits don't matter (for the multiplication algorithm) but I can't think of a way to prove that except by tediously working through all 16 cases (x, y = 4k, 4k+1, 4k+2, 4k+3). EDIT: 10 cases probably because you can assume without loss of generality that x mod 4 <= y mod 4. Floor isn't a nice operator to work with really.

But yes you can reconstruct the lower 2 bits by checking bit 0 of x. (You can also construct lower 2 bits of x^2 + y^2 if you need to as well as per my edit). It's well known that no sum of 2 squares is == 3 mod 4 I hinted at that in the thread about pathfinding. EDIT: It's a bit of a shame that BIT 0, r does not also set the carry flag cos you could then do an ADC instead of a conditional branch for odd/even. Note that BIT 0, r is faster than LD A, r \ AND 1 and it doesn't trash A either.

EDIT: In my proof that x^2 == 0 or 1 mod 4 I should really have used 4k+1 and 4k+3 (or 4k-1 == 4k+3 mod 3) but it doesn't really matter since both cases were covered (one of 2k+1 and 2k+3 is == 1 mod 4, the other one is == 3 mod 4). EDIT: Actually I only needed to prove (2k+1)^2 == 1 mod 4 since that covers all odd numbers anyway!

I also didn't prove that if n is even then n^2 == 0 mod 4 either but that is easy: n even => n = 2k => n^2 = 4k^2 == 0 mod 4

ParadigmShifter · Post by **ParadigmShifter** » Mon Apr 01, 2024 10:27 pm

If you've finished with the value in the register anyway you can use ADC to increment based on oddness to avoid a branch... if value was in D reg say

LD A, D
RRC D
ADC 0

Would do that. RLC D to get the value back though if you still need it. That only works for 8 bit maths though I suppose. Would be nice if there was an ADC HL, nn in the instruction set :/

EDIT: Not that that should matter since if it's the lower byte of a square number it will never be 255 since 255 mod 4 = 3 which is not possible from earlier results so the ADC 0 will never cause a carry into the high byte.

ParadigmShifter · Post by **ParadigmShifter** » Tue Apr 02, 2024 10:30 pm

Is your table of (int)x^2/4 the same as the one in the article now?

Code: Select all

MULTAB	DB	0, 0, 1, 2, 4, 6, 9, 12, 16, 20
	DB	25, 30, 36, 42, 49, 56, 64, 72, 81, 90
	DB	100, 110, 121, 132, 144, 156, 169, 182, 196, 210
	DB	225, 240, 0, 16, 33, 50, 68, 86, 105, 124
	DB	144, 164, 185, 206, 228, 250, 17, 40, 64, 88
	DB	113, 138, 164, 190, 217, 244, 16, 44, 73, 102
	DB	132, 162, 193, 224, 0, 32, 65, 98, 132, 166
	DB	201, 236, 16, 52, 89, 126, 164, 202, 241, 24
	DB	64, 104, 145, 186, 228, 14, 57, 100, 144, 188
	DB	233, 22, 68, 114, 161, 208, 0, 48, 97, 146
	DB	196, 246, 41, 92, 144, 196, 249, 46, 100, 154
	DB	209, 8, 64, 120, 177, 234, 36, 94, 153, 212
	DB	16, 76, 137, 198, 4, 66, 129, 192

	DB	0, 192, 129, 66, 4, 198, 137, 76, 16, 212
	DB	153, 94, 36, 234, 177, 120, 64, 8, 209, 154
	DB	100, 46, 249, 196, 144, 92, 41, 246, 196, 146
	DB	97, 48, 0, 208, 161, 114, 68, 22, 233, 188
	DB	144, 100, 57, 14, 228, 186, 145, 104, 64, 24
	DB	241, 202, 164, 126, 89, 52, 16, 236, 201, 166
	DB	132, 98, 65, 32, 0, 224, 193, 162, 132, 102
	DB	73, 44, 16, 244, 217, 190, 164, 138, 113, 88
	DB	64, 40, 17, 250, 228, 206, 185, 164, 144, 124
	DB	105, 86, 68, 50, 33, 16, 0, 240, 225, 210
	DB	196, 182, 169, 156, 144, 132, 121, 110, 100, 90
	DB	81, 72, 64, 56, 49, 42, 36, 30, 25, 20
	DB	16, 12, 9, 6, 4, 2, 1, 0

	DB	0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	DB	0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	DB	0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	DB	0, 0, 1, 1, 1, 1, 1, 1, 1, 1
	DB	1, 1, 1, 1, 1, 1, 2, 2, 2, 2
	DB	2, 2, 2, 2, 2, 2, 3, 3, 3, 3
	DB	3, 3, 3, 3, 4, 4, 4, 4, 4, 4
	DB	4, 4, 5, 5, 5, 5, 5, 5, 5, 6
	DB	6, 6, 6, 6, 6, 7, 7, 7, 7, 7
	DB	7, 8, 8, 8, 8, 8, 9, 9, 9, 9
	DB	9, 9, 10, 10, 10, 10, 10, 11, 11, 11
	DB	11, 12, 12, 12, 12, 12, 13, 13, 13, 13
	DB	14, 14, 14, 14, 15, 15, 15, 15

	DB	16, 15, 15, 15, 15, 14, 14, 14, 14, 13
	DB	13, 13, 13, 12, 12, 12, 12, 12, 11, 11
	DB	11, 11, 10, 10, 10, 10, 10, 9, 9, 9
	DB	9, 9, 9, 8, 8, 8, 8, 8, 7, 7
	DB	7, 7, 7, 7, 6, 6, 6, 6, 6, 6
	DB	5, 5, 5, 5, 5, 5, 5, 4, 4, 4
	DB	4, 4, 4, 4, 4, 3, 3, 3, 3, 3
	DB	3, 3, 3, 2, 2, 2, 2, 2, 2, 2
	DB	2, 2, 2, 1, 1, 1, 1, 1, 1, 1
	DB	1, 1, 1, 1, 1, 1, 1, 0, 0, 0
	DB	0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	DB	0, 0, 0, 0, 0, 0, 0, 0, 0, 0
	DB	0, 0, 0, 0, 0, 0, 0, 0

Because I noticed there's A LOT of redundancy in that table. The 128th entry is 0 and entries 129-255 seem to just be the values from index 127 to index 1? (i.e. entry 129+k = entry 127-k)

Same for the high byte... 128th entry is 16, and it's then the same entries in reverse order?

You could special case the 128th and 256+128th entry (hardcode them) and halve the size of the table (slight speed hit though)? Or you could fill the redundant data with other data or code instead if you want high byte to still be at offset base + 256 (it is a good idea to do that though)

I didn't check that the pattern definitely holds but it certainly looks like it to me. So it's kind of like how you only really need 1/4 of a sine lookup table since you can use symmetry for the remaining entries.

EDIT: Or is that because he is storing negative squares as well? In which case that is silly lol

It doesn't make sense for the table to be symmetric for x^2/4. Double the domain size instead lol. It looks like that is what he is doing. Noooooooooooooob alert. His routine is nice and fast though lol

EDIT2: So ignore what I said pre-edit, he's just done his table in a very wasteful manner there.

You also don't really need to store an entry for 0 either just special case that? Then you can use the full domain since 255^2 < 65536. You could also special case 1 since multiplying by 1 or 0 is pointless really the answer can be returned immediately (again, slight speed hit). (duh you can use the full domain anyway since you have 256 entries to play with... it's probably worth special casing/not even bothering calling the multiply function if one of the arguments is 0 or 1 though?). If you know either 0 or 1 or both will never be one of the arguments, even better, don't check. All you save is some bytes in the table you can use for other variables though.

sn3j · Post by **sn3j** » Wed Apr 03, 2024 8:17 pm

ParadigmShifter wrote: ↑Tue Apr 02, 2024 10:30 pm Is your table of (int)x^2/4 the same as the one in the article now?
...
You could special case the 128th and 256+128th entry (hardcode them) and halve the size of the table (slight speed hit though)? Or you could fill the redundant data with other data or code instead if you want high byte to still be at offset base + 256 (it is a good idea to do that though)

I was toying around with the idea from above to spare two bits, but found the savings are not worth to lower the Sqrt routine's performance just to gain 12T or so with the multiplication.
So the tables and the code are still like in the published .tap file.

ParadigmShifter wrote: ↑Tue Apr 02, 2024 10:30 pm EDIT: Or is that because he is storing negative squares as well? In which case that is silly lol It doesn't make sense for the table to be symmetric for x^2/4. Double the domain size instead lol. It looks like that is what he is doing. Noooooooooooooob alert. His routine is nice and fast though lol

Didn't have a look into the table before but It looks like it's indeed storing squares for negative values.

ParadigmShifter wrote: ↑Tue Apr 02, 2024 10:30 pm You also don't really need to store an entry for 0 either just special case that? Then you can use the full domain since 255^2 < 65536. You could also special case 1 since multiplying by 1 or 0 is pointless really the answer can be returned immediately (again, slight speed hit). (duh you can use the full domain anyway since you have 256 entries to play with... it's probably worth special casing/not even bothering calling the multiply function if one of the arguments is 0 or 1 though?). If you know either 0 or 1 or both will never be one of the arguments, even better, don't check. All you save is some bytes in the table you can use for other variables though.

Yes, I'd leave the handing of 0 or 1 to the caller instead of wasting cycles on a branch that's used with <1% probability.
The result is in any case computed correctly though.

ParadigmShifter · Post by **ParadigmShifter** » Wed Apr 03, 2024 8:36 pm

I think it would be more beneficial to speed up the multiplication rather than the sqrt? Optimise the parts that will be called the most more aggressively that ones that are called less often.

Multiplication is a lot more useful then sqrt. You also do 2 multiplies in the Euclidean distance algorithm and only 1 square root.

Unfortunately there's not really a Z80 profiler that I know of. Profile led optimisation is the best kind...

EDIT: I also provided a branchless way to increment A if another register is odd which should be a good optimisation anyway I think?

LD A, D ; 4T
RRC D ; 8T
ADC 0 ; 7T = 19T

is better than (note it's even better if you have a register you know contains 0 as well say B following a loop or something)

BIT 0, D; 8T
JR NZ, .even ; I always get this mixed up may need to be JR Z, .even instead ; 7T/12T
INC D ; so 7T + 4T
.even

which is either 19T or 20T (but if you need D back later on, you have to RLC D for another 8T - so try not to do that if you can avoid it). LD A, D\AND 1 at 11T is slower than BIT 0, D and destroys A as well (and you still need to do the test).

It's also definitely safe to only increment the lower byte of your 16 bit number, table will not contain #FF since #100 is divisible by 4 so it would have stored #100/4 in table instead.

DoctorRad · Post by **DoctorRad** » Wed Apr 03, 2024 9:50 pm

Did you consider using CORDIC square root? Or too slow / not accurate enough?

catmeows · Post by **catmeows** » Wed Apr 03, 2024 10:15 pm

Well, if you want really quick and dirty sqr you can also use identity log (sqrt(x)) = log(x)/2 which can be implemented as two reads of LUT and one shift. Probably best thing on this approach is that you can reuse log and antilog tables also for fast approximation of division which is also quite expensive on Z80. Multiply works too but the classic 4xy = (x + y)^2 - (x - y)^2 is accurate while log will be off little bit.

ParadigmShifter · Post by **ParadigmShifter** » Wed Apr 03, 2024 10:20 pm

Elite had something like that on the BBC version anyway. Obvs need to use multiply and divide a lot for 3d stuff.

https://www.bbcelite.com/deep_dives/mul ... ithms.html

Looks like those use a 768 byte lookup as well.

EDIT: Maybe that's some of the remakes on 6502 that do that? Sounds like original Elite used standard multiplication stuff, I dunno. (Looks like the 6502 2nd processor version uses that, no idea what that is though lol).

Also accuracy isn't as important when projecting stuff from 3d to 2d which is what it would have been used for.

sn3j · Post by **sn3j** » Wed Apr 03, 2024 10:36 pm

ParadigmShifter wrote: ↑Wed Apr 03, 2024 8:36 pm I think it would be more beneficial to speed up the multiplication rather than the sqrt? Optimise the parts that will be called the most more aggressively that ones that are called less often.

Multiplication is a lot more useful then sqrt. You also do 2 multiplies in the Euclidean distance algorithm and only 1 square root.

My intention was to have a fast metric for A*, that's the only scenario so far. I don't need any multiplication at all so far. The algorithm just looks up x² and y² in the table.
What bothers me though is that it can't compute sqrt(255² + 255²). The limit is somewhere along 180 for both x,y. (Which is, to be more precise, dx and dy, meaning no two nodes must be farther away than that.)

ParadigmShifter wrote: ↑Wed Apr 03, 2024 8:36 pm Unfortunately there's not really a Z80 profiler that I know of. Profile led optimisation is the best kind...

Fuse spills out a txt file with the cycles used per address. There's also Skool Kit but I don't know how to use it.

ParadigmShifter wrote: ↑Wed Apr 03, 2024 8:36 pm EDIT: I also provided a branchless way to increment A if another register is odd which should be a good optimisation anyway I think?

LD A, D ; 4T
RRC D ; 8T
ADC 0 ; 7T = 19T

is better than (note it's even better if you have a register you know contains 0 as well say B following a loop or something)

BIT 0, D; 8T
JR NZ, .even ; I always get this mixed up may need to be JR Z, .even instead ; 7T/12T
INC D ; so 7T + 4T
.even

which is either 19T or 20T (but if you need D back later on, you have to RLC D for another 8T - so try not to do that if you can avoid it). LD A, D\AND 1 at 11T is slower than BIT 0, D and destroys A as well (and you still need to do the test).

It's also definitely safe to only increment the lower byte of your 16 bit number, table will not contain #FF since #100 is divisible by 4 so it would have stored #100/4 in table instead.

Yops. I try to avoid branching with tricks like that.

MustardTiger · Post by **MustardTiger** » Wed Apr 03, 2024 10:40 pm

A hacky way of getting this distance between two 2d points is this

Code: Select all

determine largest and smallest values from x,y
length = largest + (smallest>>1) gives a worst case 12% error
if smallest >= (largest>>3) then length = largest + (smallest>>1) - (largest>>3) gives worst case ~6% error but mostly <2%

sn3j · Post by **sn3j** » Wed Apr 03, 2024 10:41 pm

DoctorRad wrote: ↑Wed Apr 03, 2024 9:50 pm Did you consider using CORDIC square root? Or too slow / not accurate enough?

There's something similar used on Z80 heaven.
The algorithms look quite similar to the division routines, the fastest afaik was some unrolled loop with 330T for a 16 bit integer root with remainder(?).
Problem is that anything that requires multiplication or division isn't going to work on a Z80.

ParadigmShifter · Post by **ParadigmShifter** » Wed Apr 03, 2024 10:41 pm

Is that total cycles per address? Cos that would be good. I use Spin atm mainly because I am used to it (but there's no documentation any more I can find for it).

Working out the exact tile-based metric distance might be faster though as per my last few posts in the A* thread... no multiplies just min and max and a scale by 1/sqrt(2) for the diagonal direction (which you can do using a lookup of floor 1/sqrt(2) anyway, then use the value with more diagonal steps as a tie breaker if 2 distances are equal).

Spectrum Computing

Euclidean dist/Sqrt/Multiplication routines 16bit

Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit

Re: Euclidean dist/Sqrt/Multiplication routines 16bit