Performance of Forth

animaal · Post by **animaal** » Mon Oct 05, 2020 10:23 pm

Lethargeek wrote: ↑Mon Oct 05, 2020 10:11 pm try this:
Code: Select all
: FILLIN 255 23296 16384 DO DUP I C! LOOP DROP ;
might be a bit faster than literal

or 16-bit version:
Code: Select all
: FILLIN 65535 23296 16384 DO DUP I ! 2 +LOOP DROP ;

To be honest, my stopwatch skills aren't really accurate enough to do justice here... but the second version seems to be significantly faster. Maybe a little over 1 second on my Spectaculator. Makes sense when there are half the number of iterations.

Joefish · Post by **Joefish** » Mon Oct 05, 2020 10:31 pm

White Lightning included a copyright notice that anything written in it had to be offered up to Oasis Software to publish. I doubt anyone ever bothered. The graphics libraries were powerful enough (they resurfaced in Laser Basic) but the Forth editor they provided was a terrible thing to program in.

Alone Coder · Post by **Alone Coder** » Tue Oct 06, 2020 6:05 am

It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.

catmeows · Post by **catmeows** » Tue Oct 06, 2020 6:26 am

Alone Coder wrote: ↑Tue Oct 06, 2020 6:05 am It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.

Beauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.

ketmar · Post by **ketmar** » Tue Oct 06, 2020 6:36 am

Alone Coder wrote: ↑Tue Oct 06, 2020 6:05 amThis will be a lot faster.

no, it won't.

Sokurah · Post by **Sokurah** » Tue Oct 06, 2020 10:03 am

Joefish wrote: ↑Mon Oct 05, 2020 10:31 pm The graphics libraries were powerful enough (they resurfaced in Laser Basic) but the Forth editor they provided was a terrible thing to program in.

Well, at least the manual was awesome

Joefish · Post by **Joefish** » Tue Oct 06, 2020 10:32 am

catmeows wrote: ↑Tue Oct 06, 2020 6:26 amBeauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.

The real attraction of Forth is that it's very easy to write an interpreter for it on pretty much any machine-code architecture. It's then fairly easy to write simple scripted actions in it, so it's a good choice for something like this. I'm adding a script engine to Go-Go BunnyGun at the moment, although I'm not sure I'm past the tipping point of implementing a fully-featured language like Forth. Although if I needed any maths functions I probably would do it as a Forth stack calculator.

The problem comes when you subject some other poor sod to the thing you wrote!

Lethargeek · Post by **Lethargeek** » Tue Oct 06, 2020 7:17 pm

ketmar wrote: ↑Tue Oct 06, 2020 6:36 am
Alone Coder wrote: ↑Tue Oct 06, 2020 6:05 amThis will be a lot faster.
no, it won't.

it might, if short primitives are inlined and the code is not too heavy on hi-level words

ketmar · Post by **ketmar** » Tue Oct 06, 2020 7:50 pm

Lethargeek wrote: ↑Tue Oct 06, 2020 7:17 pm
ketmar wrote: ↑Tue Oct 06, 2020 6:36 am no, it won't.
it might, if short primitives are inlined and the code is not too heavy on hi-level words

which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself. pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.

Lethargeek · Post by **Lethargeek** » Tue Oct 06, 2020 8:20 pm

ketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code

no, which may well be an actual game code, as it tends to be low-level most of the time

ketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself

nothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitives

ketmar wrote: ↑Tue Oct 06, 2020 7:50 pm pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.

as i said, it all depends on a hi-level vs low-level ops ratio in the code
with z80 you have a choice optimising either parameter stack ops OR return stack ops
but with things like 6809/6309 you can do both (and not just forth but any threaded code)

ketmar · Post by **ketmar** » Tue Oct 06, 2020 8:42 pm

Lethargeek wrote: ↑Tue Oct 06, 2020 8:20 pm
ketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code
no, which may well be an actual game code, as it tends to be low-level most of the time

why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).

Lethargeek wrote: ↑Tue Oct 06, 2020 8:20 pm
ketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself
nothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitives

this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator. simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).

that's why i abandoned optimising STC after some R&D: it's complexity simply doesn't pay off.

p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.

p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed. ;-)

Lethargeek · Post by **Lethargeek** » Tue Oct 06, 2020 9:27 pm

ketmar wrote: ↑Tue Oct 06, 2020 8:42 pm why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).

who said gfx kernel? low-level forth is more than enough for game logic
just look at the Shaw brothers games that are compiled (integer and very low-level) BASIC

ketmar wrote: ↑Tue Oct 06, 2020 8:42 pm this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator.

but this IS simple - even possible just using macros in the assembler

ketmar wrote: ↑Tue Oct 06, 2020 8:42 pm simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).

repeat - it all depends on the hi/lo level code ratio, in your use cases it won't, in my use cases it will

ketmar wrote: ↑Tue Oct 06, 2020 8:42 pm p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.

NO, and don't even bring 16 bits here, these things are fundamentally different

ketmar wrote: ↑Tue Oct 06, 2020 8:42 pm p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed.

using hardware stack as parameter stack gives a boost
also don't forget that inlined primitives lack NEXT
for example, C! becomes just