Performance of Forth

The place for codemasters or beginners to talk about programming any language for the Spectrum.
animaal
Microbot
Posts: 101
Joined: Sat Mar 09, 2019 5:14 pm

Re: Performance of Forth

Post by animaal »

Lethargeek wrote: Mon Oct 05, 2020 10:11 pm try this:

Code: Select all

: FILLIN 255 23296 16384 DO DUP I C! LOOP DROP ;
might be a bit faster than literal

or 16-bit version:

Code: Select all

: FILLIN 65535 23296 16384 DO DUP I ! 2 +LOOP DROP ;
To be honest, my stopwatch skills aren't really accurate enough to do justice here... but the second version seems to be significantly faster. Maybe a little over 1 second on my Spectaculator. Makes sense when there are half the number of iterations.
User avatar
Joefish
Rick Dangerous
Posts: 2058
Joined: Tue Nov 14, 2017 10:26 am

Re: Performance of Forth

Post by Joefish »

White Lightning included a copyright notice that anything written in it had to be offered up to Oasis Software to publish. I doubt anyone ever bothered. The graphics libraries were powerful enough (they resurfaced in Laser Basic) but the Forth editor they provided was a terrible thing to program in.
Alone Coder
Manic Miner
Posts: 401
Joined: Fri Jan 03, 2020 10:00 am

Re: Performance of Forth

Post by Alone Coder »

It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.
catmeows
Manic Miner
Posts: 716
Joined: Tue May 28, 2019 12:02 pm
Location: Prague

Re: Performance of Forth

Post by catmeows »

Alone Coder wrote: Tue Oct 06, 2020 6:05 am It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.
Beauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.
Proud owner of Didaktik M
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

Alone Coder wrote: Tue Oct 06, 2020 6:05 amThis will be a lot faster.
no, it won't.
User avatar
Sokurah
Manic Miner
Posts: 286
Joined: Tue Nov 14, 2017 10:38 am
Contact:

Re: Performance of Forth

Post by Sokurah »

Joefish wrote: Mon Oct 05, 2020 10:31 pm The graphics libraries were powerful enough (they resurfaced in Laser Basic) but the Forth editor they provided was a terrible thing to program in.

Well, at least the manual was awesome :lol:
Website: Tardis Remakes / Mostly remakes of Arcade and ZX Spectrum games.
My games for the Spectrum: Dingo, The Speccies, The Speccies 2, Vallation & Sqij.
Twitter: Sokurah
User avatar
Joefish
Rick Dangerous
Posts: 2058
Joined: Tue Nov 14, 2017 10:26 am

Re: Performance of Forth

Post by Joefish »

catmeows wrote: Tue Oct 06, 2020 6:26 amBeauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.
The real attraction of Forth is that it's very easy to write an interpreter for it on pretty much any machine-code architecture. It's then fairly easy to write simple scripted actions in it, so it's a good choice for something like this. I'm adding a script engine to Go-Go BunnyGun at the moment, although I'm not sure I'm past the tipping point of implementing a fully-featured language like Forth. Although if I needed any maths functions I probably would do it as a Forth stack calculator.

The problem comes when you subject some other poor sod to the thing you wrote! :lol:
User avatar
Lethargeek
Manic Miner
Posts: 742
Joined: Wed Dec 11, 2019 6:47 am

Re: Performance of Forth

Post by Lethargeek »

ketmar wrote: Tue Oct 06, 2020 6:36 am
Alone Coder wrote: Tue Oct 06, 2020 6:05 amThis will be a lot faster.
no, it won't.
it might, if short primitives are inlined and the code is not too heavy on hi-level words
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

Lethargeek wrote: Tue Oct 06, 2020 7:17 pm
ketmar wrote: Tue Oct 06, 2020 6:36 am no, it won't.
it might, if short primitives are inlined and the code is not too heavy on hi-level words
which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself. pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.
User avatar
Lethargeek
Manic Miner
Posts: 742
Joined: Wed Dec 11, 2019 6:47 am

Re: Performance of Forth

Post by Lethargeek »

ketmar wrote: Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code
no, which may well be an actual game code, as it tends to be low-level most of the time
ketmar wrote: Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself
nothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitives
ketmar wrote: Tue Oct 06, 2020 7:50 pm pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.
as i said, it all depends on a hi-level vs low-level ops ratio in the code
with z80 you have a choice optimising either parameter stack ops OR return stack ops
but with things like 6809/6309 you can do both (and not just forth but any threaded code)
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

Lethargeek wrote: Tue Oct 06, 2020 8:20 pm
ketmar wrote: Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code
no, which may well be an actual game code, as it tends to be low-level most of the time
why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).
Lethargeek wrote: Tue Oct 06, 2020 8:20 pm
ketmar wrote: Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself
nothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitives
this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator. simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).

that's why i abandoned optimising STC after some R&D: it's complexity simply doesn't pay off.

p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.

p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed. ;-)
User avatar
Lethargeek
Manic Miner
Posts: 742
Joined: Wed Dec 11, 2019 6:47 am

Re: Performance of Forth

Post by Lethargeek »

ketmar wrote: Tue Oct 06, 2020 8:42 pm why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).
who said gfx kernel? low-level forth is more than enough for game logic
just look at the Shaw brothers games that are compiled (integer and very low-level) BASIC
ketmar wrote: Tue Oct 06, 2020 8:42 pm this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator.
but this IS simple - even possible just using macros in the assembler
ketmar wrote: Tue Oct 06, 2020 8:42 pm simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).
repeat - it all depends on the hi/lo level code ratio, in your use cases it won't, in my use cases it will
ketmar wrote: Tue Oct 06, 2020 8:42 pm p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.
NO, and don't even bring 16 bits here, these things are fundamentally different
ketmar wrote: Tue Oct 06, 2020 8:42 pm p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed. ;-)
using hardware stack as parameter stack gives a boost
also don't forget that inlined primitives lack NEXT
for example, C! becomes just

Code: Select all

pop de
ld (hl), e
pop hl
...and then the fusion might eliminate "pop de" and/or "pop hl" checking just a few possible register states
(OTOH return stack handling becomes slower and uglier, possibly demanding alternative registers use)
catmeows
Manic Miner
Posts: 716
Joined: Tue May 28, 2019 12:02 pm
Location: Prague

Re: Performance of Forth

Post by catmeows »

Lethargeek wrote: Tue Oct 06, 2020 9:27 pm
who said gfx kernel? low-level forth is more than enough for game logic
I tend to agree here, my general vocabulary is quite simple. I may add or drop some definitions in future, but in general it is all I need.
Only MIN and MAX are compound words, rest is native implementation.

Code: Select all

uForthTable
	jr uForthDup	;00 DUP
	jr uForthDrop	;02 DROP
	jr uForthSwap	;04 SWAP
	jr uForthOver	;06 OVER
	jr uForthRot	;08 ROT
	jr uForthPlus	;0A PLUS
	jr uForthMinus	;0C MINUS
	jr uForthEqual	;0E EQUAL
	jr uForthNotEqu	;10 NEQUAL
	jr uForthGreat	;12 GREAT
	jr uForthLess	;14 LESS
	jr uForthEqLess	;16 EQLESS
	jr uForthEqGrt  ;18 EQGREAT
	jr uForthOne	;1A ONE
	jr uForthZero	;1C ZERO
	jr uForthInc	;1E INC
	jr uForthDec	;20 DEC
	jp uForthBranch ;22 BRANCH
	jp uForthBrTrue ;25 BRTRUE
	jp uForthBrFals	;28 BRFALSE
	jp uForthToR	;2B TOR
	jp uForthFromR	;2E FROMR
	jp uForthCpR	;31 COPYR
	jp uForthSubr	;34 SUBR
	jp uForthReturn ;37 RETURN
	jp uForthAbs	;3A ABS
	jp uForthNeg	;3D NEG
	jp uForthMin	;40 MIN
	jp uForthMax	;43 MAX
	jp uForthSgn	;46 SGN
	jp uForthAnd	;49 ANDOP
	jp uForthOr	;4C OROP
	jp uForthCpl	;4F CPLOP
	jp uForthMOne	;52 MONE
	jp uForthXor	;55 XOROP
	jp uForthCByte	;58 CBYTE
	jp uForthCInt	;5B CINT
	jp uForthBFetch ;5E BFETCH
	jp uForthIFetch ;61 IFETCH
	jp uForthBStore ;64 BSTORE
	jp uForthIStore ;67 ISTORE
	jp uForthMul	;7A MUL
	jp uForthDiv	;7D DIV 
	jp uForthShl	;80 SHLOP
	jp uForthShr	;83 SHROP
	jp uForthShra	;86 SHRAOP
	jp uForthCUByte ;89 CUBYTE
	jp uForthCBFetch ;8C UBFETCH
	jp uForthExit	;8F EXIT

Proud owner of Didaktik M
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

[mention]Lethargeek[/mention] it seems to me that we're talking about slightly different things here. seems that you're talking about small and simple Forth-like scripts, and i am talking about writing everything except the very lowest parts in Forth. that can explain why you have mostly words calling primitives, and i have mostly words calling other high-level words. those are two very different things indeed, and STC is much faster in your case (but still not enough to call it "alot" for me ;-). and that's prolly why i got mostly similar results for x86 and Z80 (as i said, the relative numbers were very close). (and that's why i started cross-compiler project in the first place -- word headers overhead became quite noticeable.)

so i agree with you that for "script-like" use cases there might be a good reason to switch to STC, and it may give some very noticeable speed boost.

also, i wonder if [mention]animaal[/mention] silently writing some words to DROP us with our offtopic arguments.
User avatar
programandala.net
Drutt
Posts: 2
Joined: Wed Jun 06, 2018 6:07 pm
Location: Spain
Contact:

Re: Performance of Forth

Post by programandala.net »

animaal wrote: Mon Oct 05, 2020 7:47 pm I was curious as to how the performance of Forth (Abersoft) relates to Sinclair BASIC and raw Z80. So I tested it. It's a very simple and far from comprehensive test, but perhaps somewhat indicative.

The aim is to fill the screen with pixels, and set all attributes to 255. the code is simple, and enclosed below for each language.
Spoiler: Forth is pretty good! I wonder if it was ever used to write commercial Spectrum software?


Results (Time taken):
basic : 72.50secs
forth : 1.50secs
assembly: 0.04secs

Code: Select all

( forth)
: FILLIN 23296 16384 DO 255 I ! LOOP ;
FILLIN
I couldn't resist the curiosity. I've added a new benchmark to Solo Forh (version 0.14.0-rc.124 was recently released) to try the 5 possible options to code that in Forth. I copy the code and the results below, in ticks (system frames) and milliseconds:

Code: Select all

( scr-fill-bench )

  \ 2020-11-26: Benchmark written after the test found in the
  \ following forum thread, titled "Perfomance of Forth":
  \ https://spectrumcomputing.co.uk/forums/viewtopic.php?f=6&t=3487

need dticks need reset-dticks need do need +loop need dticks>ms

: end ( d ca len -- ) cr type ." : " 2dup d. ."  ticks (" 
                      dticks>ms d. ." ms)" key drop ;
  \ Display the result of a benchmark. _d_ are the ticks and
  \ _ca len_ is the name of the bench.

: fill-16b ( -- ) reset-dticks 23296 16384
  do 255 i ! loop dticks s" 16b-loop" end ;
  \ The original method used in the forum.

: fill-8b ( -- ) reset-dticks 23296 16384
  do 255 i c! loop dticks s" 8b-loop" end ;
  \ Simpler and a bit faster 8-bit variant.

: fill-16b+ ( -- ) reset-dticks 23296 16384
  do 65535 i ! 2 +loop dticks s" 16b-+loop" end ;
  \ Much faster 16-bit variant with a 2-byte loop step.

: fill-16b+dup ( -- ) reset-dticks 65535 23296 16384
  do dup i ! 2 +loop drop dticks s" 16b-dup-+loop" end ;
  \ A bit faster variant that duplicates the value.

: pure-fill ( -- ) reset-dticks 16384 6912 255
  fill dticks s" pure" end ;
  \ The fastest option by far, a pure Forth loop-less `fill`.

: scr-fill-bench ( -- ) fill-16b fill-8b fill-16b+ fill-16b+dup
  pure-fill ;  scr-fill-bench

  \ 2020-11-26: Results:

  \ Test            ticks    ms
  \ -----           -----  ----
  \ fill-16b           59  1180
  \ fill-8b            57  1140
  \ fill-16b+          36   720
  \ fill-16b+dup       35   700
  \ pure-fill           2    40
The test has been run on an emulated ZX Spectrum 128 with a Plus D disk interface, using the Fuse emulator on Debian.

Note Solo Forth is DTC, while Abersoft Forth is an old ITC fig-Forth.
Marcos Cruz (programandala.net)
Alone Coder
Manic Miner
Posts: 401
Joined: Fri Jan 03, 2020 10:00 am

Re: Performance of Forth

Post by Alone Coder »

There is another option for keeping the program:

Code: Select all

jp operation
jp operation ;256 bytes below!
jp operation ;256 bytes below!
...
The operations should look like this subtraction:

Code: Select all

inc h
pop bc
ld a,c
sub e
ld e,a
ld a,b
sbc a,d
ld d,a
jp (hl)
The unrolling of hl will require extra code in the program.

inc hl:inc hl:inc hl instead of inc h (or ld [h]l directly in the threaded code) will not be so fast, but this way you can inline some code.

The keeping of return address for user operations that call other user operations:
a)

Code: Select all

ld (ix+127),h
dec lx
ld (ix-128),l
ld hl,...
...
ld l,(ix-128)
inc lx
ld h,(ix+127)
jp (hl) ;106t
b)
if we use de for call stack, with subtraction 12 t-states slower (pop ix:ld a,lx:sub c:ld c,a:ld a,hx:sbc a,b:ld b,a:jp (hl)), that is slower than in call:call:jp method:

Code: Select all

ex de,hl
ld (hl),d
dec l
ld (hl),e
dec l
ex de,hl
ld hl,...
...
ex de,hl
ld e,(hl)
inc l
ld d,(hl)
inc l
ex de,hl
jp (hl) ;74t
The following subtraction scheme is slower than pop ix:ld a,lx...:

Code: Select all

ld hl,$+6
jp sub ;20t
...
sub:
ex (sp),hl
ld a,l
sub c
ld c,a
ld a,h
sbc a,b
ld b,a
ret ;53t
Ret-Forth (with dw:dw:dw for primitives and dw:dw $+2 for user calls) is even more slower, even if we keep the call stack in hl:

Code: Select all

Begin of a user call:
pop de
dec l
ld (hl),e
dec l
ld (hl),d
ld sp
ret
...
SEMI:
ld e,(hl)
inc l
ld d,(hl)
inc l
ex de,hl
ld sp,hl
ex de,hl
ret ;98
And we need to keep the checksums of the memory to restore the stack in interrupt.
Alone Coder
Manic Miner
Posts: 401
Joined: Fri Jan 03, 2020 10:00 am

Re: Performance of Forth

Post by Alone Coder »

Another method of interpreting the dw-forth (without COLON):

Code: Select all

SEMICOLON: ;10t
pop hl
NEXT: ;33t for SUBR ;44t for primitives
inc hl
ld a,(hl)
inc l
 add a,a
 jr c,SUBR
ld lx,a
jp (ix) ;primitive handler (has a copy of NEXT in the end)
SUBR: ;22t-6 = 16
;hl=next command address, (hl)a=subroutine address
push hl
ld l,a
ld h,(hl) ;hl=subroutine address
;followed by a copy of NEXT without inc hl ;run the first command of the subroutine
User subroutine call = 33(NEXT)+16(SUBR)+10(SEMICOLON) = 59t

subtraction (a b - = a-b):

Code: Select all

ld a,(bc)
sub e
ld e,a
inc c
ld a,(bc)
sbc a,d
ld d,a
inc bc ;40t
;followed by a copy of NEXT
_dw
Dizzy
Posts: 76
Joined: Thu Dec 07, 2023 1:52 am

Re: Performance of Forth

Post by _dw »

I just realized that I have a registration so I can dig some topics out of the grave. .)

https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin

When writing the Forth compiler, I tried different ways to do it and measure the resulting time, because it simply depends on the implementation and there is no single correct choice (except FILL).

https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin

The time is measured externally using basic, so the real time is a little better.
And the size includes the minimal runtime part, which may not even be used.
The code is hidden in a subroutine, so there is also some small overhead regarding the calls.

Code: Select all

PUSH2(23296,16384)
DO
    PUSH(255)
    I
    STORE
LOOP
48 bytes 0.19s

Code: Select all

PUSH2(23296,16384)
DO
    PUSH(255)
    I
    CSTORE
LOOP
45 bytes 0.18s

Code: Select all

PUSH2(23296,16384)
DO
    PUSH(65535)
    I
    STORE
PUSH(2) ADDLOOP
52 bytes 0.14s

Code: Select all

PUSH2(23296,16384)
DO(S)
    PUSH(65535)
    I
    STORE
PUSH(2) ADDLOOP
53 bytes 0.07s

Code: Select all

PUSH2(65535,0)
PUSH2(23296,16384)
DO
    DROP
    I
    _2DUP
    STORE
LOOP
DROP
58 bytes 0.22s

Code: Select all

PUSH(65535,23296,16384)
DO(S)
    _2_PICK
    OVER
    STORE
LOOP
_DROP
51 bytes 0.18s

Code: Select all

PUSH(0,23296,16384)
DO
    DROP_I
    PUSH(255) OVER CSTORE
LOOP
DROP
53 bytes 0.18s

Code: Select all

PUSH(0,23296,16384)
DO
    DROP I
    PUSH(65535) OVER STORE
PUSH(2) ADDLOOP
DROP
58 bytes 0.14s

Code: Select all

PUSH(65535,0,23296,16384)
DO
    DROP I
    _2DUP STORE
PUSH(2) ADDLOOP
_2DROP
bytes 59 0.12s

Code: Select all

PUSH(0x4000)
BEGIN
    PUSH(255) OVER
    CSTORE _1ADD
    DUP
    PUSH(0x5B00)
    EQ
UNTIL
DROP
42 bytes 0.12s

Code: Select all

PUSH(0x4000)
BEGIN
    PUSH(255) OVER CSTORE _1ADD
    define({_TYP_SINGLE},{L_first})
DUP PUSH(0x5B00) EQ UNTIL
DROP
46 bytes 0.09s

Code: Select all

PUSH(0x4000)
BEGIN
    PUSH(65535) OVER STORE _2ADD
DUP PUSH(0x5B00) HEQ UNTIL
DROP
44 bytes 0.07s
Spoiler

Code: Select all

        ORG 32768
    
    
       
    
                  
    

;   ===  b e g i n  ===
    ld  (Stop+1), SP    ; 4:20      init   storing the original SP value when the "bye" word is used
    ld    L, 0x1A       ; 2:7       init   Upper screen
    call 0x1605         ; 3:17      init   Open channel
    ld   HL, 0xEA60     ; 3:10      init   Return address stack = 60000
    exx                 ; 1:4       init
    call Fillin         ; 3:17      scall
Stop:                   ;           stop
    ld   SP, 0x0000     ; 3:10      stop   restoring the original SP value when the "bye" word is used
    ld   HL, 0x2758     ; 3:10      stop
    exx                 ; 1:4       stop
    ret                 ; 1:10      stop
;   =====  e n d  =====
;   ---  the beginning of a data stack function  ---
Fillin:                 ;           ( -- )
    push DE             ; 1:11      0x4000
    ex   DE, HL         ; 1:4       0x4000
    ld   HL, 16384      ; 3:10      0x4000
begin101:               ;           begin(101)
                        ;[5:26]     65535 over !   ( addr -- addr+1 )
    ld  [HL],low 65535  ; 2:10      65535 over !
    inc  HL             ; 1:6       65535 over !
    ld  [HL],high 65535 ; 2:10      65535 over !
    inc  HL             ; 1:6       2+
                        ;[6:21]     dup 0x5B00 h= until(101) 101
    ld    A, H          ; 1:4       dup 0x5B00 h= until(101) 101
    xor  0x5B           ; 2:7       dup 0x5B00 h= until(101) 101   hi(TOS) ^ hi(0x5B00)
    jp   nz, begin101   ; 3:10      dup 0x5B00 h= until(101) 101
break101:               ;           dup 0x5B00 h= until(101) 101
    ex   DE, HL         ; 1:4       drop
    pop  DE             ; 1:10      drop   ( a -- )
Fillin_end:
    ret                 ; 1:10      s;
;   ---------  end of data stack function  ---------

Code: Select all

PUSH2(65535,0x4000)
BEGIN
    _2DUP STORE _2ADD
    _2DUP STORE _2ADD
    DUP PUSH(0x5B00) HEQ
UNTIL
_2DROP
49 bytes 0.06s
Spoiler

Code: Select all

       ORG 32768
    
    
       
    
                    
    

;   ===  b e g i n  ===
    ld  (Stop+1), SP    ; 4:20      init   storing the original SP value when the "bye" word is used
    ld    L, 0x1A       ; 2:7       init   Upper screen
    call 0x1605         ; 3:17      init   Open channel
    ld   HL, 0xEA60     ; 3:10      init   Return address stack = 60000
    exx                 ; 1:4       init
    call Fillin         ; 3:17      scall
Stop:                   ;           stop
    ld   SP, 0x0000     ; 3:10      stop   restoring the original SP value when the "bye" word is used
    ld   HL, 0x2758     ; 3:10      stop
    exx                 ; 1:4       stop
    ret                 ; 1:10      stop
;   =====  e n d  =====
;   ---  the beginning of a data stack function  ---
Fillin:                 ;           ( -- )
                        ;[8:42]     65535 0x4000   ( -- 65535 0x4000 )
    push DE             ; 1:11      65535 0x4000
    push HL             ; 1:11      65535 0x4000
    ld   DE, 0xFFFF     ; 3:10      65535 0x4000
    ld   HL, 0x4000     ; 3:10      65535 0x4000
begin101:               ;           begin(101)
                        ;[4:26]     2dup ! 2+   ( x addr -- x addr+2 )
    ld  [HL],E          ; 1:7       2dup ! 2+
    inc  HL             ; 1:6       2dup ! 2+
    ld  [HL],D          ; 1:7       2dup ! 2+
    inc  HL             ; 1:6       2dup ! 2+
                        ;[4:26]     2dup ! 2+   ( x addr -- x addr+2 )
    ld  [HL],E          ; 1:7       2dup ! 2+
    inc  HL             ; 1:6       2dup ! 2+
    ld  [HL],D          ; 1:7       2dup ! 2+
    inc  HL             ; 1:6       2dup ! 2+
    ld    A, H          ; 1:4       dup 0x5B00 h= until(101)   ( h1 -- h1 )  flag: hi(tos) == hi(23296)
    xor  0x5B           ; 2:7       dup 0x5B00 h= until(101)   hi(TOS) ^ hi(0x5B00)
    jp   nz, begin101   ; 3:10      dup 0x5B00 h= until(101)   variant: defalut
break101:               ;           dup 0x5B00 h= until(101)
    pop  HL             ; 1:10      2drop   ( b a -- )
    pop  DE             ; 1:10      2drop
Fillin_end:
    ret                 ; 1:10      s;
;   ---------  end of data stack function  ---------

Code: Select all

PUSH2(65535,0x4000)
BEGIN
    _2DUP_STORE _2ADD
DUP_PUSH_EQ_UNTIL(0x5B00)
_2DROP
50 bytes 0.07s
Spoiler

Code: Select all

 ORG 32768
    
;   ===  b e g i n  ===
    ld  (Stop+1), SP    ; 4:20      init   storing the original SP value when the "bye" word is used
    ld    L, 0x1A       ; 2:7       init   Upper screen
    call 0x1605         ; 3:17      init   Open channel
    ld   HL, 60000      ; 3:10      init   Init Return address stack
    exx                 ; 1:4       init
    
    call Fillin         ; 3:17      scall
    
Stop:                   ;           stop
    ld   SP, 0x0000     ; 3:10      stop   restoring the original SP value when the "bye" word is used
    ld   HL, 0x2758     ; 3:10      stop
    exx                 ; 1:4       stop
    ret                 ; 1:10      stop
;   =====  e n d  =====   
    
;   ---  the beginning of a data stack function  ---
Fillin:                 ;           ( -- )
        
    push DE             ; 1:11      push2(65535,0x4000)
    ld   DE, 65535      ; 3:10      push2(65535,0x4000)
    push HL             ; 1:11      push2(65535,0x4000)
    ld   HL, 0x4000     ; 3:10      push2(65535,0x4000) 
begin101:               ;           begin 101 
                        ;[4:26]     2dup ! 2+ _2dup_store_2add   ( x addr -- x addr+2 )
    ld  (HL),E          ; 1:7       2dup ! 2+ _2dup_store_2add
    inc  HL             ; 1:6       2dup ! 2+ _2dup_store_2add
    ld  (HL),D          ; 1:7       2dup ! 2+ _2dup_store_2add
    inc  HL             ; 1:6       2dup ! 2+ _2dup_store_2add 
                        ;[11:18/39] dup 0x5B00 eq until 101   variant: lo(0x5B00) = 0
    ld    A, L          ; 1:4       dup 0x5B00 eq until 101
    or    A             ; 1:4       dup 0x5B00 eq until 101
    jp   nz, begin101   ; 3:10      dup 0x5B00 eq until 101
    ld    A, high 0x5B00; 2:7       dup 0x5B00 eq until 101
    xor   H             ; 1:4       dup 0x5B00 eq until 101
    jp   nz, begin101   ; 3:10      dup 0x5B00 eq until 101
break101:               ;           dup 0x5B00 eq until 101 
    pop  HL             ; 1:10      2drop
    pop  DE             ; 1:10      2drop ( b a -- )
    
Fillin_end:
    ret                 ; 1:10      s;
;   ---------  end of data stack function  ---------

Code: Select all

PUSH2(0x5BFF,0x4000)
BEGIN
    _2DUP CSTORE _1CADD
    _2DUP CSTORE _1CADD
    _2DUP CSTORE _1CADD
    _2DUP CSTORE _1ADD
    _2DUP HEQ UNTIL
_2DROP
48 bytes 0.06s
Spoiler

Code: Select all

ifdef __ORG
    org __ORG
  else
    org 24576
  endif
    
    
       
    
          
               
              
             
             
              
            
    

;   ===  b e g i n  ===
    ld  (Stop+1), SP    ; 4:20      init   storing the original SP value when the "bye" word is used
    ld    L, 0x1A       ; 2:7       init   Upper screen
    call 0x1605         ; 3:17      init   Open channel
    ld   HL, 0xEA60     ; 3:10      init   Return address stack = 60000
    exx                 ; 1:4       init
    call Fillin         ; 3:17      scall
Stop:                   ;           stop
    ld   SP, 0x0000     ; 3:10      stop   restoring the original SP value when the "bye" word is used
    ld   HL, 0x2758     ; 3:10      stop
    exx                 ; 1:4       stop
    ret                 ; 1:10      stop
;   =====  e n d  =====
;   ---  the beginning of a data stack function  ---
Fillin:                 ;           ( -- )
                        ;[8:42]     0x5BFF 0x4000   ( -- 0x5BFF 0x4000 )
    push DE             ; 1:11      0x5BFF 0x4000
    push HL             ; 1:11      0x5BFF 0x4000
    ld   DE, 0x5BFF     ; 3:10      0x5BFF 0x4000
    ld   HL, 0x4000     ; 3:10      0x5BFF 0x4000
begin101:               ;           begin(101)
                        ;[1:7]      2dup c!   ( char addr -- char addr )  [addr]=lo8(x)
    ld  [HL],E          ; 1:7       2dup c!
    inc   L             ; 1:4       1c+   ( x1 -- x2 )   x2 = 256*hi(x1) + lo(x1 + 1)
                        ;[1:7]      2dup c!   ( char addr -- char addr )  [addr]=lo8(x)
    ld  [HL],E          ; 1:7       2dup c!
    inc   L             ; 1:4       1c+   ( x1 -- x2 )   x2 = 256*hi(x1) + lo(x1 + 1)
                        ;[1:7]      2dup c!   ( char addr -- char addr )  [addr]=lo8(x)
    ld  [HL],E          ; 1:7       2dup c!
    inc   L             ; 1:4       1c+   ( x1 -- x2 )   x2 = 256*hi(x1) + lo(x1 + 1)
                        ;[2:13]     2dup c! 1+   ( x addr -- x addr+1 )  [addr]=lo8(x)
    ld  [HL],E          ; 1:7       2dup c! 1+
    inc  HL             ; 1:6       2dup c! 1+
                        ;[5:18]     2dup h= until   ( h2 h1 -- h2 h1 )
    ld    A, H          ; 1:4       2dup h= until
    xor   D             ; 1:4       2dup h= until   hi(h2) ^ hi(h1)
    jp   nz, begin101   ; 3:10      2dup h= until
break101:               ;           2dup h= until
    pop  HL             ; 1:10      2drop   ( b a -- )
    pop  DE             ; 1:10      2drop
Fillin_end:
    ret                 ; 1:10      s;
;   ---------  end of data stack function  ---------

Code: Select all

PUSH3(0x4000,6912,255)
FILL
47 bytes 0.06s
Spoiler

Code: Select all

        ORG 32768
    
    
       
    
        
    

;   ===  b e g i n  ===
    ld  (Stop+1), SP    ; 4:20      init   storing the original SP value when the "bye" word is used
    ld    L, 0x1A       ; 2:7       init   Upper screen
    call 0x1605         ; 3:17      init   Open channel
    ld   HL, 0xEA60     ; 3:10      init   Return address stack = 60000
    exx                 ; 1:4       init
    call Fillin         ; 3:17      scall
Stop:                   ;           stop
    ld   SP, 0x0000     ; 3:10      stop   restoring the original SP value when the "bye" word is used
    ld   HL, 0x2758     ; 3:10      stop
    exx                 ; 1:4       stop
    ret                 ; 1:10      stop
;   =====  e n d  =====
;   ---  the beginning of a data stack function  ---
Fillin:                 ;           ( -- )
                       ;[22:93807]  0x4000 6912 255 fill   fill(addr,u,char)   variant >0: fill(no ptr,4*1728 (no limit),?)
    push HL             ; 1:11      0x4000 6912 255 fill
    ld   HL, 0x4000     ; 3:10      0x4000 6912 255 fill   HL = addr
    ld   BC, 0x1BFF     ; 3:10      0x4000 6912 255 fill   B = 27x, C = char
    ld  [HL],C          ; 1:7       0x4000 6912 255 fill
    inc   L             ; 1:4       0x4000 6912 255 fill
    ld  [HL],C          ; 1:7       0x4000 6912 255 fill
    inc   L             ; 1:4       0x4000 6912 255 fill
    ld  [HL],C          ; 1:7       0x4000 6912 255 fill
    inc   L             ; 1:4       0x4000 6912 255 fill
    ld  [HL],C          ; 1:7       0x4000 6912 255 fill
    inc   L             ; 1:4       0x4000 6912 255 fill
    jp   nz, $-8        ; 3:10      0x4000 6912 255 fill
    inc   H             ; 1:4       0x4000 6912 255 fill
    djnz $-12           ; 2:13/8    0x4000 6912 255 fill
    pop  HL             ; 1:10      0x4000 6912 255 fill
Fillin_end:
    ret                 ; 1:10      s;
;   ---------  end of data stack function  ---------

Code: Select all

__ASM({
    push HL
    ld HL, 0xFFFF
    ld B, 216
    di
    ld ($+7+16+3),SP
    ld SP, 0x5B00
  rept 16
    push HL
  endm
    djnz $-16
    ld SP, 0x0000
    ei
    pop HL})
58 bytes 0.01s
Z80 Forth compiler (ZX Spectrum 48kb): https://codeberg.org/DW0RKiN/M4_FORTH
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

welcome to the benchmarking club! ;-) it is mostly useless, but very fun. you are really tempting me to resurrect my optimising compiler project. ;-)
_dw
Dizzy
Posts: 76
Joined: Thu Dec 07, 2023 1:52 am

Re: Performance of Forth

Post by _dw »

I don't see it as pointless because trying to program a problem in Forth and then seeing how it translates to assembler will reveal what can be improved or what is already too challenging. For me, it's an indicator of what I should be working on. It makes me think of another solution.

Also if you debug it in the tests you ran it looks good compared to others.

Real parts of the code that have never been used or thought about may have some problems or even bugs. If no one is using suboptimal code, then it doesn't matter, and if it is used and found to be buggy or inefficient, there are many ways to improve the translation until it is good enough.

That's what I like most about it, the independence and total control.
Z80 Forth compiler (ZX Spectrum 48kb): https://codeberg.org/DW0RKiN/M4_FORTH
User avatar
ketmar
Manic Miner
Posts: 697
Joined: Tue Jun 16, 2020 5:25 pm
Location: Ukraine

Re: Performance of Forth

Post by ketmar »

i mean that most benchmarks don't reflect the real-world performance anyway. UrForth/Beast, for example (being DTC), beats some native code Forth compilers in several benchmarks. it doesn't mean that DTC is faster than native code, it only means that my optimiser is good for those particular benchmarks. on real software, The Beast is ~1.5/2.5 times slower than native code (as expected). but hey, i won The Benchmark Game! ;-)

i wrote fully featured Z80 assembler in UrForth, and even with full inlining optimisations turned on DTC is still much slower than quite simple STC, with very dumb peephole optimiser. yet the same system, with the same optimiser is on par with STC in benchmarks (because it managed to reduce some benchmarks to several primitives).

that's what i meant: playing with benchmarks is fun, but most of the time efforts spent to improve benchmark times mean almost nothing for real apps. Forth is usually used for high-level logic, and low-level primitives are written in asm anyway, so optimising out several DUPs and ROTs doesn't have a huge impact on execution times.

still, i can understand the urge to add "just some more small optimisations", because i am guilty of it myself. ;-) please, don't take my words as me trying to be rude or something, i am simply joking. it is great to see you trying to improve the performance of your compiler.
Post Reply