To be honest, my stopwatch skills aren't really accurate enough to do justice here... but the second version seems to be significantly faster. Maybe a little over 1 second on my Spectaculator. Makes sense when there are half the number of iterations.Lethargeek wrote: ↑Mon Oct 05, 2020 10:11 pm try this:might be a bit faster than literalCode: Select all
: FILLIN 255 23296 16384 DO DUP I C! LOOP DROP ;
or 16-bit version:Code: Select all
: FILLIN 65535 23296 16384 DO DUP I ! 2 +LOOP DROP ;
Performance of Forth
Re: Performance of Forth
Re: Performance of Forth
White Lightning included a copyright notice that anything written in it had to be offered up to Oasis Software to publish. I doubt anyone ever bothered. The graphics libraries were powerful enough (they resurfaced in Laser Basic) but the Forth editor they provided was a terrible thing to program in.
-
- Manic Miner
- Posts: 401
- Joined: Fri Jan 03, 2020 10:00 am
Re: Performance of Forth
It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.
Re: Performance of Forth
Beauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.Alone Coder wrote: ↑Tue Oct 06, 2020 6:05 am It is possible to make a Forth system that generates call:call:call instead of dw:dw:dw. This will be a lot faster.
Proud owner of Didaktik M
Re: Performance of Forth
no, it won't.
Re: Performance of Forth
Well, at least the manual was awesome
Website: Tardis Remakes / Mostly remakes of Arcade and ZX Spectrum games.
My games for the Spectrum: Dingo, The Speccies, The Speccies 2, Vallation & Sqij.
Twitter: Sokurah
My games for the Spectrum: Dingo, The Speccies, The Speccies 2, Vallation & Sqij.
Twitter: Sokurah
Re: Performance of Forth
The real attraction of Forth is that it's very easy to write an interpreter for it on pretty much any machine-code architecture. It's then fairly easy to write simple scripted actions in it, so it's a good choice for something like this. I'm adding a script engine to Go-Go BunnyGun at the moment, although I'm not sure I'm past the tipping point of implementing a fully-featured language like Forth. Although if I needed any maths functions I probably would do it as a Forth stack calculator.catmeows wrote: ↑Tue Oct 06, 2020 6:26 amBeauty of Forth is that you can customize performance and code density and you can quickly build a domain language on top od it. I'm using tokenized (8-bit tokens) Forth-like langugage scripts in Black Flag. IT Is much easier to manage game logic in Forth than in asm.
The problem comes when you subject some other poor sod to the thing you wrote!
- Lethargeek
- Manic Miner
- Posts: 742
- Joined: Wed Dec 11, 2019 6:47 am
Re: Performance of Forth
Re: Performance of Forth
which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself. pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.Lethargeek wrote: ↑Tue Oct 06, 2020 7:17 pmit might, if short primitives are inlined and the code is not too heavy on hi-level words
- Lethargeek
- Manic Miner
- Posts: 742
- Joined: Wed Dec 11, 2019 6:47 am
Re: Performance of Forth
no, which may well be an actual game code, as it tends to be low-level most of the time
nothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitivesketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself
as i said, it all depends on a hi-level vs low-level ops ratio in the codeketmar wrote: ↑Tue Oct 06, 2020 7:50 pm pretty straightforward x86 DTC UrForth is about 1.9 times slower than optimised STC BigForth (and about 1.2 times slower than unoptimised SP-Forth with branches/loops inlined). and this is on x86, with its rich choice of addressing modes. so for real apps it can be slightly faster (and the code is about 1.3 times bigger). there is simply no way to make it "alot faster" if you won't revert to pure asm, or won't stick with specially crafted microbenchmarks.
with z80 you have a choice optimising either parameter stack ops OR return stack ops
but with things like 6809/6309 you can do both (and not just forth but any threaded code)
Re: Performance of Forth
why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).Lethargeek wrote: ↑Tue Oct 06, 2020 8:20 pmno, which may well be an actual game code, as it tends to be low-level most of the time
this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator. simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).Lethargeek wrote: ↑Tue Oct 06, 2020 8:20 pmnothing sophisticated and not much RAM for a simple optimizer doing a fusion of the last 2-3 primitivesketmar wrote: ↑Tue Oct 06, 2020 7:50 pm which will be a mostly useless microbenchmark code. ;-) real benefits (on a real-word code) from STC (over DTC) is not even x2 most of the time. and to get close to x2 you have to have a quite sophisticated peephole optimiser, which will take most of the free RAM for itself
that's why i abandoned optimising STC after some R&D: it's complexity simply doesn't pay off.
p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.
p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed. ;-)
- Lethargeek
- Manic Miner
- Posts: 742
- Joined: Wed Dec 11, 2019 6:47 am
Re: Performance of Forth
who said gfx kernel? low-level forth is more than enough for game logicketmar wrote: ↑Tue Oct 06, 2020 8:42 pm why? almost nobody's going to write gfx kernel (or something like that) in ZX Forth anyway, it makes little sense. and the more high-level code, where Forth really shines, calls alot of high-level words (or the authors are doing it wrong, and creating a write-only mess even they won't be able undestand in a week).
just look at the Shaw brothers games that are compiled (integer and very low-level) BASIC
but this IS simple - even possible just using macros in the assemblerketmar wrote: ↑Tue Oct 06, 2020 8:42 pm this doesn't have much sense without removing intermediate stack operations. that's mostly what unoptimised SP-Forth does, and it has very little impact. and on x86 we have "xchg esp,ebp" to quickly swtich between our stacks. simply optimising away stack switches is not enough to get huge speedups, you have to use registers for intermediate values. so you need at least basic-block peephole optimiser and register allocator.
repeat - it all depends on the hi/lo level code ratio, in your use cases it won't, in my use cases it willketmar wrote: ↑Tue Oct 06, 2020 8:42 pm simple optimiser will not give you even more-or-less stable x2. so to get more speed you need to hardcode alot of special cases, and that quickly gets out of control. just take a look at any decent optimising STC compiler: it is either a mess of normal/inlineable code all over the place, or a huge list of peephole rules. or both. and it still cannot beat simple DTC even to x4 (which is not that huge after all).
NO, and don't even bring 16 bits here, these things are fundamentally differentketmar wrote: ↑Tue Oct 06, 2020 8:42 pm p.s.: of course, comparing Z80 and x86 is kinda like comparing apples and oranges, but the basic numbers are very close, and x86 is what i had tested recently, so i can operate with real results i've seen, instead of trying to remember exact numbers for unfinished R&D compilers.
using hardware stack as parameter stack gives a boostketmar wrote: ↑Tue Oct 06, 2020 8:42 pm p.p.s.: that is, STC can be faster, of course, but it is not alot faster, and creating good STC code is much more complex task than simply using DTC. and it's complexity is not worth the final speed gain, i think. of course, i'd be glad to be wrong here, because there cannot be enough speed.
also don't forget that inlined primitives lack NEXT
for example, C! becomes just
Code: Select all
pop de
ld (hl), e
pop hl
(OTOH return stack handling becomes slower and uglier, possibly demanding alternative registers use)
Re: Performance of Forth
I tend to agree here, my general vocabulary is quite simple. I may add or drop some definitions in future, but in general it is all I need.Lethargeek wrote: ↑Tue Oct 06, 2020 9:27 pm
who said gfx kernel? low-level forth is more than enough for game logic
Only MIN and MAX are compound words, rest is native implementation.
Code: Select all
uForthTable
jr uForthDup ;00 DUP
jr uForthDrop ;02 DROP
jr uForthSwap ;04 SWAP
jr uForthOver ;06 OVER
jr uForthRot ;08 ROT
jr uForthPlus ;0A PLUS
jr uForthMinus ;0C MINUS
jr uForthEqual ;0E EQUAL
jr uForthNotEqu ;10 NEQUAL
jr uForthGreat ;12 GREAT
jr uForthLess ;14 LESS
jr uForthEqLess ;16 EQLESS
jr uForthEqGrt ;18 EQGREAT
jr uForthOne ;1A ONE
jr uForthZero ;1C ZERO
jr uForthInc ;1E INC
jr uForthDec ;20 DEC
jp uForthBranch ;22 BRANCH
jp uForthBrTrue ;25 BRTRUE
jp uForthBrFals ;28 BRFALSE
jp uForthToR ;2B TOR
jp uForthFromR ;2E FROMR
jp uForthCpR ;31 COPYR
jp uForthSubr ;34 SUBR
jp uForthReturn ;37 RETURN
jp uForthAbs ;3A ABS
jp uForthNeg ;3D NEG
jp uForthMin ;40 MIN
jp uForthMax ;43 MAX
jp uForthSgn ;46 SGN
jp uForthAnd ;49 ANDOP
jp uForthOr ;4C OROP
jp uForthCpl ;4F CPLOP
jp uForthMOne ;52 MONE
jp uForthXor ;55 XOROP
jp uForthCByte ;58 CBYTE
jp uForthCInt ;5B CINT
jp uForthBFetch ;5E BFETCH
jp uForthIFetch ;61 IFETCH
jp uForthBStore ;64 BSTORE
jp uForthIStore ;67 ISTORE
jp uForthMul ;7A MUL
jp uForthDiv ;7D DIV
jp uForthShl ;80 SHLOP
jp uForthShr ;83 SHROP
jp uForthShra ;86 SHRAOP
jp uForthCUByte ;89 CUBYTE
jp uForthCBFetch ;8C UBFETCH
jp uForthExit ;8F EXIT
Proud owner of Didaktik M
Re: Performance of Forth
[mention]Lethargeek[/mention] it seems to me that we're talking about slightly different things here. seems that you're talking about small and simple Forth-like scripts, and i am talking about writing everything except the very lowest parts in Forth. that can explain why you have mostly words calling primitives, and i have mostly words calling other high-level words. those are two very different things indeed, and STC is much faster in your case (but still not enough to call it "alot" for me ;-). and that's prolly why i got mostly similar results for x86 and Z80 (as i said, the relative numbers were very close). (and that's why i started cross-compiler project in the first place -- word headers overhead became quite noticeable.)
so i agree with you that for "script-like" use cases there might be a good reason to switch to STC, and it may give some very noticeable speed boost.
also, i wonder if [mention]animaal[/mention] silently writing some words to DROP us with our offtopic arguments.
so i agree with you that for "script-like" use cases there might be a good reason to switch to STC, and it may give some very noticeable speed boost.
also, i wonder if [mention]animaal[/mention] silently writing some words to DROP us with our offtopic arguments.
- programandala.net
- Drutt
- Posts: 2
- Joined: Wed Jun 06, 2018 6:07 pm
- Location: Spain
- Contact:
Re: Performance of Forth
I couldn't resist the curiosity. I've added a new benchmark to Solo Forh (version 0.14.0-rc.124 was recently released) to try the 5 possible options to code that in Forth. I copy the code and the results below, in ticks (system frames) and milliseconds:animaal wrote: ↑Mon Oct 05, 2020 7:47 pm I was curious as to how the performance of Forth (Abersoft) relates to Sinclair BASIC and raw Z80. So I tested it. It's a very simple and far from comprehensive test, but perhaps somewhat indicative.
The aim is to fill the screen with pixels, and set all attributes to 255. the code is simple, and enclosed below for each language.
Spoiler: Forth is pretty good! I wonder if it was ever used to write commercial Spectrum software?
Results (Time taken):
basic : 72.50secs
forth : 1.50secs
assembly: 0.04secs
Code: Select all
( forth) : FILLIN 23296 16384 DO 255 I ! LOOP ; FILLIN
Code: Select all
( scr-fill-bench )
\ 2020-11-26: Benchmark written after the test found in the
\ following forum thread, titled "Perfomance of Forth":
\ https://spectrumcomputing.co.uk/forums/viewtopic.php?f=6&t=3487
need dticks need reset-dticks need do need +loop need dticks>ms
: end ( d ca len -- ) cr type ." : " 2dup d. ." ticks ("
dticks>ms d. ." ms)" key drop ;
\ Display the result of a benchmark. _d_ are the ticks and
\ _ca len_ is the name of the bench.
: fill-16b ( -- ) reset-dticks 23296 16384
do 255 i ! loop dticks s" 16b-loop" end ;
\ The original method used in the forum.
: fill-8b ( -- ) reset-dticks 23296 16384
do 255 i c! loop dticks s" 8b-loop" end ;
\ Simpler and a bit faster 8-bit variant.
: fill-16b+ ( -- ) reset-dticks 23296 16384
do 65535 i ! 2 +loop dticks s" 16b-+loop" end ;
\ Much faster 16-bit variant with a 2-byte loop step.
: fill-16b+dup ( -- ) reset-dticks 65535 23296 16384
do dup i ! 2 +loop drop dticks s" 16b-dup-+loop" end ;
\ A bit faster variant that duplicates the value.
: pure-fill ( -- ) reset-dticks 16384 6912 255
fill dticks s" pure" end ;
\ The fastest option by far, a pure Forth loop-less `fill`.
: scr-fill-bench ( -- ) fill-16b fill-8b fill-16b+ fill-16b+dup
pure-fill ; scr-fill-bench
\ 2020-11-26: Results:
\ Test ticks ms
\ ----- ----- ----
\ fill-16b 59 1180
\ fill-8b 57 1140
\ fill-16b+ 36 720
\ fill-16b+dup 35 700
\ pure-fill 2 40
Note Solo Forth is DTC, while Abersoft Forth is an old ITC fig-Forth.
Marcos Cruz (programandala.net)
-
- Manic Miner
- Posts: 401
- Joined: Fri Jan 03, 2020 10:00 am
Re: Performance of Forth
There is another option for keeping the program:
The operations should look like this subtraction:
The unrolling of hl will require extra code in the program.
inc hl:inc hl:inc hl instead of inc h (or ld [h]l directly in the threaded code) will not be so fast, but this way you can inline some code.
The keeping of return address for user operations that call other user operations:
a)
b)
if we use de for call stack, with subtraction 12 t-states slower (pop ix:ld a,lx:sub c:ld c,a:ld a,hx:sbc a,b:ld b,a:jp (hl)), that is slower than in call:call:jp method:
The following subtraction scheme is slower than pop ix:ld a,lx...:
Ret-Forth (with dw:dw:dw for primitives and dw:dw $+2 for user calls) is even more slower, even if we keep the call stack in hl:
And we need to keep the checksums of the memory to restore the stack in interrupt.
Code: Select all
jp operation
jp operation ;256 bytes below!
jp operation ;256 bytes below!
...
Code: Select all
inc h
pop bc
ld a,c
sub e
ld e,a
ld a,b
sbc a,d
ld d,a
jp (hl)
inc hl:inc hl:inc hl instead of inc h (or ld [h]l directly in the threaded code) will not be so fast, but this way you can inline some code.
The keeping of return address for user operations that call other user operations:
a)
Code: Select all
ld (ix+127),h
dec lx
ld (ix-128),l
ld hl,...
...
ld l,(ix-128)
inc lx
ld h,(ix+127)
jp (hl) ;106t
if we use de for call stack, with subtraction 12 t-states slower (pop ix:ld a,lx:sub c:ld c,a:ld a,hx:sbc a,b:ld b,a:jp (hl)), that is slower than in call:call:jp method:
Code: Select all
ex de,hl
ld (hl),d
dec l
ld (hl),e
dec l
ex de,hl
ld hl,...
...
ex de,hl
ld e,(hl)
inc l
ld d,(hl)
inc l
ex de,hl
jp (hl) ;74t
Code: Select all
ld hl,$+6
jp sub ;20t
...
sub:
ex (sp),hl
ld a,l
sub c
ld c,a
ld a,h
sbc a,b
ld b,a
ret ;53t
Code: Select all
Begin of a user call:
pop de
dec l
ld (hl),e
dec l
ld (hl),d
ld sp
ret
...
SEMI:
ld e,(hl)
inc l
ld d,(hl)
inc l
ex de,hl
ld sp,hl
ex de,hl
ret ;98
-
- Manic Miner
- Posts: 401
- Joined: Fri Jan 03, 2020 10:00 am
Re: Performance of Forth
Another method of interpreting the dw-forth (without COLON):
User subroutine call = 33(NEXT)+16(SUBR)+10(SEMICOLON) = 59t
subtraction (a b - = a-b):
Code: Select all
SEMICOLON: ;10t
pop hl
NEXT: ;33t for SUBR ;44t for primitives
inc hl
ld a,(hl)
inc l
add a,a
jr c,SUBR
ld lx,a
jp (ix) ;primitive handler (has a copy of NEXT in the end)
SUBR: ;22t-6 = 16
;hl=next command address, (hl)a=subroutine address
push hl
ld l,a
ld h,(hl) ;hl=subroutine address
;followed by a copy of NEXT without inc hl ;run the first command of the subroutine
subtraction (a b - = a-b):
Code: Select all
ld a,(bc)
sub e
ld e,a
inc c
ld a,(bc)
sbc a,d
ld d,a
inc bc ;40t
;followed by a copy of NEXT
Re: Performance of Forth
I just realized that I have a registration so I can dig some topics out of the grave. .)
https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin
When writing the Forth compiler, I tried different ways to do it and measure the resulting time, because it simply depends on the implementation and there is no single correct choice (except FILL).
https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin
The time is measured externally using basic, so the real time is a little better.
And the size includes the minimal runtime part, which may not even be used.
The code is hidden in a subroutine, so there is also some small overhead regarding the calls.
48 bytes 0.19s
45 bytes 0.18s
52 bytes 0.14s
53 bytes 0.07s
58 bytes 0.22s
51 bytes 0.18s
53 bytes 0.18s
58 bytes 0.14s
bytes 59 0.12s
42 bytes 0.12s
46 bytes 0.09s
44 bytes 0.07s
49 bytes 0.06s
50 bytes 0.07s
48 bytes 0.06s
47 bytes 0.06s
58 bytes 0.01s
https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin
When writing the Forth compiler, I tried different ways to do it and measure the resulting time, because it simply depends on the implementation and there is no single correct choice (except FILL).
https://codeberg.org/DW0RKiN/M4_FORTH/s ... ent-fillin
The time is measured externally using basic, so the real time is a little better.
And the size includes the minimal runtime part, which may not even be used.
The code is hidden in a subroutine, so there is also some small overhead regarding the calls.
Code: Select all
PUSH2(23296,16384)
DO
PUSH(255)
I
STORE
LOOP
Code: Select all
PUSH2(23296,16384)
DO
PUSH(255)
I
CSTORE
LOOP
Code: Select all
PUSH2(23296,16384)
DO
PUSH(65535)
I
STORE
PUSH(2) ADDLOOP
Code: Select all
PUSH2(23296,16384)
DO(S)
PUSH(65535)
I
STORE
PUSH(2) ADDLOOP
Code: Select all
PUSH2(65535,0)
PUSH2(23296,16384)
DO
DROP
I
_2DUP
STORE
LOOP
DROP
Code: Select all
PUSH(65535,23296,16384)
DO(S)
_2_PICK
OVER
STORE
LOOP
_DROP
Code: Select all
PUSH(0,23296,16384)
DO
DROP_I
PUSH(255) OVER CSTORE
LOOP
DROP
Code: Select all
PUSH(0,23296,16384)
DO
DROP I
PUSH(65535) OVER STORE
PUSH(2) ADDLOOP
DROP
Code: Select all
PUSH(65535,0,23296,16384)
DO
DROP I
_2DUP STORE
PUSH(2) ADDLOOP
_2DROP
Code: Select all
PUSH(0x4000)
BEGIN
PUSH(255) OVER
CSTORE _1ADD
DUP
PUSH(0x5B00)
EQ
UNTIL
DROP
Code: Select all
PUSH(0x4000)
BEGIN
PUSH(255) OVER CSTORE _1ADD
define({_TYP_SINGLE},{L_first})
DUP PUSH(0x5B00) EQ UNTIL
DROP
Code: Select all
PUSH(0x4000)
BEGIN
PUSH(65535) OVER STORE _2ADD
DUP PUSH(0x5B00) HEQ UNTIL
DROP
Spoiler
Code: Select all
ORG 32768
; === b e g i n ===
ld (Stop+1), SP ; 4:20 init storing the original SP value when the "bye" word is used
ld L, 0x1A ; 2:7 init Upper screen
call 0x1605 ; 3:17 init Open channel
ld HL, 0xEA60 ; 3:10 init Return address stack = 60000
exx ; 1:4 init
call Fillin ; 3:17 scall
Stop: ; stop
ld SP, 0x0000 ; 3:10 stop restoring the original SP value when the "bye" word is used
ld HL, 0x2758 ; 3:10 stop
exx ; 1:4 stop
ret ; 1:10 stop
; ===== e n d =====
; --- the beginning of a data stack function ---
Fillin: ; ( -- )
push DE ; 1:11 0x4000
ex DE, HL ; 1:4 0x4000
ld HL, 16384 ; 3:10 0x4000
begin101: ; begin(101)
;[5:26] 65535 over ! ( addr -- addr+1 )
ld [HL],low 65535 ; 2:10 65535 over !
inc HL ; 1:6 65535 over !
ld [HL],high 65535 ; 2:10 65535 over !
inc HL ; 1:6 2+
;[6:21] dup 0x5B00 h= until(101) 101
ld A, H ; 1:4 dup 0x5B00 h= until(101) 101
xor 0x5B ; 2:7 dup 0x5B00 h= until(101) 101 hi(TOS) ^ hi(0x5B00)
jp nz, begin101 ; 3:10 dup 0x5B00 h= until(101) 101
break101: ; dup 0x5B00 h= until(101) 101
ex DE, HL ; 1:4 drop
pop DE ; 1:10 drop ( a -- )
Fillin_end:
ret ; 1:10 s;
; --------- end of data stack function ---------
Code: Select all
PUSH2(65535,0x4000)
BEGIN
_2DUP STORE _2ADD
_2DUP STORE _2ADD
DUP PUSH(0x5B00) HEQ
UNTIL
_2DROP
Spoiler
Code: Select all
ORG 32768
; === b e g i n ===
ld (Stop+1), SP ; 4:20 init storing the original SP value when the "bye" word is used
ld L, 0x1A ; 2:7 init Upper screen
call 0x1605 ; 3:17 init Open channel
ld HL, 0xEA60 ; 3:10 init Return address stack = 60000
exx ; 1:4 init
call Fillin ; 3:17 scall
Stop: ; stop
ld SP, 0x0000 ; 3:10 stop restoring the original SP value when the "bye" word is used
ld HL, 0x2758 ; 3:10 stop
exx ; 1:4 stop
ret ; 1:10 stop
; ===== e n d =====
; --- the beginning of a data stack function ---
Fillin: ; ( -- )
;[8:42] 65535 0x4000 ( -- 65535 0x4000 )
push DE ; 1:11 65535 0x4000
push HL ; 1:11 65535 0x4000
ld DE, 0xFFFF ; 3:10 65535 0x4000
ld HL, 0x4000 ; 3:10 65535 0x4000
begin101: ; begin(101)
;[4:26] 2dup ! 2+ ( x addr -- x addr+2 )
ld [HL],E ; 1:7 2dup ! 2+
inc HL ; 1:6 2dup ! 2+
ld [HL],D ; 1:7 2dup ! 2+
inc HL ; 1:6 2dup ! 2+
;[4:26] 2dup ! 2+ ( x addr -- x addr+2 )
ld [HL],E ; 1:7 2dup ! 2+
inc HL ; 1:6 2dup ! 2+
ld [HL],D ; 1:7 2dup ! 2+
inc HL ; 1:6 2dup ! 2+
ld A, H ; 1:4 dup 0x5B00 h= until(101) ( h1 -- h1 ) flag: hi(tos) == hi(23296)
xor 0x5B ; 2:7 dup 0x5B00 h= until(101) hi(TOS) ^ hi(0x5B00)
jp nz, begin101 ; 3:10 dup 0x5B00 h= until(101) variant: defalut
break101: ; dup 0x5B00 h= until(101)
pop HL ; 1:10 2drop ( b a -- )
pop DE ; 1:10 2drop
Fillin_end:
ret ; 1:10 s;
; --------- end of data stack function ---------
Code: Select all
PUSH2(65535,0x4000)
BEGIN
_2DUP_STORE _2ADD
DUP_PUSH_EQ_UNTIL(0x5B00)
_2DROP
Spoiler
Code: Select all
ORG 32768
; === b e g i n ===
ld (Stop+1), SP ; 4:20 init storing the original SP value when the "bye" word is used
ld L, 0x1A ; 2:7 init Upper screen
call 0x1605 ; 3:17 init Open channel
ld HL, 60000 ; 3:10 init Init Return address stack
exx ; 1:4 init
call Fillin ; 3:17 scall
Stop: ; stop
ld SP, 0x0000 ; 3:10 stop restoring the original SP value when the "bye" word is used
ld HL, 0x2758 ; 3:10 stop
exx ; 1:4 stop
ret ; 1:10 stop
; ===== e n d =====
; --- the beginning of a data stack function ---
Fillin: ; ( -- )
push DE ; 1:11 push2(65535,0x4000)
ld DE, 65535 ; 3:10 push2(65535,0x4000)
push HL ; 1:11 push2(65535,0x4000)
ld HL, 0x4000 ; 3:10 push2(65535,0x4000)
begin101: ; begin 101
;[4:26] 2dup ! 2+ _2dup_store_2add ( x addr -- x addr+2 )
ld (HL),E ; 1:7 2dup ! 2+ _2dup_store_2add
inc HL ; 1:6 2dup ! 2+ _2dup_store_2add
ld (HL),D ; 1:7 2dup ! 2+ _2dup_store_2add
inc HL ; 1:6 2dup ! 2+ _2dup_store_2add
;[11:18/39] dup 0x5B00 eq until 101 variant: lo(0x5B00) = 0
ld A, L ; 1:4 dup 0x5B00 eq until 101
or A ; 1:4 dup 0x5B00 eq until 101
jp nz, begin101 ; 3:10 dup 0x5B00 eq until 101
ld A, high 0x5B00; 2:7 dup 0x5B00 eq until 101
xor H ; 1:4 dup 0x5B00 eq until 101
jp nz, begin101 ; 3:10 dup 0x5B00 eq until 101
break101: ; dup 0x5B00 eq until 101
pop HL ; 1:10 2drop
pop DE ; 1:10 2drop ( b a -- )
Fillin_end:
ret ; 1:10 s;
; --------- end of data stack function ---------
Code: Select all
PUSH2(0x5BFF,0x4000)
BEGIN
_2DUP CSTORE _1CADD
_2DUP CSTORE _1CADD
_2DUP CSTORE _1CADD
_2DUP CSTORE _1ADD
_2DUP HEQ UNTIL
_2DROP
Spoiler
Code: Select all
ifdef __ORG
org __ORG
else
org 24576
endif
; === b e g i n ===
ld (Stop+1), SP ; 4:20 init storing the original SP value when the "bye" word is used
ld L, 0x1A ; 2:7 init Upper screen
call 0x1605 ; 3:17 init Open channel
ld HL, 0xEA60 ; 3:10 init Return address stack = 60000
exx ; 1:4 init
call Fillin ; 3:17 scall
Stop: ; stop
ld SP, 0x0000 ; 3:10 stop restoring the original SP value when the "bye" word is used
ld HL, 0x2758 ; 3:10 stop
exx ; 1:4 stop
ret ; 1:10 stop
; ===== e n d =====
; --- the beginning of a data stack function ---
Fillin: ; ( -- )
;[8:42] 0x5BFF 0x4000 ( -- 0x5BFF 0x4000 )
push DE ; 1:11 0x5BFF 0x4000
push HL ; 1:11 0x5BFF 0x4000
ld DE, 0x5BFF ; 3:10 0x5BFF 0x4000
ld HL, 0x4000 ; 3:10 0x5BFF 0x4000
begin101: ; begin(101)
;[1:7] 2dup c! ( char addr -- char addr ) [addr]=lo8(x)
ld [HL],E ; 1:7 2dup c!
inc L ; 1:4 1c+ ( x1 -- x2 ) x2 = 256*hi(x1) + lo(x1 + 1)
;[1:7] 2dup c! ( char addr -- char addr ) [addr]=lo8(x)
ld [HL],E ; 1:7 2dup c!
inc L ; 1:4 1c+ ( x1 -- x2 ) x2 = 256*hi(x1) + lo(x1 + 1)
;[1:7] 2dup c! ( char addr -- char addr ) [addr]=lo8(x)
ld [HL],E ; 1:7 2dup c!
inc L ; 1:4 1c+ ( x1 -- x2 ) x2 = 256*hi(x1) + lo(x1 + 1)
;[2:13] 2dup c! 1+ ( x addr -- x addr+1 ) [addr]=lo8(x)
ld [HL],E ; 1:7 2dup c! 1+
inc HL ; 1:6 2dup c! 1+
;[5:18] 2dup h= until ( h2 h1 -- h2 h1 )
ld A, H ; 1:4 2dup h= until
xor D ; 1:4 2dup h= until hi(h2) ^ hi(h1)
jp nz, begin101 ; 3:10 2dup h= until
break101: ; 2dup h= until
pop HL ; 1:10 2drop ( b a -- )
pop DE ; 1:10 2drop
Fillin_end:
ret ; 1:10 s;
; --------- end of data stack function ---------
Code: Select all
PUSH3(0x4000,6912,255)
FILL
Spoiler
Code: Select all
ORG 32768
; === b e g i n ===
ld (Stop+1), SP ; 4:20 init storing the original SP value when the "bye" word is used
ld L, 0x1A ; 2:7 init Upper screen
call 0x1605 ; 3:17 init Open channel
ld HL, 0xEA60 ; 3:10 init Return address stack = 60000
exx ; 1:4 init
call Fillin ; 3:17 scall
Stop: ; stop
ld SP, 0x0000 ; 3:10 stop restoring the original SP value when the "bye" word is used
ld HL, 0x2758 ; 3:10 stop
exx ; 1:4 stop
ret ; 1:10 stop
; ===== e n d =====
; --- the beginning of a data stack function ---
Fillin: ; ( -- )
;[22:93807] 0x4000 6912 255 fill fill(addr,u,char) variant >0: fill(no ptr,4*1728 (no limit),?)
push HL ; 1:11 0x4000 6912 255 fill
ld HL, 0x4000 ; 3:10 0x4000 6912 255 fill HL = addr
ld BC, 0x1BFF ; 3:10 0x4000 6912 255 fill B = 27x, C = char
ld [HL],C ; 1:7 0x4000 6912 255 fill
inc L ; 1:4 0x4000 6912 255 fill
ld [HL],C ; 1:7 0x4000 6912 255 fill
inc L ; 1:4 0x4000 6912 255 fill
ld [HL],C ; 1:7 0x4000 6912 255 fill
inc L ; 1:4 0x4000 6912 255 fill
ld [HL],C ; 1:7 0x4000 6912 255 fill
inc L ; 1:4 0x4000 6912 255 fill
jp nz, $-8 ; 3:10 0x4000 6912 255 fill
inc H ; 1:4 0x4000 6912 255 fill
djnz $-12 ; 2:13/8 0x4000 6912 255 fill
pop HL ; 1:10 0x4000 6912 255 fill
Fillin_end:
ret ; 1:10 s;
; --------- end of data stack function ---------
Code: Select all
__ASM({
push HL
ld HL, 0xFFFF
ld B, 216
di
ld ($+7+16+3),SP
ld SP, 0x5B00
rept 16
push HL
endm
djnz $-16
ld SP, 0x0000
ei
pop HL})
Z80 Forth compiler (ZX Spectrum 48kb): https://codeberg.org/DW0RKiN/M4_FORTH
Re: Performance of Forth
welcome to the benchmarking club! ;-) it is mostly useless, but very fun. you are really tempting me to resurrect my optimising compiler project. ;-)
Re: Performance of Forth
I don't see it as pointless because trying to program a problem in Forth and then seeing how it translates to assembler will reveal what can be improved or what is already too challenging. For me, it's an indicator of what I should be working on. It makes me think of another solution.
Also if you debug it in the tests you ran it looks good compared to others.
Real parts of the code that have never been used or thought about may have some problems or even bugs. If no one is using suboptimal code, then it doesn't matter, and if it is used and found to be buggy or inefficient, there are many ways to improve the translation until it is good enough.
That's what I like most about it, the independence and total control.
Also if you debug it in the tests you ran it looks good compared to others.
Real parts of the code that have never been used or thought about may have some problems or even bugs. If no one is using suboptimal code, then it doesn't matter, and if it is used and found to be buggy or inefficient, there are many ways to improve the translation until it is good enough.
That's what I like most about it, the independence and total control.
Z80 Forth compiler (ZX Spectrum 48kb): https://codeberg.org/DW0RKiN/M4_FORTH
Re: Performance of Forth
i mean that most benchmarks don't reflect the real-world performance anyway. UrForth/Beast, for example (being DTC), beats some native code Forth compilers in several benchmarks. it doesn't mean that DTC is faster than native code, it only means that my optimiser is good for those particular benchmarks. on real software, The Beast is ~1.5/2.5 times slower than native code (as expected). but hey, i won The Benchmark Game! ;-)
i wrote fully featured Z80 assembler in UrForth, and even with full inlining optimisations turned on DTC is still much slower than quite simple STC, with very dumb peephole optimiser. yet the same system, with the same optimiser is on par with STC in benchmarks (because it managed to reduce some benchmarks to several primitives).
that's what i meant: playing with benchmarks is fun, but most of the time efforts spent to improve benchmark times mean almost nothing for real apps. Forth is usually used for high-level logic, and low-level primitives are written in asm anyway, so optimising out several DUPs and ROTs doesn't have a huge impact on execution times.
still, i can understand the urge to add "just some more small optimisations", because i am guilty of it myself. ;-) please, don't take my words as me trying to be rude or something, i am simply joking. it is great to see you trying to improve the performance of your compiler.
i wrote fully featured Z80 assembler in UrForth, and even with full inlining optimisations turned on DTC is still much slower than quite simple STC, with very dumb peephole optimiser. yet the same system, with the same optimiser is on par with STC in benchmarks (because it managed to reduce some benchmarks to several primitives).
that's what i meant: playing with benchmarks is fun, but most of the time efforts spent to improve benchmark times mean almost nothing for real apps. Forth is usually used for high-level logic, and low-level primitives are written in asm anyway, so optimising out several DUPs and ROTs doesn't have a huge impact on execution times.
still, i can understand the urge to add "just some more small optimisations", because i am guilty of it myself. ;-) please, don't take my words as me trying to be rude or something, i am simply joking. it is great to see you trying to improve the performance of your compiler.