Forth compiler for ZX

_dw · Post by **_dw** » Thu Feb 22, 2024 10:54 pm

It's alright. I understand that it is not tested. I understood what it was doing and I didn't see the error there myself. So it was even harder to find a fault. Especially when I'm not entirely sure how gforth and the Linux terminal do it with utf8.

If you come across the word "chains" read it as "strings". Google translator does not understand the context and in Czech it is simply an overloaded/multiple meaning word.

With the non-definition of how a table with chain links should be created in Forth is problematic.
In Forth, it can be created once, even by hand (although perhaps Forth does not support strings other than words) at the beginning.
I can easily make it as "dw" items when compiling, but...
The compiler will not understand that it is a constant list. So he can substitute values for "dw", but at the same time he will try to overwrite them to these values somewhere at the beginning of the program (in the definition area in the source). Because he doesn't know that they will never be changed. And if something were to change there, the program would fail when restarted.

That's quite enough code for the initialization.

Code: Select all

ld hl, x1; 3:10 a.m
ld [addr_x1],x1 ; 3:16 a.m
ld hl, x2; 3:10 a.m
ld [addr_x2],x1 ; 3:16 a.m
...

and he can't even optimize it because xxx is the value assigned by PASMO. So even minimal optimizations like:

Code: Select all

ld hl, x1; 3:10 a.m
ld [addr_x1],x1 ; 3:16 a.m
ld l, x2; 2:7
ld [addr_x2],x1 ; 3:16 a.m
...

The code is not so bad, the more items, the better it gets. Because it will not store it according to how we send it to it (in order according to addresses) but sort it according to the stored values (because in ld [adr], hl you can have any address).
I actually tested it by setting my org to 0x4000 and splitting the image

into a long list of "num ," and the resulting code was able to copy it faster than the beam caught up with me. (So the compilation takes hours and I had to rewrite the code several times because it takes up gigabytes of RAM memory. Well, it's still just a macro, but the result is a code that a human cannot write by hand.)

I was just looking at how I got the non-standard word >," < and when I tested it again like gforth has it, it's wrong/different.

gforth creates a computer chain and I create a normal chain with the difference that instead of STRING_SECTION I store it in VARIABLE_SECTION. No initialization.

Code: Select all

dworkin@dw-A15:~/Programovani/ZX/Forth/M4$ ../check_word.sh 'COMMASTRING({"Hi"})'
                        ;           comma"string"    ,"Hi"

VARIABLE_SECTION:

                        ; +2 bytes?
    db "Hi"

;# ============================================================================
  if ($<0x0100)
    .error Overflow 64k! over 0..255 bytes
  endif
  if ($<0x0200)
    .error Overflow 64k! over 256..511 bytes
  endif
  if ($<0x0400)
    .error Overflow 64k! over 512..1023 bytes
  endif
  if ($<0x0800)
    .error Overflow 64k! over 1024..2047 bytes
  endif
  if ($<0x1000)
    .error Overflow 64k! over 2048..4095 bytes
  endif
  if ($<0x2000)
    .error Overflow 64k! over 4096..8191 bytes
  endif
  if ($<0x3000)
    .error Overflow 64k! over 8192..12287 bytes
  endif
  if ($<0x4000)
    .error Overflow 64k! over 12288..16383 bytes
  endif
  if ($>0xFF00)
    .warning Data ends at 0xFF00+ address!
  endif
; seconds: 0           ;[ 0:0]

Maybe it would be better to leave the string in STRING_SECTION and store two items returning STRING in VARIABLE_SECTION. The address of the first character and the length of the string. No initialization! Because it would apply to string constant tables.

Then I already tested different procedures and they are functional but a bit laborious to write.

The resource then looked something like this:

Code: Select all

CREATE(MonthTable)
STRING({"January"})   DCOMMA
STRING({"February"})  DCOMMA
STRING({"March"})     DCOMMA
STRING({"April"})     DCOMMA
STRING({"May"})       DCOMMA
STRING({"June"})      DCOMMA
STRING({"July"})      DCOMMA
STRING({"August"})    DCOMMA
STRING({"September"}) DCOMMA
STRING({"October"})   DCOMMA
STRING({"November"})  DCOMMA
STRING({"December"})  DCOMMA

or

Code: Select all

CREATE(_jan)
COMMASTRING({"January",0})
CREATE(_feb)
COMMASTRING({"February",0})
CREATE(_mar)
COMMASTRING({"March",0})
CREATE(_apr)
COMMASTRING({"April",0})
CREATE(_may)
COMMASTRING({"May",0})
CREATE(_jun)
COMMASTRING({"June",0})
CREATE(_jul)
COMMASTRING({"July",0})
CREATE(_aug)
COMMASTRING({"August",0})
CREATE(_sep)
COMMASTRING({"September",0})
CREATE(_oct)
COMMASTRING({"October",0})
CREATE(_nov)
COMMASTRING({"November",0})
CREATE(_dec)
COMMASTRING({"December",0})

CREATE(MonthIndexTable)
PUSH(_jan) COMMA 
PUSH(_feb) COMMA 
PUSH(_mar) COMMA 
PUSH(_apr) COMMA 
PUSH(_may) COMMA 
PUSH(_jun) COMMA 
PUSH(_jul) COMMA 
PUSH(_aug) COMMA 
PUSH(_sep) COMMA 
PUSH(_oct) COMMA 
PUSH(_nov) COMMA 
PUSH(_dec) COMMA

and all this for different variants of character strings.

So I created the word COMMA STRING (>,c" <)

Code: Select all

dworkin@dw-A15:~/Programovani/ZX/Forth/M4$ ../check_word.sh 'COMMACSTRING({"Hi"})'
                        ;           commac"string"    ,c"Hi"

STRING_SECTION:
string101:
    db 2,"Hi"
size101              EQU $ - string101
 if (size101 != 3)
   .error: Incorrectly calculated character string size, does not match the actual size!
 endif


VARIABLE_SECTION:

    dw string101        ; +2 bytes

;# ============================================================================
  if ($<0x0100)
    .error Overflow 64k! over 0..255 bytes
  endif
  if ($<0x0200)
    .error Overflow 64k! over 256..511 bytes
  endif
  if ($<0x0400)
    .error Overflow 64k! over 512..1023 bytes
  endif
  if ($<0x0800)
    .error Overflow 64k! over 1024..2047 bytes
  endif
  if ($<0x1000)
    .error Overflow 64k! over 2048..4095 bytes
  endif
  if ($<0x2000)
    .error Overflow 64k! over 4096..8191 bytes
  endif
  if ($<0x3000)
    .error Overflow 64k! over 8192..12287 bytes
  endif
  if ($<0x4000)
    .error Overflow 64k! over 12288..16383 bytes
  endif
  if ($>0xFF00)
    .warning Data ends at 0xFF00+ address!
  endif
; seconds: 0           ;[ 0:0]

Code: Select all

create ascii_table
c"  " ,      \ space
c" -.-.--" , \ ! ( or ---...-) (or +)
c" .-..-." , \ "
c" ..--.." , \ # --> ? --> "num"?
...

transcribed to

Code: Select all

create ascii_table
,c"  "       \ space
,c" -.-.--"  \ ! ( or ---...-) (or +)
,c" .-..-."  \ "
,c" ..--.."  \ # --> ? --> "num"?
...

and after compilation morse_code.bin shrunk from 1390 to 827 bytes! Initialization took up over 40% of the total length of the program.

I should probably hang this up for all kinds of chains. Because I can do this even for null-terminated string characters, etc. Then just one word COUNTZ is added and it can be worked with like a normal string.

Rewritable chains would simply have to be created with the help of initialization by copying MOVE etc. from the constant chain. This is probably the best way. I'm still not sure. Because it would really like to see some program/task that requires it.

In addition, it seems that the definition of the standard for computer chains is made in such a way that physically the chains do not have to look like Pascal. It can be combined byte_length + pointer_to_the_first_character.

In theory, this should make various operations on strings easier. Because the pointer to this (construction? hell, I forgot how to name it) would remain the same (it wouldn't change the address).

It's amazing how many problems can be caused by just a suggestion of how to design it. And not how to do it.

Make a translator (another wrong translation = compiler) is like playing on a children's playground that has no back border.

PS: Not to mention it completely. The difference between the proposal of how to divide words into primitive words before (khamurapi) and now (morse_code) is that before it was a code for real forth (gforth) and now it has to be translated for ZX Spectrum. So too many nested words will create inefficient code. And here again the question arises what to do with it? Disguise it and have transparent source code.
But which ones? Original Forth, so run the fth2m4.sh script on top of that first? So I should somehow test whether the words are "leaf" and in M4_FORTH is "inline".
Or should I solve it with M4_FORTH? By word size? Are there leaves? Inline just before the sheet too? Feel like using it? When one automatically mit inline? And then how to solve EXECUTE? Do not look at these words inline? But if the translator will solve it himself, then again I lose a little control over the result... too many problems... .)))

Especially who I'm targeting. For someone who can write an effective forth program, so why compile it anyway? Someone you don't like
of C translations are inefficient? He probably won't want to learn Forth... Hard to say. Especially when I come across some error that worked before, but stopped after some change... and I realize that it was a few months ago and NO ONE complained. So no one has probably tried it. I wonder if anyone ever does
really useful.

ketmar · Post by **ketmar** » Fri Feb 23, 2024 7:49 am

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm Especially when I'm not entirely sure how gforth and the Linux terminal do it with utf8.

i don't know too. my system locale is KOI8-U. ;-)

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm If you come across the word "chains" read it as "strings". Google translator does not understand the context and in Czech it is simply an overloaded/multiple meaning word.

lol, computer translation is still fun after all those years spent on "improving" it. not that my "manual" writing is much better, tho.

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm PS: Not to mention it completely. The difference between the proposal of how to divide words into primitive words before (khamurapi) and now (morse_code) is that before it was a code for real forth (gforth) and now it has to be translated for ZX Spectrum. So too many nested words will create inefficient code. And here again the question arises what to do with it? Disguise it and have transparent source code.

inline. inline aggressively. it's so easy in Forth that there is no excuse to not doing it. ;-)

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm But which ones? Original Forth, so run the fth2m4.sh script on top of that first? So I should somehow test whether the words are "leaf" and in M4_FORTH is "inline".

don't bother optimising for other systems, it's not your problem. if something is efficient in M4F and not efficient in other system, it's not your fault. (that's, btw, is why you'd better have your own Forth system for PC too.)

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm Or should I solve it with M4_FORTH? By word size? Are there leaves? Inline just before the sheet too? Feel like using it? When one automatically mit inline? And then how to solve EXECUTE? Do not look at these words inline? But if the translator will solve it himself, then again I lose a little control over the result... too many problems... .)))

the usual solution is 2-pass compiler. you're doing cross-compilation anyway, so use one pass to count number of calls to some word, and then decide if it should be inlined. and let the user to fine-tune the heuristics. words which are called one or two times are obvious candidates for inlining. very small words, like ": fld@ 2+ @ ;" are obvious candidates for inlining regardless of call count. just keep fine-tuning it until you're happy.

in Beast, i am inlining anything smaller than 69 bytes of code (Very Scientific Number choosen because "69" and Succubus nicely fit together ;-). but with limited memory it's better to use 2-pass approach.

also, you can introduce more word flags to control the compiler. in beast, i have "{no-inline}", "(inline-blocker)", "(force-inline)". so i can write "{no-inline} : boo … ;", or ": foo … ; (force-inline)". (telling Succubus that the word will not be inlined before compiling it makes compilation slightly faster, that's why "no-inline" is prefix.)

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm Especially who I'm targeting.

yourself. do not think about hypotetical users, think about yourself. make the system so good that you will want to use it unstead of raw asm most of the time.

and… don't expect it to get more users. in Forth land, any serious Forth programmer is using their own tools and compilers. we exchange ideas, not implementations.

that's also why you'd better stop trying to follow any Forth standard. standards sux. it is not possible to standardize Forth, because it can be implemented in a wild number of ways. the standard (any of them) tries to avoid anything implementation-dependent, and this makes it useless. there is no way to create string tables because different systems handle strings differently. there is no way to search for words in wordlists from the code because different systems… and so on. the whole concept of machine-independent data/address size is idiocity. requirement of control flow stack is idiocity. insisting that "DOES>" could be used with "CREATE" instead of having separate "<BUILDS DOES>" is idiocity. expecting that "DOES>" can replace previous "DOES>" is idiocity. and so on. the standard is both trying to be implementation-neutral, and dictates the particular implementation in the same time.

to make the long story short, it is not possible to write good and efficient code in "standard Forth". you will inevitably resort to non-standard tricks. and then, why trying to follow any standard at all?

just one more example of "standard idiocity": "/" is machine-dependent, it can be either symmetrical, or floored. so you can't expect that "2/" and "2 /" are the same. yay. they "fixed" it by removing "2/", which was in Forth for decades, breaking a lot of code. and in the same time they couldn't decide if "NOT" should be logical or bitwise, so they made it bitwise, while it was logical for decades too. so much for "not breaking the existing code", yay.

basically, the standard (ANS) was made to promote several commercial Forth systems, so they could bear "ANS standard" label. almost nobody else was interested in Yet Another Standard. and current "standards team" is a bunch of losers who haven't wrote any real working (and used) application in Forth for decades.

ketmar · Post by **ketmar** » Fri Feb 23, 2024 8:04 am

_dw wrote: ↑Thu Feb 22, 2024 10:54 pm It's amazing how many problems can be caused by just a suggestion of how to design it. And not how to do it.

Make a translator (another wrong translation = compiler) is like playing on a children's playground that has no back border.

and that's also why you should have your own compiler, written in Forth too. because if your compiler is written in Forth, you don't have to hardcode any particular implementation: you can extend the compiler on the fly. that's why Beast, for example, doesn't have any way to create string tables: i'd implement it when i'll need it, and in the way which is best suited for the task at hand. different tasks may need different approaches, so instead of hardcoding one of them, i will simply extend the compiler in a slightly different way each time.

(by the way, Beast herself was rewritten around 5 times until i became happy with the result. and Succubus is going to get retargetable SSA codegen in the future. but even now Succubus is on par (and sometimes better) than commercial optimising Forth systems. ;-)

it is time to create your own PC Forth system, and rewrite M4F in it. ;-) also, write Z80 asm in it too, and use it instead of other Z80 assemblers. this way you'll get a powerful extensible Z80 asm you can control.

ketmar · Post by **ketmar** » Fri Mar 08, 2024 5:57 pm

heh. your compiler is better than mine SSA poc.
test code:

Code: Select all

8192 constant size

create flags size allot \ create;

: do-prime  ( -- count )
  flags size -1 fill
  0 0 begin ( count index )
    dup flags + c@
    if dup 2* 3 + 2dup +
      begin dup size < while dup flags + 0 swap c! over + repeat 2drop
    swap 1+ swap then
  1+ dup size >= until
  drop ;


: xtest
  cr ." testing..." cr
  0 23672 !
  do-prime
  23672 @
  cr ." count=" swap .
  cr ." time: " dup 50 / . 46 emit 50 mod 20 * . cr
  begin again ;

xtest

mine is 1.060, yours is 0.940. excellent result, congratulations!

p.s.: the Forth code sux, of course, but it is good to test optimisers. ;-)

_dw · Post by **_dw** » Mon Mar 18, 2024 2:50 am

Today I finished the first public reworked version of the script which tries to convert the original forth source code to M4 FORTH source code.

In the previous version, I added DOES> support, but when I started to deal with it more deeply, I found out that I have a fundamental error there.

After several attempts and reading what DOES> could do, I came to the incomplete conclusion that the word/function with DOES> can be divided into two parts. The first one before DOES> which must (erroneously) contain CREATE and then the part after DOES> which is executed after NAME is used for CREATE in the next part of the code. With the fact that before calling the second part, the address of that name is inserted into the stack as the last parameter. (If necessary, it is inserted inside that word/function, but then a separate word/function must be created for each word).

Something like

Code: Select all

: test ... label ... does> ...second... ;

test name1
...
name1
...

can be replaced by

Code: Select all

: test ... label ... ;
: test_does ...second... ;

test name1
...
name1 test_does
...

I was puzzled that the parameter of that word/function is written AFTER the call of that word/function.

This applies not only to CREATE but also to VALUE, VARIABLE, etc.

Upon further consideration, it makes sense that it cannot be the first because FORTH has a benevolent approach to names and cannot distinguish a parameter from a word/function name. If it was the first, it would fail on an unknown word.

More surprising was that it is outside the body of the word/function.

Further investigation revealed that CREATE does not even have to be part of the DOES> word/function. As long as there is a way to find out the LAST new name.

This required a complete change of approach to the script. I had to completely rewrite it from the original, more or less one-lane. Before I came across the word WORD, which made me want to rewrite it all over again.

AWK reads a line and starts converting words to TOKENS. Either it knows the word or it detects that it is a NUMBER, DOUBLE, FLOAT or it doesn't know the word and stores "raw" as a token parameter. If it's a word requiring "string for/name" and we're inside a function, it doesn't try to get a "parameter", but leave it empty. And I will mark that function as INLINE. If it encounters an unknown word and the parameter is already the name of a function, it replaces it with CALL(name), or when it finds that it is an inline function, it changes the flow to the next token inside the function, and it gradually goes through its code and stores a COPY of the token at the end token queues. If it comes across something like CREATE, VALUE, etc. that doesn't have a parameter, it jumps out of the function with the fact that it reads the parameter from the input and returns. A special tray with indicators where the flow is located must be kept. After testing GFORTH, the name is always completely outside all embedded functions. WORD behaves similarly, except that it changes the type of separator/termination of the loaded word. PARSE is like WORD, only WORD creates a PASCAL-type computer string, and PARSE returns the address and length the same as STRING.

Since I added support for [IF] [ELSE] and [THEN] or [ENDIF], it got even more complicated. Especially the behavior for [IF] if he doesn't pay is a little different than I expected. If the branch is valid, [ELSE] can be inserted/hidden in the string or PARSE/NAME without having any effect. If the branch is not valid, [ELSE] is also searched inside the chain and it will be valid. In an invalid branch, everything is just a bunch of characters until [ELSE] or [THEN]/[ENDIF] is found. It's complicated by the fact that after [IF] it feels like another nesting... so the code after [ELSE] is still invalid.

In order not to confuse it, I now have a script that at least to some extent tries to be functional with parameters after the name.

The problem is that I don't have a test set, so what I tried manually before when I added some support is not tested now. So there must be an error or several errors somewhere.

_dw · Post by **_dw** » Mon Mar 18, 2024 2:54 am

ketmar wrote: ↑Fri Mar 08, 2024 5:57 pm heh. your compiler is better than mine SSA poc.
test code:
Code: Select all
8192 constant size

create flags size allot \ create;

: do-prime  ( -- count )
  flags size -1 fill
  0 0 begin ( count index )
    dup flags + c@
    if dup 2* 3 + 2dup +
      begin dup size < while dup flags + 0 swap c! over + repeat 2drop
    swap 1+ swap then
  1+ dup size >= until
  drop ;


: xtest
  cr ." testing..." cr
  0 23672 !
  do-prime
  23672 @
  cr ." count=" swap .
  cr ." time: " dup 50 / . 46 emit 50 mod 20 * . cr
  begin again ;

xtest
mine is 1.060, yours is 0.940. excellent result, congratulations!

p.s.: the Forth code sux, of course, but it is good to test optimisers.

What is SSA poc.?

If your interpreted code runs basically as fast as compiled, then it is a special case (problematic algorithm that compiles badly) or an incredible result of the interpreter.

SSA poc. is there any other hardware? If you use the same one, you are 6x faster than the best C compiler and 3.5x faster than Boriel Basic, unbelievable!

PS: By the way, just yesterday I was correcting the code for NEGATE so that it returns a sign and does not convert the number to unsigned, because ALLOT tended to understand the number as 63536 and not -1000. Whether a maximum of 32767 bytes can be allocated from the FORTH standard. I didn't like that very much, so if ALLOT receives 63536 as a parameter, it allocates (and fails) and if -1000, it de-allocates 1000 bytes. Both are the same value, just interpreted differently. Probably a better solution would be to give a word that has a double parameter.

PPS: In many places I automatically convert -1000 to 63536, because the word itself knows from the definition what kind of value it is. Whether signed or unsigned. And PASMO cannot deal with negative numbers. (An ugly mistake in priorities)

ketmar · Post by **ketmar** » Mon Mar 18, 2024 4:11 am

_dw wrote: ↑Mon Mar 18, 2024 2:54 am What is SSA poc.?

i meant my SSA-based proof-of-concept compiler. it is not ready to be published yet, but basically it works like this:
1. compiled code is represented by a special VM bytecode (primitives are encoded as one-byte VM instructions, calls to other high-level words are 2 bytes). this bytecode can be executed directly (with a simple interpreter), but it's just a side-effect. ;-)
2. VM code is converted to the traditional SSA representation, replacing stack operations with special pseudoinstructions.
3. the usual SSA optimisations applied (constant folding, dead code elimination, and so on).
4. for each basic block (which doesn't contain branch or call) the machine code is generated. codegen tries to use all registers available, track register values, data size and so on.

i'm not happy with implementation details yet, though.

the generated code looks more or less like yours (slightly better, because codegen used BC as stack slot — 8ts vs 10/11 ts), but my codegen is not smart enough to realise that comparison with 8192 could be done with 8-bit instructions, and emited 16-bit "sbc".

after adding comparison trick, runtime dropped to 0.800. ;-)

that's what my code was like:

Spoiler

Code: Select all

:do-prime-main  ;; ( -- count )
  call  # stc-to-dstack
  ;; lit stc-flags
  push  hl
  ld    hl, # stc-flags
  ;; lit stc-size
  push  hl
  ld    hl, # stc-size
  ;; lit -1
  push  hl
  ld    hl, # -1
  call  # stc-to-rstack
  call  # stc-fill
  call  # stc-to-dstack
  ;; lit 0
  push  hl
  ld    hl, # 0
  ;; lit 0
  push  hl
  ;; ( count index )
:.loop0
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ld    a, (hl)
  ;; 0branch
  or    a
  ld    hl, bc
  jp    z, # .lifbrn-0
  ;; dup 2* 3 +
  ld    bc, hl
  add   hl, hl
  inc   hl
  inc   hl
  inc   hl
  ;; 2dup +
  push  bc
  push  hl
  add   hl, bc
:.loop1
  ;; dup size -
  ld    bc, hl
  ;;  lit size
  ld    de, # stc-size
  or    a
  sbc   hl, de
  ld    hl, bc
  ;; +0branch
  jp    p, # .loop1-break
  ;; dup flags +
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ;; 0 swap c!
  ld    (hl), # 0
  ld    hl, bc
  ;; over +
  pop   de
  push  de
  add   hl, de
  jp    # .loop1
:.loop1-break
  ;; else| 2drop
  pop   hl
  pop   hl
  ;; swap 1+ swap
  pop   de
  inc   de
  push  de
:.lifbrn-0
  ;; 1+
  inc   hl
  ;; dup size -
  ld    bc, hl
  ld    de, # stc-size
  or    a
  sbc   hl, de
  ld    hl, bc
  ;; -branch
  jp    m, # .loop0
  ;; end of loop
  pop   hl
  call  # stc-to-rstack
  ret

and that's what it look like after more optimisation tricks:

Spoiler

Code: Select all

:do-prime-main  ;; ( -- count )
  call  # stc-to-dstack
  ;; lit stc-flags
  push  hl
  ld    hl, # stc-flags
  ;; lit stc-size
  push  hl
  ld    hl, # stc-size
  ;; lit -1
  push  hl
  ld    hl, # -1
  call  # stc-to-rstack
  call  # stc-fill
  call  # stc-to-dstack
  ;; lit 0
  push  hl
  ld    hl, # 0
  ;; lit 0
  push  hl
  ;; ( count index )
:.loop0
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ld    a, (hl)
  ;; 0branch
  or    a
  ld    hl, bc
  jp    z, # .lifbrn-0
  ;; dup 2* 3 +
  ld    bc, hl
  add   hl, hl
  inc   hl
  inc   hl
  inc   hl
  ;; 2dup +
  push  bc
  push  hl
  add   hl, bc
:.loop1
  ;; dup size - +0branch
  ;; this is: dup size >= tbranch
  ld    a, h
  cp    # $20
  ;; loop exit condition should be JR if possible
  jr    nc, # .loop1-break
  ;; dup flags +
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ;; 0 swap c!
  ld    (hl), # 0
  ld    hl, bc
  ;; over +
  pop   de
  push  de
  add   hl, de
  jp    # .loop1
:.loop1-break
  ;; else| 2drop
  pop   hl
  pop   hl
  ;; swap 1+ swap
  pop   de
  inc   de
  push  de
:.lifbrn-0
  ;; 1+
  inc   hl
  ;; dup size - -branch
  ;; this is: dup size < tbranch
  ld    a, h
  cp    a, # $20
  jp    c, # .loop0
  ;; end of loop
  pop   hl
  call  # stc-to-rstack
  ret

the compiler holds TOS in HL, all other registers are free to use. that's why it is able to use "BC" as temporary stack slot (slightly faster than push/pop).

_dw wrote: ↑Mon Mar 18, 2024 2:54 am PS: By the way, just yesterday I was correcting the code for NEGATE so that it returns a sign and does not convert the number to unsigned, because ALLOT tended to understand the number as 63536 and not -1000. Whether a maximum of 32767 bytes can be allocated from the FORTH standard. I didn't like that very much, so if ALLOT receives 63536 as a parameter, it allocates (and fails) and if -1000, it de-allocates 1000 bytes. Both are the same value, just interpreted differently. Probably a better solution would be to give a word that has a double parameter.

i usually have two words: "ALLOT" and "UNALLOT". "ALLOT" fails if allocation overflows, and "UNALLOT" should be used to "deallocate".

_dw wrote: ↑Mon Mar 18, 2024 2:54 am And PASMO cannot deal with negative numbers. (An ugly mistake in priorities)

that's why writing compiler in Forth itself is better! ;-) i am using my own Z80 assembler, written in Forth too.

ketmar · Post by **ketmar** » Mon Mar 18, 2024 4:23 am

as for "DOES>", the new standard did it wrong (yet again). in fig-FORTH there was a special word to create "doers": "<BUILDS … DOES>". i.e. "CREATE"d words cannot have "DOES>" part, only "builded".

Code: Select all

: a create … does> … ; — WRONG!
: b <builds … does> …; — RIGHT!

this way the compiler knows that you are going to create a "doer", and can insert any special code it needs to support proper "DOES>" semantics.

that's what all my compilers are doing: "<BUILDS" inserts a small prologue (basically, "push # dataaddr / jp # doer-addr"), and "DOES>" simply patches the "JP".

trying to allow "DOES>" for any "CREATE"d word simply makes everything more complex, and has no practical benefits at all.

and you are right: "DOES>" doesn't care what was written before it. it simply checks the "LATEST" (last defined words), and if it is a valid one (i.e. one created with "<BUILDS"), it patches "JP", and switches to the normal colon compilation mode. actually, it compiles special "(DOES>)" word, which… well, does this in runtime. ;-)

ketmar · Post by **ketmar** » Mon Mar 18, 2024 4:36 am

anyway, isn't it is the time to rewrite the compiler in Forth? ;-) it will be much, much faster, and it will be possible to mix target code and host Forth code (i.e. use host Forth to calculate some tables, for example).

you can use intermediate bytecode (as i did in my SSA compiler). i.e. compile everything to bytecode for VM. then you'll be able to easily trace it to find out which words are used, perform inlining (simply copy VM bytecode in place), and so on.

by the way. you can take a look at my Abersoft fig-FORTH mod. it is done with UrForth, as dual compiler (i.e. you can define both ZX words, and host words). it also has Z80 assembler (both prefix and postfix).

there is no intermediate compiler there, because i had to patch the existing system, not create a new one. but it's still easier than trying to write a patch in assembler, or something like that. ;-)

_dw · Post by **_dw** » Sat Mar 23, 2024 1:29 am

ketmar wrote: ↑Mon Mar 18, 2024 4:11 am i meant my SSA-based proof-of-concept compiler. it is not ready to be published yet, but basically it works like this:
1. compiled code is represented by a special VM bytecode (primitives are encoded as one-byte VM instructions, calls to other high-level words are 2 bytes). this bytecode can be executed directly (with a simple interpreter), but it's just a side-effect.
2. VM code is converted to the traditional SSA representation, replacing stack operations with special pseudoinstructions.
3. the usual SSA optimisations applied (constant folding, dead code elimination, and so on).
4. for each basic block (which doesn't contain branch or call) the machine code is generated. codegen tries to use all registers available, track register values, data size and so on.

i'm not happy with implementation details yet, though.

the generated code looks more or less like yours (slightly better, because codegen used BC as stack slot — 8ts vs 10/11 ts), but my codegen is not smart enough to realise that comparison with 8192 could be done with 8-bit instructions, and emited 16-bit "sbc".

after adding comparison trick, runtime dropped to 0.800.

that's what my code was like:
Spoiler
Code: Select all
:do-prime-main  ;; ( -- count )
  call  # stc-to-dstack
  ;; lit stc-flags
  push  hl
  ld    hl, # stc-flags
  ;; lit stc-size
  push  hl
  ld    hl, # stc-size
  ;; lit -1
  push  hl
  ld    hl, # -1
  call  # stc-to-rstack
  call  # stc-fill
  call  # stc-to-dstack
  ;; lit 0
  push  hl
  ld    hl, # 0
  ;; lit 0
  push  hl
  ;; ( count index )
:.loop0
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ld    a, (hl)
  ;; 0branch
  or    a
  ld    hl, bc
  jp    z, # .lifbrn-0
  ;; dup 2* 3 +
  ld    bc, hl
  add   hl, hl
  inc   hl
  inc   hl
  inc   hl
  ;; 2dup +
  push  bc
  push  hl
  add   hl, bc
:.loop1
  ;; dup size -
  ld    bc, hl
  ;;  lit size
  ld    de, # stc-size
  or    a
  sbc   hl, de
  ld    hl, bc
  ;; +0branch
  jp    p, # .loop1-break
  ;; dup flags +
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ;; 0 swap c!
  ld    (hl), # 0
  ld    hl, bc
  ;; over +
  pop   de
  push  de
  add   hl, de
  jp    # .loop1
:.loop1-break
  ;; else| 2drop
  pop   hl
  pop   hl
  ;; swap 1+ swap
  pop   de
  inc   de
  push  de
:.lifbrn-0
  ;; 1+
  inc   hl
  ;; dup size -
  ld    bc, hl
  ld    de, # stc-size
  or    a
  sbc   hl, de
  ld    hl, bc
  ;; -branch
  jp    m, # .loop0
  ;; end of loop
  pop   hl
  call  # stc-to-rstack
  ret
and that's what it look like after more optimisation tricks:
Spoiler
Code: Select all
:do-prime-main  ;; ( -- count )
  call  # stc-to-dstack
  ;; lit stc-flags
  push  hl
  ld    hl, # stc-flags
  ;; lit stc-size
  push  hl
  ld    hl, # stc-size
  ;; lit -1
  push  hl
  ld    hl, # -1
  call  # stc-to-rstack
  call  # stc-fill
  call  # stc-to-dstack
  ;; lit 0
  push  hl
  ld    hl, # 0
  ;; lit 0
  push  hl
  ;; ( count index )
:.loop0
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ld    a, (hl)
  ;; 0branch
  or    a
  ld    hl, bc
  jp    z, # .lifbrn-0
  ;; dup 2* 3 +
  ld    bc, hl
  add   hl, hl
  inc   hl
  inc   hl
  inc   hl
  ;; 2dup +
  push  bc
  push  hl
  add   hl, bc
:.loop1
  ;; dup size - +0branch
  ;; this is: dup size >= tbranch
  ld    a, h
  cp    # $20
  ;; loop exit condition should be JR if possible
  jr    nc, # .loop1-break
  ;; dup flags +
  ld    bc, hl
  ld    de, # stc-flags
  add   hl, de
  ;; 0 swap c!
  ld    (hl), # 0
  ld    hl, bc
  ;; over +
  pop   de
  push  de
  add   hl, de
  jp    # .loop1
:.loop1-break
  ;; else| 2drop
  pop   hl
  pop   hl
  ;; swap 1+ swap
  pop   de
  inc   de
  push  de
:.lifbrn-0
  ;; 1+
  inc   hl
  ;; dup size - -branch
  ;; this is: dup size < tbranch
  ld    a, h
  cp    a, # $20
  jp    c, # .loop0
  ;; end of loop
  pop   hl
  call  # stc-to-rstack
  ret
the compiler holds TOS in HL, all other registers are free to use. that's why it is able to use "BC" as temporary stack slot (slightly faster than push/pop).

i usually have two words: "ALLOT" and "UNALLOT". "ALLOT" fails if allocation overflows, and "UNALLOT" should be used to "deallocate".

that's why writing compiler in Forth itself is better! i am using my own Z80 assembler, written in Forth too.

I've been trying for days (!!!) to understand that code. It seems that the transcription to the "original" Forth would look like this.

Spoiler

Code: Select all

8190 constant size 
create flags
size allot

: do-prime ( -- )
  flags size 1 fill
  0 0 
  begin                         \ prime_count i
    flags over + c@ if          \ flags[i] != 0 use_bc
      dup 2* 3 +                \ prime_count i 2*i+3  
      2dup +                    \ prime_count i 2*i+3 3*i+3
      begin
        dup size < while        \ use_bc
          dup flags +           \ use bc
          0 swap c!               
          over +                   
      repeat 
      2drop                     \ prime_count i
      swap 1+ swap              \ prime_count++ i
    then
    1+                         
  dup size >= until             \ use_bc
  drop . ." primes"
;

do-prime

You don't use do..loop loops, so I replaced them with the closest equivalent, namely begin..while..repeat and begin..until

Moving the ret from the physical (SP) stack that you probably use as a data stack is done manually via "call # stc-to-dstack"

The mystery is why the name of where it is stored looks like "data stack" (at the same time I understand it as a return address stack) and then for the word FILL you use "call # stc-to-rstack" which looks like "return address stack" but I have no idea what that's... the third stack? Maybe physically (SP) the auxiliary stack is used purely by Forth and the Data and Return address stack are emulated?

I have free BC, AF, AF', BC', DE' double register. With the fact that it is not guaranteed what is there because it is free to use and does not have to return to its original state.
I MOSTLY don't use BC as temporary storage for HL because it's not worth it.
You have it almost everywhere in five places. 4x it is easy to use when it comes back immediately, but once you store BC on the tray and then load it into HL. It's a mystery to me what kind of heuristics you have.

Note that, in addition to using the "ld BC,HL" pseudo-instruction, you have not defined the "cp HL,DE" or "cp HL,DE" pseudo-instructions.

ld C, L ; 1:4
ld B, H ; 1:4
or A
sbc HL, DE
ld L, C ; 1:4
ld H, B ; 1:4
are 4 bytes and 16 T-cycles

or A
sbc HL, DE
add HL, DE ; 1:11
is 1 byte and 11 T-cycles (add HL, DE not change flags)

The improved case has the problem that the session is different from the test file. It has a field size of 8190 bytes and not 8192, which is divisible by 256. In the end, it has no effect on the result because not a single prime number is hidden there. But this may not be true for other buffer sizes.

I tried rewriting the code manually in assembler. I also made some "illegal" edits because you proved yourself as a person to "transform" to a more optimal code. Take the computation of some variables out of the loop. Just change the design slightly, but still keep the original algorithm. The result is 0.39 seconds. And that's not even the end state for optimization. So there is room for improvement.

Aligning the field size to divisibility of 256 can be done by aligning it from the end. The beginning is also at the address divisible by 256, but the first bytes are not used. Just set via "fill".

It cannot be used to compare anything. Only for the limited time of the best possible result if the compiler was as smart as a human.

https://codeberg.org/DW0RKiN/M4_FORTH/s ... modify.asm

I only improved the translation of the original code by creating connections for "c@ if", because these were almost the only tokens that were not connected to anything. Acceleration of only "0.02" seconds.

The version without DO LOOP that you use is 0.9 seconds. I only fixed the "drop_over" (pop hl / push hl) and "add/sub/xor/and/or" connections. There it is more efficient to disconnect it and connect it with the following on "drop" and "over_add".

According to my asm version, I created a forth version that does not use any special M4 FORTH enhancements. Just written so that it could be more effective. And the result is 0.7 seconds... 30% speedup that I correctly chose what the variables will contain and that they will continuously increase in a loop (+1 and +2) instead of constantly recalculating.

Spoiler

Code: Select all

8190 constant size

create flags  

: do-prime ( -- )
  flags size 1+ 1 fill
  0 flags 3 ( sum_primes i+flags 2*i+3 )
  begin
    over C@ if
      2dup + ( sum_primes i+flags x=2*i+3 3*i+3+flags+n*x )
      begin
        dup flags size + 1+ u< 
        while
          0 over c!
          over +
      repeat
      drop
      rot 1+ -rot           ( sum_primes++ )
    then
    2 +                     ( sum_primes i+flags   2*[i+1]+3 )
    swap 1+ swap            ( sum_primes i+1+flags 2*[i+1]+3 )
    over flags size + <> while
  repeat
  drop drop
  . ." PRIMES" ;


do-prime

https://codeberg.org/DW0RKiN/M4_FORTH/s ... ve_edit.m4
https://codeberg.org/DW0RKiN/M4_FORTH/s ... e_edit.asm

My attempt to comment and understand your code.

Spoiler

Code: Select all

:do-prime-main  ;; ( -- count )
  call  # stc-to-dstack           ; save ret
  ;; lit stc-flags                ;
  push  hl                        ; flags 
  ld    hl, # stc-flags           ; flags 
  ;; lit stc-size                 ;
  push  hl                        ; size 1+ --> size
  ld    hl, # stc-size            ; size 1+ --> size
  ;; lit -1                       ;
  push  hl                        ; 1 --> -1
  ld    hl, # -1                  ; 1 --> -1 (true == -1)
  call  # stc-to-rstack           ;
  call  # stc-fill                ; fill
  call  # stc-to-dstack           ; --> save ???          flags size 1+ 1 fill
  ;; lit 0                        ;
  push  hl                        ; 0
  ld    hl, # 0                   ; 0
  ;; lit 0                        ;
  push  hl                        ; size 0 --> dup?       0 size 0 do
  ;; ( count index )              ; hl == i
:.loop0                           ;

    ld    bc, hl                  ; save hl  ---+
    ld    de, # stc-flags         ; flags       |
    add   hl, de                  ; +           |
    ld    a, (hl)                 ; c@          |
    ;; 0branch                    ;             |         flags I + C@ if
    or    a                       ; if          |
    ld    hl, bc                  ; load hl  ---+
    
    jp    z, # .lifbrn-0          ; if
    ;; dup 2* 3 +                 ;
    
    ld    bc, hl                  ; save hl
    add   hl, hl                  ; dup +
    inc   hl                      ; 3 +
    inc   hl                      ; 3 +
    inc   hl                      ; 3 +
    ;; 2dup +                     ;
    push  bc                      ; push original hl (i)
    push  hl                      ; push 2*i+3
    add   hl, bc                  ; 2*i+3+i               I dup + 3 + dup I +
:.loop1                           ;
      ;; dup size -               ;
      ld    bc, hl                ; save hl ----+   == i+x+n*x to bc, x = 2*i+3
      ;;  lit size                ;             |
      ld    de, # stc-size        ; size        |         --> size 1+
      or    a                     ;             |
      sbc   hl, de                ; <           |
      ld    hl, bc                ; load hl ----+
      
      ;; +0branch                 ;                       dup size 1+ < 
      jp    p, # .loop1-break     ; while
      ;; dup flags +              ;
      
      ld    bc, hl                ; save hl   ------+
      ld    de, # stc-flags       ; 0 over flags    |
      add   hl, de                ; +               |
      ;; 0 swap c!                ;                 |
      ld    (hl), # 0             ; c!              |
      ld    hl, bc                ; load hl   ------+
      
      ;; over +                   ;
      pop   de                    ; over  (2*i+3)
      push  de                    ; over
      add   hl, de                ; +                     0 over flags + C! over + 
      jp    # .loop1              ;                       repeat 
:.loop1-break                     ;
    ;; else| 2drop                ;
    pop   hl                      ; drop
    pop   hl                      ; drop == load hl (i)
    ;; swap 1+ swap               ;
    pop   de                      ; 
    inc   de                      ; sum of primes
    push  de                      ;                      drop drop 1+
:.lifbrn-0                        ;
    ;; 1+                         ;
    inc   hl                      ; hl = i++
    ;; dup size -                 ;
    
    ld    bc, hl                  ; save hl    --------+
    ld    de, # stc-size          ; ( 0 size do )      |
    or    a                       ;                    |
    sbc   hl, de                  ;                    |
    ld    hl, bc                  ; load hl    --------+
    
    ;; -branch                    ; loop                  loop
  jp    m, # .loop0               ;
  ;; end of loop                  ;
  pop   hl                        ;
  call  # stc-to-rstack           ; load ret
  ret                             ; ;

The original source is written in English, so I didn't quite understand it. There are too many versions. But I'm almost sure that 8192 probably doesn't have it. Even though I mention it there too...
http://www.forth.org/TM-10656.pdf

_dw · Post by **_dw** » Sat Mar 23, 2024 1:31 am

ketmar wrote: ↑Mon Mar 18, 2024 4:23 am as for "DOES>", the new standard did it wrong (yet again). in fig-FORTH there was a special word to create "doers": "<BUILDS … DOES>". i.e. "CREATE"d words cannot have "DOES>" part, only "builded".
Code: Select all
: a create … does> … ; — WRONG!
: b <builds … does> …; — RIGHT!
this way the compiler knows that you are going to create a "doer", and can insert any special code it needs to support proper "DOES>" semantics.

that's what all my compilers are doing: "<BUILDS" inserts a small prologue (basically, "push # dataaddr / jp # doer-addr"), and "DOES>" simply patches the "JP".

trying to allow "DOES>" for any "CREATE"d word simply makes everything more complex, and has no practical benefits at all.

and you are right: "DOES>" doesn't care what was written before it. it simply checks the "LATEST" (last defined words), and if it is a valid one (i.e. one created with "<BUILDS"), it patches "JP", and switches to the normal colon compilation mode. actually, it compiles special "(DOES>)" word, which… well, does this in runtime.

Thanks for the hint <BUILD. I also omitted this word and hoped that it was not essential for most of the program.

_dw · Post by **_dw** » Sat Mar 23, 2024 1:43 am

ketmar wrote: ↑Mon Mar 18, 2024 4:36 am anyway, isn't it is the time to rewrite the compiler in Forth? it will be much, much faster, and it will be possible to mix target code and host Forth code (i.e. use host Forth to calculate some tables, for example).

you can use intermediate bytecode (as i did in my SSA compiler). i.e. compile everything to bytecode for VM. then you'll be able to easily trace it to find out which words are used, perform inlining (simply copy VM bytecode in place), and so on.

by the way. you can take a look at my Abersoft fig-FORTH mod. it is done with UrForth, as dual compiler (i.e. you can define both ZX words, and host words). it also has Z80 assembler (both prefix and postfix).

there is no intermediate compiler there, because i had to patch the existing system, not create a new one. but it's still easier than trying to write a patch in assembler, or something like that.

My brain refuses to think! I can write assembler. They can easily understand what is wrong with the M4 FORTH and where it could be easily improved or repaired. But something new...argh! Understanding foreign code is already a problem. And I solved your code for days. I'm not even sure what I'm seeing there, whether it's a foreign language. How does it behave? Is it assembler? Or should I look at it as forth words emulating assembler? Every time after 10 minutes, my head went into the mode: You are not a programmer. It doesn't always matter anyway. What better way to watch an anime series online and kill time?
I think you overestimate me. Write a compiler in FORTH? My brain immediately says: No thanks! You can not do it.

They can only work in small steps. Always tiny steps. Just persevere. But this is not a step. That's a leap across the chasm.

ketmar · Post by **ketmar** » Sat Mar 23, 2024 8:44 am

_dw wrote: ↑Sat Mar 23, 2024 1:29 am I've been trying for days (!!!) to understand that code.

ah, it's mostly straightforward. i omited some helper routines, tho.

_dw wrote: ↑Sat Mar 23, 2024 1:29 am You don't use do..loop loops, so I replaced them with the closest equivalent, namely begin..while..repeat and begin..until

yeah, i am not a big fan of DO/FOR, and sometimes don't even bother to implement them.

_dw wrote: ↑Sat Mar 23, 2024 1:29 am Moving the ret from the physical (SP) stack that you probably use as a data stack is done manually via "call # stc-to-dstack"

this is the trick i did used in x86 UrForth: i am switching SP between data (parameters) stack and return stack. those calls simply save the old SP value, and load a new one. "to-dstack" means "SP is now in return stack mode, switch it to data stack mode". and "to-rstack" does the opposite.

so when i call "stc-to-dstack", i can use PUSH/POP to work with data (parameters) stack. and after "stc-to-rstack" i can use CALL/RET to call other words. this way both inlined and optimised primitive code can use PUSH/POP, and stream of high-level calls can be CALL/RET.

also note that HL holds TOS value.

_dw wrote: ↑Sat Mar 23, 2024 1:29 am I have free BC, AF, AF', BC', DE' double register.

i found that alternate register set is not of much use. if only Z80 had some way to exchange only one register pair between normal and alternate…

_dw wrote: ↑Sat Mar 23, 2024 1:29 am Note that, in addition to using the "ld BC,HL" pseudo-instruction, you have not defined the "cp HL,DE" or "cp HL,DE" pseudo-instructions.

ah, 16-bit move is just a handy macro left from my general-purpose Z80 assembler. i prefer to not define macros with side effects, though.

_dw wrote: ↑Sat Mar 23, 2024 1:29 am The improved case has the problem that the session is different from the test file. It has a field size of 8190 bytes and not 8192, which is divisible by 256.

yeah, i cheated with 8192, and forgot to mention it. i changed the constant in the Forth source to allow codegen to use 8-bit comparisons. ;-)

_dw wrote: ↑Sat Mar 23, 2024 1:29 am I tried rewriting the code manually in assembler. I also made some "illegal" edits because you proved yourself as a person to "transform" to a more optimal code.

not me. ;-) this is Succubus/Z80, she did all the transformations. yet she is limited by "basic blocks", and have to spill all registers back to parameter stack at the end of basic block. "basic block" ends at branch or call to another word.

so what you see is a direct ouput of cross-compiler, i didn't manually tuned it. Succubus is able to see some patterns, and even to move instructions around sometimes.

_dw wrote: ↑Sat Mar 23, 2024 1:29 am My attempt to comment and understand your code.

yeah, "LD BC, HL" and the opposite is what Succubus used instead of "PUSH/POP", because she has a free register there. she also moved some PUSH down the lane (around "2dup +").

Succubus works on SSA representation with infinite register supply, and she tries to optimise SSA code, not Forth code. there is no more Forth code at this stage, so the end result may not look like exact translation. and she may move some instructions around to better utilise CPU registers later. she also knows when some register holds 8-bit value, and tries to use this knowledge in codegen.

ketmar · Post by **ketmar** » Sat Mar 23, 2024 9:18 am

_dw wrote: ↑Sat Mar 23, 2024 1:43 am And I solved your code for days.

this is because you had tried to decipher the output of highly optimising compiler. take GCC -O3 output, for example, and try to "decompile" it back to C code. it will be very hard to do, because the compiler transformed the original code into something barely recognizable. Succubus did the same (more or less).

_dw wrote: ↑Sat Mar 23, 2024 1:43 am Every time after 10 minutes, my head went into the mode: You are not a programmer. It doesn't always matter anyway.

this is usually how i feel when i starting some new project. ;-) no, really. but i'm telling myself: "keep calm and carry on. things will sort themselves out eventually."

_dw wrote: ↑Sat Mar 23, 2024 1:43 am What better way to watch an anime series online and kill time?

you can do both! i love watching anime too! ;-)

_dw wrote: ↑Sat Mar 23, 2024 1:43 am I think you overestimate me. Write a compiler in FORTH? My brain immediately says: No thanks! You can not do it.

They can only work in small steps. Always tiny steps. Just persevere. But this is not a step. That's a leap across the chasm.

we talked enough for me to see that you can easily do it. ;-) the problem is that you never did it before, so it's hard to start: the task seems to be huge and complex.

but basic Forth compiler is very easy:

Code: Select all

: INTERPRET
  begin
    parse-name ( addr count ) — we read a space-delimited word from the input stream
    dup 0= if 2drop exit endif — assume that "parse-name" returns empty string when the input stream ends
    2dup find ( cfa TRUE // FALSE ) — dictionary search; and we need "addr count" for numeric parsing below
    if — this is a valid word, execute or compile it
      >r 2drop r> — drop "addr count", we don't need them anymore
      dup immediate? 0= state @ 0<> and if ( compiling) compile, ( compile call to CFA) else execute endif
    else
      convert-number  ( number TRUE // FALSE ) — this tries to perform numeric conversion
      not?error" wut?" — cannot convert a number
      state @ if compile-literal endif
    endif
  again ;

this is the whole Forth interpret/compile loop. ;-)

the real magic happens in "COMPILE," — it can be as simple as compiling a CALL to the given word, or it can do arbitrary code manipulations. you can try to write a "classical" threaded code compiler first, without any optimisations, just to get the idea of how it works. and then improve on it later.

just don't try to write an optimising compiler at the first try. start with the classical Indirect Threaded Code system, and make it work. then convert it to Direct Threaded Code. then to Subroutine Threaded Code. those are small steps, and in the end you'll have a compiler which generates native machine code. and it is still easy to understand. at this stage, you can start adding machine code optimisations.

the joy of Forth is that it is simple. not only the language itself, but the compiler too.

i am working on Abersoft fig-Forth recompilation project now (just4fun). i'll prolly publish it later, and it may help you to start. this is the classical ITC system, written mostly in Forth, with very low number of asm primitives. the usual source for fig-Forth is in assembler, but i am converting it to "asm + Forth" instead. it may show you how to do a Forth cross-compiler in Forth itself. of course, the cross-compiler is written in UrForth, but that's not really a big deal, UrForth is still Forth. ;-)

it may be hard to start writing "real" code in Forth, but once you start doing it, it will eventually "click", and it will become as easy as doing it in any other language you are used to. just remember The Rule: write short words. very short words. this way you won't have to perform stack acrobatics. and don't afraid of using global variables if necessary.

_dw · Post by **_dw** » Sun Mar 24, 2024 11:32 am

I didn't think of calling "stc-to-dstack" and calling "stc-to-rstack" to switch SP contents. When I initially tried different variants of how to use SP, I was told that changing SP is not worth it. Because mostly the data is moved from the "data stack" to the "return address stack" or vice versa. Only from a few cells it could be more interesting. Or some nested function calls in succession.

ketmar · Post by **ketmar** » Sun Mar 24, 2024 1:13 pm

_dw wrote: ↑Sun Mar 24, 2024 11:32 am I didn't think of calling "stc-to-dstack" and calling "stc-to-rstack" to switch SP contents. When I initially tried different variants of how to use SP, I was told that changing SP is not worth it.

it doesn't worth it with the traditional STC code. but Succubus does very aggressive inlining, so most of the code is either stream of CALLs to high-level words, or stream of inlined primitives. in this case switching the stacks gives a significant speedup.

if only Z80 had an alternative SP…

_dw · Post by **_dw** » Sun Mar 24, 2024 9:44 pm

ketmar wrote: ↑Sun Mar 24, 2024 1:13 pmif only Z80 had an alternative SP…

Then the Z80 would not only be a processor usable for humans, but also for compilers as we know them today.

A single instruction " ex SP, SP' " for one byte and 4 clocks would be enough and everything would change. The problem is not only physical (a more complex processor), but that the operating codes are already occupied. So one would have to look for an instruction that will be affected by moving beyond the prefix byte. For example, " ld SP, nn ", which is surprisingly in the "basic" code table.

It would also like to remove one of the IX or IY registers and replace it with the SP register. Then it would be possible to read or change not only the last item from the stack. That would make the authors of the C compiler happy.

Well, it's pointless to solve it.

The design of the processor instruction set is interesting. If you manage to reduce the size of the instruction to a multiple of 4 bits, you will have some basic instructions shortened, but the total length of the program will eventually increase... But the processor could be simpler. Interesting to look for the optimal solution.

ketmar · Post by **ketmar** » Sun Mar 24, 2024 10:37 pm

several Forth CPUs had various interesting… solutions, including minimalistic instruction sets, superscalar CPUs, VLIW CPUs… and most of them are even more minimalistic than minimalistic RISCs! ;-)

Spectrum Computing

Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX

Re: Forth compiler for ZX