F8 Forth cross-compiler; featuring Succubus, native Z80 code writer

ketmar · Post by **ketmar** » Mon Apr 29, 2024 9:15 am

THIS IS ANNOUNCE THREAD!
the system exists, and it is working, but not public yet. i simply couldn't keep it to myself anymore. ;-)
so, in the good old tradition of making posts about projects i'll never finish, here is yet another one!

F8 Forth cross-compiler, featuring Succubus, native Z80 code writer

F8 is FAST cross-compiler for Forth language. instead of using an interpreter, as most Forth systems do, F8 generates native machine code, and inlines a lot of primitives on its way. my usual Erathostenes benchmark is ~8 times faster than with Abersoft Forth Recompiled (which is DTC, and already faster than the original Abersoft Forth).

to give you some numbers: Abersoft Recompiled takes about 9 seconds (slighly less) to finish the benchmark, and F8 is able to do it in ~800 msecs.

of course, for real programs the results will not be such impressive, but i still expect x2-x5 speedup for most Forth code.

yet everything has its price: the code generated by F8 is bigger than the usual Forth threaded code. and you won't have an interactive interpreter to play with. but F8 is able to trace your program, and include only words it used. i.e. you tell F8 what the "main" word is, and F8 includes only the code which was really used, throwing away everything else. it also throws away all word headers. so most of the time, F8 will be able to produce both smaller and faster programs.

i wrote F8 because i am planning to use it for writing my own software, so it's not Yet Another Toy Project. also, F8 comes with a freebe: native x86 Forth compiler for GNU/Linux (because this is the language F8 written in).

(ok, i must confess that it was started as R&D project for Succubus/Z80, and it is still highly experimental. but it works, and works even better than i expected.)

both compilers featuring Succubus — optimising native code writer. for x86, Succubus is able to generate machine code comparable with SPF4, and she prolly even able to beat some commercial Forth compilers. most of the time,
x86 code generated by Succubus is ~2 times slower than "gcc -O2". and she is able to do that in milliseconds! rebuilding the whole x86 system takes ~60 msecs, and this is more than 500 KB of Forth code!

when Succubus writing code for Z80, she is lightning fast too. even for huge Forth programs you will prolly not be able to notice the compiling itself. your Speccy emulator might start slower than Succubus finishing her job! ;-)

so you'll be able to instantly test your Forth code, almost like if you have an interactive interpreter!

F8 can generate standard .tap files, +3DOS .dsk images, and .sna snapshots. it doesn't support 128K, though. that is, the code will work, but there are no special tools to use additional 128K memory. but you have built-in Z80 assembler available, so you'll be able to write your own memory managing system.

see also: M4Forth

you may also take a look at M4Forth.

this is another Forth cross-compiler, written in M4 macro language. it has similar goals, but completely different implementation. it's great, and you will not want to miss it if you are interested in Forth! besides, this is the only Forth compiler written in M4 (at least i haven't seen others ;-).

F8 vs M4Forth

while both compilers generate native Z80 machine code, there are two major differences.

first, F8 is doing aggressive inlining, which is quite important if you are writing real Forth code (with many small words) instead of "C with awkward syntax and manual stack manipulations". of course, M4Forth can be improved to perform inlining too, so... this one is not that major. ;-)

second, F8 is FAST. M4Forth compiling speed is limited by M4 implementation, and it could take several seconds to compile your code. F8 is able to do everything in less than half of a second; i don't think that you will be able to make it spend more than 300-600 msecs even on most complex source code.

yet, the code generated by M4Forth is usually better and faster. if it will get inliner (or you will write the inliner yourself), M4Forth may be better than F8 if you need to squeeze more tstates from our beloved Z80.

also, M4Forth is mostly compatible with Forth standards, and F8 is totally non-standard. it doesn't even have your usual "IF/BEGIN" words! (don't worry, conditions and loops are there, they just slightly non-conventional.)

code generation: sample
source:

Code: Select all

$9000 zx-constant flags
8192 zx-constant size

zx: do-prime  ( -- count )
  flags size -1 fill
  0 0 << ( count index )
    dup flags + c@
    ?< dup 2* 3 + 2dup +
      << dup size - -?^| dup flags + 0 swap c! over + |? else| 2drop >>
      swap 1+ swap
    >?
  1+ dup size - -?^||
  else| drop >> ;

generated machine code:

Code: Select all

8016: E3           ex    (sp), hl
8017: 22 62 80     ld    $8062 (), hl
801A: 21 00 90     ld    hl, # $9000
801D: 11 01 90     ld    de, # $9001
8020: 01 FF 1F     ld    bc, # $1FFF
8023: 71           ld    (hl), c
8024: ED B0        ldir
8026: C5           push  bc
8027: 69 60        ld    hl, bc
8029: 5D 54        ld    de, hl
802B: 3E 90        ld    a, # $90
802D: 84           add   a, h
802E: 67           ld    h, a
802F: 7E           ld    a, (hl)
8030: B7           or    a
8031: EB           ex    de, hl
8032: CA 59 80     jp    z, # $8059
8035: 5D 54        ld    de, hl
8037: 29           add   hl, hl
8038: 23           inc   hl
8039: 23           inc   hl
803A: 23           inc   hl
803B: 4D 44        ld    bc, hl
803D: 19           add   hl, de
803E: D5           push  de
803F: C5           push  bc
8040: 7C           ld    a, h
8041: D6 20        sub   # $20
8043: F2 54 80     jp    p, # $8054
8046: 5D 54        ld    de, hl
8048: 3E 90        ld    a, # $90
804A: 84           add   a, h
804B: 67           ld    h, a
804C: 36 00        ld    (hl), # $00
804E: E1           pop   hl
804F: E5           push  hl
8050: 19           add   hl, de
8051: C3 40 80     jp    # $8040
8054: E1           pop   hl
8055: E1           pop   hl
8056: D1           pop   de
8057: 13           inc   de
8058: D5           push  de
8059: 23           inc   hl
805A: 7C           ld    a, h
805B: D6 20        sub   # $20
805D: FA 29 80     jp    m, # $8029
8060: E1           pop   hl
8061: C3 00 00     jp    # $0000

as you may see, the generated code looks pretty decent. and much, much better than what z88dk is able to produce. ;-)

_dw · Post by **_dw** » Tue Apr 30, 2024 1:14 am

In fact, in the worst case scenario, compiling with M4 Forth can take hours.
Translation(edit:compile) speed was never a priority, only the final product.
In certain cases when the best solution is found and by searching the state space, M4 Forth simply starts doing it. And he doesn't care about the M4, it's a macro. And it's not the best tool for this.

In general, there are two cases. Too many loops. When every loop is analyzed and every step of the loop is traversed and statistics are made. So at the end it has information at which value the index ends and how many times hi or lo part can have "false positive".

The second area is a bunch of data in a row. If it is to be initialized at every start, M4 will analyze and try to find the fastest solution for initialization.

Suppose we have something like the field 5,3,7,1,4,0,2,6.

M4 will run a macro to analyze whether it is something like 5,6,7,8,9,10,11 or another given field that can be initialized in a loop. Otherwise, run the data sorting macro and then initialize them from the smallest values and take advantage of the fact that some value is duplicated or that changing to a new one is cheap. The address in the field will be inline. So it will be like a leap forward. (I'm not sure if google translator has mastered this, it's just as if randomly (in terms of position) initialized).

Spoiler

Code: Select all

dworkin@dw-A15:~/Programovani/ZX/Forth/Testing$ ../check_word.sh 'PUSH(5) COMMA PUSH(3) COMMA PUSH(7) COMMA PUSH(1) COMMA PUSH(4) COMMA PUSH(0) COMMA PUSH(2) COMMA PUSH(6) COMMA'
    push HL             ; 1:11      5 , 3 , ... 6 ,   default version
    ld   HL, 0          ; 3:10      0 ,
    ld  [10+VARIABLE_SECTION],HL; 3:16      0 ,
    inc   L             ; 1:4       1 ,
    ld  [6+VARIABLE_SECTION],HL; 3:16      1 ,
    inc   L             ; 1:4       2 ,
    ld  [12+VARIABLE_SECTION],HL; 3:16      2 ,
    inc   L             ; 1:4       3 ,
    ld  [2+VARIABLE_SECTION],HL; 3:16      3 ,
    inc   L             ; 1:4       4 ,
    ld  [8+VARIABLE_SECTION],HL; 3:16      4 ,
    inc   L             ; 1:4       5 ,
    ld  [VARIABLE_SECTION],HL; 3:16      5 ,
    inc   L             ; 1:4       6 ,
    ld  [14+VARIABLE_SECTION],HL; 3:16      6 ,
    inc   L             ; 1:4       7 ,
    ld  [4+VARIABLE_SECTION],HL; 3:16      7 ,
    pop  HL             ; 1:10      5 , 3 , ... 6 ,
                        ;[36:187]   5 , 3 , ... 6 ,

VARIABLE_SECTION:

    dw 0x0005           ;           5 ,   = 5
    dw 0x0003           ;           3 ,   = 3
    dw 0x0007           ;           7 ,   = 7
    dw 0x0001           ;           1 ,   = 1
    dw 0x0004           ;           4 ,   = 4
    dw 0x0000           ;           0 ,   = 0
    dw 0x0002           ;           2 ,   = 2
    dw 0x0006           ;           6 ,   = 6

;# ============================================================================
  if ($<0x0100)
    .error Overflow 64k! over 0..255 bytes
  endif
  if ($<0x0200)
    .error Overflow 64k! over 256..511 bytes
  endif
  if ($<0x0400)
    .error Overflow 64k! over 512..1023 bytes
  endif
  if ($<0x0800)
    .error Overflow 64k! over 1024..2047 bytes
  endif
  if ($<0x1000)
    .error Overflow 64k! over 2048..4095 bytes
  endif
  if ($<0x2000)
    .error Overflow 64k! over 4096..8191 bytes
  endif
  if ($<0x3000)
    .error Overflow 64k! over 8192..12287 bytes
  endif
  if ($<0x4000)
    .error Overflow 64k! over 12288..16383 bytes
  endif
  if ($>0xFF00)
    .warning Data ends at 0xFF00+ address!
  endif
; seconds: 6           ;[36:187]

The same applies if we try to cram a lot of data onto the stack. The optimization of the registry content will be done through the search of the state space. I've limited this so it won't find the best solution because it's really hard to calculate.

Spoiler

Code: Select all

dworkin@dw-A15:~/Programovani/ZX/Forth/Testing$ ../check_word.sh 'PUSH(0x3322) PUSH(0x1111) PUSH(0x2222) PUSH(0x3333) PUSH(0x6644)'
    push DE             ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644
    push HL             ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644
    ld   DE, 0x3322     ; 3:10      0x3322 0x1111 0x2222 0x3333 0x6644
    push DE             ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644   = 13090
    ld   HL, 0x1111     ; 3:10      0x3322 0x1111 0x2222 0x3333 0x6644
    push HL             ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644   = 4369
    add  HL, HL         ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644   8738 = 2*4369
    push HL             ; 1:11      0x3322 0x1111 0x2222 0x3333 0x6644   = 8738
    ld    E, D          ; 1:4       0x3322 0x1111 0x2222 0x3333 0x6644   E = D = 0x33
    ld   HL, 0x6644     ; 3:10      0x3322 0x1111 0x2222 0x3333 0x6644
; seconds: 13          ;[16:100]

13 seconds and they had such a small entrance. And because it grows exponentially, it always only searches a limited range plus the last values. It finds the best one and moves the search window.

The preceding case, only it is not in the field but on the stack.

Spoiler

Code: Select all

dworkin@dw-A15:~/Programovani/ZX/Forth/Testing$ ../check_word.sh 'PUSH(5) PUSH(3) PUSH(7) PUSH(1) PUSH(4) PUSH(0) PUSH(2) PUSH(6)'
    push DE             ; 1:11      5 3 7 1 4 0 2 6
    push HL             ; 1:11      5 3 7 1 4 0 2 6
    ld   DE, 0x0003     ; 3:10      5 3 7 1 4 0 2 6
    ld   HL, 0x0005     ; 3:10      5 3 7 1 4 0 2 6
    push HL             ; 1:11      5 3 7 1 4 0 2 6   = 5
    push DE             ; 1:11      5 3 7 1 4 0 2 6   = 3
    ld    L, 0x07       ; 2:7       5 3 7 1 4 0 2 6
    push HL             ; 1:11      5 3 7 1 4 0 2 6   = 7
    ld    L, 0x01       ; 2:7       5 3 7 1 4 0 2 6
    push HL             ; 1:11      5 3 7 1 4 0 2 6   = 1
    ld    L, 0x04       ; 2:7       5 3 7 1 4 0 2 6
    push HL             ; 1:11      5 3 7 1 4 0 2 6   = 4
    ld    L, H          ; 1:4       5 3 7 1 4 0 2 6   L = H = 0x00
    push HL             ; 1:11      5 3 7 1 4 0 2 6   = 0
    dec   E             ; 1:4       5 3 7 1 4 0 2 6
    ld    L, 0x06       ; 2:7       5 3 7 1 4 0 2 6
; seconds: 30          ;[24:144]

It's all experimental and I haven't seen anything like it anywhere. In C translation, you do not have access to the stack, and initialization of the field is done simply by copying the constant field to another position.

ketmar · Post by **ketmar** » Tue Apr 30, 2024 9:08 am

yeah, that's why i said that M4Forth does better code. because it does. ;-)

the huge problem is that it is impossible to program in Forth either without interactive REPL, or without very fast turnaround times. we don' need a debugger because we debugging our code as we're writting it. ;-) you're writing one/two/three small Forth words, and immediately test them. then you're writing another one or two, and test. and on, and on, and on. this way you're always building on a solid and tested foundation. also, testing small words is much easier than testing complex programs.

this was The Revolution in 1970. the revolution nobody noticed, and people rediscovered the joy of REPL decades later. and with much worser languages. (i'm not counting Symbolics and other LISP machines, because they were existed in a seprarete realm, almost literally.)

i was able to write the usable sprite editor with Abersoft Forth in several hours only due to "instant" compile times. i invited ;-) Succubus to UrForth/Beast x86 because the system was too slow rebuilding herself (~200/300 msecs, and counting).

so the idea of F8 is to generate "good enough" code (which will inevitably beat z88dk in almost any case, this is low-hanging fruit ;-), and do it FAST. otherwise i won't be able to write anything usable with it. because it is literally "write the word, run ZXEmuT to test it, rinse and repeat". so my decade old i3 core2duo should be able to generate code faster than i release Alt+F9 and press Ctrl+F9 to run the emulator. ;-)

yet Succubus actually does limited type inference, value range propagation and lifetime analysis, and i'm planning to improve this in future F8 versions. she almost never rewrites your code behind the scenes, though (except the usual constant folding).

and i am so keen on inlining because this is the only way to get any good machine code from Forth. Forth words are so small that there's usually nothing to optimise, and codegens like M4Forth or Succubus shines when they have a good chunk of code to work with. i am using Forth because Forth words are small, easy to write, and easy to debug. but in the same time i need Forth words to be huge to get more speed. the only way to solve this is aggressive inlining. i mean, when i have some data structure, i often have a set of tiny words for each field:

Code: Select all

: field^  ( base^ — field^ )  4 + ; — get field address from the base word
: field@  ( base^ — value )  field^ @ ;
: field!  ( value base^ ) field^ ! ;

and so on. yeah, the words are trivial (and automatically generated by struct builder), but they make the code more readable and maintainable. using direct offsets each time is ugly, and writing "field-ofs + @" is boring.

but there is nothing to optimise in such small words: they are basically meant to be inlined. ;-) especially considering that inlining in Forth is much easier than in "common" languages (it is literally zero cost ;-). inlining also allows me to factor my code, and don't worry about slowdowns due to worser optimisations and more calls.

that's why inlining is vital for any good Forth → native code compilers.

and that's why M4Forth needs supplemental compiler with instant turnarounds. sadly, F8 cannot be used as such, because it is not compatible with any existing standard (it is basically UrForth/Beast, ported to Z80).

p.s.: btw, Succubus tracks register values, so she could reuse registers (she did it in the example code: used BC instead of directly loading zerores to HL), and she is able to generate value loads using register->register copy, if it is possible. also, the replaced "ld l, (hl) / ld h, # 0 / ld a, h / or a" with "ld a, (hl) / or a", because she knows that loaded value is 8-bit, and it is not used after comparison; so she could load it directly to accumulator. she will also try to perform similar optimisations for other comparison and math, optimising loads, and using 8-bit math if the value is proven to be 8-bit.

i will also add backward range propagation later, so code like "c@ + $FF and" will use only 8-bit math, without generating an intermediate 16-bit value.

ketmar · Post by **ketmar** » Sun May 12, 2024 6:51 pm

F8 approaching beta soon. (hehe. i hope.) and it is a major rewrite too. i realised that what i actually have is a huge pattern matcher, disguised as Forth code. but hey, Forth is ideally suited for DSLs! so now codegen is driven by rule database, like this (by the way: it is still Forth code, not something that should be preprocessed externally):

Code: Select all

RULE - ( a b -- c )

VSTACK: HL HL
GEN:
  invite Succubus
  vdrop vdrop
  0 vpush#
  ir-drop-self ;

VSTACK: DE DE
GEN-DITTO

VSTACK: BC BC
GEN-DITTO


VSTACK: #n #n
GEN:
  invite Succubus
  ?vpop# ?vpop# swap - lo-word vpush#
  ir-drop-self ;


VSTACK: r16 #0
GEN:
  Succubus:vdrop
  ir-drop-self ;


;; negate
;; 8-bit negate?
VSTACK: #0 r8
GEN:
  invite Succubus
  ;; dec r16h     ;; 4 -- it is guaranteed to be 0
  ;; ld  a, r16l  ;; 4
  ;; neg          ;; 8
  ;; ld  r16l, a  ;; 4
  ;; 4+4+8+4=20
  vswap vdrop
  v-sp0-pick-r16
  ( reg-idx )
  ir-mc-removable-last? ?<
    dup last-removable-load-r16h-smth? ?<
      ir-mc-remove-last
      ;; we removed loading 0 to r16h, need to load $FF there
      $FF over gen-set-c#-r16h
    || dup gen:dec-r16h >?
  || dup gen:dec-r16h >?
  dup gen:r16l->a
  gen:neg
  dup gen:a->r16l
  r16-?! ;

VSTACK: #0 r16
REGS: HL/free
GEN:
  ;; yay, HL is free!
  ;; in this case, we don't need TOS to be unique.
  ;;   ld  hl, # 0  ;; 10
  ;;   or  a        ;; 4
  ;;   sbc hl, r16  ;; 15
  ;;   10+4+15=29
  invite Succubus
  f-sbc-16 tc-cpu-flags-effect:!
  0 reg-HL load-#-r16
  tc-last-cpu-flags? f-no-carry <> ?< gen:or-a-a >?
  ?vpop-r16 vdrop
  gen:sbc-hl-r16
  reg-HL r16-?!
  reg-HL vpush-r16 ;

;; HL is not free here
VSTACK: #0 r16
GEN:
  invite Succubus
  ;; general case -- 8-bit negate:
  ;;   ld  a, r16l  ;; 4
  ;;   cpl          ;; 4
  ;;   ld  r16l, a  ;; 4
  ;;   ld  a, r16h  ;; 4
  ;;   cpl          ;; 4
  ;;   ld  r16h, a  ;; 4
  ;;   inc r16      ;; 6
  ;; 4+4+4+4+4+4+6=30
  vswap vdrop
  v-sp0-pick-r16
  dup gen:r16l->a
  gen:cpl
  dup gen:a->r16l
  dup gen:r16h->a
  gen:cpl
  dup gen:a->r16h
  dup gen:inc-r16
  dup r16-?! ;


;; inc/dec
VSTACK: r16 #1
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:dec-r16h
  r16-?! ;

VSTACK: r16 #2
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:dec-r16h
  dup gen:dec-r16h
  r16-?! ;

VSTACK: r16 #3
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:dec-r16h
  dup gen:dec-r16h
  dup gen:dec-r16h
  r16-?! ;

VSTACK: r16 #-1
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:inc-r16h
  r16-?! ;

VSTACK: r16 #-2
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:inc-r16h
  dup gen:inc-r16h
  r16-?! ;

VSTACK: r16 #-3
GEN:
  invite Succubus
  vdrop
  v-sp0-pick-r16
  dup gen:inc-r16h
  dup gen:inc-r16h
  dup gen:inc-r16h
  r16-?! ;


;; convert "r16 #n" to "r16 -#n", and use "add"
VSTACK: r16 #n
GEN:
  invite Succubus
  ?vpop# negate lo-word vpush#
  [ss-pfa] + ir-pfa!
  ir-repeat-no-reset ;


VSTACK: #n r16
REGS: HL/free
GEN:
  invite Succubus
  ;; ld   hl, #n  ;; 10
  ;; or   a       ;; 4
  ;; sbc  hl, r16 ;; 15
  ;; 10+4+15=29
  ?vpop-r16
  ?vpop# reg-HL load-#-r16
  f-no-carry tc-last-cpu-flags? <> ?< gen:or-a-a >?
  gen:sbc-hl-r16
  f-sbc-16 tc-cpu-flags-effect:!
  reg-HL dup vpush-r16 r16-?! ;

;; HL is not free here
VSTACK: #n r16
GEN:
  invite Succubus
  ;; ld    a, # n     ;; 7
  ;; sub   a, r2r16l  ;; 4
  ;; ld    r1r16l, a  ;; 4
  ;; ld    a, # n     ;; 7
  ;; sbc   a, r2r16h  ;; 4
  ;; ld    r1r16h, a  ;; 4
  ;; 7+4+4+7+4+4=30
  v-sp0-pop-r16 ?vpop#
  ( reg-idx n )
  dup load-c#-a
  ( reg-idx n )
  over gen:sub-a-r16l
  over gen:a->r16l
  ( reg-idx n )
  hi-byte over load-c#-a-no-xor-no-r16l
  dup gen:sbc-a-r16h
  dup gen:a->r16h
  f-sbc-16 tc-cpu-flags-effect:!
  drop
  dup vpush-r16 r16-?! ;


;; if we have a free HL, use it
VSTACK: DE BC
REGS: HL/free
GEN:
  Succubus:vs-ex-de-hl
  ir-fallthrough ;

;; r16 cannot be HL here
VSTACK: HL r16
GEN:
  invite Succubus
  ?vpop-r16
  ( src-reg-idx )
  dup v-sp0-pick-r16-reserve-r16 reg-HL = not?error" Succubus expects main 16-bit gem"
  f-no-carry tc-last-cpu-flags? <> ?< gen:or-a-a >?
  gen:sbc-hl-r16
  f-sbc-16 tc-cpu-flags-effect:!
  reg-HL r16-?! ;


;; neither r16 can be HL here. also, they cannot be equal.
;; that is, it is either "BC DE" or "DE BC".
;; also, HL is not free. that is, we don't have free regs here.
VSTACK: r16/non-unique r16/non-unique
GEN:
  invite Succubus
  ;; we need to make one of the registers unique.
  ;; to do that, spill until we have a free reg.
  << spill-one get-r16 not?^|| else| drop >>
  ir-scan-again ;

;; r16 cannot be equal here.
;; use 8-bit math.
VSTACK: r16/unique r16
GEN:
  ;; ( r1 r2 ): put `r1-r2` to r1, drop r2
  invite Succubus
  ?vpop-r16 0 vs-r16@
  ( r2-reg-idx r1-reg-idx )
  dup r16-unique? not?error" Succubus expects unique 16-bit gem"
  ;; ld    a, r1r16l  ;; 4
  ;; sub   a, r2r16l  ;; 4
  ;; ld    r1r16l, a  ;; 4
  ;; ld    a, r1r16h  ;; 4
  ;; sbc   a, r2r16h  ;; 4
  ;; ld    r1r16h, a  ;; 4
  ;; 4+4+4+4+4+4=24
  dup gen:r16l->a
  over gen:sub-a-r16l
  dup gen:a->r16l
  dup gen:r16h->a
  over gen:sbc-a-r16h
  dup gen:a->r16h
  f-sbc-16 tc-cpu-flags-effect:!
  r16-?! drop ;

;; r16 cannot be equal here.
;; use 8-bit math.
VSTACK: r16 r16/unique
GEN:
  ;; ( r1 r2 ): put `r1-r2` to r2, drop r1
  invite Succubus
  ?vpop-r16 ?vpop-r16
  ( r2-reg-idx r1-reg-idx )
  dup r16-unique? not?error" Succubus expects unique 16-bit gem"
  ;; ld    a, r1r16l  ;; 4
  ;; sub   a, r2r16l  ;; 4
  ;; ld    r2r16l, a  ;; 4
  ;; ld    a, r1r16h  ;; 4
  ;; sbc   a, r2r16h  ;; 4
  ;; ld    r2r16h, a  ;; 4
  ;; 4+4+4+4+4+4=24
  dup gen:r16l->a
  over gen:sub-a-r16l
  over gen:a->r16l
  dup gen:r16h->a
  over gen:sbc-a-r16h
  over gen:a->r16h
  f-sbc-16 tc-cpu-flags-effect:!
  drop dup vpush-r16 r16-?! ;

END-RULE

Succubus simply evaluates all predicates, and executes the first subrule that matched. in "GEN:" bodies i can either generate machine code, or rewrite high-level IR code (this is mostly standard Forth threaded code, but each instruction has codegen state saved, so i can "rewind" everything back).

there is also predicates like "PREV-IR: + [ r16 #n ]" — they check already generated code, and codegen state when the command was processed. this way i can fold things like "3 + 5 + 7 +" into one "15 +" instruction.

nice side-effect: predicate checkers are JIT-compiled into machine code. because predicate compiler basically just compiles Forth words to check various conditions, and UrForth turns that into native x86 code, as any other Forth words.

as you may see, there is nothing really fancy in F8: just a huge base of patterns. but with virtual stack and register tracking this rule database can generate suprisingly good code.

i must admit that the very same idea is used in commercial VFX Forth, for example, which is known to generate the fastest machine code. and i also must say that i invented this idea more than 20 years ago, just never had a good reason to implement it. so it is not stolen, i learned about VFX codegen only recently.

also, F8 is the rare example of something that would be backported from ZX to x86. because i am planning to replace current UrForth codegen with this new rule-based approach. UrForth is already among the fastest x86 Forth systems, but with this new CG it would be able to compete with the best commercial offerings. (and it is free!) yay.

there is a lot to do yet, but i definitely love the way things going. and the fact that the fastest Forth system is using my ideas ('cmon, i invented this long before VFX guys did! this is my idea! ;-) makes me even more sure that i am not daydreaming.

Spoiler

and i REALLY want to finish and release the first version of this project. because i have at least two original game ideas (well, one stolen, but nobody played the original anyway! ;-), but i will never be able to do THREE projects simultaneously, lol. i can barely cope with one.

teaser: both games will not only feature destructive environment as a gimmick, but will be heavily based on that.

Spectrum Computing

F8 Forth cross-compiler; featuring Succubus, native Z80 code writer

F8 Forth cross-compiler; featuring Succubus, native Z80 code writer

Re: F8 Forth cross-compiler; featuring Succubus, native Z80 code writer

Re: F8 Forth cross-compiler; featuring Succubus, native Z80 code writer

Re: F8 Forth cross-compiler; featuring Succubus, native Z80 code writer