Threaded Mode | Linear Mode

robve · (This post was last modified: 05-21-2023 12:41 PM by robve.)

Recently started a new Forth project. This time for the Sharp PC-850(V)(S). Perhaps this is interesting to some of you.

[Image: PC-G850-forth.png]

Repository: https://github.com/Robert-van-Engelen/Forth850

It's about 8K in size plus 8K free space or more, written from scratch in Z80 assembly and Forth itself. It's an ANS/standard compliant, directly threaded code (DTC) Forth implementation. Words are case insensitive.

No file IO and no floating point yet (no syscalls are published as far as I know), but may look into floating point later, e.g. to use a Z80 32 bit float library, if it fits. (EDIT: floating point is included with the latest version).

Why write from scratch? Well, firstly well-known Z80 Forth implementations like CamelForth, eForth and Jupiter Ace aren't ANS Forth compliant and lack features. Secondly, it's more fun to write from scratch. The Moving Forth article was helpful, but I've decided to use DE as TOS, BC as IP (instruction pointer) and RP (return stack pointer) in memory. It's faster that way, since DE (TOS) and HL (scratch) are exchangeable in only 4 cycles and HL is a versatile register pair. Even faster would be to implement subroutine threaded code (STC) to execute Z80 code directly without DTC fetch+execute, but that will require 50% more space on average for Forth programs. Nice for speed, but perhaps not so much for usability on a small 32K machine.

With a bit of time tinkering, I came up with efficient Z80 code for Forth words, more efficient in some cases than CamelForth, eForth and Jupiter Ace (the latter isn't particularly fast by the way). If someone can point to more efficient Z80 Forth DTC implementations, then please let me know.

For example, my signed 16x16->16 multiplication takes only 51 cycles per bit versus 62 cycles per bit published by Zilog. I'm using the ASZ80 macro assembler with the CODE macro defining a Forth word "*" in machine code with label "star":

Code:

                CODE *,star

                push de                 ; save TOS

                exx                     ; save bc with ip

                pop de                  ; n2->de

                pop bc                  ; n1->bc

                ld hl,0                 ; 0->hl

                ld a,c                  ; c->a low order byte of n1

                ld c,b                  ; b->c save high order byte of n1

                ld b,8                  ; 8->b loop counter

1$:             rra             ;  4    ; loop, a>>1->a set cf

                jr nc,2$        ;  7    ;   if cf=1 then

                add hl,de       ; 11    ;     hl+de->hl

2$:             sla e           ;  8    ;

                rl d            ;  8    ;   de<<1->de

                djnz 1$         ; 13(51); until --b=0

                ld a,c                  ; c->a high order byte of n1

                ld b,8                  ; 8->b loop counter

3$:             rra             ;  4    ; loop, a>>1->a set cf

                jr nc,4$        ;  7    ;   if cf=1 then

                add hl,de       ; 11    ;     hl+de->hl

4$:             sla e           ;  8    ;

                rl d            ;  8    ;   de<<1->de

                djnz 3$         ; 13(51); until --b=0

                push hl                 ; save hl with product as TOS

                exx                     ; restore bc with ip

                pop de                  ; pop new TOS

                JP_NEXT                 ; continue

EDIT: even faster multiplication is possible. To calculate the high order byte we do not need to iterate over all 8 bits of the high order multiplier stored in register c, but only over the nonzero bits. We also can ignore the lower order result stored in register e. This reduces the max loop iteration cycle time to 32 and 33 per bit. Furthermore, the second loop only runs until the last bit of register c is shifted out. If register c is zero, the second loop does not execute thereby saving hundreds of cycles. We also use jp instead of jr to improve and balance the cycle time per bit:

Code:

                CODE *,star

                push de                 ; save TOS

                exx                     ; save bc with ip

                pop de                  ; n2 -> de

                pop bc                  ; n1 -> bc

                ld hl,0                 ; 0 -> hl

                ld a,c                  ; c -> a low order byte of n1

                ld c,b                  ; b -> c save high order byte of n1

                ld b,8                  ; 8 -> b loop counter

1$:             rra             ;  4    ; loop, a >> 1 -> a set cf

                jr nc,2$        ;  7    ;   if cf = 1 then

                add hl,de       ; 11    ;     hl + de -> hl

2$:             sla e           ;  8    ;

                rl d            ;  8    ;   de << 1 -> de

                djnz 1$         ; 13(51); until --b = 0

                ld a,h                  ; h -> a do high order, low order is done

                jr 5$                   ; jump to shift c and loop

3$:             add d           ;  4    ; loop, a + d -> a

4$:             sla d           ;  8    ;   d << 1 -> d

5$:             srl c           ;  8    ;   c >> 1 -> c set cf and z if no bits left

                jr c,3$         ; 12/7(32); until cf = 0 repeat with addition

                jp nz,4$        ;   10(33); until c = 0 repeat without addition

                ld h,a                  ; a -> h

                push hl                 ; save hl with product as TOS

                exx                     ; restore bc with ip

                pop de                  ; pop new TOS

                JP_NEXT                 ; continue

At the heart of Forth parsing are my new CHOP and TRIM words in assembly that exploit special Z80 ops, which I found to be better suitable and more efficient than CamelForth's approach to parsing. The CHOP and TRIM words:

Code:

; CHOP          c-addr u1 char -- c-addr u2

;               truncate string up to matching char

;               leaves string if char not found

;               char=0x20 (bl) chops 0x00 to 0x20 (white space)

                CODE CHOP,chop

                ld a,e                  ; char->a

                exx                     ; save bc with ip

                ex af,af'               ; save a with char

                pop bc                  ; pop u1->bc

                ld e,c                  ;

                ld d,b                  ; u1->de

                ld a,c                  ;

                or b                    ; test bc=0, 0->cf

                jr z,2$                 ; if bc=0 then not found

                pop hl                  ;

                push hl                 ; c-addr->hl

                ex af,af'               ; restore a with char

                cp 0x20                 ;

                jr z,3$                 ; if a=0x20 then find white space

                or a                    ; 0->cf not found

;               find char in string

                cpir                    ; repeat until a=[hl++] or --bc=0

                jr nz,2$                ; if match then 

1$:             ccf                     ; complement cf when found, to correct cpi bc--

2$:             ex de,hl                ; u1->hl

                sbc hl,bc               ; u1-bc-cf->hl

                push hl                 ; save hl as TOS

                exx                     ; restore bc with ip

                pop de                  ; pop new TOS

                JP_NEXT                 ; continue

;               find white space in string

3$:             cp (hl)         ;  7    ; loop to compare a to [hl]

                cpi             ; 16    ;   hl++,bc--

                jr nc,1$        ;  7    ;   if [hl]<a then found

                jp pe,3$        ; 10    ; until bc=0

                jr 1$                   ; not found

; TRIM          c-addr1 u1 char -- c-addr2 u2

;               trim initial chars

;               char=0x20 (bl) trims 0x00 to 0x20 (white space)

                CODE TRIM,trim

                ld a,e                  ; char->a

                exx                     ; save bc with ip

                pop bc                  ; u1->bc

                pop hl                  ; c-addr1->hl

1$:             ex af,af'               ; save a

                ld a,c                  ;

                or b                    ;

                jr z,3$                 ; if bc<>0 then

                ex af,af'               ;   restore a

2$:             cpi             ; 16    ;   loop

                jr nz,4$        ;  7    ;     while a=[hl++],--bc

                jp pe,2$        ; 10    ;   until b=0

3$:             push hl                 ; save hl as 2OS

                push bc                 ; save bc as TOS

                exx                     ; restore bc with ip

                pop de                  ; pop new TOS

                JP_NEXT                 ; continue

4$:             cp 0x20                 ;

                jr nz,5$                ; if char=0x20 then

                dec hl                  ;

                cp (hl)                 ;

                inc hl                  ;

                jr nc,1$                ;   if [hl-1]<=0x20 then keep trimming

5$:             inc bc                  ; correct bc++ for cpi match

                dec hl                  ; correct hl-- for cpi match

                jr 3$                   ; finalize trimming

My case-insensitive dictionary search FIND-WORD should speed up Forth compilation by quite a bit:

Code:

; FIND-WORD     c-addr u -- c-addr 0 | xt 1 | xt -1

;               search dictionary for matching word

;               leaves execution token and 1=immediate or -1=not immediate

;               leaves c-addr and 0 when not found

                CODE FIND-WORD,findword

                ld a,d                  ;

                or a                    ; test d=0 high order byte of u

                jp nz,zero_next         ; if u is too large then set new TOS to 0

                sla e                   ; shift u to compare w/o immediate bit

                jp c,zero_next          ; if u is too large then set new TOS to 0

                jp z,zero_next          ; if u=0 then set new TOS to 0

                push de                 ; save de with 2*u

                exx                     ; save bc with ip

                pop bc                  ; pop 2*u->bc

                pop de                  ; pop c-addr->de

                ld hl,(context+3)       ; CONTEXT->hl

                jr 3$                   ; start searching

;               loop over dictionary

1$:             pop de                  ; restore de with c-addr

2$:             pop hl          ; 10    ; loop, restore hl with lfa

3$:             ld a,(hl)       ;  7    ;

                inc hl          ;  6    ;

                ld h,(hl)       ;  7    ;

                ld l,a          ;  4    ;   [hl]->hl follow link at hl=lfa

                or h            ;  4    ;

                jr z,6$         ;  7    ;   if hl=0 then not found

                push hl         ; 11    ;   save hl with lfa

                inc hl          ;  6    ;

                inc hl          ;  6    ;   hl+2->hl with nt (nfa)

                ld a,(hl)       ;  7    ;   word length

                add a           ;  4    ;   shift away immediate bit

                cp c            ;  4    ;   test a=c word length match (both shifted)

                jr nz,2$        ; 12(95);   if lengths differ then continue searching

;               compare string to word

                push de                 ;   save de with c-addr

                inc hl                  ;   hl++ point to nfa chars

                ld b,c                  ;   2*u->b

                srl b                   ;   u->b word length (nonzero)

;               loop over word chars

4$:             ld a,(de)       ;  7    ;   loop

                cp (hl)         ;  7    ;     compare [de]=[hl]

                jr z,5$         ; 12/7  ;     if mismatch then

                and 0xdf        ;    7  ;       make upper case

                cp 'A           ;    7  ;

                jr c,1$         ;    7  ;       if a<'A' then continue search

                cp 'Z+1         ;    7  ;

                jr nc,1$        ;    7  ;       if a>'Z' then continue search

                xor (hl)        ;    7  ;

                and 0xdf        ;    7  ;       case insensitive compare [de]=[hl]

                jr nz,1$        ;    7  ;       if mismatch then continue search

5$:             inc de          ;  6    ;     de++ point to next char of c-addr

                inc hl          ;  6    ;     hl++ point to next char of word

                djnz 4$         ; 13(51/102);until --b=0

;               found a matching word

                pop de                  ;   discard saved c-addr

                ex (sp),hl              ;   save hl with xt as 2OS, restore hl with lfa

                inc hl                  ;

                inc hl                  ;   hl+2->hl with nt (nfa)

                bit immediate_bit,(hl)  ;   test immediate bit of [hl] word length

                exx                     ;   restore bc with ip

                jp nz,one_next          ;   set new TOS to 1 if word is immediate

                jp mone_next            ;   set new TOS to -1

;               not found

6$:             push de                 ; save de with c-addr as 2OS

                exx                     ; restore bc with ip

                jp zero_next            ; set new TOS to 0

The Z80 is alright for Forth (I've used the Z80 back in the 80s and loved it), but not as good as the 6809 and the ESR-L (PC-E500) that have two stack registers. What a luxury compared to the Z80!

- Rob

robve · 11-08-2022, 09:23 PM

Currently testing the Forth850 for Forth standard compliance and performance.

The NQUEENS benchmark runs in 0.865 seconds.

This is measured in "fast mode", looping NQUEENS 100 times with a stopwatch.

In "slow mode", the NQUEENS benchmark runs in a little longer in 0.95 seconds.

This is the same Forth NQUEENS program as benchmarked on the PC-E500.

The PC-G850 runs at 8MHz, whereas the PC-E500 runs at 2.3MHz with NQUEENS taking 3.47 seconds. Since 8/2.3 = 3.48, I am happy to see that the performance of these implementations are comparable when adjusting for clock speed. The Z80 runs slightly faster, but that is attributable to the fact the the PC-E500 has a 20 bit address space, which requires 16<->20 bit conversions in Forth when dealing with addresses stored 16 bit Forth cells.

If it matters, the "slow mode" version of Forth850 uses ~300 bytes less of the 8K Forth core, which doesn't seem critical to me to save space. So the "fast" version is probably the best way to go to release once testing is completed.

Both slow and fast versions perform stack over/underflow detection and BREAK key interrupt detection. Without these, NQUEENS would run in 0.84 seconds. But that version won't be as usable to play with since it will crash when the stack over/underflows.

Will release Forth850 soon when testing is completed, so anyone can try it out and rerun the benchmark.

- Rob

Sukiari · 11-09-2022, 07:45 PM

This is incredible, thank you. It would be amazing to see a new generation of these kinds of machines, hopefully projects like yours will drum up more interest and eventually we see something maybe crowd funded.

robve · 11-10-2022, 09:36 PM

Forth850 is now available: https://github.com/Robert-van-Engelen/Forth850

It has 294 words in an 8K executable, including 2xx and Dxx words for double cells, 256 byte string buffers PAD, TIB and TMP, string operations, exceptions THROW and CATCH, VALUE, 2VALUE, FORGET, VOCABULARY, simple graphics with DRAW and VIEW, port IO with OUT and INP.

It's super fast by design and written from scratch in Z80 assembly (see the README in the repository for details). Not only does the nqueens benchmark run in only 0.865 seconds, compiling Forth source is speedy too without noticeable delays.

A nice feature of Forth850 is that the size of the free dictionary space is user-definable. This is done by setting USERxxxx in the Monitor to the size of the Forth system. At minimum the system requires 9K e.g. USER23FF to accommodate the 8K executable with minimal stacks and dictionary space. The largest is USER75FF with about 21K of free dictionary space.

I've included a forth850.wav file to load the Forth 850 binary via a cassette interface. I will look into serial later. I need to acquire a serial cable first.

Things that will be nice to add (in progress):

an additional extended "full version" with additional built-in Forth words
with an improved command line editor, like Forth500
ability to load Forth source code from the TEXT editor
serial IO to communicate via RS232 and load source code

- Rob

xerxes · 11-11-2022, 01:55 AM

(11-10-2022 09:36 PM)robve Wrote: It's super fast by design and written from scratch in Z80 assembly

Indeed! 5 times faster than C is really impressive. The only more efficient language on a pocket computer is the Pascal compiler
of the PB-2000C with 1.27 seconds at 0.91 MHz or clock speed adjusted 0.144 seconds at 8.0 MHz. Thank you for the test.

robve · (This post was last modified: 11-12-2022 06:55 PM by robve.)

Well, nothing can or will always go 100% according to plan! During a last cleanup effort of the source code to post on the repository, I accidentally copied a DROP line. As a result, entering a Forth command produced <ERR-13>.

I was pulling my hair out to figure this one out. I had spent three days to carefully test everything during my free hours before I ran into this today to test an addition of new GETKEY word.

It's fixed now Smile

The PC-G850 Monitor helped me a lot to check my code and set breakpoints. The PC-G850 is a great little machine!

Does anyone have suggestions for a small Z80 single precision (32 bit) floating point library? BCD or IEEE754 or something else?

- Rob

robve · (This post was last modified: 11-12-2022 06:55 PM by robve.)

With the DRAW word you can create new characters to display with text.

For example, define GREMIT (graphic emit) to display 8x6 pixel characters:

Code:

: GREMIT

  CREATE

  DOES>

    6 DRAW

    28 EMIT \ cursor right

;

To display \( \alpha \) and \( \beta \) we just need to define their pixels, which is simpler in binary with %:

Code:

GREMIT alpha

  %00111000 C,

  %01000100 C,

  %01000100 C,

  %00111000 C,

  %01000100 C,

  %00000000 C,

GREMIT beta

  %11111110 C,

  %00100001 C,

  %01001001 C,

  %01001001 C,

  %00110110 C,

  %00000000 C,

So ." result " alpha ." =" beta displays result α=β

The VIEW word extracts pixel patterns. With this word you can move pixels and OR- or XOR-pixels to DRAW.

The cursor's code uses VIEW and DRAW to invert the cursor on screen. KEY displays the cursor. I've written a version of EMIT to support control characters, which the PC-G850 does not support natively.

Lots of possibilities to enhance Forth programs.

- Rob

robve · (This post was last modified: 05-21-2023 12:47 PM by robve.)

Forth850 can be extended "on the fly" with your own Z80 machine code using the PC-G850(V)(S) built-in Z80 Assembler.

To clarify, here are the steps. A BEEP example follows below.

first define the word in Forth850 with NFA, <name>
HERE HEX . displays the ORG xxxx address to specify in the assembly code
write the assembly in the TEXT editor with this ORG xxxxH address
run the Assembler, which adds the code to the <name> you defined in Forth850
write down the number of bytes assembled, this is important
in Forth850, DECIMAL <number> ALLOT makes the code part of <name> so that HERE HEX . displays the next address after the code (use negative -nnnn ALLOT to undo ALLOT when making code changes, or pick a larger ALLOT value to reserve space for assembly code additions)
IMPORTANT: rerun the Assembler, because the first assembly was overwritten by Forth850 (the hold area is HERE) before you ALLOTed

A couple of (obvious) points to keep in mind when writing machine code for Forth850:

register BC holds the IP (instruction pointer) and should not be modified, i.e. save it somewhere to restore it before returning
register DE holds the TOS (top of stack) and can be used and set
the stack can be used to push/pop Forth parameters
register IY holds the address of "next" and should not be modified
return from the code with JP (IY), never use RET

Example BEEP word.

In Forth850:

Code:

NFA, BEEP

HERE HEX .

This displays the ORG address for the assembly of the new BEEP word. Suppose the address is 20FA, then in TEXT editor (line numbers not shown) we write:

Code:

      ORG 20FA

      di              ;

      di              ; disable interrupts

      ld hl,0000h     ;

      xor a           ;

loop: out (18h),a     ; loop, out audio port

wait: dec l           ;   loop

      jr nz,wait      ;   until --l=0

      cpl             ; switch on/off

      dec h           ;

      jr nz,loop      ; until --h=0

      ei              ; enable interrupts

      jp (iy)         ; next

Assemble the code, which is 18 bytes long. Return to Forth850 with BASIC and CALL256 and enter:

Code:

18 ALLOT

Important: run the Assembler again to save the code in the ALLOTed space.

Now BEEP works in Forth850.

Alternatively, the BEEP word can also be defined with HEX codes in Forth850 as follows, which takes more effort but with the same result:

Code:

NFA, BEEP       ( -- )

  HEX

  F3 C,         \       di              ;

  F3 C,         \       di              ; disable interrupts

  21 C, 0 ,     \       ld hl,0000h     ;

  AF C,         \       xor a           ;

  D3 C, 18 C,   \ loop: out (18h),a     ; loop, out audio port

  D2 C,         \ wait: dec l           ;   loop

  20 C, FD C,   \       jr nz,wait      ;   until --l=0

  2F C,         \       cpl             ; switch on/off

  25 C,         \       dec h           ;

  20 C, F7 C,   \       jr nz,loop      ; until --h=0

  FB C,         \       ei              ; enable interrupts

  FD C, E9 C,   \       jp (iy)         ; next

  DECIMAL       \ 18 bytes

A simple example that flips the bytes of the TOS stored in register DE, which takes 5 bytes of machine code:

Code:

NFA, FLIP

HERE HEX .

xxxx

5 ALLOT

Code:

      ORG xxxxH

      ld a,e

      ld e,d

      ld d,a

      jp (iy)

As always when writing assembly, if something is seriously wrong with it then we may crash and have to start over. In the worst case we have to install Forth850 again when the dictionary is damaged by "random POKEs". However, it is often not necessary to reset the machine when asked for MEMORY CLEAR (Y/N) just say N (NO) and give it another try after fixing the problem.

- Rob

robve · 11-14-2022, 07:35 PM

A really nice thing about the PC-G850(V)(S) is that the ROM and RAM addresses are known and included in the excellent PC-G850V(S) User Guide translated by Jack W. Hu.

So let's go forth and put our "Pokecon" to some good use by defining a new Forth850 word TEXT that reads the TEXT editor area to evaluate Forth. In this way we can write Forth code in the built-in TEXT editor and then read it back into Forth:

write Forth code with line numbers in the TEXT editor
in Forth enter TEXT to read the source code

The TEXT area start- and end-pointers are located in RAM at &H7973-4 and &H7975-6, respectively. The BASIC and TEXT code has a &HFF marker at TEXT start and end. The internal BASIC/TEXT line format is: hi lo len data... &H0D where

hi is high order byte of line number, hi is &HFF at end
lo is low order byte of line number
len is length of the line up to and including the &H0D byte

Keeping this in mind, the TEXT word is defined as follows, using EVALUATE to evaluate each line in the TEXT area:

Code:

: TEXT

  $7973 @ 1+ >R

  BEGIN

    R>                  \ -- addr

  DUP C@ $FF <> WHILE

    2+ DUP C@ SWAP 1+   \ -- len addr

    2DUP + >R

    SWAP 1- EVALUATE

  REPEAT

  DROP ;

A minor caveat of using EVALUATE for each line: .( and ( cannot span multiple lines, i.e. a ) must be on the same line.

- Rob

robve · 11-29-2022, 07:56 PM

I've written new IEEE 754 floating point routines from scratch in Z80 for Forth850. I wasn't happy with the Z80 floating point libraries I found on GitHub and elsewhere, which appear to have sizable code bases and are heavy on memory access and use. The new Z80 floating point routines I wrote operate with Z80 registers only, including the shadow registers, except for a one push-pop pair to pass a value from a shadow register pair to a register pair. The shadow registers are available for use with the PC-G850 (i.e. not used by interrupts). So let's put them to work.

The new math.asm Z80 library:

IEEE 754 single precision floating point addition, subtraction, negation, multiplication, division, integer <-> float conversion, string <-> float conversion
"memoryless" using registers only (+shadow), at most one push+pop per flop
optimized for speed and reduced code size (no loop unrolling)
extensively tested

But lacks (at this time):

no INF/NAN values; routines return error condition (cf set)
truncated 24 bit mantissa instead of IEEE 754 unbiased "banker's rounding" (floating point string parsing and output use rounding)

I may extend the library to support INF/NAN and banker's rounding to make it compliant with IEEE 754. Right now, the only observable difference is the rounding (lack thereof). Consistency of the numerical treatment is important.

The Forth code for the Sine plot shown in the picture:

Code:

3.14159265E0 2CONSTANT PI

PI 72E0 F/ 2VALUE SCALE \ free to change e.g.: PI 144E0 F/ TO SCALE

: SINE

  PAGE

  144 0 DO

    I DUP S>D D>F SCALE F*

    FSIN 24E0 F*

    F>D D>S 24 + POINT

  LOOP

  GETKEY DROP ;

Trig functions in Forth:

Code:

: FSIN  ( r1 -- r2 ) \ only valid for -2pi <= r1 <= 2pi

  2DUP 2DUP 2DUP F* FNEGATE

  2ROT 2ROT

  15 2 DO

    5 PICK 5 PICK

    F* I I 1+ *

    S>D D>F F/

    2DUP 2ROT F+ 2SWAP

  2 +LOOP

  2DROP 2SWAP 2DROP ;

: FCOS  ( r1 -- r2 ) 1.5707962E0 2SWAP F- FSIN ;

: FTAN  ( r1 -- r2 ) 2DUP FSIN 2SWAP FCOS F/ ;

I've updated the forth850-full version with the new features. The binary is about 10K, so there is plenty of space left on the machine. I've tested the implementation extensively and will continue to do so in the near future.

- Rob

robve · 12-07-2022, 03:07 AM

(11-29-2022 07:56 PM)robve Wrote: I may extend the library to support INF/NAN and banker's rounding to make it compliant with IEEE 754. Right now, the only observable difference is the rounding (lack thereof).

I rewrote part of the Z80 IEEE 754 math core today to add two proper IEEE 754 rounding modes to choose from (at compile time):

round to nearest, ties to even a.k.a. banker's rounding, typically used by libmath as the default rounding mode (C/C++, Python, Lua, Julia etc.)
round to nearest, ties to away

The Forth850 update will use round to nearest, ties to even. Correct rounding is applied to addition, subtraction, multiplication, division and all Forth floating point math functions (since they use these core routines.) The string input and output of floats was already properly rounding. The Z80 math library is 67 bytes longer for just a tad over 1K of portable Z80 code and does not suffer from performance degradation or other compromises, i.e. no memory accesses as everything is still kept in Z80 registers.

Adding INF/NAN is should actually be easy with some logic to test these special cases. I could make it an option. But I prefer to have exceptions in Forth to catch floating point overflow, division by zero etc. so signaling NaN and INF exceptions.

With internal proper round to nearest, ties to even the new sine plot is perfectly symmetric on close inspection:
[Image: IMG_5355.jpg]

A bit surprised myself that I was able to pull this off, as I was already using all Z80 registers Smile

- Rob

robve · 01-01-2023, 05:45 PM

Happy to share that Forth850 v1.0 is now available on GitHub.

The basic 8K version has 295 words. The more complete 11K version has 349 words, including IEEE 754 single precision and a more capable line editor (like Forth500). Both versions were extensively tested and verified *). The performance has not changed since the initial release. The n-queens benchmark is solved in 0.865 seconds.

I wrote three versions of a new IEEE 754 single precision math library with standard floating point operations in Z80 assembly solely using Z80 registers, i.e. "memoryless" **). All trig and other math functions are defined in Forth in MATH.FTH.
- math.asm: a truncating version that is small, 960 bytes of code;
- mathr.asm: a proper rounding version (IEEE 754 round to nearest, ties to even) 1085 bytes;
- mathri.asm: like mathr.asm extended with IEEE INF and NAN (Forth850 does not use INF/NAN) and signed zero, but no subnormals (which are evil), 1286 bytes.

It felt quite satisfying to put this all together, as everything fits in Z80 registers, including shadow, but not IX and IY which remain unused. Some effective use of "Z80 tricks" I learned in the 80s Smile

- Rob

*) while this version was extensively tested and verified, there is no guarantee that there are zero bugs. Please report issues so I can fix them.

**) "memoryless" but a PUSH+POP must be used to copy to Z80 shadow registers, so there is at most one PUSH+POP pair per flop (add, sub, mul, div).

F-73P · (This post was last modified: 01-03-2023 03:40 AM by F-73P.)

Great work!

Did you write routines to convert between strings and floats? I've started writing string-to-double and double-to-string routines and it's quite challenging, requiring extended-precision arithmetic and tables.

robve · (This post was last modified: 01-04-2023 02:30 PM by robve.)

(01-03-2023 03:32 AM)F-73P Wrote: Did you write routines to convert between strings and floats? I've started writing string-to-double and double-to-string routines and it's quite challenging, requiring extended-precision arithmetic and tables.

Yes. For single precision floating point, that is.

I evaluated three strategies with tables. Because single precision floating point decimal exponents are limited to the +/-38 range, a full powers of 10 table is logical to use but an overkill. I found that a table with 12 entries saves memory and is sufficiently accurate, almost as accurate as a full table.

I've documented this in the mathr.asm assembly source and also in mathri.asm (same, but with inf/nan and signed zero) see the fpow10 routine:

Code:

;                 method                                   min bits   mean bits

;                 iterative multiplication by 10:           20.2439     23.4988

;                 iterative multiplication by 10**4:        21.3502     24.4916

;                 iterative multiplication by 10**5:        21.3243     24.5500

;                 iterative multiplication by 10**6:        21.3297     24.5937

;                 iterative multiplication by 10**7:        21.3253     24.5619

;                 iterative multiplication by 10**8:        21.3004     24.5260

;                 iterative multiplication by 10**10:       21.0809     24.4483

;                 iterative multiplication by 10**16:       20.7694     24.0836

;                 table lookup with  4 powers of 10:        21.4136     24.6670

;                 table lookup with  8 powers of 10:        21.9131     25.2981

;                 table lookup with  9 powers of 10:        21.8907     25.4628

;                 table lookup with 10 powers of 10:        22.0177     25.5388

;               * table lookup with 12 powers of 10:        22.0482     25.7529

;                 table lookup with 16 powers of 10:        22.0004     25.4073

;                 exponentiation by squaring from bottom:   22.0004     25.3867

;                 exponentiation by squaring from top:      22.0023     25.3830

;                 table lookup with 38 powers of 10:        exact       exact

;

;               * = selected and enabled for small code size and high accuracy

where "min bits" means the minimal number of bits guaranteed to be accurate -log(d)/log(2.0) with minimum d of relative errors e between x and y = fabs(x - y)/fabs(y), and "mean bits" means the number of bits accurate on average (relative error) in 10 million random samples: -log(a/SAMPLES)/log(2.0) with a the sum of relative errors. The mantissa has 24 bits. Note that summing relative errors can get over 25 "mean bits", which is when more cases are exact than have errors. I ran the tests in C code (DM me if you want this code).

The method selected (table of 12) is enabled with .if-.endif, while the others are disabled, but could be used. I would believe that with double precision floating point you could use a small table, perhaps as small as 12 but probably larger. Exponentiation by squaring is another option to keep the table small. See my assembly source code for the Z80 routines.

As you also point out, the powers of 10 tables are used by atof (string to float) and ftoa (float to string). The mathr.asm ftoa routine computes ((exponent - bias - 1) * 77 + 77 + 82) / 256 to estimate the decimal exponent given the biased binary exponent. I've seen this elsewhere, but I made an improvement to adjust with +82 to prevent underestimation. With double precision this will need work and more tweaking.

In the atof routine I used a 32 bit mantissa to multiply by 10 to maintain accuracy and produce a 24 bit mantissa from a string of digits without loss or precision up to 31 digits, with rounding to nearest ties to even when over 10 digits or so. For double precision, this needs a 64 bit mantissa.

Beware of pitfalls when parsing floats. Floats can be represented in many different ways, such as 1234567890E-45 and 0.0000000000123456789E45 for example. Rather obvious to look at, but the code should handle it.

I wrote the math code in less than a week in the evenings, but testing took another week or two to make sure there are no bugs and to make some tweaks. For example, I found that the round to nearest ties to even with "sticky bits" didn't work correctly, so I came up with a better way to implement this mechanism without requiring much code or using Z80 registers which I can't, since they are already all in use.

There is a lot more I could comment on. If you have questions, let me know in your reply or DM me.

- Rob

F-73P · (This post was last modified: 01-06-2023 06:42 AM by F-73P.)

Brilliant!

I'm using a relatively simple but inefficient approach for converting strings to single-precision floating-point: use multiple-precision arithmetic and tables to find integers s,e such that s x 2^e = user input, where s element of [2^23,2^24) and is rounded up. Then use s,e to obtain the IEEE 754 single-precision encoding.

Its working for the few values I've tested (0.5 -> 0x3F000000, 2 -> 0x40000000 and 2.5 -> 0x40200000). Once I've added higher powers of 10 to the table and parsing exponents I'll try smaller and bigger values.

For floating-point to string I'll use the first algorithm in this paper.

robve · (This post was last modified: 05-21-2023 01:48 AM by robve.)

Forth850 v1.4 released

Forth850 GitHub repository

Here's a summary of updates since the first stable release v1.0 (January 9, 2023):
- v1.1 fix CASE OF not compiling
- v1.2 new words UD/MOD, D/MOD, DMOD and D/; fix TO for 2VALUE
- v1.3 permit floating point constants with no digits after E, such as 1E, as per Forth 2012 standard
- v1.4 new word F>S added to the full version with floating point; improve LOGS.FTH and MATH.FTH particularly the F** word see below

Exponentiation by squaring

The F** algorithm's range and accuracy is improved in v1.4, see also the HP forum thread by J-F Garnier small challenge. The exponentiation-by-squaring conditions are tightened to ensure accuracy of the floating point result within 1 ULP, i.e. the least significant bit may be off in some cases. Exponentiation-by-squaring is fast and accurate for many integer F** calculations where the standard exp-log algorithm may produce a 1/2 ULP error, i.e. not necessarily return an integer when an integer result is expected.

Code:

: F**       ( r1 r2 -- r3 )

  2DUP F0= IF \ r2 = 0

    2OVER F0= IF -46 THROW THEN \ error if r1 = 0 and r2 = 0

    2DROP 2DROP 1E0 EXIT \ return 1.0

  THEN

  2OVER F0= IF \ r1 = 0

    2DUP F0< IF -42 THROW THEN \ error if r1 = 0 and r2 < 0

    2DROP 2DROP 0E0 EXIT \ return 0.0

  THEN

  \ exponentiation by squaring r1^n when n is a small integer |n|<=16

  2DUP 2DUP FTRUNC F= IF \ r2 has no fractional part

    2DUP ['] F>D CATCH 0= IF \ r2 is convertable to a double n

      2DUP DABS 17. DU< IF \ |n| <= 16

        DROP \ drop high order of n

        DUP 0< >R \ save sign of n

        ABS >R \ save |n|

        2DROP \ drop old r2

        1E0 \ -- r1 1.0

        BEGIN

          R@ 1 AND IF 2OVER F* THEN

          R> 1 RSHIFT \ -- r1^n product u>>1

        DUP WHILE

          >R

          2SWAP 2DUP F* 2SWAP \ -- r1^n^2 product u>>1

        REPEAT

        DROP 2SWAP 2DROP \ -- product

        R> IF 1E0 2SWAP F/ THEN \ reciprocal when exponent was negative

        EXIT

      THEN

      OVER 1 AND IF \ n is odd

        2OVER F0< IF \ r1 is negative

          2DROP 2SWAP FABS 2SWAP F^ FNEGATE EXIT \ return -(|r1|^n)

        THEN

      THEN

      2DROP 2SWAP FABS 2SWAP \ we want to return |r1|^r2

    ELSE

      2DROP \ drop copy of r2

    THEN

  THEN

  F^ ;

Enjoy!

- Rob

iaincockburn · 07-30-2023, 10:53 AM

Hi Rob. Please help me. I have a G850VS and I would love to study FORTH using your code. However I keep getting ERR-13 when entering any forth word. I noticed you had similar problems in the past. Any advice please?

PS no issues when loading other IHX apps.

Thanks
Iain

robve · (This post was last modified: 07-31-2023 04:37 PM by robve.)

(07-30-2023 10:53 AM)iaincockburn Wrote: Hi Rob. Please help me. I have a G850VS and I would love to study FORTH using your code. However I keep getting ERR-13 when entering any forth word. I noticed you had similar problems in the past. Any advice please?

Have you added any definitions or is this a problem that occurs right after loading Forth850? I assume you're using the most recent version of Forth850.

I've re-tested the wav and ihx files in the repo. Before loading Forth850, please make sure to allocate sufficient free space in the Monitor first, see the README.

The old error -13 problem was not a big issue, it was caused by a simple typo that I corrected within a day or so to upload a fix.

- Rob

iaincockburn · 08-06-2023, 10:11 AM

Hi Rob - thanks again for your speedy response and my apologies for my tardiness.

Yes, I had read your Readme and set the USER to your recommendation and ran it with R100 as suggested (also tried G100). I also did a hard reset and cleared memory, all to no avail. I think I had loaded your latest version from GITHub (release 1.4).

Please don't worry about it!