Threaded Mode | Linear Mode

robve · (This post was last modified: 10-16-2021 01:04 AM by robve.)

(10-10-2021 01:36 PM)Helix Wrote:
(10-10-2021 09:11 AM)Klaus Overhage Wrote: When using WORDS, it happened to me that a BREAK via the ON key crashed the computer.

I have no crash with my Sharp. A Break just causes an exception error.

I believe what may have happened is that the missing SWAP in the strupper example overwrote the start of the dictionary that contains the break logic. This caused instability. My bad to leave out the SWAP in the example.

I'll take this opportunity for a quick update.

I spent a bit of time to redesign the core Forth interpreter assembly to improve execution speeds. It looks feasible to accelerate Forth500 as follows:

- colon call and return sequence (docol__xt + doret__xt): 22% faster
- fetch-execute (cont__): 13% faster
- deferred word vectoring (dodefer__xt): 23% faster
- constant fetch (docon__xt): 16% faster
- does> execution (does__xt): 17% faster

The redesign uses a RAM register to extend 16 bit addresses to 20 bit by presetting the 3rd byte (high order byte) to the 11th segment $b of the memory address space (the CPU is little endian). This is cheaper than the current method of converting a 16 bit register to a 20 bit register. These 16 to 20 bit conversions happen a lot, because Forth500 cells are 16 bit when the machine is 20 bit.

The register assignments remain the same as before:
20 bit register X holds the IP (instruction pointer)
20 bit register U holds the SP (stack pointer)
20 bit register S holds the RP (return stack nointer)
16 bit registers BA (A low and B high) hold the TOS (top of stack)

Other registers available:
20 bit register Y
16 bit register I, assigning IL (I low) also sets IH (I high) to zero

Internal RAM is addressed as (N) with 8 bit N. Internal RAM can hold 8, 16 and 20 (24) bit values to load/store to/from registers and to/from external RAM.

To cover 16 bit to 20 bit addresses, we load a 16 bit address into a RAM "register", say (yi) and (yi+1) (two bytes internal RAM). We set and keep (yi+2) to $b (segment). To get the 20 bit address we simply load X from (yi).

The changes to the core Forth500 execution words are summarized in this outline:

Code:

yi:             equ     $36

zi:             equ     $39

ps:             equ     $b                      ; 11th segment

base_address:   equ     $b0000                  ; 11th segment address

;-------------------------------------------------------------------------------

                org     $b9000                  ; $b0000 or $b1000 or $b9000 ...

;-------------------------------------------------------------------------------

                pre_off

boot:           ;...

                mv      (!yi+2),!ps             ; Store segment in 3rd byte

                mv      (!zi+2),!ps             ; Store segment in 3rd byte

                ;...

;-------------------------------------------------------------------------------

docol__xt:      mv      i,x             ; 2     ; I holds the IP

                pushs   i               ; 6     ; Push IP (return address)

                pmdf    (!yi),3         ; 4     ; Set new IP

                mv      x,(!yi)         ; 5     = 17 cycles

;---------------

interp__:       pre_on                  ; cycles = 7 + 13 = 20

                test    ($ff),$08       ; 5     ; Is break pushed?

                pre_off

                jrnz    break__         ; 2/3   ; Break was pushed

;---------------                        ; cycles = 13

cont__:         mvw     (!yi),[x++]     ; 7     ; Set (yi) to new execution token

                jp      (!yi)           ; 6     ; Execute new token

;-------------------------------------------------------------------------------

break__:        ;...

;-------------------------------------------------------------------------------

doret__xt:      mvw     (!yi),[s++]     ; 7     ; Pop IP (return address)

                mv      x,(!yi)         ; 5     ; X holds the IP

                mvw     (!yi),[x++]     ; 7     ; Fetch new execution token

                jp      (!yi)           ; 6     = 25 cycles versus 30

;-------------------------------------------------------------------------------

dovar__xt:      pushu   ba              ; 4     ; Save old TOS

                pmdf    (!yi),3         ; 4     ; Set new TOS

                mv      ba,(!yi)        ; 4     ; to the address of the data

                jr      !cont__         ; 3     = 15+13 cycles versus 15+15

;-------------------------------------------------------------------------------

docon__xt:      pushu   ba              ; 4     ; Save TOS

                mv      ba,[(!yi)+3]    ; 12    ; Set new TOS

                jr      !cont__         ; 3     = 19+13 cycles versus 23+15

;-------------------------------------------------------------------------------

dodefer__xt:    mvw     (!yi),[(yi)+3]  ; 14

                jp      (!yi)           ; 6     = 20 cycles versus 26

;-------------------------------------------------------------------------------

does__xt:       pushu   ba              ; 4     ; Save TOS

                pmdf    (!yi),3         ; 4     ; Set new TOS

                mv      ba,(!yi)        ; 4     ; to the address of the data

                mvw     (!yi),[s]       ; 7     ; The CALL does__xt return short address is the execution token

                mv      i,x             ; 2     ; I holds the IP

                mv      [s],i           ; 5     ; Push old IP

                mv      x,(!yi)         ; 5     ; Set new IP

                jp      !cont__         ; 4     = 35+13 versus 43+15

The pieces of this puzzle nicely fall in place, which is satisfying. I've used some of the more exotic instructions, such as PMDF (pointer modify) that operates on internal RAM 20 bit addresses, and JP (yi) to jump to the 20 bit address in (yi).

A colon-return sequence is reduced to 66 cycles from 79: 4 JP docol__xt + 17 (docol__xt) + 20 (interp__) + 25 (doret__xt). This is the execution overhead of a word defined as a colon definition and includes a check for a BREAK key press to interrupt execution. A colon definition internally in the dictionary starts with a JP docol__xt. A constant starts with JP docon__xt, a variable starts with JP dovar__xt.

A word fetch-execute overhead is reduced to 13 cycles from 15. This is the fetch-execute overhead of words defined in assembly, by fetching them as 16 bit addresses to execute by jumping to their machine code located at a 20 bit address.

I want to first roll out the floating-point addition, fully working and tested, for the next Forth500 update to the repo in two weeks (or so, because I need to make time for this). I will focus later on implementing further optimizations to speed up Forth500, e.g. using the outline above.

PS (edit): from the details of the CPU technical manual, PMDF may not perform the operation on a 20 bit pointer stored in internal RAM but rather on a single byte internal RAM pointer to internal RAM. Oops. This raises the cycle count to 71 from 79 by using inc x three times. Still a worthwhile speed improvement to consider.

- Rob