Post Reply 
FORTH for the SHARP PC-E500 (S)
10-13-2021, 02:38 AM (This post was last modified: 10-16-2021 01:04 AM by robve.)
Post: #32
RE: FORTH for the SHARP PC-E500 (S)
(10-10-2021 01:36 PM)Helix Wrote:  
(10-10-2021 09:11 AM)Klaus Overhage Wrote:  When using WORDS, it happened to me that a BREAK via the ON key crashed the computer.

I have no crash with my Sharp. A Break just causes an exception error.

I believe what may have happened is that the missing SWAP in the strupper example overwrote the start of the dictionary that contains the break logic. This caused instability. My bad to leave out the SWAP in the example.

I'll take this opportunity for a quick update.

I spent a bit of time to redesign the core Forth interpreter assembly to improve execution speeds. It looks feasible to accelerate Forth500 as follows:

- colon call and return sequence (docol__xt + doret__xt): 22% faster
- fetch-execute (cont__): 13% faster
- deferred word vectoring (dodefer__xt): 23% faster
- constant fetch (docon__xt): 16% faster
- does> execution (does__xt): 17% faster

The redesign uses a RAM register to extend 16 bit addresses to 20 bit by presetting the 3rd byte (high order byte) to the 11th segment $b of the memory address space (the CPU is little endian). This is cheaper than the current method of converting a 16 bit register to a 20 bit register. These 16 to 20 bit conversions happen a lot, because Forth500 cells are 16 bit when the machine is 20 bit.

The register assignments remain the same as before:
20 bit register X holds the IP (instruction pointer)
20 bit register U holds the SP (stack pointer)
20 bit register S holds the RP (return stack nointer)
16 bit registers BA (A low and B high) hold the TOS (top of stack)

Other registers available:
20 bit register Y
16 bit register I, assigning IL (I low) also sets IH (I high) to zero

Internal RAM is addressed as (N) with 8 bit N. Internal RAM can hold 8, 16 and 20 (24) bit values to load/store to/from registers and to/from external RAM.

To cover 16 bit to 20 bit addresses, we load a 16 bit address into a RAM "register", say (yi) and (yi+1) (two bytes internal RAM). We set and keep (yi+2) to $b (segment). To get the 20 bit address we simply load X from (yi).

The changes to the core Forth500 execution words are summarized in this outline:
yi:             equ     $36
zi:             equ     $39
ps:             equ     $b                      ; 11th segment
base_address:   equ     $b0000                  ; 11th segment address
                org     $b9000                  ; $b0000 or $b1000 or $b9000 ...
boot:           ;...
                mv      (!yi+2),!ps             ; Store segment in 3rd byte
                mv      (!zi+2),!ps             ; Store segment in 3rd byte
docol__xt:      mv      i,x             ; 2     ; I holds the IP
                pushs   i               ; 6     ; Push IP (return address)
                pmdf    (!yi),3         ; 4     ; Set new IP
                mv      x,(!yi)         ; 5     = 17 cycles
interp__:       pre_on                  ; cycles = 7 + 13 = 20
                test    ($ff),$08       ; 5     ; Is break pushed?
                jrnz    break__         ; 2/3   ; Break was pushed
;---------------                        ; cycles = 13
cont__:         mvw     (!yi),[x++]     ; 7     ; Set (yi) to new execution token
                jp      (!yi)           ; 6     ; Execute new token
break__:        ;...
doret__xt:      mvw     (!yi),[s++]     ; 7     ; Pop IP (return address)
                mv      x,(!yi)         ; 5     ; X holds the IP
                mvw     (!yi),[x++]     ; 7     ; Fetch new execution token
                jp      (!yi)           ; 6     = 25 cycles versus 30
dovar__xt:      pushu   ba              ; 4     ; Save old TOS
                pmdf    (!yi),3         ; 4     ; Set new TOS
                mv      ba,(!yi)        ; 4     ; to the address of the data
                jr      !cont__         ; 3     = 15+13 cycles versus 15+15
docon__xt:      pushu   ba              ; 4     ; Save TOS
                mv      ba,[(!yi)+3]    ; 12    ; Set new TOS
                jr      !cont__         ; 3     = 19+13 cycles versus 23+15
dodefer__xt:    mvw     (!yi),[(yi)+3]  ; 14
                jp      (!yi)           ; 6     = 20 cycles versus 26
does__xt:       pushu   ba              ; 4     ; Save TOS
                pmdf    (!yi),3         ; 4     ; Set new TOS
                mv      ba,(!yi)        ; 4     ; to the address of the data
                mvw     (!yi),[s]       ; 7     ; The CALL does__xt return short address is the execution token
                mv      i,x             ; 2     ; I holds the IP
                mv      [s],i           ; 5     ; Push old IP
                mv      x,(!yi)         ; 5     ; Set new IP
                jp      !cont__         ; 4     = 35+13 versus 43+15

The pieces of this puzzle nicely fall in place, which is satisfying. I've used some of the more exotic instructions, such as PMDF (pointer modify) that operates on internal RAM 20 bit addresses, and JP (yi) to jump to the 20 bit address in (yi).

A colon-return sequence is reduced to 66 cycles from 79: 4 JP docol__xt + 17 (docol__xt) + 20 (interp__) + 25 (doret__xt). This is the execution overhead of a word defined as a colon definition and includes a check for a BREAK key press to interrupt execution. A colon definition internally in the dictionary starts with a JP docol__xt. A constant starts with JP docon__xt, a variable starts with JP dovar__xt.

A word fetch-execute overhead is reduced to 13 cycles from 15. This is the fetch-execute overhead of words defined in assembly, by fetching them as 16 bit addresses to execute by jumping to their machine code located at a 20 bit address.

I want to first roll out the floating-point addition, fully working and tested, for the next Forth500 update to the repo in two weeks (or so, because I need to make time for this). I will focus later on implementing further optimizations to speed up Forth500, e.g. using the outline above.

PS (edit): from the details of the CPU technical manual, PMDF may not perform the operation on a 20 bit pointer stored in internal RAM but rather on a single byte internal RAM pointer to internal RAM. Oops. This raises the cycle count to 71 from 79 by using inc x three times. Still a worthwhile speed improvement to consider.

- Rob

"I count on old friends to remain rational"
Visit this user's website Find all posts by this user
Quote this message in a reply
Post Reply 

Messages In This Thread
FORTH for the SHARP PC-E500 (S) - Helix - 09-06-2021, 11:41 PM
RE: FORTH for the SHARP PC-E500 (S) - dmh - 10-02-2022, 02:29 PM
RE: FORTH for the SHARP PC-E500 (S) - dmh - 10-04-2022, 12:46 PM
RE: FORTH for the SHARP PC-E500 (S) - dmh - 10-04-2022, 10:55 PM
RE: FORTH for the SHARP PC-E500 (S) - robve - 10-13-2021 02:38 AM

User(s) browsing this thread: 1 Guest(s)