Post Reply 
monic part 6: the mC programming language and compiler
12-11-2021, 12:29 AM (This post was last modified: 01-15-2022 07:00 AM by F-73P.)
Post: #1
monic part 6: the mC programming language and compiler
Previous: monic part 5: writing C programs on the calculator

In the previous thread I introduced the mC ("monic C") high-level programming language for the monic calculator. In this thread I'll document the development of a compiler that translates the language into bytecode for execution by a virtual machine.

mC does not conform to any C standard. Features include:
  • recursive functions
  • local variables
  • multi-dimensional arrays and pointers
Below is the EBNF grammar (v1.0) for mC (see the previous thread for information on grammars and EBNF notation):

program \(\rightarrow\) code ENDFILE

code \(\rightarrow\) { var-declaration } { function-declaration } main-function-declaration

var-declaration \(\rightarrow\) ARRAY ID dimensions SEMICOLON | [ "*" ] ID SEMICOLON

dimensions \(\rightarrow\) LEFTBRACKET NUM RIGHTBRACKET { LEFTBRACKET NUM RIGHTBRACKET }

function-declaration \(\rightarrow\) ID [ LEFTPARENTHESIS params RIGHTPARENTHESIS ] LEFTBRACE local-declarations statement-list RIGHTBRACE

main-function-declaration \(\rightarrow\) MAIN LEFTBRACE local-declarations statement-list RIGHTBRACE

params \(\rightarrow\) [ param { COMMA param } ]

param \(\rightarrow\) ID { LEFTBRACKET NUM RIGHTBRACKET } | "*" ID

local-declarations \(\rightarrow\) { LOCAL var-declaration }

statement-list \(\rightarrow\) { statement }

statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | do-stmt | for-stmt| switch-stmt | return-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt

compound-stmt\(\rightarrow\) LEFTBRACE statement-list RIGHTBRACE

if-stmt \(\rightarrow\) IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ]

while-stmt \(\rightarrow\) WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement

do-stmt \(\rightarrow\) DO statement WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS SEMICOLON

for-stmt \(\rightarrow\) FOR LEFTPARENTHESIS [ assignment ] SEMICOLON Boolean-expression SEMICOLON [ addditive-expression ] RIGHTPARENTHESIS statement

switch-stmt \(\rightarrow\) SWITCH LEFTPARENTHESIS var RIGHTPARENTHESIS LEFTBRACE { CASE NUM COLON statement-list BREAK SEMICOLON } [ DEFAULT COLON statement-list ] RIGHTBRACE

var \(\rightarrow\) ID { LEFTBRACKET additive-expression RIGHTBRACKET } | "*" ID

return-stmt \(\rightarrow\) RETURN [ expression ] SEMICOLON

read-stmt \(\rightarrow\) READ var SEMICOLON

write-stmt \(\rightarrow\) WRITE expression SEMICOLON

assignment-stmt \(\rightarrow\) assignment SEMICOLON

expression-stmt \(\rightarrow\) [ expression ] SEMICOLON

goto-stmt \(\rightarrow\) GOTO ID SEMICOLON

label-stmt \(\rightarrow\) ID COLON statement

nop-stmt \(\rightarrow\) NOPR SEMICOLON

assignment \(\rightarrow\) var "=" expression

expression \(\rightarrow\) Boolean-expression | additive-expression

Boolean-expression \(\rightarrow\) logical-expression { boolop logical-expression }

logical-expression \(\rightarrow\) additive-expression relop additive-expression | var | TRUE | FALSE

additive-expression \(\rightarrow\) term { addop term }

term \(\rightarrow\) [ NEGATE ] factor { mulop [ NEGATE ] factor }

factor \(\rightarrow\) LEFTPARENTHESIS additive-expression RIGHTPARENTHESIS | NUM | var | ID LEFTPARENTHESIS args RIGHTPARENTHESIS | var "++" | var "--" | "&" ID { LEFTBRACKET additive-expression RIGHTBRACKET }

args \(\rightarrow\) [ expression { COMMA expression } ]

boolop \(\rightarrow\) "and" | "or" | "not"

relop \(\rightarrow\) "<=" | "<" | ">" | ">=" | "==" | "!="

addop \(\rightarrow\) "+" | "-"

mulop \(\rightarrow\) "*" | "/"
Find all posts by this user
Quote this message in a reply
12-12-2021, 06:57 PM
Post: #2
RE: monic part 6: the mC programming language and compiler
Nice work! I really like your project and I'm very much looking forward to learn more about it. Looks like you put a lot of effort into this to cover most of the C language.

I am curious how you plan to handle logical expressions as "short circuit" code or not. I noticed that RETURN, WRITE and a few other constructs allow 'expression' as an argument, but 'factor' for example uses 'additive_expression' instead of 'expression'. To implement RETURN and WRITE this way requires "short circuit logic" to return Boolean values without evaluating unnecessary AND and OR operands. This can be tricky to implement, but not too hard to do using backpatch lists to jump on "true" and "false" conditions to the right VM opcode targets.

I think I spotted a problem with the 'function_declaration' starting with an ID just like 'var_declaration', so there is no way of telling which is which with one token lookahead. Making the '(' and ')' optional in 'function_declaration' makes it even harder to distinguish from 'var_declaration' because you have to look ahead all the way to the first '(' to see if this is a function, not a variable declaration. Perhaps insist that functions should be declared with parenthesis?

Also 'expression' is defined to be either a 'Boolean_expression' or a 'additive_expression' but you have no way of telling which one applies by just looking at the lookahead token.

Why is SWITCH restricted to a 'var' argument and not an `additive_expression'?

I've converted the grammar to Yacc to give it a go with Bison and fixed 'function_declaration' to require parenthesis and MAIN with optional parameters for consistence and just in case if you'd like to pass arguments to the main program. However, this grammar has 47 shift/reduce conflicts which suggests that several (cascading) ambiguities exist:
Code:
%token ARRAY
%token ID
%token NUM
%token MAIN
%token LOCAL
%token RETURN
%token READ
%token WRITE
%token GOTO
%token NOPR
%token SWITCH
%token CASE
%token BREAK
%token DEFAULT
%token FOR
%token IF
%token ELSE
%token WHILE
%token DO
%token NEGATE
%token LE
%token GE
%token EQ
%token NE
%token AND
%token OR
%token NOT
%token INC       // ++
%token DEC       // --
%token TRUE
%token FALSE
%token ENDFILE
%%
program
: code ENDFILE
;
code
: var_declarations function_declarations main_function_declaration
;
var_declarations
: var_declaration var_declarations
|
;
var_declaration
: ARRAY ID dimensions ';'
| '*' ID ';'
| ID ';'
;
dimensions
: '[' NUM ']' dimensions
| '[' NUM ']'
;
function_declarations
: function_declaration function_declarations
|
;
function_declaration
: ID '(' params_opt ')' '{' local_declarations statement_list '}'
;
main_function_declaration
: MAIN '(' params_opt ')' '{' local_declarations statement_list '}'
;
params
: param ',' params
| param
;
params_opt
: params
|
;
param
: ID nums
| '*' ID
;
nums
: '[' NUM ']' nums
|
;
local_declarations
: LOCAL var_declaration local_declarations
|
;
statement_list
: statement statement_list
|
;
statement
: compound_stmt
| if_stmt
| while_stmt
| do_stmt
| for_stmt| switch_stmt
| return_stmt
| read_stmt
| write_stmt
| assignment_stmt
| expression_stmt
| goto_stmt
| label_stmt
| nop_stmt 
;
compound_stmt
: '{' statement_list '}'
;
if_stmt
: IF '(' Boolean_expression ')' statement
| IF '(' Boolean_expression ')' statement ELSE statement
;
while_stmt
: WHILE '(' Boolean_expression ')' statement
;
do_stmt
: DO statement WHILE '(' Boolean_expression ')' ';'
;
for_stmt
: FOR '(' assignment_opt ';' Boolean_expression ';' additive_expression_opt ')' statement 
;
switch_stmt
: SWITCH '(' var ')' '{' cases ';' default_opt '}' 
;
cases
: CASE NUM ':' statement_list BREAK ';' cases
|
;
default_opt
: DEFAULT ':' statement_list
|
;
var
: ID indices
| '*' ID 
;
indices
: '[' additive_expression ']' indices
|
;
return_stmt
: RETURN expression_opt ';' 
;
read_stmt
: READ var ';' 
;
write_stmt
: WRITE expression ';' 
;
assignment_stmt
: assignment ';'
;
expression_stmt
: expression_opt ';'
;
goto_stmt
: GOTO ID ';'
;
label_stmt
: ID ':' statement
;
nop_stmt
: NOPR ';'
;
assignment
: var '=' expression 
;
assignment_opt
: assignment
|
;
expression
: Boolean_expression
| additive_expression
;
expression_opt
: expression
|
;
Boolean_expression
: logical_expression boolop Boolean_expression
| logical_expression
;
logical_expression
: additive_expression relop additive_expression
| var
| TRUE
| FALSE 
;
additive_expression
: term addop additive_expression
| term
;
additive_expression_opt
: additive_expression
|
;
term
: neg_opt factor mulop term
| neg_opt factor
;
neg_opt
: NEGATE
|
;
factor
: '(' additive_expression ')'
| NUM
| var
| ID '(' args_opt ')'
| var INC
| var DEC
| '&' ID indices
;
args
: expression ',' args
| expression
;
args_opt
: args
|
;
boolop
: AND
| OR
| NOT
;
relop
: LE
| '<'
| '>'
| GE
| EQ
| NE
;
addop
: '+'
| '-'
;
mulop
: '*'
| '/'
;
%%
Run bison -v grammar.y to produce a grammar.output report.

I didn't have time to verify the conflicts to see what constructs might actually become problematic to implement as a parser based on these two conceptually identical grammar versions, but the grammar is not LR(1) and LALR. Note that there should only be one shift/reduce conflict for the IF-ELSE ambiguity that always arrises in C. I haven't checked the grammar with ANTLR, which may reveal more useful information about potential grammar ambiguity issues to implement a recursive-descent parser, since ANTLR is a LL(1) generator. ANTLR also accepts grammars beyond LL(1).

Hope this is helpful. Looks like a fun and useful project to work on.

- Rob

"I can count on my friends" -- HP 71B,Prime|Ti VOY200,Nspire CXII CAS|Casio fx-CG50...|Sharp PC-G850,E500,2500,1500,14xx,13xx,12xx...
Visit this user's website Find all posts by this user
Quote this message in a reply
12-13-2021, 03:39 AM (This post was last modified: 12-17-2021 04:11 AM by F-73P.)
Post: #3
RE: monic part 6: the mC programming language and compiler
(12-12-2021 06:57 PM)robve Wrote:  Nice work! I really like your project and I'm very much looking forward to learn more about it. Looks like you put a lot of effort into this to cover most of the C language.

Thanks, I've started using a development board with a 4.3" display and i.MX RT1062 processor (Cortex-M7, 600MHz, 4MB FLASH, 32MB RAM) and am rewriting the firmware, so I thought this would be a good time to document the project. Here's a screen shot of a test program:

[Image: 51743164643_c0f2edc8f2.jpg]

Some aspects have been very challenging but I'm learning a lot and it has been fun too. I would like to add support for a programming language similar to FORTRAN as well:

[Image: 51742101942_dc63d72959.jpg][Image: 51743812475_47cf6ec3c2.jpg]

(12-12-2021 06:57 PM)robve Wrote:  Why is SWITCH restricted to a 'var' argument and not an `additive_expression'?
Thanks, I'll fix that. I've only ever used a variable argument with SWITCH and just didn't think of it. Great to have a fresh pair of eyes go over the grammar Smile

(12-12-2021 06:57 PM)robve Wrote:  I am curious how you plan to handle logical expressions as "short circuit" code or not. I noticed that RETURN, WRITE and a few other constructs allow 'expression' as an argument, but 'factor' for example uses 'additive_expression' instead of 'expression'. To implement RETURN and WRITE this way requires "short circuit logic" to return Boolean values without evaluating unnecessary AND and OR operands. This can be tricky to implement, but not too hard to do using backpatch lists to jump on "true" and "false" conditions to the right VM opcode targets.
To be honest I never heard about "short circuit" code until you mentioned it Smile. I thought to just evaluate the entire Boolean expression, but I now see programs can run faster with "short circuit" code. Something to think about!

Thanks for pointing out inconsistencies in the use of 'expression' and 'additive-expression'. I'll amend the grammar and post it soon.

(12-12-2021 06:57 PM)robve Wrote:  I think I spotted a problem with the 'function_declaration' starting with an ID just like 'var_declaration', so there is no way of telling which is which with one token lookahead. Making the '(' and ')' optional in 'function_declaration' makes it even harder to distinguish from 'var_declaration' because you have to look ahead all the way to the first '(' to see if this is a function, not a variable declaration. Perhaps insist that functions should be declared with parenthesis?
Yes you are right. I wanted to make parentheses optional for functions without parameters to reduce keystrokes but your suggestion is better, and I'll employ an extra token lookahead to determine if it's a function or variable declaration.

(12-12-2021 06:57 PM)robve Wrote:  Also 'expression' is defined to be either a 'Boolean_expression' or a 'additive_expression' but you have no way of telling which one applies by just looking at the lookahead token.
Well spotted, I'll post an amended grammar soon.

(12-12-2021 06:57 PM)robve Wrote:  I've converted the grammar to Yacc to give it a go with Bison and fixed 'function_declaration' to require parenthesis and MAIN with optional parameters for consistence and just in case if you'd like to pass arguments to the main program
Great idea, I thought to only use "read" to bring in values from the stack. Your way allows one to write e.g. "main(a,b)" and save two read statements.

(12-12-2021 06:57 PM)robve Wrote:  Hope this is helpful. Looks like a fun and useful project to work on.
Thanks Rob, this is very helpful. I really appreciate you taking the time to check the grammar and provide valuable feedback, which I'll use to improve the project.
Find all posts by this user
Quote this message in a reply
12-16-2021, 10:14 AM (This post was last modified: 12-17-2021 03:25 AM by F-73P.)
Post: #4
RE: monic part 6: the mC programming language and compiler
Below is v1.1 of the EBNF grammar, which hopefully solves the problems Rob found in the previous version:

program \(\rightarrow\) code ENDFILE

code \(\rightarrow\) { var-declaration } { function-declaration } main-function-declaration

var-declaration \(\rightarrow\) ARRAY ID dimensions SEMICOLON | [ "*" ] ID SEMICOLON

dimensions \(\rightarrow\) LEFTBRACKET NUM RIGHTBRACKET { LEFTBRACKET NUM RIGHTBRACKET }

function-declaration \(\rightarrow\) ID LEFTPARENTHESIS params RIGHTPARENTHESIS LEFTBRACE local-declarations statement-list RIGHTBRACE

main-function-declaration \(\rightarrow\) MAIN LEFTPARENTHESIS params RIGHTPARENTHESIS LEFTBRACE local-declarations statement-list RIGHTBRACE

params \(\rightarrow\) [ param { COMMA param } ]

param \(\rightarrow\) ID { LEFTBRACKET NUM RIGHTBRACKET } | "*" ID

local-declarations \(\rightarrow\) { LOCAL var-declaration }

statement-list \(\rightarrow\) { statement }

statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | do-stmt | for-stmt| switch-stmt | return-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt

compound-stmt\(\rightarrow\) LEFTBRACE statement-list RIGHTBRACE

if-stmt \(\rightarrow\) IF LEFTPARENTHESIS expression RIGHTPARENTHESIS statement [ ELSE statement ]

while-stmt \(\rightarrow\) WHILE LEFTPARENTHESIS expression RIGHTPARENTHESIS statement

do-stmt \(\rightarrow\) DO statement WHILE LEFTPARENTHESIS expression RIGHTPARENTHESIS SEMICOLON

for-stmt \(\rightarrow\) FOR LEFTPARENTHESIS [ assignment ] SEMICOLON [ expression ] SEMICOLON [ expression ] RIGHTPARENTHESIS statement

switch-stmt \(\rightarrow\) SWITCH LEFTPARENTHESIS expression RIGHTPARENTHESIS LEFTBRACE { CASE NUM COLON statement-list BREAK SEMICOLON } [ DEFAULT COLON statement-list ] RIGHTBRACE

var \(\rightarrow\) ID { LEFTBRACKET expression RIGHTBRACKET } | "*" ID

return-stmt \(\rightarrow\) RETURN [ expression ] SEMICOLON

read-stmt \(\rightarrow\) READ var SEMICOLON

write-stmt \(\rightarrow\) WRITE expression SEMICOLON

assignment-stmt \(\rightarrow\) assignment SEMICOLON

expression-stmt \(\rightarrow\) [ expression ] SEMICOLON

goto-stmt \(\rightarrow\) GOTO ID SEMICOLON

label-stmt \(\rightarrow\) ID COLON statement-list

nop-stmt \(\rightarrow\) NOPR SEMICOLON

assignment \(\rightarrow\) var "=" expression

expression \(\rightarrow\) simple-expression { boolop simple-expression }

simple-expression \(\rightarrow\) additive-expression [ relop additive-expression ]

additive-expression \(\rightarrow\) term { addop term }

term \(\rightarrow\) [ NEGATE ] factor { mulop [ NEGATE ] factor }

factor \(\rightarrow\) LEFTPARENTHESIS expression RIGHTPARENTHESIS | NUM | var | var "++" | var "--" | "&" ID { LEFTBRACKET expression RIGHTBRACKET } | "true" | "false" | ID LEFTPARENTHESIS args RIGHTPARENTHESIS

args \(\rightarrow\) [ expression { COMMA expression } ]

boolop \(\rightarrow\) "and" | "or" | "not"

relop \(\rightarrow\) "<=" | "<" | ">" | ">=" | "==" | "!="

addop \(\rightarrow\) "+" | "-"

mulop \(\rightarrow\) "*" | "/"

I've written a parser recogniser for this grammar. Below are the C files ("display.c" is not required, it was used to test some display routines for the calculator):

Code:

//compiler.c v1.01

//includes

#include <stdint.h>

//external variables
extern char *source;
extern uint16_t sourceLines;

//prototypes

void parser_recogniser_mC(void);
void gen_sourceIndexArray(void);
void refresh_display(char *source, uint16_t sourceLines);

//functions

int main (void) {    
    parser_recogniser_mC();
  refresh_display(source,sourceLines);    
}

Code:

//scanner.c v1.01

//DFA scanner - read the source code and return the next token to the parser

//includes

#include <stdbool.h>
#include <ctype.h>
#include <stdint.h>
#include <string.h>

#include "definitions.h"

//structs

struct { //lookup table of reserved words
char* str;
TokenType tok;
} reservedWords[RESERVED] = {{"main",MAINTOKEN},{"if",IF},{"else",ELSE},{"do",DO},{"while",WHILE},{"for",FOR}, {"return",RETURN},
                             {"read",READ},{"write",WRITE},{"and",AND},{"or",OR},{"not",NOT},{"true",TRUE},
                             {"false",FALSE},{"local",LOCAL},{"array",ARRAY},{"switch",SWITCH},
                             {"case",CASE},{"break",BREAK},{"default",DEFAULT},{"goto",GOTO},{"nop",NOPR}};
//global variables    
                                                         
char source1[SOURCE+1]= {
                         "/*factorial program*/\n" //22 characters, start=0, end=21 
                       "\n" //1 character (23), start=22, end=22
                                             "main() {\n" //9 characters (32), start=23, end=31
                                             "  read x;\n" //10 characters (42), start=32, end=41                            
                                             "  factorial=1;\n" //15 characters (57), start=42, end=56
                                             "  while(x>0) {\n" //15 characters (72), start=57, end=71
                                             "    factorial=factorial*x;\n" //27 characters (99), start=72, end=98
                                             "    x--;\n" //9 characters (108), start=99, end=107
                                             "  }\n" //4 characters (112), start=108, end=111
                                             "  write factorial;\n" //19 characters (131), start=112, end=130
                                             "}\n" //2 characters (133), start=131, end=132                              
                     };
uint16_t source1Lines=11;
                                            
char source2[SOURCE+1]= {
                         "/*factorial program*/\n" //22 characters, start=0, end=21 
                       "\n" //1 character (23), start=22, end=22
                                             "main() {\n" //9 characters (32), start=23, end=31
                                             "  read x;\n" //10 characters (42), start=32, end=41                            
                                             "  factorial=1;\n" //15 characters (57), start=42, end=56
                                             "  while(x>0) {\n" //15 characters (72), start=57, end=71
                                             "    factorial=factorial*x;\n" //27 characters (99), start=72, end=98
                                             "    x-;\n" //8 characters (107), start=99, end=106 /*****ERROR*****/
                                             "  }\n" //4 characters (111), start=107, end=110
                                             "  write factorial;\n" //19 characters (130), start=111, end=129
                                             "}\n" //2 characters (132), start=130, end=131                              
                     };
uint16_t source2Lines=11;

char *source;
uint16_t sourceLines;
uint32_t sourceIndex=0; //current position in source
uint16_t lineno=0; //current line number in source    
char lexemeString[MAXLEXEMELEN+1];
//uint32_t sourceIndexArray[MAXLINENO];

//prototypes
                                         
TokenType reservedLookup(char *lexemeString);

//functions
                                         
//void gen_sourceIndexArray(void) { //generate sourceIndexArray, where sourceIndexArray[i] = source index at beginning 
//                                    //of program line i
//    source=source1;    
//    
//    while(source[sourceIndex]) {
//        sourceIndexArray[lineno]=sourceIndex;        
//        while((source[sourceIndex] != '\n') && source[sourceIndex]) sourceIndex++;        
//        sourceIndex++;
//        lineno++;        
//    }
//    
//    sourceIndex=0;
//  lineno=0;    
//}

TokenType getToken(void) { //returns the next token in source    
    char c;    //current character in source
    
    uint16_t lexemeStringIndex = 0; //index for storing lexeme into lexemeString
    _Bool save; //flag to indicate save to lexemeString
    
    TokenType currentToken; //holds current token to be returned to the parser    
  
    StateType state = START; //current state - always begins at START
    
    source=source2;
    sourceLines=source2Lines;

  while (state!=DONE) {
        c=source[sourceIndex++]; //next character in program
        save = true;
        switch (state) { //begin switch (state)
            case START:
                if (isdigit(c))
                    state = INNUM;
         else if (isalpha(c))
           state = INID;
                 else if (c == '<')
                     state = INLESSTHAN;
                 else if (c == '>')
                     state = INGREATERTHAN;
                 else if (c == '=')
                     state = INEQUAL;
                 else if (c == '!')
                     state = INNOTEQUAL;
                 else if (c == '/') {
                     save = false;
                     state = INDIV;
                 }
                 else if (c =='+') {
                     state = INPLUS;
                 }
                 else if (c =='-') {
                     state = INMINUS;
                 }
         else if ((c == ' ') || (c == '\t') || (c == '\n')) {
                     save = false;
                     if (c == '\n') lineno++;
                 }
         else
         { state = DONE;
           switch (c) {
                         case ENDFILE:
               save = false;
               currentToken = ENDFILE;
               break;
             case '*':
               currentToken = TIMES;
               break;
                         case ':':
               currentToken = COLON;
               break;
             case ';':
               currentToken = SEMI;
               break;
             case ',':
                             currentToken = COMMA;
               break;
             case '(':
               currentToken = LPAREN;
               break;
             case ')':
               currentToken = RPAREN;
               break;
                         case '[':
               currentToken = LBRACKET;
               break;
             case ']':
               currentToken = RBRACKET;
               break;
                         case '{':
               currentToken = LBRACE;
               break;
             case '}':
               currentToken = RBRACE;
               break;
                         case '&':
               currentToken = ADDR;
               break;
             default:
                 //insert error code
                 state = DONE;
           }
         }
         break;
            case INNUM:
         if (!isdigit(c)) { //backup in the input
                     if(c == 'e') {
                         state=INEXP;
                     }
           else if(c != '.') {
                         sourceIndex--;
                         save = false;
                         state = DONE;
                         currentToken = NUM;
                     }
         }
         break;
            case INEXP:
                state=INNUM;
                break;
      case INID: //ID is letters/digits (cannot start with a digit)
         if (!isalpha(c)) { //backup in the input
                     if (!isdigit(c)) {
                         sourceIndex--;
                         save = false;
                         state = DONE;
                         currentToken = ID;
                     }
         }
         break;
            case INLESSTHAN:
              state = DONE;
              if (c == '=') currentToken = LTE;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = LT;
        }
              break;
            case INGREATERTHAN:
              state = DONE;
              if (c == '=') currentToken = GTE;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = GT;
        }
              break;
            case INEQUAL:
                state = DONE;
              if (c == '=') currentToken = EQ;
              else {
                    sourceIndex--;
                    save=false;
                    currentToken = ASSIGN;
                }
                break;
            case INNOTEQUAL:
                state = DONE;
              if (c == '=') currentToken = NEQ;
              else {
                    //insert error code
                    state = DONE;
                }
              break;
            case INDIV:
        save = false;
              if (c == '*') state = INCOMMENT;
              else {
                    state = DONE;
                    sourceIndex--;
                    currentToken = DIV;                        
                    if (lexemeStringIndex < MAXLEXEMELEN) {
                        lexemeString[lexemeStringIndex++] = '/';
                        lexemeString[lexemeStringIndex] = '\0';
                    }
                    else {
                        //insert error code
                    }    
        }
              break;
            case INPLUS:
        state = DONE;
              if (c == '+') currentToken = INC;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = PLUS;
        }
              break;
            case INMINUS:
        state = DONE;
              if (c == '-') currentToken = DEC;
              else {
          save = false;
                    sourceIndex--;
                    currentToken = MINUS;
        }
              break;
      case INCOMMENT:
              save = false;
              if (c == ENDFILE) { 
                    state = DONE;
                  currentToken = ENDFILE;
              }
              else if (c == '*')
                  state = OUTCOMMENT;
                else if (c == '\n') lineno++;
              break;
            case OUTCOMMENT:
                save = false;
                if (c == '/') state = START;
              else { //backup in the input
           sourceIndex--;
           state = INCOMMENT;
        }
                break;
      case DONE:
      default: //should never happen
               //insert error code
                 state = DONE;
         break;
    } //end switch (state)

        if (save) {
            if (lexemeStringIndex < MAXLEXEMELEN) lexemeString[lexemeStringIndex++] = c;
      else {
                //insert error code
            }                
        }

    if (state == DONE) {
            lexemeString[lexemeStringIndex] = '\0';
       if (currentToken == ID) currentToken = reservedLookup(lexemeString);
        }
  }
    return currentToken;
}

TokenType reservedLookup (char *lexemeString) { //lookup an identifier to see if it is a reserved word (linear search)
    uint16_t i;
    for (i=0;i<RESERVED;i++) {
        if (!strcmp(lexemeString,reservedWords[i].str)) return reservedWords[i].tok;
    }
    return ID;
}

uint32_t get_sourceIndex(void) {
    return sourceIndex;
}

void set_sourceIndex(uint32_t sourceIndexStart) {
  sourceIndex=sourceIndexStart;    
}

Code:

//parser.c v1.01
//parser recogniser for mC

//includes

#include <stdint.h>
#include <stdbool.h>
#include "definitions.h"

//definitions

#define match_check_advance(token) match(token);if(error)return;advance();  

//external variables

extern char lexemeString[MAXLEXEMELEN+1];

//global variables

TokenType token;
_Bool error=false;
_Bool flag=true;
uint32_t sourceIndexPrevious; //source index at beginning of token
uint32_t sourceIndexNext; //source index after token if syntax error

//prototypes

void code(void);
void var_declaration(void);
void dimensions(void);
void function_declaration(void);
void main_function_declaration(void);
void params(void);
void param(void);
void local_declarations(void);
void statement_list(void);
void statement(void);
void compound_stmt(void);
void if_stmt(void);                
void while_stmt(void);
void do_stmt(void);
void for_stmt(void);
void switch_stmt(void);
void var(void);
void return_stmt(void);
void read_stmt(void);                
void write_stmt(void);
void assignment_stmt(void);
void expression_stmt(void);
void goto_stmt(void);
void label_stmt(void);
void nop_stmt(void);
void assignment(void);
void expression(void);
void simple_expression(void);
void additive_expression(void);
void term(void);
void factor(void);
void args(void);

void advance(void);
TokenType getToken(void);
void match(TokenType);

uint32_t get_sourceIndex(void);
void set_sourceIndex(uint32_t);

//functions

void parser_recogniser_mC(void) {    
    advance();
    code();
  if(error) { //display error message        
        sourceIndexNext=get_sourceIndex();        
    }  
}

void code(void) {    
    while((token==ARRAY||token==TIMES||token==ID) && flag) {
        var_declaration();if(error)return;
    }
    while(token==ID) {
        function_declaration();if(error)return;
    }        
    main_function_declaration();    
}

void var_declaration(void) {
    uint32_t sourceIndexStart;
    switch(token) {
        case ARRAY:
            advance();
          match_check_advance(ID);
          dimensions();if(error)return;
          match_check_advance(SEMI);
            break;
        case TIMES: //pointer
            advance();
          match_check_advance(ID);
          match_check_advance(SEMI);
            break;
        default: //ID - extra lookahead required to determine whether variable or function declaration 
            sourceIndexStart=get_sourceIndex(); //save current sourceIndex                      
          advance();
          switch(token) {                
            case SEMI: //variable declaration                
        advance();            
                break;            
            default: //function declaration
                match(LPAREN); if(error)return;
              token=ID;
              set_sourceIndex(sourceIndexStart);// restore sourceIndex
              flag=false;
          }
    }
}

void dimensions(void) {
    match_check_advance(LBRACKET);
    match_check_advance(NUM);
    match_check_advance(RBRACKET);
    while(token==LBRACKET){
        advance();
        match_check_advance(NUM);
        match_check_advance(RBRACKET);
    }
}

void function_declaration(void){    
    match_check_advance(ID);
    match_check_advance(LPAREN);
    params();if(error)return;
    match_check_advance(RPAREN);    
    match_check_advance(LBRACE);
    local_declarations();if(error)return;
    statement_list();if(error)return;    
  match_check_advance(RBRACE);    
}

void main_function_declaration(void){    
    match_check_advance(MAINTOKEN);
    match_check_advance(LPAREN);
    params();if(error)return;
    match_check_advance(RPAREN);    
    match_check_advance(LBRACE);
    local_declarations();if(error)return;
    statement_list();if(error)return;    
  match_check_advance(RBRACE);
}

void params(void) {
    if (token==TIMES || token==ID) {        
        param();if(error)return;    
        while (token==COMMA) {
            advance();            
            param();if(error)return;    
        }        
    }    
}

void param(void){
    switch(token) {
        case TIMES: //pointer
            advance();
          match_check_advance(ID);
            break;
        default: //ID
            match_check_advance(ID);
          while(token==LBRACKET) {
                advance();
                match_check_advance(NUM);
                match_check_advance(RBRACKET);
            }        
    }    
}

void local_declarations(void){
    while(token==LOCAL){
        advance();
        var_declaration();if(error)return; 
    }
}

void statement_list(void) { 
    while(token!=RBRACE) {
        statement();if(error)return;
    }        
}

void statement(void) {
  uint32_t sourceIndexStart;
  TokenType lookaheadToken;
    
    switch(token) {
        case LBRACE:
            compound_stmt();          
            break;
        case IF:
            if_stmt();    
            break;
        case WHILE:
            while_stmt();
          break;
    case DO:
            do_stmt();
          break;
    case FOR:
            for_stmt();
          break;
        case SWITCH:
            switch_stmt();
          break;
        case RETURN:
            return_stmt();
            break;
        case READ:
            read_stmt();    
            break;
        case WRITE:
            write_stmt();    
            break;        
        case GOTO:
            goto_stmt();    
            break;
        case NOPR:
            nop_stmt();    
            break;
        case ID: //extra lookahead required to determine whether assignment, label or expression statement
            sourceIndexStart=get_sourceIndex(); //save current sourceIndex                      
          advance();
          lookaheadToken=token;
          token=ID;
          set_sourceIndex(sourceIndexStart);// restore sourceIndex
          switch(lookaheadToken) {                
            case ASSIGN: //assignment statement                
                assignment_stmt();
                break;
            case COLON: //label statement            
                label_stmt();
            default: //expression                 
                expression_stmt();
          }            
            break;
        default: //expression statement            
      expression_stmt();    
    }
}

void compound_stmt(void) {
    match_check_advance(LBRACE);
    statement_list();if(error)return;
    match_check_advance(RBRACE);
}

void if_stmt(void) {
    match_check_advance(IF);
    match_check_advance(LPAREN);
    expression();if(error)return;
    match_check_advance(RPAREN);
    statement();if(error)return;
    if (token==ELSE) {
        advance();
        statement();
    }    
}

void while_stmt(void) {
    match_check_advance(WHILE);
    match_check_advance(LPAREN);
    expression();if(error)return;
    match_check_advance(RPAREN);
    statement();if(error)return;
}

void do_stmt(void) {
    match_check_advance(DO);
    statement();if(error)return;
    match_check_advance(WHILE);
    match_check_advance(LPAREN);
    expression();if(error)return;
    match_check_advance(RPAREN);    
}

void for_stmt(void){
    match_check_advance(FOR);
    match_check_advance(LPAREN);
    if(token==TIMES || token==ID) assignment();if(error)return;
    match_check_advance(SEMI);
    if(token==MINUS || token==LPAREN || token==NUM || token==TIMES || token==ID || token==ADDR || 
         token==TRUE || token==FALSE) expression();if(error)return;
    match_check_advance(SEMI);
    if(token==MINUS || token==LPAREN || token==NUM || token==TIMES || token==ID || token==ADDR || 
         token==TRUE || token==FALSE) expression();if(error)return;
    match_check_advance(RPAREN);
    statement();if(error)return;
}

void switch_stmt(void){
    match_check_advance(SWITCH);
    match_check_advance(LPAREN);
    expression();if(error)return;
    match_check_advance(RPAREN);
    match_check_advance(LBRACE);
    while(token==CASE) {
        advance();
        match_check_advance(NUM);
        match_check_advance(COLON);
        statement_list();if(error)return;
        match_check_advance(BREAK);
        match_check_advance(SEMI);    
    }
    if (token==DEFAULT) {
        advance();
        match_check_advance(COLON);
        statement_list();if(error)return;
    }    
    match_check_advance(RBRACE);    
}

void return_stmt(void){
    match_check_advance(RETURN);
    if(token==MINUS || token==LPAREN || token==NUM || token==TIMES || token==ID || token==ADDR || 
         token==TRUE || token==FALSE) expression();if(error)return;
  match_check_advance(SEMI);    
}

void read_stmt(void) {
    match_check_advance(READ);
    var();if(error)return;
    match_check_advance(SEMI);
}

void var(void){
    switch(token){
        case TIMES: //pointer
            break;
        default:
            match_check_advance(ID);
          while(token==LBRACKET){
                advance();
                expression();if(error)return;
                match_check_advance(RBRACKET);
            }
    }    
}

void write_stmt(void) {
    match_check_advance(WRITE);
    expression();if(error)return;
    match_check_advance(SEMI);
}

void assignment_stmt(void) {
    assignment();if(error)return;
    match_check_advance(SEMI);
}

void expression_stmt(void) {
    if(token==MINUS || token==LPAREN || token==NUM || token==TIMES || token==ID || token==ADDR || 
         token==TRUE || token==FALSE) expression();if(error)return;
    match_check_advance(SEMI);
}

void goto_stmt(void) {
    match_check_advance(SEMI);
    match_check_advance(ID);
    match_check_advance(SEMI);
}

void label_stmt(void) {
    match_check_advance(ID);
    match_check_advance(COLON);
    statement_list();
}

void nop_stmt(void) {
    match_check_advance(NOPR);
    match_check_advance(SEMI);
}

void assignment(void) {
    var();if(error)return;
    match_check_advance(ASSIGN);
    expression();
}

void expression(void) {
    simple_expression();if(error)return;
    while(token==AND || token==OR || token==NOT) {
        advance();
        simple_expression();if(error)return;
    }
}

void simple_expression(void){
    additive_expression();if(error)return;
    if (token==LTE || token==LT || token==GT || token==GTE || token==EQ || token==NEQ) {
        advance();
        additive_expression();
    }
}

void additive_expression(void) {
    term();if(error)return;
    while(token==PLUS || token==MINUS) {
        advance();
        term();if(error)return;
    }
}

void term(void) {
    if(token==MINUS) {
        advance();
    }
    factor();if(error)return;
    while(token==TIMES || token==DIV) {
        advance();
        if(token==MINUS) {
          advance();
      }
        factor();if(error)return;
    }
}

void factor(void) {
    uint32_t sourceIndexStart;
    
    switch(token) {
        case LPAREN:
            advance();
          expression();if(error)return;
          match_check_advance(RPAREN);
            break;
        case NUM:
            advance();
            break;
        case ID: //extra lookahead required to determine whether function call or one of var,var++,var--
            sourceIndexStart=get_sourceIndex(); //save current sourceIndex
            advance();
          if(token==LPAREN) { //function call
                args();if(error)return;
                match_check_advance(RPAREN);
            }
            else { //var, var++, var--
                set_sourceIndex(sourceIndexStart);// restore sourceIndex
                token=ID;
                var();if(error)return;
                switch(token) {
                    case INC:
                        advance();
                        break;
                    case DEC:
                        advance();
                        break;
                    default:
                        __asm volatile ("nop");
                }
            }                
            break;
        case TRUE:
            advance();
            break;
        case FALSE:
            advance();
            break;
        default: //address operator
            match_check_advance(ADDR);            
          match_check_advance(ID);
            while(token==LBRACKET){
                    advance();
                    expression();if(error)return;
                    match_check_advance(RBRACKET);
            }
    }    
}

void args(void) {
    if(token==MINUS || token==LPAREN || token==NUM || token==TIMES || token==ID || token==ADDR || 
    token==TRUE || token==FALSE) {
        expression();if(error)return;
        while(token==COMMA) {
            advance();
            expression();if(error)return;
        }
  }    
}

void advance(void) {
    sourceIndexPrevious=get_sourceIndex(); //save current sourceIndex    
    token=getToken();
}

void match(TokenType expected) {    
    if (expected!=token) error=true;
}

Code:

//display.c v1.00

//includes

#include "definitions.h"

//definitions

#define ADAFRUITFONTWIDTH 5
#define ADAFRUITFONTHEIGHT 7
#define CFONTSIZE 2
#define CPROGRAMXSTART 0
#define CPROGRAMYSTART 0

#include <stdint.h>

//external variables

extern _Bool error;
extern uint16_t lineno;
extern uint32_t sourceIndexPrevious; 
extern uint32_t sourceIndexNext; 

//global variables

uint32_t sourceIndexArray[MAXLINENO];

//prototypes

void gen_sourceIndexArray(char *source);

//functions

void gen_sourceIndexArray(char *source) { //generate sourceIndexArray, where sourceIndexArray[i] = source index at beginning
    //of program line i
    uint16_t lineNo = 0;
    uint16_t sourceIndex = 0;

    while(source[sourceIndex]) {
        sourceIndexArray[lineNo]=sourceIndex;
        while((source[sourceIndex] != '\n') && source[sourceIndex]) sourceIndex++;
        sourceIndex++;
        lineNo++;
    }
    sourceIndexArray[lineNo]=sourceIndex;
}

void refresh_display(char *source, uint16_t sourceLines) {
    
  uint16_t xCoord = CPROGRAMXSTART;
  uint16_t yCoord = CPROGRAMYSTART;
    uint16_t xCoordError;
    
    uint16_t lineNo=0;    
    uint16_t sourceIndexStart;
  char sourceChar;

  gen_sourceIndexArray(source);    
    
//    while(source[sourceIndex]) {
//        
//        //xCoord = displayValue(xCoord, yCoord, lineNo),colour);
//        xCoord += ADAFRUITFONTWIDTH*size; 
//        sourceIndexStart = sourceIndex;
//        while((source[sourceIndex] != '\n') && source[sourceIndex]) sourceIndex++;
//        source[sourceIndex] = 0;
//        //printString(xCoord, yCoord, &source[sourceIndexStart], AQUA, size);
//        if (error && lineNo==lineno+1) {
//            //determine x coordinate of offending token            
//            xCoordError=xCoord+(sourceIndexPrevious-sourceIndexStart)*ADAFRUITFONTWIDTH*size;
//            sourceChar = source[sourceIndexNext]; 
//            source[sourceIndexNext] = 0; //set to 0 to mark end of string to be displayed in red
//            //printString(xCoordError, yCoord, &source[sourceIndexPrevious], RED, size);
//            source[sourceIndexNext] = sourceChar; //restore original source code character                             
//        }
//        source[sourceIndex] = '\n';
//        sourceIndex++;
//        lineNo++;
//        xCoord = CPROGRAMXSTART;
//        yCoord += ADAFRUITFONTHEIGHT*size; 
//    }
  
    while(lineNo<sourceLines) {

        //xCoord = displayValue(xCoord, yCoord, lineNo+1, YELLOW);
        xCoord += ADAFRUITFONTWIDTH*CFONTSIZE; 
        sourceIndexStart=sourceIndexArray[lineNo];
        source[sourceIndexArray[lineNo+1]-1] = 0;
        //if(compileFlag && !error) printString(xCoord, yCoord, &source[sourceIndexStart], GREEN, CFONTSIZE);
        //else printString(xCoord, yCoord, &source[sourceIndexStart], AQUA, CFONTSIZE);
        if (error && lineNo==lineno) {
            //determine x coordinate of offending token
            xCoordError=xCoord+(sourceIndexPrevious-sourceIndexStart)*ADAFRUITFONTWIDTH*CFONTSIZE;
            sourceChar = source[sourceIndexNext];
            source[sourceIndexNext] = 0; //set to 0 to mark end of string to be displayed in red
            //printString(xCoordError, yCoord, &source[sourceIndexPrevious], RED, CFONTSIZE);
            source[sourceIndexNext] = sourceChar; //restore original source code character
        }

        source[sourceIndexArray[lineNo+1]-1] = '\n';
        lineNo++;
        xCoord = CPROGRAMXSTART;
        yCoord += ADAFRUITFONTHEIGHT*CFONTSIZE;
    }      
}

Code:

//definitions.h v1.01

//tokens
typedef enum { ENDFILE, //book-keeping tokens    
    MAINTOKEN,IF,ELSE,DO,WHILE,FOR,RETURN,READ,WRITE,AND,OR,NOT,TRUE,FALSE,LOCA​L,ARRAY,SWITCH, //reserved words
    CASE,BREAK,DEFAULT,GOTO,NOPR, //reserved words
    ID,NUM, //multicharacter tokens
    //special symbols
    PLUS,MINUS,TIMES,DIV,LT,LTE,GT,GTE,EQ,NEQ,ASSIGN,COLON,SEMI,COMMA,LPAREN,RP​AREN,LBRACKET,RBRACKET,
    LBRACE,RBRACE,INC,DEC,ADDR
} TokenType;

//states in scanner DFA
typedef enum {START,INNUM,INEXP,INID,INLESSTHAN,INGREATERTHAN,INEQUAL,INNOTEQUAL,INDIV,I​NPLUS,INMINUS,
              INCOMMENT,OUTCOMMENT,DONE} StateType;

//definitions
#define RESERVED 22 //the number of reserved words
#define SOURCE 1000 //maximum character length of source code
#define MAXLEXEMELEN 40 //maximum length of lexeme
#define MAXLINENO 200 //maximum number of progam lines

Below are some screen shots. The source code is displayed in green after "compile" is pressed if no syntax errors are found:

[Image: 51752732405_5b578c8ccc.jpg][Image: 51752503359_5901aebbbb.jpg]

otherwise the offending token is printed in red and an error message displayed at the bottom of the screen:

[Image: 51752731935_1e3857a15e.jpg][Image: 51751029642_31974f540b.jpg]
Find all posts by this user
Quote this message in a reply
01-16-2022, 10:01 PM
Post: #5
RE: monic part 6: the mC programming language and compiler
(12-13-2021 03:39 AM)F-73P Wrote:  I've started using a development board with a 4.3" display and i.MX RT1062 processor (Cortex-M7, 600MHz, 4MB FLASH, 32MB RAM) and am rewriting the firmware, so I thought this would be a good time to document the project. Here's a screen shot of a test program:

[Image: 51743164643_c0f2edc8f2.jpg]

Some aspects have been very challenging but I'm learning a lot and it has been fun too. I would like to add support for a programming language similar to FORTRAN as well:

[Image: 51742101942_dc63d72959.jpg][Image: 51743812475_47cf6ec3c2.jpg]

Looks like this is becoming a really nice small but capable system! I wonder how you handle keyboard input? With QWERTY keys or with a menu system to enter commands? Or both? I prefer QWERTY like the Sharp and Casio pocket computers, the HP 71B and Ti Voyage 200, especially the 200 because its keyboard and menu system helps to write programs quickly without getting bogged down by awkward keypress combinations to write code.

To implement interpreters from scratch, Crafting Interpreters is an indispensable resource. Personally, I'm more on the compiler-compiler side of the programming language implementation spectrum, but Crafting Compilers makes for a good reading to learn the trade without requiring compiler-compiler tools or a deep understanding of the theoretical foundations of these.

- Rob

"I can count on my friends" -- HP 71B,Prime|Ti VOY200,Nspire CXII CAS|Casio fx-CG50...|Sharp PC-G850,E500,2500,1500,14xx,13xx,12xx...
Visit this user's website Find all posts by this user
Quote this message in a reply
01-21-2022, 03:49 AM
Post: #6
RE: monic part 6: the mC programming language and compiler
(01-16-2022 10:01 PM)robve Wrote:  I wonder how you handle keyboard input? With QWERTY keys or with a menu system to enter commands? Or both?

Both, but without a QWERTY keypad:

[Image: 51626443852_cc1801b291_b.jpg]

I'll try to incorporate a QWERTY keypad on the next prototype:

[Image: 51833753930_b72eb48c1f_b.jpg]

(01-16-2022 10:01 PM)robve Wrote:  To implement interpreters from scratch, Crafting Interpreters is an indispensable resource.

Thanks, I'll check it out!
Find all posts by this user
Quote this message in a reply
Post Reply 




User(s) browsing this thread: 1 Guest(s)