Threaded Mode | Linear Mode

F-73P · (This post was last modified: 01-05-2023 07:22 AM by F-73P.)

Continuing on from the previous post, the second major component of the compiler is the parser. The parser performs three functions: syntax analysis, static semantic analysis and code generation.

It is interesting to note that early programmers feared it may not be possible to translate a high-level language into executable code, "or if it was, then the object code would be so inefficient as to be useless. The development of the FORTRAN language and its compiler by a team at IBM led by John Backus between 1954 and 1957 showed that both these fears were unfounded. Nevertheless, the success of this project came about only with a great deal of effort, since most of the processes involved in translating programming languages were not well understood at the time" (Kenneth C. Louden, "Compiler Construction: Principles and Practice", 1997, p.3).

Fortunately, "the study of the parsing problem (the determination of efficient algorithms for the recognition of context-free languages) was pursued in the 1960's and 1970's and led to a fairly complete solution of this problem, which today has become a standard part of compiler theory" (ibid).

There are several kinds of parsing algorithms. According to Louden, "recursive-descent parsing is quite versatile and is the most suitable method for a handwritten parser" (ibid, p.143) and is the type of parser I wrote. The first stage was to write a grammar for mC and use this to write a parser recogniser, which processes the tokens from the scanner and reports an error if an unexpected token is received.

A subset of mC (mC-) will be used to illustrate the process. mC- differs from mC in the following ways:

All variables are global.
Arrays are not supported.
Functions have no parameters.
There are no "do while", "for", "switch" or "return" statements.

Below is a program in mC- (for simplicity it is assumed the user has entered an integer on the stack):

Code:

/* factorial program */

fact {

 factorial=1;

 while (x>1) {

  factorial = factorial*x;

  x--;

 }

}

main { 

 read x; 

 if (x>1) fact;

 else factorial=1;

 write factorial;

}

The grammar of a programming language is a list of rules giving the syntax, or structure, of the language. Before writing a recursive-descent parser the grammar of the programming language should be written in extended Backus-Naur form (EBNF). EBNF notation evolved from the notation used by John Backus and Peter Naur in the description of the Algol60 programming language (the language that, according to C.A.R. Hoare, "was a great improvement on its successors"). "EBNF notation is designed to mirror closely the actual code of a recursive-descent parser, so a grammar should always be translated into EBNF if recursive-descent is to be used" (ibid, p.146).

As mentioned in the previous post, a program can be thought of as a string, i.e. an array of characters (the "code") terminated by the null character (the ENDFILE token). So the first rule of the grammar in EBNF notation is:

program \(\rightarrow\) code ENDFILE

The rules in a grammar are called productions. Each production is comprised of symbols belonging to one of three categories: nonterminals, terminals and metasymbols. Nonterminals are symbols denoting the names of the underlying structures in the programming language. In the production above "program" and "code" are nonterminals. Terminals are the tokens of the programming language and are written in uppercase letters. In this example ENDFILE is a terminal. The metasymbols are not part of the programming language; they provide information on the relationship between the structures and tokens in a programming language. Here \(\rightarrow\) is a metasymbol.

When writing a grammar for a programming language, the production for the most general structure (called the start symbol - "program" in this case) is listed first. Then productions for each of the nonterminals to the right of the arrow metasymbol are listed. This process is repeated until productions for all of the nonterminals have been written down.

For the first production there is only one nonterminal to the right of the arrow: "code". So the next step is to write a production for "code". In the mC- programming language the code is comprised of a sequence of function declarations, the last of which is the main function. The next production in the grammar is therefore:

code \(\rightarrow\) { function-declaration } main-function-declaration

In EBNF notation the braces metasymbols indicate repetition, allowing (in theory) as many function declarations as desired (it also allows for no function declarations).

The next step is to write the production for the "function-declaration" nonterminal:

function-declaration \(\rightarrow\) ID LEFTBRACE statement-list RIGHTBRACE

This says every function (other than the main function) must begin with an ID (a string made up of letters and numbers and not starting with a number), followed by a left brace, followed by a sequence of statements, followed by a right brace. This production can also be written:

function-declaration \(\rightarrow\) ID "{" statement-list "}"

Here the braces are in quotation marks to indicate they are tokens and not repetition metasymbols.

Similarly the production for the "main-function-declaration" nonterminal is:

main-function-declaration \(\rightarrow\) MAIN LEFTBRACE statement-list RIGHTBRACE

The MAIN token is a reserved word (i.e. it cannot be used for variable and label names) and is represented by the string "main".

Continuing in this way gives the EBNF grammar for mC-:

program \(\rightarrow\) code ENDFILE

code \(\rightarrow\) { function-declaration } main-function-declaration

function-declaration \(\rightarrow\) ID LEFTBRACE statement-list RIGHTBRACE

main-function-declaration \(\rightarrow\) MAIN LEFTBRACE statement-list RIGHTBRACE

statement-list \(\rightarrow\) { statement }

statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt

compound-stmt\(\rightarrow\) LEFTBRACE statement-list RIGHTBRACE

if-stmt \(\rightarrow\) IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ]

while-stmt \(\rightarrow\) WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement

read-stmt \(\rightarrow\) READ ID SEMICOLON

write-stmt \(\rightarrow\) WRITE expression SEMICOLON

assignment-stmt \(\rightarrow\) assignment SEMICOLON

expression-stmt \(\rightarrow\) [ expression ] SEMICOLON

goto-stmt \(\rightarrow\) GOTO ID SEMICOLON

label-stmt \(\rightarrow\) ID COLON statement

nop-stmt \(\rightarrow\) NOPR SEMICOLON

assignment \(\rightarrow\) ID "=" expression

expression \(\rightarrow\) Boolean-expression | additive-expression

Boolean-expression \(\rightarrow\) logical-expression { boolop logical-expression }

logical-expression \(\rightarrow\) additive-expression relop additive-expression | ID | TRUE | FALSE

additive-expression \(\rightarrow\) term { addop term }

term \(\rightarrow\) [ NEGATE ] factor { mulop [ NEGATE ] factor }

factor \(\rightarrow\) LEFTPARENTHESIS additive-expression RIGHTPARENTHESIS | NUM | ID | ID "++" | ID "--"

boolop \(\rightarrow\) "and" | "or" | "not"

relop \(\rightarrow\) "<=" | "<" | ">" | ">=" | "==" | "!="

addop \(\rightarrow\) "+" | "-"

mulop \(\rightarrow\) "*" | "/"

Consider the production for the "statement" nonterminal:

statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt

The | metasymbol indicates choice, i.e. a "statement" is either a "compound", "if", "while", "read", "write", "assignment", "expression", "goto", "label" or "nop" statement.

Consider the production for the "if-stmt" nonterminal:

if-stmt \(\rightarrow\) IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ]

The brackets metasymbols indicate an optional construct, so the "if" statement has one of the following two forms:

1) if (Boolean expression) statement
2) if (Boolean expression) statement else statement

A parser recogniser for mC- can now be written. The next stage is to add software that performs static semantic analysis and code generation. Finally a virtual machine is written, which executes the translated code. The next thread, which focusses on the mC compiler, will cover these steps in detail.