monic part 5: writing C programs on the calculator
|
12-08-2021, 01:15 AM
(This post was last modified: 01-05-2023 07:22 AM by F-73P.)
Post: #21
|
|||
|
|||
RE: monic part 5: writing C programs on the calculator
Continuing on from the previous post, the second major component of the compiler is the parser. The parser performs three functions: syntax analysis, static semantic analysis and code generation.
It is interesting to note that early programmers feared it may not be possible to translate a high-level language into executable code, "or if it was, then the object code would be so inefficient as to be useless. The development of the FORTRAN language and its compiler by a team at IBM led by John Backus between 1954 and 1957 showed that both these fears were unfounded. Nevertheless, the success of this project came about only with a great deal of effort, since most of the processes involved in translating programming languages were not well understood at the time" (Kenneth C. Louden, "Compiler Construction: Principles and Practice", 1997, p.3). Fortunately, "the study of the parsing problem (the determination of efficient algorithms for the recognition of context-free languages) was pursued in the 1960's and 1970's and led to a fairly complete solution of this problem, which today has become a standard part of compiler theory" (ibid). There are several kinds of parsing algorithms. According to Louden, "recursive-descent parsing is quite versatile and is the most suitable method for a handwritten parser" (ibid, p.143) and is the type of parser I wrote. The first stage was to write a grammar for mC and use this to write a parser recogniser, which processes the tokens from the scanner and reports an error if an unexpected token is received. A subset of mC (mC-) will be used to illustrate the process. mC- differs from mC in the following ways:
Below is a program in mC- (for simplicity it is assumed the user has entered an integer on the stack): Code:
The grammar of a programming language is a list of rules giving the syntax, or structure, of the language. Before writing a recursive-descent parser the grammar of the programming language should be written in extended Backus-Naur form (EBNF). EBNF notation evolved from the notation used by John Backus and Peter Naur in the description of the Algol60 programming language (the language that, according to C.A.R. Hoare, "was a great improvement on its successors"). "EBNF notation is designed to mirror closely the actual code of a recursive-descent parser, so a grammar should always be translated into EBNF if recursive-descent is to be used" (ibid, p.146). As mentioned in the previous post, a program can be thought of as a string, i.e. an array of characters (the "code") terminated by the null character (the ENDFILE token). So the first rule of the grammar in EBNF notation is: program \(\rightarrow\) code ENDFILE The rules in a grammar are called productions. Each production is comprised of symbols belonging to one of three categories: nonterminals, terminals and metasymbols. Nonterminals are symbols denoting the names of the underlying structures in the programming language. In the production above "program" and "code" are nonterminals. Terminals are the tokens of the programming language and are written in uppercase letters. In this example ENDFILE is a terminal. The metasymbols are not part of the programming language; they provide information on the relationship between the structures and tokens in a programming language. Here \(\rightarrow\) is a metasymbol. When writing a grammar for a programming language, the production for the most general structure (called the start symbol - "program" in this case) is listed first. Then productions for each of the nonterminals to the right of the arrow metasymbol are listed. This process is repeated until productions for all of the nonterminals have been written down. For the first production there is only one nonterminal to the right of the arrow: "code". So the next step is to write a production for "code". In the mC- programming language the code is comprised of a sequence of function declarations, the last of which is the main function. The next production in the grammar is therefore: code \(\rightarrow\) { function-declaration } main-function-declaration In EBNF notation the braces metasymbols indicate repetition, allowing (in theory) as many function declarations as desired (it also allows for no function declarations). The next step is to write the production for the "function-declaration" nonterminal: function-declaration \(\rightarrow\) ID LEFTBRACE statement-list RIGHTBRACE This says every function (other than the main function) must begin with an ID (a string made up of letters and numbers and not starting with a number), followed by a left brace, followed by a sequence of statements, followed by a right brace. This production can also be written: function-declaration \(\rightarrow\) ID "{" statement-list "}" Here the braces are in quotation marks to indicate they are tokens and not repetition metasymbols. Similarly the production for the "main-function-declaration" nonterminal is: main-function-declaration \(\rightarrow\) MAIN LEFTBRACE statement-list RIGHTBRACE The MAIN token is a reserved word (i.e. it cannot be used for variable and label names) and is represented by the string "main". Continuing in this way gives the EBNF grammar for mC-: program \(\rightarrow\) code ENDFILE code \(\rightarrow\) { function-declaration } main-function-declaration function-declaration \(\rightarrow\) ID LEFTBRACE statement-list RIGHTBRACE main-function-declaration \(\rightarrow\) MAIN LEFTBRACE statement-list RIGHTBRACE statement-list \(\rightarrow\) { statement } statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt compound-stmt\(\rightarrow\) LEFTBRACE statement-list RIGHTBRACE if-stmt \(\rightarrow\) IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ] while-stmt \(\rightarrow\) WHILE LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement read-stmt \(\rightarrow\) READ ID SEMICOLON write-stmt \(\rightarrow\) WRITE expression SEMICOLON assignment-stmt \(\rightarrow\) assignment SEMICOLON expression-stmt \(\rightarrow\) [ expression ] SEMICOLON goto-stmt \(\rightarrow\) GOTO ID SEMICOLON label-stmt \(\rightarrow\) ID COLON statement nop-stmt \(\rightarrow\) NOPR SEMICOLON assignment \(\rightarrow\) ID "=" expression expression \(\rightarrow\) Boolean-expression | additive-expression Boolean-expression \(\rightarrow\) logical-expression { boolop logical-expression } logical-expression \(\rightarrow\) additive-expression relop additive-expression | ID | TRUE | FALSE additive-expression \(\rightarrow\) term { addop term } term \(\rightarrow\) [ NEGATE ] factor { mulop [ NEGATE ] factor } factor \(\rightarrow\) LEFTPARENTHESIS additive-expression RIGHTPARENTHESIS | NUM | ID | ID "++" | ID "--" boolop \(\rightarrow\) "and" | "or" | "not" relop \(\rightarrow\) "<=" | "<" | ">" | ">=" | "==" | "!=" addop \(\rightarrow\) "+" | "-" mulop \(\rightarrow\) "*" | "/" Consider the production for the "statement" nonterminal: statement \(\rightarrow\) compound-stmt | if-stmt | while-stmt | read-stmt | write-stmt | assignment-stmt | expression-stmt | goto-stmt | label-stmt | nop-stmt The | metasymbol indicates choice, i.e. a "statement" is either a "compound", "if", "while", "read", "write", "assignment", "expression", "goto", "label" or "nop" statement. Consider the production for the "if-stmt" nonterminal: if-stmt \(\rightarrow\) IF LEFTPARENTHESIS Boolean-expression RIGHTPARENTHESIS statement [ ELSE statement ] The brackets metasymbols indicate an optional construct, so the "if" statement has one of the following two forms: 1) if (Boolean expression) statement 2) if (Boolean expression) statement else statement A parser recogniser for mC- can now be written. The next stage is to add software that performs static semantic analysis and code generation. Finally a virtual machine is written, which executes the translated code. The next thread, which focusses on the mC compiler, will cover these steps in detail. The C language combines all the power of assembly language with all the ease-of-use of assembly language |
|||
« Next Oldest | Next Newest »
|
User(s) browsing this thread: 2 Guest(s)