-
Code Generation
07/20/2017 at 11:06 • 0 commentsCode Generation
I have been testing some fairly big programs. Another grammar fault I found is that you cannot assign negative constants. Fixed.
Here is the grammar at the moment:
PL/0 Grammar: program = block "." . block = [ "const" ident "=" ["-"] number {"," ident "=" number} ";"] // Add negative constants [ "var" ident {"," ident} ";"] [ "var" ident "[" number "]" {"," ident "[" number "]"} ";"] // Add arrays { "procedure" ident ";" block ";" } statement. statement = [ ident ":=" expression ident "[" expression "]" ":=" expression // Add arrays ident ":=" string // Add string assignmnet to ident ident "[" expression "]" ":=" string // Add string assignmnet to arrays | "call" ident | "?" ident | "write" ident | "putc" ident // Add "write" (int) and "putc" (char) | "!" expression | "read" expression | "getc" expression // Add "read" (int) and "getc" (char) | "begin" statement {";" statement } [";"] "end" // Add optional ";" before "end" | "if" condition "then" statement [ "else" statment ] // Add "else" | "while" condition "do" statement ]. condition = "odd" expression | expression ("="|"#"||"!="||"<>"|"<"|"<="|">"|">=") expression . // Add "!+" and "<>" expression = [ "+"|"-"] term { ("+"|"-") term}. term = factor {("*"|"/"|"%"|"mod") factor}. // Add "%" and "mod" factor = ident | ident "[" expression "]" | number | "(" expression ")". // Add arrays Add comments (to tokeniser): // ... EOL { ... } /* ... */ (* ... *)
Still have not fixed the unary minus fault but its not really an issue, just put parentheses around it.
The above grammar may make you eyes roll but you need it to code a compiler.
The Plan
The plan for building the code generator is to rework the interpreter:
- Change the stack to downward growth from upward growth (done).
- Swap out the P-Code instructions for pseudo CPU OpCode (done).
- Convert the interpreter into a OpCode lister (done).
- Covert the opcodes to Subleq.
The reason for doing this to the interpreter is that changes can be incrementally checked.
The CPU I am modelling is as minimal as practical. The current pseudo OpCodes are stack focused. Here are the registers:
Registers Comment Ax Primary Register Bc Secondary Register Tx Temporary Register (used for swap) PC Program Counter (?) SP Stack Pointer (downward growing) BP Base Pointer The program counter (PC) is not actually a register but a subroutine return address. It is set by the compiler at compile time, that is the real PC register is not actually accessed during execution.
The stack and register pseudo opcodes:
Stack operations: Register moves: FromStack (Ax=stack[Ax] mov Ax,Bx ToStack (stack[Ax]=Bx) mov Ax,SP push Ax mov SP,Ax pop Ax mov Ax,BP pop Bx mov BP,Ax mov PC,Ax Move Immediate (literal): mov Ax,PC mov Bx,Imm mov Tx,Ax mov Bx,Tx I have favoured register moves over stack operations because stage operations are expensive. Here are the remaining pseudo opcodes:
Increment/Decrement: Mathematical Operations: Boolean Operations: dec Ax sub Ax,Bx odd Ax inc Ax add Ax,Bx eq Ax,Bx mul Ax,Bx lt Ax,Bx Jump: div Ax,Bx le Ax,Bx jmp mod Ax,Bx ne Ax,Bx jz gt Ax,Bx ge Ax,Bx Add four input/outputs instructions:
Input/Output: Code: read (integer) read get (char) getc write (integer) write put (char) putc Recoded the pseudo opcode interpreter into a code lister (done). Called it PCode2OpCode.
I have been reviewing the OpCode output and I have used indirect addressing with the Ax register which of course is not a valid instruction for the x86 instruction set. It is pseudo opcodes after all. I also changed the format to make it easier to import, for example I use "mov_Ax,Bx" rather than mov Ax, Bx" and I dropped the "L" prefix for address labels.
AlanX
-
PL/0 Completed?
07/19/2017 at 01:39 • 0 commentsPL/0 Completed?
Brave words! Can it really be completed? There are still things I would like to add but I need to move on. The current state of play is:
- Recognises most PL/0 dialects.
- Tolerant to redundant ";"
- One dimensional arrays
- Pascal strings
To do:
Would like to add an ASM instruction (pass through to the target CPU).
Rework the error messages to provide better hints.
Here is an example program:
{ Program Test String } var Str[16],Prompt[16]; procedure ReadStr; var i,test; { Read a Pascal string, uses Str } begin i:=1; test:=1; while test=1 do begin getc Str[i]; if Str[i]<32 then begin Str:=i-1; { Str == Str[0] } test:=0 end else begin if i=16 then begin test:=0; Str:=i end else begin i:=i+1 end end end end; procedure WriteStr; var i; { Write a string, uses Str } begin i:=1; while i<=Str do begin putc Str[i]; i:=i+1 end end; procedure WritePrompt; var i; { Write Prompt } begin i:=1; while i<=Prompt do begin putc Prompt[i]; i:=i+1 end end; begin Prompt:="String Test"; call WritePrompt; putc 10; { new line } Prompt:="Enter a string: "; call WritePrompt; call ReadStr; Prompt:="You wrote: "; call WritePrompt; call WriteStr; putc 10; { new line } end.
You may notice I have used redundant ":" in this example.And here is the run:
C:\AlanX\pl0>.\PLZero_Tokeniser\PLZero_Tokeniser -i test18.pl0 -o test18.token -nl C:\AlanX\pl0>.\PLZero_Compiler\PLZero_Compiler -i test18.token -o test18.code -t test18.table C:\AlanX\pl0>.\PLZero_Interpreter\PLZero_Interpreter -i test18.code -o test18.run -s test18.stack String Test Enter a string: Hello World! You wrote: Hello World! C:\AlanX\pl0>pause Press any key to continue . . .
I think that is pretty cool!
Next
The next task is to write a P-Code to Subleq translator.
AlanX
-
PLZero Compiler
07/13/2017 at 16:12 • 0 commentsPLZero (PL/0) Compiler
The compiler is almost done. The Nikluas Wirth Pascal has been recoded in C. The structure completely rebuilt. I don't even think Mr Wirth would recognise his code! I removed the tokeniser and interpreter from the compiler and are now stand alone programs (part of the tool train).
Here are a couple of working programs.
Multiplication:
const m=7, n = 85; var print, x, y, z; { Assignment to print variable is printed } procedure multiply; var a, b; begin a := x; b := y; z := 0; while b > 0 do begin if odd b then z := z + a; a := 2 * a; b := b / 2; end; end; begin x := m; y := n; call multiply; print:=z; end.
Prime number search (http://www.dustyoldcomputers.com/pdp-8/software/pl0/):
{ calculate prime numbers } const pcmax = 101; { max on 8 is 2045 } var print, { memory mapped print routine memory location } pc, { prime candidate } d, { divisor } t1, { temp #1 } t2, { temp #2 } isnot; { isnot a prime flag } begin print := 1; print := 2; pc := 3; t1 := pc/2; while pc <= pcmax do begin { trial division to test for prime } d := 3; isnot := 0; { This keeps t1 >= to the sqrt(pc) } if (t1*t1) < pc then t1 := t1 + 1; while d <= t1 do begin t2 := pc / d; if (t2*d) = pc then { not prime } begin isnot := 1; { remember not prime } d := pc; { force loop exit because we don't have break } end; d := d + 2; end; if isnot = 0 then print := pc; { store at print will print it } pc := pc + 2; end; end.
Wirth's test program:CONST m = 7, n = 85; VAR print, putc, x, y, z, q, r; PROCEDURE multiply; VAR a, b; BEGIN a := x; b := y; z := 0; WHILE b > 0 DO BEGIN IF ODD b THEN z := z + a; a := 2 * a; b := b / 2 END END; PROCEDURE divide; VAR w; BEGIN r := x; q := 0; w := y; WHILE w <= r DO w := 2 * w; WHILE w > y DO BEGIN q := 2 * q; w := w / 2; IF w <= r THEN BEGIN r := r - w; q := q + 1 END END END; PROCEDURE gcd; VAR f, g; BEGIN f := x; g := y; WHILE f # g DO BEGIN IF f < g THEN g := g - f; IF g < f THEN f := f - g END; z := f END; BEGIN x := m; y := n; CALL multiply; print:=z; putc:=10; x := 25; y := 3; CALL divide; print:=q; print:=r; putc:=10; x := 84; y := 36; CALL gcd; print:=z END.
I have edited the tokeniser to recognise:
- any case for token words (i.e. BEGIN, begin or Begin etc.) but variables are still case sensitive
- "var" and "int" as an alternative
- "<>", "!=", "#" as alternatives
- "print" (lower case) is recognised at any level as the integer output.
- "putc" (lower case) is recognised at any level as the char output.
I rather like the upper case for keywords.
Here is the result from Wirth's program:
Begin PL/0: 595 8 1 12 End PL/0.
I need a final check of the code, add a execution list option before moving on to the Subleq translator.
AlanX
-
Fixing Unary Minus
07/08/2017 at 00:48 • 1 commentLanguage Structure
Programming languages can the presented/described in format called Backus-Naur Form (BNF). Here is a proper example from "Let's Build a Compiler!" Jack W. Crenshaw:
<b-expression> ::= <b-term> [<orop> <b-term>]* <b-term> ::= <not-factor> [AND <not-factor>]* <not-factor> ::= [NOT] <b-factor> <b-factor> ::= <b-literal> | <b-variable> | <relation> <relation> ::= | <expression> [<relop> <expression] <expression> ::= <term> [<addop> <term>]* <term> ::= <signed factor> [<mulop> factor]* <signed factor>::= [<addop>] <factor> <factor> ::= <integer> | <variable> | (<b-expression>)
Here is what I decoded from Simple Compiler in a "bnf" like form:Factor -> Identifier -> Integer -> ( BoolExpression ) Term -> Factor -> Factor * Factor+ -> Factor / Factor+ Expression -> Term -> + Term -> - Term -> Term + Term+ -> Term - Term+ BoolExpression -> Expressiom -> Expression = Expression+ -> Expression <> Expression+ -> Expression < Expression+ -> Expression <= Expression+ -> Expression > Expression+ -> Expression >= Expression+
(Note, my decoded format is as "coded" and is upside down compared to the bnf format.)
Okay what does all this mean? Basically the minimum number of precedence levels (from "Let's Build a Compiler!" Jack W. Crenshaw) is 8:
Level Syntax Element Operator 0 factor literal, variable 1 signed factor unary minus 2 term *, / 3 expression +, - 4 b-factor literal, variable, relop 5 not-factor NOT 6 b-term AND 7 b-expression OR, XOR The Simple compiler only has 4, so it was doomed from the start. Even without the bit operations it still needed 5 levels. What failed was the unary minus. This is why I had to put brackets around the "-5" the the code:
begin A=4; B=1; if A<B then write A*10<b; else begin write 1<B*10; write B*10, (a+b)*(-5); end C=-1; write C; end
The original coder did try with "Expression -> - Term" as the C=-1 does work, but a true unary minus needs a higher precedence than "*" for "(a+b)*-5" to work.
So I need at added "SignedFactor" between "Factor" and "Term" based on Crenshaw's bnf, and remove the "Expression -> + Term" and "Expression -> - Term":
Factor -> Identifier -> Integer -> ( BoolExpression ) SignedFactor -> Factor -> + Factor -> - Factor Term -> SignedFactor -> SignedFactor * SignedFactor+ -> SignedFactor / SignedFactor+ Expression -> Term -> Term + Term+ -> Term - Term+ BoolExpression -> Expressiom -> Expression = Expression+ -> Expression <> Expression+ -> Expression < Expression+ -> Expression <= Expression+ -> Expression > Expression+ -> Expression >= Expression+
It took a little while to get my head around the internals of the recursive procedure calls but eventually I got it to work:
void SignedFactor(void) { char op; op=tTok; if ((op==_plus)||(op==_minus)) { GetNextToken(); } Factor(); if (op==_plus) { // Nothing to do } else if (op==_minus) { Emit(_movBxAx); Emit(_xorAxAx); Emit(_subBx); } }
In the above code:- Th global "tTok" is saved to local "op" in case it is required for later use (other procedures may "eat it"). Basically parser procedures "eat" the token if it belongs to them.
- Next, if the token is for this procedure (i.e. it is a "+" or a "-") then "eat" the token by getting the next token (GetNextToken()).
- Okay, the procedure has to wait for the "Factor" that will be operated on to come back (i.e. pass through to Factor()).
- Factor() has come back and the result will be in "Ax". If "op" is a "+" or a "-" then do the operation (emit the code).
- Okay all done, return to the calling procedure (Term()).
Here is the current language structure for Simple Compiler:
Factor -> Identifier -> Integer -> ( BoolExpression ) SignedFactor -> Factor -> + Factor -> - Factor Term -> SignedFactor -> SignedFactor * SignedFactor+ -> SignedFactor / SignedFactor+ -> SignedFactor % SignedFactor+ Expression -> Term -> Term + Term+ -> Term - Term+ BoolExpression -> Expression -> Expression = Expression+ -> Expression <> Expression+ -> Expression < Expression+ -> Expression <= Expression+ -> Expression > Expression+ -> Expression >= Expression+ Statement -> Begin -> While -> If -> Write -> Read -> Assignment Begin -> Statement -> End While -> BoolExpression Statement If -> BoolExpression then Statement -> BoolExpression then Statement "else" Statement Write -> BoolExpression , BoolExpression+ Read -> Input Assignment -> Identifier = BoolExpression
The only thing not shown above is that the semicolon (i.e. ";") is treated as a white space. -
Subleq Efficiency
07/06/2017 at 04:29 • 0 commentsA Not Very Happy Interpreter
The interpreter is not very happy. I have used the OpCode jump address without considering the Subleq actual address (i.e. the OpCode instruction size does not match the Subleq macro size). This is a bit tricky as I do not calculate Subleq instructions lengths and therefore cannot calculate the Subleq address directly (at least in OpCode2Subleq). A good fix is to precede each Subleq instruction with a label identifying the OpCode Address, eg. "_OPCxxxxx". This has no cost to the final integer code size.
The use of the OpCode instruction pointer label "_OPCxxxxx" works very well. Very easy to find the OPC code in Subleq now.
Fixed a coding error, where I coded "jz" as if it was "jeqz"? Fixed and now the interpreter executes and presents the correct answer.
Subleq Efficiency
Here is the final test of my tool train:
C:\AlanX\SimpleC\OpCode2Subleq>SimpleC -o 1.opc 0<1.sc Tokenised Code: Line 1: BEGIN Line 2: A = 4 Line 3: B = 1 Line 4: IF A < B THEN Line 5: WRITE A * 10 < B Line 6: ELSE BEGIN Line 7: WRITE 1 < B * 10 Line 8: WRITE B * 10 , ( A + B ) * ( - 5 ) Line 9: END Line 10: C = - 1 Line 11: WRITE C Line 12: END Done. C:\AlanX\SimpleC\OpCode2Subleq>OpCode2Subleq -i 1.opc -o 1.sasm C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Asm -i 1.sasm -o 1.code -l 1.list C:\AlanX\SimpleC\OpCode2Subleq>Subleq_Int -i 1.code -l 1.list -t 1.trace SUBLEQ Interpreter 1 10 -25 -1 Interpreter finished C:\AlanX\SimpleC\OpCode2Subleq>pause Press any key to continue . . .
Code Efficiency
12 lines of high level code became 95 words (73 OpCodes) which turned into
5530 words of Subleq code (1903 lines of uncommented code). That is about 26 lines of Subleq per OpCode. And the code takes a full 4 seconds to execute!
AlanX
-
Converting the Simple Compiler to a Subleq Compiler
06/24/2017 at 17:03 • 0 commentsCompiler Subleq Backend
In order to convert Simple Compiler to a Subleq compiler, I have to replace the Simple Compiler Assembler (OpCode) Lister and Assemble (OpCode) Interpreter with a Subleq OpCode exporter. I will use an external Subleq assembler and interpreter. The Subleq OpCode exporter is called the Subleq backend. The Subleq backend contains all the Subleq macros, but only composite macros (which takes an operand) are accessible. Here is the list of Subleq macros:
- jump
- clear
- not
- shl
- inc
- dec
- chs
- label
- return
- putc
- getc
- wrtInt
- rdInt
- sub
- add
- copy
- ncopy (negative copy)
- jlez
- jgez
- jeqz
- jgtz
- jltz
- jnez
- jmin
- store
- absolute
- xor
- and
- or
- nand
- nor
- new sub
- new add
- system
- movAxVar
- movVarAx
- movAxImm
- movAxBx
- movBxAx
- pushAx
- pushBx
- pushFx
- popAx
- popBx
- popFx
- addBx
- subBx
- mulBx
- divBx
- jmp
- jz
- wrtAx
- wrtLn
- rdAx
- halt
- cmpAx
- orAxAx
- xorAxAx
- setEq
- setNE
- setLT
- setLE
- setGT
- setGE
- HALT (0)
- OUTPUT (-1)
- INPUT (-2)
- Z (zero)
- T (temp)
- t (tmp)
- P (+1)
- N (-1)
- the CPU model:
- Ax
- Bx
- Fx
- SP
- DP
- and few useful constants:
- _SP (-32)
- _DP (-16384)
- _MIN (-32768)
- _CR (13)
- _LF (10)
- _SPACE (32)
- _MINUS (45)
- _ZERO (48)
- _NINE (57)
The Subleq system also has a point to pointer copy (PPC) routine (as it is assumed the monitor code will be stored in ROM and pointer to pointer copy requires self modifying code).
I will likely add many more constants to support string output later.
I have added the binary operation macros:
- XOR Ax,Bx
- AND Ax,Bx
- OR Ax,Bx
- NAND Ax,Bx
- NOR Ax,Bx
- NOT Ax
- SHL Ax
- SHR Ax (to be added)
These bit operations don't update the flag register.
Integer Minimum Value
The minimum integer value (MIN) for a 16 bit integer is -32768. Subleq has a major problem with MIN. It treats MIN as equal to 0 unless special precautions are taken. This is not usually a problem unless the code uses the sign bit (such as for bit operations etc.). In the end I rewrote all the test macros (i.e. jeqz, jnez, jlez, jltz, jgez and jgtz) to be MIN aware.
I also used MIN as the return value for multiplication overflow and attempt to divide by zero. There is a "jmin" macro if required.
Working Subleq Backend
I have got a working Subleq backend test-bed (TestMacro2Subleq), for example the following Subleq macros:
compositeMacro("system",0); compositeMacro("rdAx",0); compositeMacro("pushAx",0); compositeMacro("pushAx",0); compositeMacro("rdAx",0); compositeMacro("movBxAx",0); compositeMacro("popAx",0); compositeMacro("pushBx",0); // iMul compositeMacro("mulBx",0); compositeMacro("wrtInt",0); compositeMacro("wrtLn",0); // iDiv compositeMacro("popBx",0); compositeMacro("popAx",0); compositeMacro("divBx",0); compositeMacro("wrtInt",0); compositeMacro("wrtLn",0); compositeMacro("movAxBx",0); compositeMacro("wrtInt",0); compositeMacro("wrtLn",0);
Produces after assembly and interpretation:
C:\AlanX\TestSubleq>TestMacro2Subleq 1>Test.sasm C:\AlanX\TestSubleq>subleq_asm -i Test.sasm -o Test.code -l Test.list C:\AlanX\TestSubleq>Subleq_Int -i Test.code -l Test.list -t Test.trace SUBLEQ Interpreter 6 <- Entered integer 2 <- Entered integer 12 -> Integer multiplication (=6*2) 3 -> Integer division (=6/2) 0 -> Integer remainder (=6%2) Interpreter finished C:\AlanX\TestSubleq>pause Press any key to continue . . .
The Subleq code produced is too long to list here.
AlanX
-
Translated the Pascal Code to C Code
06/22/2017 at 16:18 • 0 commentsC Code
The code has been translated. Quite a bit of clean up:
- General regrouping of code.
- Fixed memory reference error (mixed up low/high byte order).
- Simplified the symbol table.
- Added code to export the code being tokenise to "stderr"
- Reworked the error reporting routine.
- Fixed up the EOF problem.
- Reordered the constants and pushed them out to a header file.
- Pushed out the assembler and interpreter code to include files.
- Changed the CPU model.
- Added some options to the command line.
The current version of the tool tranin has been uploaded.
Understanding the Code
The compiler code in most of the minimalist program codes I reviewed are very similar. Even to the point where missing instructions (i.e.">") and constant names are the same (i.e. SETLSS rather than SETLT and SETLEA instead of SETLE).
A good site for compiler construction (but not complete) is http://zserge.com/blog/cucu-part1.html. I will likely refer back to this site for when I add functions to my code.
A good pdf on compiler construction (by Jack W. Crenshaw) is http://compilers.iecc.com/crenshaw/ (I have uploaded is book compiler.pdf).
But the place to go is "Recursive-Descent Parsing" Chapter 6.6 of the AWK book pp 147-152. (I have uploaded a pdf of the book and the programs from Chapter 6).
The NewCPU Model
The CPU model for nearly all the compiler I have reviewed is:
- Ax: Accumulator register
- Bx: Secondary register
- Fx: Flag register
- SP: Stack Pointer
- DP: Data Pointer (Simple Compiler)
- 16 bit data and address word width
Usually all data is on the stack. The Simple Compiler uses a separate Data Pointer (DP) (i.e. modelling a Pascal style "heap memory"). I will keep this method.
The the assembler code therefore revolves around Ax and Bx:
- MOV Ax, Bx
- MOV Bx, Ax
- PUSH Ax
- PUSH Bx (added)
- PUSH Fx (added)
- POP Ax
- POP Bx
- POP Fx (added)
- MOV AX, IMM
- MOV AX, [ADDR] (uses DP)
- MOV [ADDR], AX (uses DP)
Two basic jumps:
- JMP [ADDR]
- JZ [ADDR]
The remainder maps the language symbols (i.e. '+', '-', '*', '/', '=',' <>','<','<=','>','>=') :
- ADD Ax, Bx
- SUB Ax, Bx
- IMUL Ax, Bx (software)
- IDIV Ax, Bx (software)
- SETEQ
- SETNE
- SETLT (added)
- SETLE (added)
- SETGT
- SETGE
The set commands set (for true) or reset (for false) the AX depending on the Flag register. True is defined as Ax=1 and false is defined as Ax=0.
The following commands set the flag register (Fx):
- OR Ax, Ax (Sets the Flags based on the Ax)
- XOR Ax, Ax (Clears the Ax and sets flags)
- CMP Ax, Bx (Sets flags based on Ax-Bx)
Finally there is a HALT and a couple of I/O commands:
- HALT (same as reset)
- wrtAx (converts an integer to characters before printing)
- wrtLn (write a new line)
- rdAx (added, reeads characters and converts to integer)
Language Construct
Languages can me modelled in Backus–Naur Form (BNF). Before extending the language I need to map the existing language. I use the syntax from the AWK handbook for the language construct:
// Factor -> Identifier // -> Integer // -> ( BoolExpression ) // Term -> Factor // -> Factor * Factor+ // -> Factor / Factor+ // Expression -> Term // -> + Term // -> - Term // -> Term + Term+ // -> Term - Term+ // BoolExpression -> Expressiom // -> Expression = Expression // -> Expression <> Expression // -> Expression < Expression // -> Expression <= Expression // -> Expression > Expression // -> Expression >= Expression // Statement -> Begin // -> While // -> If // -> Write // -> Read // -> Assignment // Begin -> Statement \\ This is where a statement separator (";") should added! // -> End // While -> BoolExpression Statement // If -> BoolExpression then Statement // -> BoolExpression then Statement "else" Statement // Write -> BoolExpression , BoolExpression+ // Read -> Input // Assignment -> Identifier = BoolExpression
Note: the "+" at the end of a line means it can be repeated.
Its okay but I think the program should begin with a "begin" and end with an "end". The Statement procedure needs to be modified for this. There probably needs to be a semi-colon (";") as a statement separator as well. I think it might get messy without a statement delimiter later.
Bit-wise operations (i.e. shl, shr, or, and, xor, not) are also missing.
The parser code is complicated because the procedures use the CPU stack to communicate between procedures. But once you realise this it get a bit easier to read.
Fixed the EOF bug in the "BeginState" procedure and some more code cleanup to allow export of the tokeniser data. Pretty happy with the code as it stands and I now understand it fully.
AlanX