# risc386 -- Restricted Instruction Set i386 simulator (C) 2013, Andreas Abel, Ludwig-Maximilians-University Munich The main purpose of this simulator is to test i386 code generated by a compiler before register allocation. Therefore, it supports temporaries, an potentially infinite amount of extra registers `t`. (Of course, it can also be used to execute symbolic assembler after register allocation.) The supported instruction set is very restricted but sufficient to write a compiler for MiniJava [Andrew Appel, Modern Compiler Implementation in Java]. I. System requirements: ------------------------------------------------------------------------ You need a recent version of GHC and Cabal (e.g. via the Haskell Platform). II. Installation: ------------------------------------------------------------------------ The executable `risc386` can be installed with `cabal install risc386`. Here are more manual instructions starting from the tarball: 1. Change to a temporary directory. 2. Unpack the tar ball tar xzf risc386-x.y.z.tar.gz 3. Change to the unpacked directory cd risc386-x.y.z 4. Install using Haskell's packet manager cabal cabal install III. Running the simulator: ------------------------------------------------------------------------ risc386 input-file.s IV. Format of the input file: ------------------------------------------------------------------------ The input file must be symbolic assembler in Intel format. Here is a small example: .intel_syntax .global Lmain .type Lmain, @function Lmain: #args enter 0, 0 L0: push 8 call L_halloc add %esp, 4 mov t1001, %eax push t1001 call LC$value add %esp, 4 mov t1002, %eax push t1002 call L_println_int add %esp, 4 L1: leave ret .global LC$value .type LC$value, @function LC$value: #args LOC 0 enter 0, 0 L2: mov t1004, DWORD PTR [%ebp+8] mov DWORD PTR [t1004+4], 555 mov t1003, DWORD PTR [%ebp+8] mov %eax, DWORD PTR [t1003+4] L3: leave ret Lexing rules: (If you want to be sure, read the `.x` file, the lexer specification.) * White space is ignored (except as separator for alphanumeric tokens). * Lines beginning with a dot `.` are skipped. These lines are pragmas for the symbolic assembler, which `risc386` ignores. * Lines beginning with a hash-symbol followed by a space `# ` are comments, which are ignored as well. * Lines beginning with a hash followed by a non-space character are `risc386` pragmas and not ignored. Currently, `risc386` only recognizes the pragma `#args`. * Valid tokens are: #args LOC REG [ ] : , . + - * dword ptr DWORD PTR mov lea MOV LEA add sub imul ADD SUB IMUL idiv inc dec neg IDIV INC DEC NEG shl shr sal sar SHL SHR SAL SAR and or xor AND OR XOR not NOT cmp CMP je jne jl jle jge JE JNE JL JLE JGE jmp call ret JMP CALL RET push pop enter leave PUSH POP ENTER LEAVE nop NOP eax ebx ecx edx esi edi ebp esp %eax %ebx %ecx %edx %esi %edi %ebp %esp (given by reg.ex [0-9]+) t (denoting a temporary register) (given by reg.ex. [a-zA-Z][a-zA-Z0-9_'$]*) Identifiers are used for labels. Parsing rules: (If you want to know all of them, read the `.y` file) 1. The input file must be a sequence of procedures. There must be one procedure whose name ends in `main`. This one is taken as the entry point. 2. Each procedure starts with a label and ends with a return instruction. Optionally, it can be preceded by a declaration of its arguments #args REG %eax, LOC 0, LOC 4 Lmyproc: ... RET `Lmyproc` expects its first argument in register `%eax`, its second at `[%esp+0]` and its third at `[%esp+4]`. The stack addresses are to be taken *before* the `CALL` is executed (which will put the return address on the stack and shift the relative location of the arguments by +4). 3. The body of each procedure is a list of i386 assembler instructions in Intel syntax. The supported instructions are listed above. Each instruction my be preceded by a label. Conditional and unconditional jumps are only allowed to a label, and only to one defined in the same procedure. Cross-procedure jumps or jumps to a calculated address are not supported. CALLs are only defined to a procedure label. `risc386` assumes the cdecl calling convention. 4. Restrictions for individual instructions: `RET` does accept arguments `ENTER` is only supported in the form `ENTER , 0` Runtime: - `risc386` knows a number of predefined procedures. They expect their arguments on the stack (cdecl calling convention) and return the result in %eax. `L_halloc` * 1 Argument: number of bytes to allocate on the heap * Result : pointer to first allocated byte. `L_println_int` * 1 Argument: signed 32bit integer value to print * Result : nothing `L_print_char` * 1 Argument: unicode char (32bit) to print * Result : nothing `L_raise` * 1 Argument: error code * Result : nothing, does not return, stops execution Execution specialties: - `risc386` supports 4 different types, all of size 32 bits: 1. Signed integers. 2. Heap addresses. Heap addresses consist of a base address which was obtained by `L_halloc` plus an offset. The offset must be a multiple of 4. 3. Stack addresses. `%esp` and `%ebp` may only be loaded with stack addresses. 4. Return addresses. Get pushed onto the stack by a `CALL`. `RET` checks that a return address lies on top of the stack before returning. The content of the return address is ignored, `RET` jumps back to the procedure where the matching `CALL` was issued. - `CMP` is the only command that sets flags. - `CALL` saves all temporary registers, `RET` restores them.