risc386 is a symbolic Intel(R) 386 assembler interpreter which
allows infinitely many registers (temporaries). Its purpose is
to debug the output of a MiniJava compiler (from Andrew Appel's
book, Modern Compiler Implementation in JAVA) before register
allocation has been performed.
risc386 supports only a small fragment of i386 instructions.
It expects its input to be a list of procedures in .intel_syntax
each of which is started by a label and terminated by a return
statement.
Control flow is restricted, so, only jumps to procedure-local
labels are allowed. Reading from an uninitialized memory location
will lead to an exception.
[
Skip to Readme]
risc386 -- Restricted Instruction Set i386 simulator
(C) 2013, Andreas Abel, Ludwig-Maximilians-University Munich
The main purpose of this simulator is to test i386 code generated by a
compiler before register allocation. Therefore, it supports
temporaries, an potentially infinite amount of extra registers
t<number>
. (Of course, it can also be used to execute symbolic
assembler after register allocation.)
The supported instruction set is very restricted but sufficient to
write a compiler for MiniJava [Andrew Appel, Modern Compiler
Implementation in Java].
I. System requirements:
You need a recent version of GHC and Cabal (e.g. via the Haskell Platform).
II. Installation:
The executable risc386
can be installed with cabal install risc386
.
Here are more manual instructions starting from the tarball:
-
Change to a temporary directory.
-
Unpack the tar ball
tar xzf risc386-x.y.z.tar.gz
-
Change to the unpacked directory
cd risc386-x.y.z
-
Install using Haskell's packet manager cabal
cabal install
III. Running the simulator:
risc386 input-file.s
The input file must be symbolic assembler in Intel format.
Here is a small example:
.intel_syntax
.global Lmain
.type Lmain, @function
Lmain:
#args
enter 0, 0
L0: push 8
call L_halloc
add %esp, 4
mov t1001, %eax
push t1001
call LC$value
add %esp, 4
mov t1002, %eax
push t1002
call L_println_int
add %esp, 4
L1: leave
ret
.global LC$value
.type LC$value, @function
LC$value:
#args LOC 0
enter 0, 0
L2: mov t1004, DWORD PTR [%ebp+8]
mov DWORD PTR [t1004+4], 555
mov t1003, DWORD PTR [%ebp+8]
mov %eax, DWORD PTR [t1003+4]
L3: leave
ret
Lexing rules:
(If you want to be sure, read the .x
file, the lexer specification.)
-
White space is ignored (except as separator for alphanumeric tokens).
-
Lines beginning with a dot .
are skipped.
These lines are pragmas for the symbolic assembler,
which risc386
ignores.
-
Lines beginning with a hash-symbol followed by a space #
are comments, which are ignored as well.
-
Lines beginning with a hash followed by a non-space character
are risc386
pragmas and not ignored.
Currently, risc386
only recognizes the pragma #args
.
-
Valid tokens are:
#args LOC REG
[ ] : , . + - *
dword ptr DWORD PTR
mov lea MOV LEA
add sub imul ADD SUB IMUL
idiv inc dec neg IDIV INC DEC NEG
shl shr sal sar SHL SHR SAL SAR
and or xor AND OR XOR
not NOT
cmp CMP
je jne jl jle jge JE JNE JL JLE JGE
jmp call ret JMP CALL RET
push pop enter leave PUSH POP ENTER LEAVE
nop NOP
eax ebx ecx edx esi edi ebp esp
%eax %ebx %ecx %edx %esi %edi %ebp %esp
<number> (given by reg.ex [0-9]+)
t<number> (denoting a temporary register)
<ident> (given by reg.ex. [a-zA-Z][a-zA-Z0-9_'$]*)
Identifiers are used for labels.
Parsing rules:
(If you want to know all of them, read the .y
file)
-
The input file must be a sequence of procedures.
There must be one procedure whose name ends in main
.
This one is taken as the entry point.
-
Each procedure starts with a label and ends with a return
instruction. Optionally, it can be preceded by a declaration
of its arguments
#args REG %eax, LOC 0, LOC 4
Lmyproc:
...
RET
Lmyproc
expects its first argument in register %eax
,
its second at [%esp+0]
and its third at [%esp+4]
.
The stack addresses are to be taken before the CALL
is executed (which will put the return address on the stack
and shift the relative location of the arguments by +4).
-
The body of each procedure is a list of i386 assembler
instructions in Intel syntax. The supported instructions
are listed above.
Each instruction my be preceded by a label.
Conditional and unconditional jumps are only allowed to
a label, and only to one defined in the same procedure.
Cross-procedure jumps or jumps to a calculated address
are not supported.
CALLs are only defined to a procedure label.
risc386
assumes the cdecl calling convention.
-
Restrictions for individual instructions:
RET
does accept arguments
ENTER
is only supported in the form ENTER <number>, 0
Runtime:
Execution specialties:
-
risc386
supports 4 different types, all of size 32 bits:
-
Signed integers.
-
Heap addresses.
Heap addresses consist of a base address which was obtained
by L_halloc
plus an offset. The offset must be a multiple of 4.
-
Stack addresses.
%esp
and %ebp
may only be loaded with stack addresses.
-
Return addresses.
Get pushed onto the stack by a CALL
.
RET
checks that a return address lies on top of the stack
before returning. The content of the return address is
ignored, RET
jumps back to the procedure where the matching
CALL
was issued.
-
CMP
is the only command that sets flags.
-
CALL
saves all temporary registers, RET
restores them.