Nand2Tetris Project 6: The Assembler
Background
Low-level machine programs are rarely written by humans. Typically, they are
generated by compilers. Yet humans can inspect the translated code and
learn important lessons about how to write their high-level programs better, in
a way that avoids low-level pitfalls and exploits the underlying hardware better. One of the key players in this translation process is the assembler -- a
program designed to translate code written in a symbolic machine language
into code written in binary machine language. This project marks an exciting landmark in our Nand to Tetris odyssey: it
deals with building the first rung up the software hierarchy, which will
eventually end up in the construction of a compiler for a Java/C++ like highlevel
language. The relevant reading for this project is Chapter 6. Some of the
useful tools available include, the Hack Assembler, the CPU Emulator and
working versions of the three
programs, bin/parser, bin/translator and bin/disassembler. The CPU
Emulator uses its own builtin disassembler to display memory containing
instructions. Objective
The Hack assembler is a relatively simple program however, so that you can
gain experience with the tools used in other workshops and assignments, you
will build your assembler from three parts. This will involve using a
precompiled tokeniser for the Hack assembly language to implement
a parser that recognises labels, A-instructions and C-instructions using
tokens returned by the tokeniser. The parser will construct a tree
representation of the program that the translator will walk over in order to
assemble the final machine code. The disassembler will use a precompiled
tokeniser for Hack machine code to construct a Hack assembly
language program. Compiling and Running Your Programs
You programs can be compiled with the following command. % make notest
To compile all three programs and test your programs at the same time you
can use the following command:
% make
Eventually the output should look like this:
Notes: The reminder to run svn update and svn commit should appear every
hour. It is important to always run svn update when you start work and svn
commit when you make logbook entries or take a break. The reminder to write in your logbook should appear every half hour. If
you logbook does not record your progress as you work your final mark
for an assignment may be capped at 25 or in some cases reduced to 0. The Test column can be used to refer to a specific test if you wish to
inspect the actual output from the test. The Program column tells you what program was running including any
command line arguments it was passed. The Output column shows you whether or not your program produced
the expected standard output (file 1). If it did the column will have a green
background, if it did not it will have a red background and if the test does
not care, a ? will be displayed. The Errors column shows you whether or not your program produced
the expected standard error (file 2). The Status columns shows you the exit status returned when your
program terminated.
The Test Result column shows you whether or not your program
passed the test overall. This will either be Test Passed in green or Test
Failed in red. The Description column may tell you something about the nature of the
test or the input. The Tokeniser
You must use the precompiled tokeniser functions provided by the library in
the startup files. The tokens recognised and the tokeniser functions are
described in the file includes/tokeniser.h. All error handling is silent, if the
tokeniser encounters a character that cannot be part of a token or a character
that is not whitespace, it simply returns the tk_eoi token. Once
the tk_eoi token is returned all future attempts to read a token will also return
the tk_eoi token. A set of extra token kinds are also provided that can be used with
the mustbe() and have() functions to check for membership of a particular
group of tokens. For example, the token kind tk_dest can be used to check if
a token is a destination component of a C instruction, ie one
of tk_dest_A, tk_dest_M, tk_dest_D, tk_dest_MD, tk_dest_AM, tk_dest_A
D, tk_dest_AMD or tk_null. All of the available groups are described in the
file includes/tokeniser.h. The Assembly Language Grammar
The following table describes the structure of an assembly language program
that must be translated into machine code. All comments and whitespace, except for newline characters, will have been removed by the tokeniser. Therefore these rules are defined purely in terms of the tokens returned by
the tokeniser. Term Definition
program ::= line* tk_eoi
line ::= instruction? tk_eol
instruction ::= tk_label | A-instr | C-instr
A-instr ::= tk_name | tk_number
C-instr ::= tk_dest? tk_alu_op tk_jump?
Notes: in a definition the or operator | separates alternative components
in a definition the question mark character ? indicates that the
preceding component may appear 0 or one times
in a definition the star character * indicates that the preceding
component may appear 0 or more times
Abstract Syntax Tree
An abstract syntax tree is just a tree like data structure that holds a
representation of a program. In the case of Hack Assembly language, the
abstract syntax tree consists of a root node which has one sub-tree for every
label, a-instruction or c-instruction in the program. In addition to the ainstruction
node there is also an a-name node to represent a-instructions
where the value is still described by a variable or label name. Descriptions of
the functions required to create abstract syntax tree nodes and to access their
contents can be found in the file includes/abstract-syntax-tree.h. All abstract syntax tree nodes are immutable - they cannot be changed - so a
new node cannot be created until all of its sub-trees have been created. This
makes it impossible to construct loops in a tree which greatly simplifies their
implementation. All tree nodes are referenced by a value of type ast which
appears to be a pointer but is really an integer index into a private table. All
tree nodes remain in existence for the lifetime of the running program that
creates them. Parser
The skeleton parser.cpp file has most of the parsing structure already written, you just need to complete the functions that parse labels, A-instructions and
C-instructions. The same parsing technique used in workshop 05 is used here
but, it works with tokens rather than characters. The main() function of
the parser program prints out an XML representation of the abstract syntax
tree using the precompiled library function ast_print_as_xml(). You do not
need to know how to write out XML. XML is produced by the parser because
it is easy to read by humans.
If your parser encounters any errors, it must exit immediately and have not
produced any output. Examples of errors include, multiple definitions of a
label, an A-instruction with a number that is too large, or a C-instruction with
multiple destination components, multiple jump components or no ALU
component. In this two program structure, the parser will not be able to
directly detect the first two kinds of errors. Multiple definitions of a label will
not be detectable until the translator walks the abstract syntax tree and a
number that is too large will cause the tokeniser to report that it is at the end
of the input. The other errors can be detected by the parser and should result
in an immediate exit with no output. The iobuffer.h include file gives details of functions you can use to manage
when output is produced by your programs. Translator
The executable program, translator, will read the abstract syntax tree and
assemble a machine code representation of the original Hack Assembly
Language program. The assembled code will be formatted as sixteen zeros or
ones per line and it must be written to standard output using
the write_to_output() function described in includes/iobuffer.h.
The translator program uses the precompiled function ast_parse_xml(), to
read the output of the parser program. You do not need to know how to read
in and interpret XML. The later workshops and assignments will use this
technique to allow their compilers to be written as separate programs. One advantage of this two program approach is that walking over the tree
more than once is really simple. The skeleton translator.cpp file includes an
example of how to walk over the abstract syntax tree produced by the parser. This is important in the translator when it comes to resolving symbols. When
a program includes an A-instruction such as @somename you cannot tell
if somename is a label or a variable until the entire program has been
inspected because the label definition may come later. This will be true for
jumps to a later part of a program. We solve this problem by walking over the
abstract syntax tree and record all the labels that we can find. This is Pass 1. When we walk over the abstract syntax tree a second time, Pass 2, every
label name will already have been defined. Therefore, all undefined names
found in Pass 2 must be variable names.
If your translator encounters any errors, it must exit immediately and have
not produced any output. It may be interesting to reflect on why we did not
include missing label definitions amongst the list of errors to be detected. Lookup Tables
Your translator program will need a way of generating the correct set of bits
for each part of a C-instruction. For example, "JMP" needs to be mapped to
"111". To do this you will need some sort of lookup table to record these
mappings. The translator program also needs a lookup table so that it can
implement a symbol table to record label addresses and the memory locations
for variables. You should use the precompiled symbol table classes described
in the includes/symbols.h file to create any lookup tables that you require. The includes/symbols.h file includes examples of how to use these lookup
tables. You will need several lookup tables, one for the symbol table and one for
each component of a C-instruction. Although most parts of a C-instruction are
unique, the M, A and D registers need to be mapped to different sets of bits
depending on whether they are used as an alu operation or a destination. For
example, "A" as a destination would be "100" but as an alu operation it would
be "0110000". Disassembler
The disassembler program uses its own tokeniser, which returns one token
per instruction to parse a Hack machine code representation of a program. Details of the tokens, tokeniser functions and the grammar that is recognised
are all described in the includes/tokeniser-d.h file. The new Assembly language code that you generate must be formatted as
follows and must be written to standard output using
the write_to_output() function described in includes/iobuffer.h:
Labels are on their own line, the opening bracket, '(', must be at the
start of the line. A-instructions are on their own line prefixed by 8 spaces. C-instructions are on their own line prefixed by 8 spaces. If the jump or destination components of a C-instruction are null, they
are not written out. If the alu operation component of a C-instruction is not recognised, use
the string "< ** UNDEFINED ALU OPERATION ** >"
Apart from the 8 space prefixes and spaces within the alu error
message, no other white space can appear on a line of output. No symbols
If the disassembler program is passed the command line argument "N",
the disassemble_no_symbols() function will be called. This function will
output all A-instructions in numeric form and there will be no labels or variable
names in the output. Symbols
If no command line arguments are passed to the disassembler program,
the disassemble_symbols() function will be called and A-instructions must
be output in symbolic form where possible. If there is a choice, always output
the label name. The rules for generating label and variable names are as
follows:
Label Names
If an A-instruction is immediately followed by a C-instruction with a jump (the
jump bits are not "000") and the A-instruction contains the ROM address of
one of instructions in the program, the A-instruction must be written out using
the label name for that instruction. The first instruction in the program that is
the target of a jump is given the label "L0", the next instruction that is a target
is given the label "L1", and so on. This will require a list of label names to be
kept and two passes over the program's machine code. RAM Addresses
If an A-instruction is immediately followed by a C-instruction that reads or
writes RAM and the address in RAM can be given a name, the A-instruction
must be written out using that name.
If the RAM address has a predefined name in Assembly language, then that
name must be used. Note: addresses 0 to 4 must be named SP, LCL, ARG, THIS or THAT rather than R0, R1, R2, R3 or R4.
If the RAM address is in the range 16 to 255 it may be the address of a
variable. This will require knowing the address of the next free variable in
memory. Remember that in the Hack Assembly Language, the first variable is
allocated address 16, the next 17 and so on. So, if the address is at least 16
and it is less than or equal to the address of the next free variable, the
address refers to a variable and the A-instruction must be written out using
the variable's name. If the address is the address of the next free
variable, the address of the next free variable is also incremented by 1. Variable names can be calculated from the address by subtracting 16 and
prefixing the result with "v_". The first variable is at address 16 and is named
"v_0", the next is at address 17 and is named "v_1", and so on. Disassembler Lookup Tables
Your disassembler program will need a way of generating the correct strings
for each part of a C-instruction. For example, for a jump , "111" needs to be
mapped to "JMP". This can be easily done by copying the table initialisation
code from your translator program and swapping the order of the arguments. For the predefined symbols do not include the mappings for "R0" to "R4" so
that the names "SP", "LCL", "ARG", "THIS" and "THIS" will be returned
instead. Test Programs
The tests in the tests directory can run individually if necessary. If you fail a
test and want to see what was actually produced by your program you can run
the test in a number of ways. In the following examples
the translator program has an error when trying to assemble D+A:
You can run a single test and also see the live output too:
In this case the translator program has some trace writes added at the start
of each assembly pass. Trace writes are disabled when testing your programs
but the test script also runs your program with trace writes enabled to help
you with debugging. If there are trace writes a star will be displayed next to
the F* or P* in the Output and Error columns. To see the differences between the expected and actual output we
can show the test. To use less to view the output you can use less instead
of show. This example shows the instruction with the error, the + line is
output that is not expected and the - line is a line of output that is missing.
To see more detail such as as the names of the input files, names of the test
output files and any trace writes, we Show the test. To use less to view the
output you can use Less instead of Show. This example shows the
differences between the program output with and without the traces writes.
In addition to the tests in the tests directory, the web submission system will
run a number of secret tests that may contain various errors. Note: these
tests are secret, if your programs fail any of these secret tests you will
not receive any feedback about these secret tests, even if you ask!
The Pong program supplied in the tests directory was written in the Java-like
high-level Jack language and translated into the Hack assembly language by
the Jack compiler and a Hack Virtual Machine translator. (The Virtual Machine, Jack and the Jack compiler are described in Chapters 7-8, 9 and 10-11, respectively). Although the original Jack program is only about 300 lines of
Jack code, the executable Pong code is naturally much longer. Running this
interactive program in the supplied CPU Emulator is a slow affair, so don't
expect a high-powered Pong game. This slowness is actually a virtue, since it
enables your eye to track the graphical behavior of the program. And don't
worry! as we continue to build the software platform in the next few projects, Pong and and other games will run much faster.