,MIPS。
Introduction
Each semester I develop one or more projects that require you to implement many of the design patterns and APIs discussed in class. The project for this semester is an instruction-level simulator of a common CPU instruction set architecture called MIPS.
MIPS is a RISC (reduced-instruction set computer) architecture adaptable to a wide-variety of applications, ranging from embedded systems (e.g. PIC32) and game consoles (e.g. PlayStation), to high-end servers (e.g. ARMv8). We will be simulating the first version, MIPS Release I, without co-processors. This CPU is ideally suited to our purposes since there are (relatively) few instructions with only three instruction formats and two addressing modes. It uses a 32-bit word size and has 32 general-purpose registers.
Our simulator, which we will call simmips , will read a defined subset of MIPS assembly source code and simulate program execution, similar, but not identical to, SPIM and MARS. The code and tests will be developed over the first part of the semester in milestones 0-2. We will then write a graphical debugger and visualization tool in milestones 3-4 during the last part of the semester.
The milestones will be used for grading and thus are motivation to not procrastinate. They correspond closely to the schedule of material that we will be covering in class. Your progress will be tracked through a series of private git repositories on GitHub.
The goal of this first milestone is to develop the lexer for our simulator. This will allow you to get accustomed to the course workflow, gain experience reading specifications, and refresh your programming skills. You will write a lexer for MIPS assembly source files, producing a sequence of tokens.
This assignment requires very little C++ experience beyond basic data structures. If you struggle with this assignment you are woefully ill-prepared for this course. This means you should devote additional time to it and seek advice from the TAs and myself, or as a last resort drop.
Recommended Background Reading
- Section 4 of “MIPS Assembly Language Programming using QtSpim”, by Ed Jorgensen, 2016.
- Lexing
- CMake Workflow on Windows
Basics of MIPS Assembly
The role of a compiler is to translate from a higher-level representation of a computation, called the source language, to a lower one, called the target language. Traditionally, when you are programming in a high-level language like C or C++, the target language is assembly. Individual files, or modules, are then passed through another compiler, called an assembler to produce object (machine) code. Multiple module object code is combined (usually with pre-compiled library code) by the linker to produce the final executable machine code. Modern compilers typically subsume high-level compilation and assembly into a single program, although they can emit intermediate languages. Some high-level languages (e.g. Java, C#, Lua, PLT scheme) compile to a portable intermediate language called bytecode, which is then compiled to machine code as needed (so-called just-in-time or JIT).
At each level of the compilation stage, the code moves from syntax that is more human understandable to less, while becoming more efficient for hardware execution. Assembly is the last language in this process that is symbolic enough that humans can interpret it without serious effort (one can decode and write/understand machine code, but it is a lot of work). Assembly language is expressed in plain text files called assembly source files. Like all languages it has a specific syntax, what constitutes a well-formed sequence of characters, and semantics, what the program does (it’s side effects).
Our simulator will only simulate execution of a single module of code; thus, we will have no need for a linker. It will also not handle conditional assembly or macros.
It will be helpful to see an example first before diving into the details of the assembler source format in later milestones. So what follows is a simple program with some C-like translations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| # this is a comment, any characters after # up to newline are ignored .data # start the data section, where variables are declared
var1: .byte 1 # char var1 = 1; var2: .half 65526 # short var2 = 65526; var3: .word 1 # int var3 = 1; var4: .word 2 # int var4 = 2; var5: .word 0 # int var5 = 0; .text # start the code section, a sequence of instructions main: # compute var5 = var3 + var4;
lw $t0, var3 lw $t1, var4 add $t3, $t0, $t1 sw $t3, var5
|
Some initial things of note. Comments begin with the # character and are not semantically important to the execution of the program. The program contains several directives to the compiler. These begin with a . character. The most prominent are .data and .text which control if the assembler is laying out or initializing memory, or processing computations as instructions. Also notable are labels, strings that end in the character : .
Lexing the Assembly Source
MIPS assembly programs, like that above, are written in plain ASCII text. In the assembly file format there are several syntactically meaningful tokens:
- EOL (LF or ASCII code point 0x0a) indicates the end of a source line has been reached (no value)
- SEP (a comma or ASCII code point 0x2c) is the separator for lists (no value)
- OPEN_PAREN (an open parenthesis or ASCII code point 0x28) indicates the start of an indirect address (no value)
- CLOSE_PAREN (a close parenthesis or ASCII code point 0x29) indicates the end of an indirect address (no value)
- STRING_DELIM (double quote or ASCII code point 0x22) is used to wrap literal strings (no value)
- EQUAL (an equal sign or ASCII code point 0x3d) is used in a constant assignment (no value)
- STRING is any other white-space (ASCII code points 0x020, 0x0d, 0x09, 0x0b, or 0x0c) separated string (value is the string)
The job of the lexer is to convert the input text to a sequence of these tokens. It should discard any comments during this process by ignoring any input from the character # up to a newline. It should also record the line number from the file the token appeared on.
The interface is a C++ free function with the signature:
1 2 3 4
| into a sequence of tokens. */ TokenSequence lexer(std::istream amp;);
|
This function is declared in the file lexer.hpp and should be implemented in the file lexer.cpp . A TokenSequence is a list of type Token
1
| typedef std::listlt;Tokengt; TokenList;
|
where a Token is a type, a line number where the token originated, and the string value
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| enum TokenType #123; EOL, SEP, EQUAL, OPEN_PAREN, CLOSE_PAREN, STRING_DELIM, STRING, ERROR #125;;
class Token #123; public: Token(TokenType type, std::size_t line);
Token(TokenType type, std::size_t line, const std::string amp;value);
TokenType type() const;
std::size_t line() const;
std::string value() const; #125;
|
The Token and TokenSequence types are defined and implemented in the module token.hpp / token.cpp .
Example
Consider the following contrived example of an assembly source file.
1 2 3 4 5 6 7
| .data LENGTH = 10 array: .space LENGTH str: .asciiz quot;the (end)quot; .text main: lw $t0, array lw $t1, ($t0)
|
Would produce the following sequence of tokens, written as (type, value), where the line breaks are not significant.
(STRING,quot;.dataquot;) (EOL,quot;quot;) (STRING,quot;LENGTHquot;) (EQUAL,quot;quot;) (STRING,quot;10quot;) (EOL,quot;quot;) (STRING,quot;array:quot;) (STRING,quot;.spacequot;) (STRING,quot;LENGTHquot;) (EOL,quot;quot;) (STRING,quot;str:quot;) (STRING,quot;.asciizquot;) (STRING_DELIMETER,quot;quot;) (STRING,quot;the (end)quot;) (STRING_DELIMETER,quot;quot;) (EOL,quot;quot;) (STRING,quot;.textquot;) (EOL,quot;quot;) (STRING,quot;main:quot;) (STRING,quot;lwquot;) (STRING,quot;$t0quot;) (SEP,quot;quot;) (STRING, quot;arrayquot;) (EOL,quot;quot;) (STRING,quot;lwquot;) (STRING,quot;$t1quot;) (SEP,quot;quot;) (OPEN_PAREN,quot;quot;) (STRING, quot;$t0quot;) (CLOSE_PAREN,quot;quot;) (EOL,quot;quot;)
Note in particular that any text other than ASCII code 0xa in between pairs of “ are treated literally (not as potential tokens).
The lexer should detect the following errors (only):
- A string must consist of matching “ and cannot span multiple lines.
- A pair of ( and ) must match and cannot span multiple lines.
For example the following inputs would produce lexing errors.
1 2 3 4 5 6
| .data var: .ascii quot;this is a stringquot; .text lw $t0, ($t1 )
|
On the first error detected the last token in the returned sequence should have the type ERROR and the value should be an informative error message that starts with the string Error: . For example the token sequence resulting from the second error case above might be
(STRING,quot;.textquot;) (EOL,quot;quot;) (STRING,quot;var:quot;) (STRING,quot;lwquot;) (STRING,quot;$t0quot;) (SEP,quot;quot;) (OPEN_PAREN,quot;quot;) (ERROR,quot;Error: unmatched paren on line 2quot;)
Instructions
Accept the GitHub invitation above and then clone the git repository to your local machine. The initial git repository contains several files. The types for TokenSequence are defined for you in the header lexer.hpp . Your task in this milestone is to implement the function lexer in the file lexer.cpp . You can define additional units of code (classes, helper functions) in lexer.cpp , but do not change the provided type definitions, nor the lexer function signature.
Unit tests for your function are provided in the file test_lexer.cpp using the Catch testing framework (see meeting 7). Do not modify them (other than to perhaps add temporary debugging output). The included CMakeLists.txt file sets up building these tests for you. Just configure the build, run the build, and then run the tests. You should use git to commit versions of your program source (only) as you go along, demonstrating good incremental programming technique.
Steps to build and run the tests in the reference environment (after vagrant ssh). To configure the build
cmake /vagrant
To run the build make or
cmake --build .
To run the unit tests make test or
cmake --build . --target test
The test output is placed in ``Testng/Temporary/LastTest.log. You can also just run the tests directly:
./unit_tests
To configure and run the build in strict mode (increased warnings, warnings become errors)
cmake -DSTRICT=True /vagrant make clean; make
Submission
To submit your assignment:
- Tag the git commit that you wish to be considered for grading as “final”.
- Push this change to GitHub
If you need to tag a different version of your code as final (before the due date) simply create and push a new tag appending a monotonically increasing number to final, e.g. final2, final3, etc.
Be sure you have committed all the changes you intend to. It is a good idea to re-clone your repository into a separate directory and double check it is what you intend to submit. Failure to complete these steps by the due date will result in a failed submission.