CSC 230 Project 3
Shell Pattern Matcher
Shell programs like the Bourne shell or bash (the default shell on the common platform) support a pattern matching syntax. This
is used for matching filenames or other syntax in shell commands. It lets us use a question mark to match any single character,
or an asterisk to match any sequence of zero or more characters. Other characters just match themselves. For this project, you're
going to write a program that uses the shell pattern syntax to match lines from a given file.
Our program will be called match. You can run it as follows to match lines from a file named file-d.txt that end with ".html".
The sample input file-d.txt is provided with the starter. It contains a bunch of randomly generated filenames.
$ ./match '*.html' file-d.txt
special-blue-monster.html
little-orange-shoe.html
old-pink-hat.html
special-red-shoe.html
first-gray-shoe.html
dirty-pink-car.html
big-blue-rock.html
special-pink-monster.html
dirty-gray-hat.html
special-green-rock.html
big-blue-car.html
first-orange-rock.html
old-red-car.html
special-red-rock.html
old-gray-monster.html
Or, you can run the match program as follows, to match all the lines from this file that have three characters between two dashes
and a 3-character sequence at the end after a dot. For this particular input file, the only lines that match have the word "red" in
the middle and end with either ".exe", ".txt", ".mp3" or ".jpg". The "-n" option tells the program to report line numbers with each
line it matches.
$ ./match -n '*-???-*.???' file-d.txt
3 little-red-shoe.exe
14 first-red-hat.mp3
16 first-red-monster.exe
19 first-red-shoe.txt
26 dirty-red-book.exe
29 little-red-rock.mp3
38 little-red-car.txt
40 old-red-hat.jpg
48 old-red-rock.exe
51 dirty-red-rock.jpg
58 special-red-monster.exe
64 dirty-red-car.txt
65 first-red-rock.txt
88 dirty-red-monster.txt
92 little-red-rock.jpg
100 first-red-book.mp3
As in project 2, you'll be developing this project using git for revision control. You should be able to just unpack the starter into
the p3 directory of your cloned repo to get started. See the Getting Started section for instructions.
This homework supports a number of our course objectives. See the Learning Outcomes section for a list.
Rules for Project 3
You get to complete this project individually. If you're unsure what's permitted, you can have a look at the academic integrity
guidelines in the course syllabus.
In the design section, you'll see some instructions for how your implementation is expected to work. Be sure you follow these
rules. It's not enough to just turn in a working program; your program needs to follow the design constraints we've asked you to
follow.
Requirements
You're going to write a program named match. It will let the user match lines from a text input file against a given pattern. It also
supports command-line options to include line numbers in the output and to invert the match (just printing lines that don't match
the pattern).
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 2/13
Input and Output
The match program will print a report of matching lines to standard output. It will read input from a file given on the commandline.
The input file will always be the last command-line argument. For example, when run as follows, the program will be reading
input from a file named file-c.txt:
$ ./match 'a???w' file-c.txt
When reporting matching lines to standard output, each output line should be limited to at most 80 characters. This is a standard
width for terminal windows. If the output line requires more than 80 characters, you will print up to character 78, then print ".."
at the end to indicate that you weren't able to show the entire line. For example, consider an input file containing the following
two lines:
This is a sort line.
This is a much longer line, containing a bunch of extra characters that won't fit in just 80 columns of output.
If both of these lines matched the pattern, then they would be printed as follows. The entire first line will fit, but, for the second
line, we can only print the first 78 characters, then ".." to show there was more of this line that we weren't able to report.
This is a sort line.
This is a much longer line, containing a bunch of extra characters that won't ..
The 80-character limit is just a restriction on output length. The program can read longer lines from the input (it just can't report
the whole line). Input lines may be no more than 1023 characters long (not counting line termination and not counting the null
terminator that might be used to store a string). If the input file contains a line that's longer than this, the program will not print
anything to standard output. Instead, it will terminate with an exit status of 1 and print the following line to standard error:
Line too long
The program can report on up to 1000 matched lines. If there are too many matching lines, it will not print anything to standard
output. Instead, it will terminate with an exit status of 1 and print the following line to standard error:
Too many matches
Pattern Matching
The second-to-last command-line argument gives a pattern to match against lines from the input file. This pattern uses some of
the same syntax used by the shell when you match things like filenames. The pattern has to match the entire input line; it doesn't
count as a match if the pattern just matches a substring in the input line.
Inside a pattern, most characters are considered ordinary characters; they just match themselves. For example, the letter 'a' in a
pattern will just match the letter 'a' in a line from the input file. A pattern like "abc" will only match a line "abc" in the input file.
In a pattern, a couple of characters are considered special. These are sometimes called metacharacters in pattern matching.
These let you give patterns that can match multiple different lines from the input file:
The question mark can match any single character from a line. For example, a pattern like "a?c" would match any line that
started with 'a', ended with 'c' and with any single character in between. So, "a?c" would match the line "abc", "aac", "a-c",
etc. You can use multiple question marks to match arbitrary characters anywhere in the line. For example "?bc?" would
match any 4-character line with the letters "bc" in the middle.
*
The star will match any sequence of zero or more characters. For example, 'a*b' will match any line (of two characters or
more) that starts with 'a' and ends with 'b'. Or, the pattern "*x*" will match any sequence of characters that contains the
letter 'x'.
In our implementation, we won't permit more than one consecutive star in the pattern; a pattern with multiple stars in a row
will be considered invalid. Supporting multiple consecutive stars makes pattern matching a little more difficult (just a little
bit), but it doesn't give us any additional expressive power in the pattern. Two or more consecutive stars will match exactly
the same things as just one star. So we will reject patterns like this and simplify our pattern matching code a little bit.
There's also an optional extension to the pattern syntax that you can implement for up to 8 points of extra credit. It's described
below in the Extra Credit section below.
Since our program is using the same pattern matching syntax as the shell, we'll usually need to put quotes around a pattern when
we run our program. Otherwise, the shell will try to match the pattern against filenames in the current directory. For example, if
we tried to run our program as follows, the shell would replace the pattern with a list of all the files in the current directory. This
would happen before our program even started.
# This won't work.
$ match * input-file.txt
# This will protect the pattern from the shell.
$ match '*' input-file.txt
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 3/13
Command-Line Options
The match program supports two optional command-line arguments, -n and -v. These may or may not be included on the
command line, and they may be given in any order (or, even, given multiple times). If given, they will occur before the pattern
argument. Giving either of these arguments multiple times is the same as giving that argument just once. So, "./match -n -v -n
'abc' file.txt" is the same as "./match -n -v 'abc' file.txt".
Line Number Report
If -n, is given on the command line, then output should give the line number where each matching line occurred in the input file.
Line numbers count from 1. The line number should be given right-justified at the start of each output line. It should be followed
by a space, then the contents of the matching line from the input.
In the output, the field width for the line number should be determined by the largest line number that actually occurs in the
output. In the following output, for example, we only need one space for the line number.
1 This is line one
5 This is line five
However, in the following output, we need to use a two-digit line number, so every output line will use two colums to report the
line number:
1 This is line one
5 This is line five
82 This is line eighty two
Or, if we had to report a line with a 3-digit line number, the output would look like:
1 This is line one
5 This is line five
82 This is line eighty two
902 This is line nine hundred and two
Notice that the width of the line number field depends on the lines that are actually reported, not just on the number of lines in
the input file. Also notice that the 80-character output limit applies to the entire output line (including the optional line-number
field). So, for example, if the line number takes three characters, then, with the space, you only have 76 characters remaining to
show the matching line from the input).
Match Inversion
If -v, is given on the command line, then the program should report lines that don't match the pattern, rather than lines that do.
Invalid Arguments
If the program is given an invalid pattern (one with two * characters in a row, or, for the extra credit, a pattern with a bad
character class), it will print the following line to standard error and terminate with an exit status of 1. In this error message, pat
is the pattern given on the command line:
Invalid pattern: pat
If the program is given invalid command line arguments, it will print the following usage message to standard error and then
terminate with an exit status of 1. Command line arguments would be invalid if some option other than -n or -v was given, or if
the pattern or filename arguments were not present:
usage: match [-n] [-v] pattern file
If the program is given an input file that can't be opened, it will print the following message to standard error and then terminate
with an exit status of 1. Here, filename is the name of the file given on the command line:
Can't open file: filename
Design
Your program will be implemented four components:
input.c and input.h
This component contains code to read lines of text from the input file.
list.c and list.h
This component contains code to save matching lines from the input file and report them when the program is done.
pattern.c and pattern.h
This is probably the most interesting component. It contains code to match lines of the input against the pattern.
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 4/13
match.c
This is the top-level main component. It contains the main() function and uses the other components to implement the
program's functionality.
The following figure shows the dependency structure of our components. You can see we've tried to make sure there are no
dependency cycles among the components. The match component can use the other three, but pattern, list and input are all
independent. They don't need to use code provided by each other.
Figure: dependencies among the program's components
Functions
You'll implement and use the following functions, to break the program into smaller components and to help simplify your toplevel
match.c component. Remember that functions that are defined in one component and used in another should be prototyped
(and documented) in the header. You can add other functions as you like, to help simplify your implementation. Functions that
are for internal use in a component should be given internal linkage (static) and should not be prototyped in the header.
The input component will just provide one function that the main component will use:
bool readLine( FILE *fp, char line[], int capacity )
The readLine() function will attempt to read a single line of text from the given file and store it as a string in the given
character array. The capacity argument tells it how long the array is, so it can exit and print an appropriate error message if
one of the input lines is too long. It returns true if it is successful reading another line of text, or false if there are no more
lines to read.
The list component will provide the following two functions, for recording matching lines and reporting them at the end:
void addLine( int lno, char const line[] )
This function adds the given line to the list of matching lines. The line parameter gives the contents of the line and the lno
parameter gives the line number where this line occurred in the input.
void printList( bool numberFlag )
This function prints all the matching lines (the ones previously recorded via addLine). If numberFlag is true, it prints the line
number field at the start of each line. Otherwise, it just prints the text of the matching line.
The pattern component provides the following two functions for checking the validity of a pattern and for matching an input line
against a pattern.
bool validPattern( char const pat[] )
Given a pattern, this function will check to make sure the pattern is valid and return true or false accordingly. Unless you do
the extra credit, the only thing that will make a pattern invalid is if it has two star characters in a row.
bool matchPattern( char const pat[], char const line[] )
This function returns true if the given line of text matches the given pattern. The "Pattern Matching" section below gives you
some guidance about how to implement this.
Match List Representation
The list component is responsible for remembering all the matching lines. Inside, it will use (static) global arrays to remember the
text of all the matching lines, along with a line number for each matching line and any other state you need in order to remember
all the matching lines.
You can use arrays to record the text of the matching lines and their line numbers. The line number can just be stored in an array
of ints. The text of the lines themselves can be stored in a two-dimensional array of char, with a row for each line of text.
You won't need to use dynamically allocated memory for this project (we'll use that on project 4). Since there's a fixed limit to the
number of lines the program can store and the length of each line you'll have to report, you can make fixed-sized arrays to hold
all the information needed by the list component. If these limits are exceeded, you're just expected to print an error message and
terminate.
Global Variables
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 5/13
Don't use any global variables other than the ones you'll use to represent the record of matching lines in the list component.
Functions should be able to get everything else they need from their parameters passed to them.
Pattern Matching
Matching an input line against a pattern will require working through each character of the input line and keeping up with how
much of the pattern has been matched so far. This will be kind of like matching two strings, but the special characters in the
pattern will make things a little more interesting. We'll really be using a type of finite state machine called a Nondeterministic
Finite Automaton (NFA). You'll learn all about these in CSC 333. For now, we can cover enough to be able to match simple
patterns like the ones we're working with.
State Machine Overview
We can model a pattern like "a?c" with the following finite state machine. The circles (states) represent how much of the pattern
we have matched so far. The arrows show what character has to occur in the input to get from one state to another. We'll match
a character from the input by following an arrow that's has the same symbol on it. A question mark on an arrow indicates that
you can get from one state to another by matching any single character.
NFA to match a?c
In this automaton, state zero indicates that you have not matched anything from the input line yet. You can get to state 1 after
matching a letter 'a'. From state 1, you can get to state 2 after matching any character. Getting from state 2 to state 3 requires
matching a 'c' from the input. The double circle on state 3 indicates that it is a final/accepting state. If we can match the whole
input line and get to this state, then the input line matches the pattern.
Imagine that we are matching a line containing "abc". Initially, if we haven't looked any characters in this line, the finite state
machine would be in state 0.
After matching the first character from "abc", the state machine would be in state 1.
After matching the next character from "abc", the state machine could transition to state 2, following the arrow with the question
mark on it to match the 'b' in "abc".
After matching the next character from "abc", the state machine would be in state 3. So, we reached a final state after matching
the whole input line. The pattern matches the input line, "abc".
State Machine Representation
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 6/13
How can we implement a state machine line the one shown above? We don't really have to implement anything that looks like this
drawing, with circles and arrows. We can think of the text of the pattern itself as a representation of the state machine. This is
shown in the following figure.
We can think of each state as the index into a position in the pattern. In sate zero, we need to match the character 'a' (just like
state zero in the figure above). In state 1, we've already matched the "a" and we need match the question mark next. In state 2,
we've already matched the "a?" in the pattern and we need to match the 'c' next. In state 3, we've already matched the whole
pattern. Notice that we have one extra state at the end. The pattern is only three characters long, but we need states 0, 1, 2 and
3. From a given state, you can look at the corresponding above character of the pattern to see what you need to match next. For
example, if you're in state 2, you need to match a 'c' next in the input.
We will use an array of boolean flags to keep up with the state of the automaton. The picture above shows just the flag for state 2
set. This would indicate that we've matched the "a?" part of the pattern but we still need to match the 'c'. Why do we need an
array of flags for keeping up with the state? Couldn't we just use an integer to store the current sate? We'll need this to support
the '*' character in a pattern. To handle star, we will need to consider more than one possible state that the finite state machine
could reach. We'll use the array of flags to keep up with the set of states the machine could reach by matching a given sequence
of characters.
Matching With the Stars
Consider the pattern, "ab*d?f". The following figure shows how this pattern could be represented as a state machine. This figure
shows how to handle a star in a pattern.
After matching the 'b' right before the star, we can match any sequence of zero or more characters before matching the 'd' right
after the star. To match zero characters with the star, we can go straight from state 1 to state 3. This will require matching a 'b'
and then a 'd' immediately afterward. The arrow under state 2 lets us do that. To match exactly one character with the star, we
can go from state 1 to state 2, matching a 'b'. Then, we can match any one character going from state 2 to state 3, then we
would have to match a 'd' to get past state 3. To match more than one character with the star, we can take the loop above state
2 to match any number of characters before going to state 3 on the last character matched by the star.
When there's a star in the pattern, it means we will have choices as we traverse the state machine. You can see this in the figure
above. From state 1, there are two places we could go while matching a 'b'. From state 2, there are two places we could go while
matching any character. In general, after matching some part of an input line, there may be more than one state we could reach.
For example, on the input "abcdef", we could reach state 6 by going from through states 0 --> 1 --> 2 --> 3 --> 4 --> 5 --> 6.
Or, we could get to state 2 via 0 --> 1 --> 2 --> 2 --> 2 --> 2 --> 2. Or, we could reach state 3, by going 0 --> 1 --> 2 --> 2 --
> 2 --> 2 --> 3. As long as there's a way to reach the last state while matching all the characters of the input line, then the input
line matches the pattern.
To figure out if it's possible to reach the last state, we'll use flags to keep up with which states could be reached after matching
some of an input line.
Pattern Matching Example
The following figure shows how we'll represent the pattern "ab*d?f". Initially, before matching any characters from the input line,
we will be in state 0, so just the flag for state 0 is set:
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 7/13
Let's say we're trying to match an input line, "abccdxcdxf". First we have to match the letter 'a'. From state 0 (the only flag that's
set in the picture above), matching an 'a' will take us to state 1. You can see this in the state machine figure; matching an 'a'
from state 0 takes you to state 1. You can also see this in the figure directly above. If you're in state 0, the next part of the
pattern to match is the letter 'a'. It's the character from the pattern right above the check mark. So after matching the first
character in the input line, state 1 is the only state we can reach:
Next, we have to match the 'b' from the input line. From state 1, matching a 'b' can take us to either state 2 or state 3. You an
see this in the finite state machine above. This is what will always happen when the next character in the pattern is a star. The
star can either match some characters from the input line (going to state 2 lets us do that), or it could match no characters from
the input line (going two states ahead, to state 3 lets us do that).
Next, we have to match a 'c' from the input line. From the picture above, we could get to either state 2 or state 3 after reading
just the first two characters of the input line. From state 2, matching the 'c' could take us to state 2 or state 3. From state 3, we
can't match the 'c'. So after matching the first three characters of the input line, we can still just get to state 2 or state 3:
The next character on the input line is another 'c'. Here, the same same happens as before. We could reach state 2 or 3 before
matching this 'c', so we can still reach states 2 or 3 after matching this 'c'.
Next, we have to match the 'd'. From state 2, matching a 'd' could take us to state 2 or state 3. From state 3, matching a 'd'
would take us to state 4. So, after matching the first five characters in the input line, "abccd", we could get to states 2, 3 or 4:
Next, we have to match an 'x'. From state 2, matching an 'x' could take us to state 2 or 3. From state 3, we can't match the 'x'.
From state 4, we could match an 'x' using the question mark and get to state 5. So, after matching "abccdx", we could get to
states 2, 3 or 5:
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 8/13
Next, we have to match a 'c'. From state 2, matching a 'c' could take us to states 2 or 3. From state 3 and state 5, we can't match
a 'c', so after matching the first 7 characters from the input line, we can only get to state 2 or 3:
Next, we have to match a 'd'. From state 2, matching a 'd' can take us to state 2 or 3. From state 3, matching a 'd' takes us to
state 4, so we could reach either states 2, 3 or 4:
Next, we have to match an 'x'. From state 2, matching an 'x' could take us to states 2 or 3. From state 3, we can't match an 'x' .
From state 4, we can match the 'x' with the question mark to get to state 5. So, matching the first 9 characters of the input line
could get us to state 2, 3 or 5.
Finally, we have to match the 'f' at the end of the input line. From state 2, matching 'f' can take us to state 2 or 3. From state 3,
we can't match the 'f'. From state 5, matching the 'f' takes us to state 6, so matching the whole input line could let us reach state
2, 3 or 6:
Now, we're done with this input line. We've matched the whole line, and it's possible to reach the last state (state 6). So the input
line matches the pattern. If our program was printing lines that match this pattern, we would want to print this input line (or,
really, save it for printing later).
Pattern Matching Implementation
You'll implement your matchPattern() function like the example shown above. You'll use an array of flags (one element longer
than the pattern) to store the set of states you can reach as you match characters from the current input line. For each character
in the input line, you'll compute the new set of states you can reach after matching that character. You'll probably want two
arrays for this, one holding the set of states you could reach before the current character, the other holding the new set of states
you can reach after matching the current character.
The starter includes a reportFlags() function to help you see what's going on in your implementation as you're matching a pattern
against an input line. It's fairly simple. Given a string for a pattern and an array of state flags, it will print out out the flags below
the pattern, kind of like what's shown in the pictures above. For example, given the pattern and the state array for the last figure
in the example above, it would print the following output. The caret characters show the reachable states.
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 9/13
ab*d?f
^^ ^
Patterns Starting with a Star
When you're matching a pattern, you normally start in state 0, before you've matched any characters from the input line. If the
pattern starts with a star, then the starting state will be either state 0 or state 1. Starting in state zero will let you match one or
more characters from the input line with the star. If you don't need to match anything with the star (it's OK for the star to match
zero characters), then starting in state 1 will let you do that. That way, you're not trying to modify an array while you're still
depending on the values it used to contain.
The following shows how we could match an input line of "12ab89" with the pattern "*ab*". Here, we're showing the output of the
reportFlags() function described above. The following shows the output of this function after matching each character from the
input line. I added a line above each call to reportFlags() to say how much of the input line had been matched.
Initial state(s)
*ab*
^^
After matching "1"
*ab*
^^
After matching "12"
*ab*
^^
After matching "12a"
*ab*
^^^
After matching "12ab"
*ab*
^^ ^^
After matching "12ab8"
*ab*
^^ ^^
After matching "12ab89"
*ab*
^^ ^^
At the bottom, we see that we can reach the last state after matching the whole input line, so the pattern matches the input line.
Magic Numbers
Be sure to avoid magic numbers in your source code. Use the preprocessor to give a meaningful name to all the important, nonobvious
values you need to use. For example, you can define constants for the maximum length of a line of the input file, or the
limit on the width of lines in the output. With some optional and some required arguments on the command line, it might be good
to have some constants for how many required arguments there need to be at the end. There are already standard constants you
should use for the exit status when your program terminates.
For constants that are explained right in the line of code where they occur, I wouldn't call these magic numbers. For example, if
you need to read two integers, you might write something like the following. Here, the value 2 wouldn't be considered a magic
number. It's explained right there in the format string, where we say we'd like to parse two integers. Defining a named constant
to use here would probably just make the program harder to maintain.
if ( fscanf( stream, "%d%d", &a, &b ) != 2 ) {
....
}
Extra Credit
There's some shell pattern syntax we aren't supporting in our application. For 8 points of extra credit, you can implement one
additional feature. In addition to the '?' and '*' special characters, the shell lets you match against sets of characters. The syntax
for this is like the character class syntax supported by printf and by regular expression. Inside a pattern, you can put any
sequence of characters inside square brackets. This will match any one of the given characters. So, for example, the pattern
"ab[cde]f' will match the strings "abcf", "abdf" or "abef". It won't match the string "abcdf", since the "[cde]" syntax only
matches a single character, not a sequence of characters.
The starter includes an extra test input file, file-ec-1.test for trying out extra credit test cases. It contains lines like
test-n.txt for numbers n from 1 to 100. If you've implemented the extra credit, you should be able to run your program as
follows to get all the strings for lines 1 .. 5.
2021/10/11 CSC230 Project 3 –
https://www.csc2.ncsu.edu/courses/csc230/proj/p3/p3.html 10/13
$ ./match 'test-[12345].txt' file-ec-1.txt
test-1.txt
test-2.txt
test-3.txt
test-4.txt
test-5.txt
Or, you can run it as follows, to match any numbers that start with 3 or 4 and end with 6, 7 or 8.
$ ./match 'test-[34][678].txt' file-ec-1.txt
test-36.txt
test-37.txt
test-38.txt
test-46.txt
test-47.txt
test-48.txt
It's OK for characters ? and * to occur inside a character class, but they just match occurrences of ? or * (i.e., these two
characters are not considered special when they occur inside a character class). A character class must start with a [ and end
with ] and it must not contain square brackets inside. Every occurrence of ] in the pattern much match a previous occurrence of [
(so, you can't nest character classes and the pattern can't end inside a character class). If a pattern violates these rules, your
program will handle it just like a pattern with two * characters in a row. For example:
$ ./match 'test-[34][678.txt' file-ec-1.txt
Invalid pattern: test-[34][678.txt
Build Automation
You get to create your own Makefile for this project (called Makefile with a capital 'M', no filename extension). Its default target
should build your match program, compiling each source file to an object file, then linking to produce an executable. Your Makefile
should correctly describe the project's dependencies, so if just one source file changes it rebuilds just the parts of the project that<