CMSC 421 Assignment One
Neural Networks and Optimization
September 12, 2023
General Instructions. Please submit TWO (2) files to ELMS:
(1) a PDF file that is the report of your experimental results and answers to the questions.
(2) a codebase submission in form of a zip file including only the code folders/files you modified and
the Questions folder. Please do not submit the Data Folder we provided. The code should contain
your implementations of the experiments and code for producing visualizations of the results.
The project is due at 11:59 pm on September 26 (Monday), 2023.
Please read through this document before starting your implementation and experiments. Your score
will be mostly dependent on the completion of experiments, the effectiveness of the reported results,
visualizations, the consistency between the experimental results and analysis, and the clarity of the
report. Neatness and clarity count! Good visualization helps!
As you would need to use pytorch for the second half of the programming assignment Convolutional
Neural Networks - 15 Points, We have included links to some tutorials and documentations to help
you get started with pytorch:
• Official Pytorch Documentation
• Quickstart Guide
• Tensors
• Data Loading
• Building models in Pytorch
Implementation Details
For each problem, you’ll need to code both the training and application phases of the neural network.
During training, you’ll adjust the network’s weights and biases using gradient descent. Use a single
parameter, η, to control the step size during gradient descent. The updated weights and biases will be
calculated as the old values minus the gradient multiplied by the step size.
We will be providing code snippets and datasets for some parts of the assignment. You will be required
to read the comments in the code file and fill in the missing pieces in the code files to correctly execute
these files. Please ensure that you are read through all the code files we provide. These will be available
in the CMSC421 - Fall2023 GitHub repository.
1
Part 1: Programming Task - (50 Points)
Objective
The goal of this assignment is to build a neural network from scratch, focusing on implementing the
backpropagation algorithm. You’ll apply your neural network to simple, synthetic datasets to gain
hands-on experience in tuning network parameters.
Language and Libraries
Python is mandatory for this assignment. Use numpy for all linear algebra operations. Do not use
machine learning libraries like PyTorch or TensorFlow for Questions 1,2 & 3; only numpy, matplotlib,
and Python built-in libraries are permitted.
1 Simple Linear Regression Model - (10 Points)
1.1 Network Architecture
• The network consists of an input layer, a hidden layer with one unit, a bias layer, and an output
layer with one unit.
• The output is a linear combination of the input, represented as a1 = Xw0 + a0 + b1.
1.2 Loss Function
Use a regression loss for training, defined as
1
2
Xn
i=1
(yi − a1(xi))2
1.3 Implementation
Using the template_for_solitions file, write code to train this network and apply it to data on both
1D data as q1_a and on higher dimensional data as q1_b.
• Data Preparation: Use the q1_ function from the Data.generator module to generate
training and testing data. The data module has both a and b so use the appropriate function
call to fetch the right data for each experiment.
• Network Setup: Use the net_setup method in the Trainer class to initialize the network, loss
layer, and optimizer.
• Training: Use the train method in the Trainer class to train the network. Plot the training
loss over iterations.
• Testing: Use the test data to evaluate the model’s performance. Plot the actual vs. predicted
values and compute evaluation metrics.
Tests and Experiments
1.4 Hyperparameters
• The main hyperparameters are the step size (η) and the number of gradient descent iterations.
• You may also have implicit hyperparameters like weight and bias initialization.
Hyperparameter Tuning
Discuss the difficulty level in finding an appropriate set of hyperparameters.
2
2 A Shallow Network - (10 Points)
The goal of this assignment is to implement a fully connected neural network with a single hidden
layer and a ReLU (Rectified Linear Unit) activation function. The network should be flexible enough
to accommodate any number of units in the hidden layer and any size of input, while having just one
output unit.
2.1 Network Architecture
The network consists of an input layer, a hidden layer with one unit, a bias layer, and an output layer
with one unit.
• Input Layer: a01
, a02
, . . . , a0d
• Hidden Layer: z1j =
Pd
k=1 Xw1k a0k + b1j
• ReLU Activation: a1j = max(0, z1j
)
• Output Layer: a2 =
Pd
k=1 Xw2k a1k + b2
2.2 Loss Function
Continue to use a regression loss for training the network. You can continue to use a regression loss
in training the network defined as
Xn
i=1
1
2
(yi − a
1
1
(xi))2
2.3 Implementation
Using the template_for_solitions file, write code to train this network and apply it to data on both
1D data as q2_a.py and on higher dimensional data as q2_b.py.
• Data Preparation: Use the q2_ function from the Data.generator module to generate
training and testing data. The data module has both a and b so use the appropriate function
call to fetch the right data for each experiment.
• Network Setup: Use the net_setup method in the Trainer class to initialize the network, loss
layer, and optimizer.
• Training: Use the train method in the Trainer class to train the network. Plot the training
loss over iterations.
• Testing: Use the test data to evaluate the model’s performance. Plot the actual vs. predicted
values and compute evaluation metrics.
Tests and Experiments
2.4 Hyperparameters
You now have an additional hyperparameter: the number of hidden units.
Hyperparameter Tuning:
• Discuss the difficulty in finding an appropriate set of hyperparameters.
• Compare the difficulty level between solving the 1D problem and the higher-dimensional problem.
3
3 General Deep Learning - (15 Points)
The goal of this section of the assignment is to write your neural network to handle fully-connected
networks of arbitrary depth. It will be just like the network in Problem 2, but with more layers. Each
layer will use a ReLU activation function, except for the final layer.
Tests and Experiments
• Test your network with the same training data that you used in Problem 2 A Shallow Network -
(10 Points), using both 1D and higher dimensional data. Experiment with using 3 and 5 hidden
layers. Evaluate the accuracy of your solutions in the same way as Problem 2 A Shallow Network
- (10 Points).
• Conduct and report on experiments to determine whether the depth of a network has any significant effect on how quickly your network can converge to a good solution. Include at least one
plot to justify your conclusions.
Again ensure your files are saved as q3_a.py and q3_b.py.
EXTRA CREDIT (EC): - Cross Entropy Loss (10 Points) Modify your network General
Deep Learning - (15 Points) in to perform classification tasks using a cross-entropy loss and a logistic
activation function in the output layer.
If you are submitting the EC save the code files as qec_a.py and qec_b.py.
3.1 Network Architecture
• Input Layer: Arbitrary size
• Hidden Layers: ReLU activation, arbitrary depth
• Output Layer: Logistic activation function defined as a
L
1 =
1
1+e
−zL
1
3.2 Loss Function
Use a cross-entropy loss defined as:
−
Xn
i=1
yi
log(a
L
1
(xi)) + (1 − yi)log(1 − a
L
1
(xi))
Here, yi
is assumed to be a binary value (0 or 1).
3.3 Note on Numerical Stability
Be cautious when exponentiating numbers in the sigmoid function to avoid overflow. Utilize np.maximum
and np.minimum for a concise implementation.
Tests and Experiments
3.4 Test Scenarios
1. 1D Data Tests:
• Linearly Separable Data:
– Vary the margin between points and the number of layers.
– Investigate the difficulty in finding hyperparameters based on the margin.
– Examine the speed of convergence based on the margin. Include plots.
• Non-Linearly Separable Data:
– Note the differences you observe when the data is not linearly separable.
4
2. Higher-Dimensional Data Tests:
• Repeat the experiments with higher-dimensional data.
• Use both linearly separable and non-linearly separable data sets.
• Include data to support your conclusions.
5
4 Convolutional Neural Networks - 15 Points
In this Section, you are required to implement a Convolutional Neural Network (CNN) using PyTorch
to classify images from the CINIC-10 dataset provided.
Requirements
Your CNN model should meet the following criteria:
(A) Utilize dropout for regularization. Mathematically, dropout sets a fraction p of the input units
to 0 at each update during training time, which helps to prevent overfitting.
(B) Be trained using either the RMSprop and ADAM optimizer separately. The update rule for
RMSprop is given by:
θt+1 = θt −
η
√
vt +
· gt
where θ are the parameters, η is the learning rate, vt is the moving average of the squared
gradient, is a smoothing term to avoid division by zero, and gt is the gradient.
For ADAM, the update rule is:
θt+1 = θt −
η · mˆ t √
vˆt +
where mˆ t and vˆt are bias-corrected estimates of the first and second moment of the gradients.
Report on how each optimizer performed.
(C) Include at least 3 convolutional layers and 2 fully connected layers. The convolution operation
can be represented as:
(f ∗ g)(t) = X
τ
f(τ ) · g(t − τ )
(D) Use wandb for visualization of the training loss L, which could be the cross-entropy loss for
classification:
L = −
X
i
yi
log(ˆyi)
Experimental Results
In addition to reporting the Test Accuracy and plotting the figure of Training Loss over iterations, the
following experimental results should also be reported for a comprehensive evaluation of the model’s
performance:
1. Validation Accuracy and Loss: Monitor and report the accuracy and loss on a separate
validation set to assess the model’s generalization capability.
2. Confusion Matrix: Include a confusion matrix to identify which classes the model is having
difficulty distinguishing between.
3. Precision, Recall, and F1-Score: Calculate and report these metrics to provide a more
nuanced view of the model’s performance. The F1-Score is the harmonic mean of Precision and
Recall and is defined as:
F1 = 2 ×
Precision × Recall
Precision + Recall
4. Model Size: Report the number of parameters and the memory footprint of the model.
5. Hyperparameter Tuning: If hyperparameter tuning is performed, report the performance
under different hyperparameter settings, such as learning rate, batch size, etc.
6. Class-wise Accuracy: Report the accuracy for each individual class to show how well the
model performs on different categories.
6
Part 2: Theoretical Questions - (50 Points + 3 Bonus Points)
1. Please answer the following questions about the activation function: - (9 Points)
(A) Why do we need activation functions in neural networks? (1 points)
(B) Write down the formula of the Sigmoid function and its derivative. What are the pros and cons
of using the Sigmoid function in neural networks? (4 points)
(C) Write down the formula of the ReLU function and its derivative. What are the pros and cons of
using the ReLU function in neural networks? (4 points)
2. When we optimize the neural networks, we usually use gradient descent to update
the weights of neural networks. To obtain well-trained neural networks, one of the most
important hyperparameters is the learning rate. Please answer the following questions
about learning rate: - (6 Points)
(A) What is the role of the learning rate in the gradient descent algorithm? (2 points)
(B) What happens to the neural network if the Learning Rate is too low or too high? (4 points)
3. After we train a neural network, we need to evaluate the model performance by determining if the model is underfitting or overfitting. Please answer the following questions
about underfitting or overfitting: - (12 Points)
(A) Explain the concept of underfitting and overfitting in your own words. And explain how to
determine whether a model is overfitting or underfitting based on the model performance on the
training set and validation set. (4 points)
(B) Please write down four methods that can be used to prevent the overfitting of a neural network.
(4 points)
(C) Please write down four methods that can be used to prevent the underfitting of a neural network.
(4 points)
4. Computer Vision(CV) and Natural Language Processing(NLP) are two primary application areas of neural networks. In CV areas, CNN models are often used to extract
information from images and videos, while RNN and Transformer are often used in NLP
areas to handle text data. - (9 Points + 3 Bonus Points)
(A) The key components of a CNN architecture include convolutional layers, pooling layers, and fully
connected layers. Provide a brief description of the function of each component. (4.5 points)
(B) Explain the concept of Hidden State, Time Steps and Weight Sharing in the design of RNN. (4.5
points)
(C) Bonus Question: Batch Normalization (BN) is important in real-world practice. Please describe
what BN is doing and explain why do we need BN in neural networks. (3 points)
5. Convolutional to Multi-layer Perceptron - (14 Points)
A convolution operation is a linear operation, and therefore convolutional layers can be represented
in the form of matrix multiplication, or in other words, represented by multi-layer perceptron. More
precisely, if we denote the convolution operation as c(x, θw, θb, γ), where θw are the filter weights, θb
are the filter biases, and γ are the padding and stride parameters, we want to convert the filters to a
weight matrix so that
flatten(c(x, θw, θb, γ)) = Wflatten(x) + b, (1)
where flatten(·) takes in a tensor of size (d1, d2, d3) and outputs a 1-D vector of size (d1×d2×d3). For example, flatten(F ilter1) = (i1,1, i1,2, i1,3, i2,1, i2,2, i2,3, i3,1, i3,2, i3,3, j1,1, j1,2, j1,3, j2,1, j2,2, j2,3, j3,1, j3,2, j3,3)
The converted weights and biases W and b depend on the convolution filters θw, θb and also γ (paddings
and strides).
Suppose the input is a 2 × 2 × 3 (C × H × W) image, and we have a convolutional layer with
two filters as shown in Figure 1, where the filter size is 3 × 3, the padding is 1 (filled with zeros)
7
1st Channel
A Sliding Window
2nd Channel
j1,1 j1,2 j1,3
j2,1 14 15
j3,1 17 18
i1,1 i1,2 i1,3
i2,1 i2,2 i2,3
i3,1 i3,2 i3,3
l1,1 l1,2 l1,3
l2,1 32 33
l3,1 35 36
k1,1 k1,2 k1,3
k2,1 k2,2 K2,3
k3,1 k3,2 k3,3
Filter 1
Filter 2
Figure 1: Input image and filters. Note that the sliding window slides in row major order, i.e., it first
slides right and changes to the first position of the second row until it reaches the end of the first row.
The white region around the input image is the zero padding.
and the stride is 1. The bias terms for the two convolutional filters in Filter1(Filter2) are b1(b3)
and b2(b4) respectively. For one filter, we convolve it with every sliding window of the input
image, and every such convolve operation over one sliding window generates one output of this
convolutional layer. For one filter, there are 6 sliding windows in total, which correspond to the
6 outputs of such filter. For every sliding window, we can think the output to be generated by
a dot product of a weight vector and the flattened input image, where non-zero entries of the
the weight vector should have exactly the same values as the filter, and their positions depend
on the sliding window. When we get the weight vector for each sliding window, we can simply
stack them together to get the converted weight matrix W. The bias part is simple, as for one
filter, we are adding the same bias to every sliding window output. Write out the weight matrix
W and bias b in terms of the filter weights and biases. Convince yourself that you get exactly
the same output (flattened) as the original convolution.