11-751/18-781 Fall 2020
Homework 4
OUT: 27th October, 11 : 59 PM ET
DUE: 16th November, 12 : 00 PM ET
Collaboration Policy
Homeworks must be completed individually. You are allowed to discuss the homework assignment with other
students and collaborate by discussing the problems at a conceptual level. However, your answers to the
questions and any code you submit must be entirely your own. If you do collaborate with other students (i.e.
discussing how to attack one of the programming problems at a conceptual level), you must report these
collaborations in Problem 1 - Question 6. Your grade for homework 4 will be reduced if it is determined that
any part of your homework submission is not your individual work. No collaborations are permitted
on Problem 1 of Homework 4.
Collaboration without full disclosure will be handled in compliance with CMU Policy on Cheating and
Plagiarism: https://www.cmu.edu/policies/student-and-student-life/academic-integrity.html
Late Day Policy
You have a total of 3 late days that you can use over the semester for the 4 homework assignments. For
homework 4, no submissions will be accepted after November 18th at 12:00pm (ET), 2 days after the
homework deadline. If you do need a one-time extension due to special circumstances, please contact the
instructor (Ian Lane) via Piazza.
1
11-751/18-781 Fall 2020 : Homework 4 Problem 1
Compute Resources for Homeworks
You can complete the course homeworks either using your personal computers, other compute resources you
have access to, or you can choose to use one of the GHC machines that have been assigned for this course
(ghc50.ghc.andrew.cmu.edu - ghc69.ghc.andrew.cmu.edu). The GHC machines (ghc50 - ghc69) are Red
Hat Linux machines with 8-core i7-9700 CPUs, 16 GB of RAM and a GeForce GTX 2080 GPU. You can log
into a GHC machine as shown below:
$ ssh @ghc.ghc.andrew.cmu.edu
$ Password:
Note that one or more of the GHC machines could be offline at anytime. If you are unable to log into a
specific machine, try one of the other machines in the cluster. You can also use the "w" command to see how
many other students are using a particular machine. i.e.
$ ssh -t @ghc.ghc.andrew.cmu.edu w
12:50:08 up 6:03, 1 user, load average: 0.01, 0.04, 0.05
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
pts/2 12:50 0.00s 0.05s 0.00s w
For students who are new to a linux programming environment, here are some commonly used commands:
https://www.cmu.edu/computing/services/comm-collab/collaboration/afs/how-to/unix-commands.
pdf
2
11-751/18-781 Fall 2020 : Homework 4 Problem 1
Problems:
Problem 1
Short Answer Questions (60 pts)
Complete the 5 questions on the online assignment in Gradescope. Answers must be entered directly within
Gradescope itself. Please be as precise and concise in your answers as possible.
Remember to answer all 5 questions.
If you do collaborate with other students on Problem 2, (i.e. discussing how to attack one of the problems
at the conceptual-level) you must report these collaborations in Question 6.
Notes on Problem 1:
1. Questions 5 is the Initial Submission question for Problem 2, and requires the upload of an image file
containing your solution. Make sure this is in the png format
Gradescope Assignment Link: https://www.gradescope.com/courses/163942/assignments/762434
3
11-751/18-781 Fall 2020 : Homework 4 Problem 2
Sequence Models with Attention
Speech recognition can be regarded as a sequence transduction task with speech feature frames as inputs
and language tokens as outputs. Sequence models have an encoder-decoder architecture, where the encoder
obtains hidden representations from the input audio features, and the decoder produces language tokens in
an autoregressive fashion. Attention can be used to obtain alignments between the encoded outputs and
the decoder outputs at every decoding time step. In this assignment, you will implement an attention based
approach to speech recognition.
The primary reference paper we recommend for this assignment is Listen,Attend, and Spell (LAS). The code
for this assignment will be written using Pytorch, and we recommend the following resources to students
who are new to the toolkit: Pytorch Documentation, Pytorch Tutorials, and in particular, the tutorial for
Machine Translation using Sequence Models with Attention
Handout Structure and Coding Instructions
Data
The Data is present within the asr_data directory of the handout. This contains three directories for
train,dev, and test. Data that is input to the model and dataloader is stored in JSON format. The JSON
file contains utterance dictionaries, and each utterance has input, and output keys, which contain the path to
Kaldi ark files, and character level tokenized speech transcripts respectively. The JSON file for the unknown
test set only contains the input speech features and no transcription.
Code Template
The code template contains the following structure:
1. conf- The directory has configurations in YAML format for training and decoding
(a) base.yaml- The baseline configuration that can help you get the necessary scores for this assignment.
2. models
(a) encoder.py- Has the Neural Network Modules for the basic RNNLayer, pBLSTM, and Listener,
which is the LAS Encoder. You need to complete the forward method for the pBLSTM and
Listener Modules.
(b) decoder.py- Has the Neural Network Speller Module which is the LAS Decoder. It contains the
forward and greedy_decode methods that you need to fill. Optionally, to get better performance,
you could also implement the beam_decode method in this file.
(c) las_model.py- Has the Neural Network Module Wrapper for SpeechLAS, the sequence model with
attention.
(d) attention.py- Has Neural Network Modules that implement Location Based Attention
3. train.py- Interface for training the ASR Models. Has all the training options and default values listed.
4. decode.py - Interface for decoding the trained ASR Model. This will generate the decoded_test.txt
file that you need to submit to Gradescope.
5. utils.py- Contains Loss,Accuracy, WER utilities and padding utilities for the nural network code.
6. trainer.py- Contains the Trainer object that performs ASR Training
7. requirements.txt- Contains the pip installable Python environment for this assignment
4
Grading Scheme 11-751/18-781 Fall 2020 : Homework 4 Problem 2
Your Gradescope Code Submission: decoded_test.txt
Setup on GHC Machines
The storage in your folder on the GHC machines is limited, so we have placed the data in a course folder.
You can create a softlink from that path to your working directory, so that you can use the speech data.
To prevent the quota issue when installing packages, first comment out the torch in requirements.txt. The
GHC machines have torch preinstalled and we would use the preinstalled version. If you forgot to comment
out torch and get a disk space quota issue, please use rm -rf .cache/pip to clean your pip package and rerun
the instructions.
$ ssh @ghc.ghc.andrew.cmu.edu
$ ln -s /afs/andrew.cmu.edu/course/11/751/homework-4/asr_data .
$ python3 -m venv --system-site-packages
$ source /bin/activate
$ vim requirements.txt // comment out the torch line
$ pip install --ignore-installed -r requirements.txt
To download the template code and get started:
$ wget https://www.andrew.cmu.edu/user/ianlane/11751/homework-4_code.zip
$ unzip homework-4_code.zip
$ cd homework-4_code/
$ python3 train.py --tag
...
$ python3 decode.py --model-dir exp/train_
...
Grading Scheme
This assignment will use the Gradescope leaderboard with WER on the unknown test set as the metric. For
this assignment, we have two deadlines:
1. Preliminary Deadline November 9th (10 points)- You will submit your output transcript for the
unknown test set to Gradescope, and submit an attention plot for utterance ID "4kac030n" to Part 1
of the Gradescope assignment by this deadline.
(a) Attention Plots in Part 1: 5 points
(b) "Reasonable" WER score ( <= 60) for the unknown test set on the Gradescope Leaderboard: 5
points
2. Full Deadline November 16th (40 points)- You will submit your decoded hypothesis file.
(a) If your test WER <= 20 : 20 points
(b) 10 <= Test WER <= 20 : Maximum of 20 points with your test WER being interpolated between
10 and 20
Bonus Points: +10 points for the top 5 leaderboard entries with WER < 10
Building a Sequence Model in Pytorch
To build a neural network model in Pytorch, we have the following important steps:
5
Data and Experimental Setup 11-751/18-781 Fall 2020 : Homework 4 Problem 2
1. Data Preparation: Prepare the data by extracting speech features and tokenizing the text transcript
based on the unit you use to model speech. Then compile this data into easy readable format like
json files. Partition the data for training, development and evaluation. All the above steps have been
done for you. For training, and development, you will have access to the ground truth text transcript
labels and speech features. For evaluation, you will use speech features to obtain predictions of the
text transcript from your model.
2. Data Loading and Batching: Build a Pytorch Dataset object that has the __getitem__() and
__len__() methods, which return a data sample given an element key, and the total number of data
samples in a partition respectively. Then we create a DataLoader with a custom batching mechanism.
We create batches of examples based on the input size. This means that we can have batches of
different sizes such that the total number of input feature floats (batch-bins) has a maximum for each
batch.
3. Trainer: Then we need to implement a trainer that performs the training. Within each epoch of
training, we have training and validation steps. In the train step, we load data from the data loader,
run it through our model, compute the Cross Entropy Loss, and then perform backpropagation. Based
on the computed gradients, we update the model parameters by calling the step() method of optimiser.
In the validation step, we load the validation data, do forward propagation through the model, and
compute statistics- loss, accuracy, perplexity, and Word Error Rate (WER). After the training and
validation steps, we log statistics for the epoch, and save models.
4. Neural Network Building: This is the focus of the current assignment. Neural networks are written
in Pytorch using torch.nn package, which contains many standard neural network layers including
Linear, Convolutions, RNN, LSTM etc. As we are building an attention based sequence model, we
have three primary components: Encoder, Decoder, and Attention. Each of these components is
written as a torch.nn.Module which has an __init__ and forward method. The former defines the
important neural network layers within the module, and the forward method describes how forward
propagation will occur through the neural network given the input to find the outputs using the neural
network layers defined in the __init__ function.
5. Decoding and Search: After training the sequence model, you will use the trained model to produce
text transcriptions for an unknown test set. Hence, this would necessitate writing a decode function
within the decoder. Typically, beam search is used, and for this assignment, you will implement greedy
search, and optionally beam search.
Data and Experimental Setup
You will use the Wall Street Journal Corpus for training with an unknown test set. Characters have been
selected as the modelling units, and a dictionary file has been provided.
Batching, data-loading, and training code has been provided in the template. You can use the "batch-bins"
option to create batches as per your GPU.
Building the LAS Model
A LAS model with attention has three important components: the Listener(encoder), the Speller(decoder),
and the attention module. The Listener consists of a Pyramidal Bi-LSTM Network structure that takes
in the given utterances and compresses it to produce high-level representations for the Speller network.
The Speller takes in the high-level feature output from the Listener network and uses it to compute a
probability distribution over sequences of characters using the attention mechanism. Attention intuitively
can be understood as trying to learn a mapping from a word vector to some areas of the utterance map.
6
Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2
The Listener produces a high-level representation of the given utterance and the Speller uses parts of the
representation (produced from the Listener) to predict the next word in the sequence.
Inputs, Batching and Padding
In sequence models that use text, videos, speech etc., each example input is two-dimensional, and has a
sequence length component and a frame feature dimension. So, a speech feature input of shape [50,83]
means we have 50 frames of audio inputs, and 83 dimensional features for each frame. In training a sequence
model, we can train and update model parameters using batches, or using individual examples. Batches are
the preferred means to train the model.
Batches can be created with a fixed batch size (all the batches have the same number of elements) or variable
batch sizes (all batches can have different numbers of elements). Variable batch sizes are more efficient for
sequence inputs, and are used in this assignment. It is efficient to create batches of examples that have
the same or similar sequence lengths, as this maximizes compute efficiency. In this assignment, we create
batches based on bins, i.e., the total number of floating point values within a batch.
Within a batch, it is possible that the multiple input elements have different sequence lengths. To be able to
create tensors of shape [batch_size,sequence_length,feature_dimension], we need to pad the inputs whose
length is less than the maximum sequence length of all elements of the batch along the sequence_length
dimension. We pad the audio features with zero (the parameter audio_pad in train.py) to create the audio
input xs, while remembering the actual lengths of the inputs in the parameter xlens.
The outputs for speech recognition are sequences of language tokens with one dimension representing
sequence_length. Different elements in a batch would have different output_lengths, so we also need
to pad the shorter output sequences to be able to create a tensor (ys_ref in las_model.py) of shape
[batch_size,max_sequence_length]. We pad with -1 ( the parameter text_pad in train.py) in the template
code. We need to remember the lengths of the actual output lengths in the variable ylen in las_model.py. We
need to mask the encoder outputs while computing the attention to not include the padded input elements,
and mask the decoded outputs while computing the loss so as to not include the padded output elements.
All of this has been done for you in the provided code template.
Pyramidal Encoder: Listener
The encoder code is to be written in models/encoder.py in the template.
The Listener contains an initial LSTM layer and Pyramidal BiLSTM layers. The Pyramidal BiLSTM
subsamples the feature input of size [Batch_size,Sequence_length,feature_dimensions] along the sequence
length axis by factor p = 2, while increasing the feature dimension by the same subsampling factor. In
order to write the code for the Listener, we need two basic components: the RNNLayer Module, and the
pBLSTM Module. The RNNLayer Module is the implementation of an RNN within Pytorch, and can
implement N layers of GRU/LSTM/RNN. While implementing an RNN Module in Pytorch over padded
variable length inputs, you can use the pack_padded_sequence and pad_packed_sequence utilities from
Pytorch to implement it efficiently. This has been done for you as an example.
The pBLSTM Module implements a single layer of LSTM with Pyramidal Subsampling. The idea here is
that if you get a sequence input of shape [batch,inp_sequence_length,hidden_dim], then you reshape the
input to reduce the sequence_length by factor p = 2 and multiply the hidden_dim by the same factor p.
Next, you pass this modified input through an RNNLayer module to get the output of the pBLSTM. You
will observe that if you are using Pyramidal Subsampling and a bidirectional encoder, that the resulting
output has four times the hidden dimension you provide for the pBLSTM module.
The Listener Module can be implemented using (a) an initial RNNLayer that maps the input feature di-
7
Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2
mension to a hidden dimension, and (b) a sequence of pBLSTM layers that subsample the sequence_length
dimension as mentioned.
This is what the Listener should do:
L1, Llens1 = RNNLayer(xs, xlens)
for i = 1, 2....Ne; do
Li+1, Llensi+1 = pBLSTM(Li
, Llensi)
done
hs, hlens = LNe+1, LlensNe+1
where Ne is the number of encoder pBLSTM layers (elayers in train.py), xs is the padded input feature
tensor, xlens is the lengths of the inputs before padding, the encoder outputs are hs and hlens, obtained
from the last pBLSTM layer.
Number of Encoder Layers, Subsampling factor: Generally, in training attention models with speech
features extracted every 25ms with 10ms overlap, as in this case, subsampling factor of upto 8 can be used,
which means a maximum of 3 pBLSTM layers.
Projection Layers in the Encoder: You can choose to add projection layers between the pBLSTMs that
reduce the hidden dimension. A pBLSTM layer increases the hidden dimension four-fold, so we can use a
projection layer to reduce the hidden dimension four-fold before using it as input to the next pBLSTM.
Attention Module
The attention code is provided in models/attention.py in the template.
Attention has many formulations- Content based, Location based or hybrid methods. Content based attention
was used in the LAS paper, but in this assignment, we are providing location based attention,
proposed in Attention based Models for Speech Recognition. Refer to Section 2.1 in that paper for a detailed
mathematical treatment of location based attention. Fundamentally, location based attention computes the
attention based on a projection of the encoder output (key), a projection of the decoder state (query), and a
convolution followed by projection of the previous attention weights. The code for location based attention
has been provided in attention.py.
Alternative Attention methods: You could use different attention modules and parameters to possibly
improve your WER performance on the leaderboard.
Autoregressive Decoder: Speller
The decoder code is to be written in models/decoder.py in the template.
The Speller is an LSTM based auto-regressive decoder, i.e., we use the previously decoded outputs in the
current decoding step. At every decoding time step, we compute the Attention Context and weights based
on the current decoder state, the encoder output, and previous attention weight; then concatenate the
attention context with the character embedding of the previous output (ground-truth reference or model
output), and then use this concatenated vector as the input to the first decoder layer. Because we have to
compute a different attention context and embedding at each time step based on the previous time-step, we
cannot use torch.nn.LSTM/torch.nn.GRU as in the Listener to implement this. We have to instead use an
torch.nn.LSTMCell/torch.nn.GRUCell within Pytorch, and loop over all decoding time-steps during training
and decoding.
8
Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2
The forward method of the Speller has a loop over the maximum target token sequence length in the batch.
This is what the Speller decoding loop should do:
for i = 1, 2,..... maxlen; do
Embedding = EmbeddingLayer(ysi−1)
The RNNForward method of the decoder goes through the d layers of the LSTM decoder, with the input to
the first layer being the concatenated attention context and token embedding.
The greedy_decode method performs batch greedy decoding using the encoder outputs, and decoding parameters
maxlenratio and minlenratio. In training, we know the maximum length of tokens that we can
output in each batch, but that is not the case for decoding. Therefore, we can either (a) use a fixed maximum
token length for decoding (say 200), or (b) assume that due to the correlation between length of audio and
length of transcript, the maximum length we need for each batch is a fraction of the length of the hidden
states hlens. We recommend option (b), and maxlenratio is the fraction that defines what fraction of the
maximum value in hlens should be the maximum decoding length for the batch. In greedy decoding, we
have a loop over decoding timesteps similar to the forward method, and the main difference is that while
computing embeddings we can only use the previously decoded token. We compute the logits similar to
in the forward method, and after that, compute the log-softmax logits and take the argmax over the token
vocabulary. A decoded output is considered complete once < eos > has been produced. This method returns
a list of token indices corresponding to the decoded tokens for all elements in a batch.
Adding SOS and EOS to reference: In the decoder, we use the start of sentence < sos > token to mark
the beginning of decoding, and end of sentence < eos > to mark the end of decoding. Both of these are
represented by the same token in the template code. While performing decoding for the first time step, you
need to use the embedding for < sos >, which you consider the previous output. When the model produces
the < eos > token, you consider decoding complete. Therefore, < eos > needs to be added to the reference
while computing the loss because you want the model to produce < eos > to indicate the end of decoding.
In the decoder forward method, you will use the reference ys and ylen to create a new reference with the
< eos > padding and return that as the ground-truth reference for loss computation along with the logits.
Pre-computing Ground Truth Embeddings: In the decoder loop, at each time-step, you compute the
embeddings of the previous ground truth token. It is recommended that you pre-compute these embeddings
outside the decoding loop, and access the i − th element as needed within the loop. To pre-compute these
embeddings, you need the reference token list ys, but with the < sos > token appended to the start.
Scheduled Sampling: In the listener, the token embeddings that you feed in at training time are the
ground truth reference tokens produced at the previous time-step. However, at decoding time, you use the
predicted tokens from the previous time-step to compute the embeddings. This leads to the problem of label
bias, which can be addressed by scheduled sampling or teacher forcing. The idea is to have a parameter
ssprob, where at every decoding time-step with probability ssprob, you use the previous decoded outputs
9
Decoding and Search 11-751/18-781 Fall 2020 : Homework 4 Problem 2
at training time, and with probability 1 − ssprob, you use the ground truth reference tokens to compute
embeddings. Using scheduled sampling speeds up model convergence, and it is beneficial to (a) use the same
value of ssprob through all epochs, or (b) gradually increase the dependence on previously decoded tokens.
Decoder Dropout: Dropout is a regularization strategy that mitigates the impact of overfitting, and we
recommend using dropout in the decoder.
Multiple Decoder Layers: It might be beneficial to add more layers to the decoder, but we recommend
that you get it working with a single decoder layer first.
Loss, Computing Statistics
In the utils file, we have a StatsCalculator function that computes the CrossEntropyLoss, the accuracy,
perplexity and Word Error Rate on the validation set. The CrossEntropy loss takes in the logits from the
decoder, and the modified reference with < eos >, and ignores the padded elements in the reference using
the ignore_id option.
We have support for Tensorboard logging of important values like loss, accuracy, and WER. We also provide
support for logging Tensorboard attention plots and gradient plots.
Decoding and Search
The file decode.py is the wrapper for decoding your trained model. It calls the decode_greedy method
in las_model.py, which in turn looks at the greedy_decode in decoder.py. The decode_greedy method in
las_model.py returns a list of length batchsize of lists, each of which contains the decoded tokens for that
element of the batch. Then, in decode.py, we convert these decoded tokens to their corresponding character
tokens, and replace "< space >" with " ". Then we write the decoded outputs in the format of
to decoded_hyp.txt, which is what you should submit on Gradescope.
Debugging Strategies and Common Issues
Initial Value of CELoss
Print out the cross entropy loss you compute for the first batch without training. For N way classification,
the initial cross-entropy loss should be ln(N).
Gradient Plots
The gradient plot is an important debugging tool. The trainer file has code that outputs the gradient plots
via Tensorboard and saves it as a file. The gradient plot has min, max, and mean gradients. You want to
make sure that by after a few epochs of training, you have gradients across all decoder layers, through the
attention module, and in the encoder. If you don’t have gradients for particular layers, check your forward
implementation to make sure you have written the code correctly.
Attention Plots
In ASR, attention models should learn to produce monotonic attentions across training. Monitor your
attention plots to ensure that you have reasonably trained models.
Other Tips
1. It is challenging to debug in such large code bases. Please make sure that your code is highly modular,
and you have different functions to implement what you need. Make sure all variable names are
10
How to improve my WER and earn Bonus points? 11-751/18-781 Fall 2020 : Homework 4 Problem 2
Figure 1: Example of Reasonable Gradients during training
descriptive ( e.g., using attention_wt rather than a), and that you have comments in your code
describing what you did. We would be able to help debug your code only if the code is clean, modular,
and descriptive.
2. Use small parameter sizes while building your models as large models can’t be trained well with less
data.
3. Use regularization strategies like weight decay, dropout to address overfitting.
4. You can also use the blog post by Andrej Karpathy here to understand how to build and tune neural
networks.
5. Monitor the training curves on Tensorboard to identify over-fitting and under-fitting, and modify
optimizer parameters as you see fit.
How to improve my WER and earn Bonus points?
Here, we list additional ways that could help you improve performance and target the bonus points for top
scores:
1. Add CTC Loss to Attention, and train with joint CTC Attention
2. Implement Beam Search on your own to improve decoding scores
3. Use your code from HW3 to train a character language model as opposed to word language model.
At each decoding step, get the probability distribution over the token vocabulary from the language
model, and from LAS, and weight both contributions to make a final decision on decoding.
4. Pretrain Decoder as a Language Model to produce character tokens, and then train LAS.
5. Use Transformer Models rather than LSTM based LAS
6. Modify the pyramidal subsampling to convolutional sub-sampling with LSTM layers
Good luck and enjoy the challenge!