COURSEWORK ASSIGNMENT
MODULE: CMP-6026A/CMP-7016A – Audio-visual processing
ASSIGNMENT TITLE: Design, implementation and evaluation of a speech recognition system
DATE SET: Week 1
PRACTICAL DEMONSTRATION: Week 7 Wednesday (slot to be advised) – 5th Nov 2025
RETURN DATE: Friday of Week 8
ASSIGNMENT VALUE: 50%
LEARNING OUTCOMES
• Explain how humans produce speech from audio and visual perspectives and how these differ across different speech sounds and be able to give examples of how these are subject to noise and distortion
• Apply a range of tools to display and process audio and visual signals and be able to analyse these to find structure and identify sound or visual events
• Transfer knowledge learnt into code that extracts useful features from audio and visual data to provide robust and discriminative information in a compact format and apply this to machine learning methods
• Design and construct audio and visual speech recognisers and evaluate their performance under varying adverse operating conditions
• Work in a small team and organise work appropriately using simple project management techniques before demonstrating accomplishments within a professional setting
SPECIFICATION
Overview
This assignment involves the design, implementation and evaluation of a speaker-dependent speech recognition system to recognise the names of 20 students taken from the CMP-6026A/CMP-7016A modules in clean and noisy conditions.
Description
The task of building and testing a speech recogniser can be broken down into five stages:
i) Speech data collection and labelling
ii) Feature extraction
iii) Acoustic modelling
iv) Noise compensation
v) Evaluation
The speech recogniser is to be speaker-dependent which means that it will be trained on speech from just a single speaker and should also be tested on speech from only that speaker. The vocabulary is a set of 20 names taken from the students studying the CMP-6026A and CMP-7016A, which will be provided by separately.
The twenty names have been selected to be words that are distinctive, some that are confusable with others and some that are short. Your recogniser will perform isolated word recognition. This means that, during testing, you will provide it with the audio of a single name, and the recogniser will output a single label providing its classification of that speech.
The assignment is to be carried out in pairs, with marks awarded according to the mark scheme provided. The assignment will use Python and a variety of Python libraries such as TensorFlow, numpy, matplotlib and scikit-learn. These are standard libraries and give a good introduction as to how such a task may be carried out in industry.
The second assignment (CW2) will be based closely on this assignment. This means that this assignment will form an important underpinning for the next coursework. Feedback and feedforward from this assignment should be useful when undertaking the second assignment.
1. Speech data collection and labelling
A speech recogniser must be trained on examples of the speech sounds that it is expected to recognise. For this assignment, the vocabulary of the speech recogniser comprises 20 names taken from students on CMP-6026A and CMP-7016A. Therefore, the first part of the assignment involves recording examples of each name in the vocabulary. Theoretically, the more examples of each name, the higher the accuracy of the speech recogniser. The minimum number of samples you should collect is 20 of each name. Each speech file can be stored as a separate WAV file (e.g. dave001.wav).
Next, each audio file requires an associated label. This could be stored in the filename (as above), or via the directory structure, or in a separate reference text file including the label for each file. You should choose a logical approach that you can easily interface with from your Python code.
You only need to collect audio for this first coursework. However, bear in mind that you will have to collect a dataset of audio-visual speech for CW2, and so it might be more efficient to collect audio-visual data from the outset and to put the video data to one side until required. We strongly recommend finding video recording software that can record video data at a fixed framerate (not a variable framerate). Also, consider that you will have to give a live demo of your work for CW1 and, ideally, for CW2.
2. Feature extraction
Feature extraction’s task is to extract a set of feature vectors from each speech utterance that forms the input to the speech recogniser. This will involve designing and implementing in Python an algorithm to extract feature vectors from each speech signal. Many different feature extraction methods exist, but for this assignment you should consider only filterbank-derived cepstral features. You may first want to use a linear frequency filterbank with rectangular channels as a simple starting point. This can be extended to a mel-scaled filterbank and to then incorporate triangular shaped channels to ultimately produce mel- frequency cepstral coefficients (MFCCs). You should also consider augmenting the feature vector with energy and then its temporal derivatives as this should increase recognition accuracy. These different configurations should provide you with some interesting designs that you can test within your speech recogniser.
The feature extraction code should take as input a speech file (for example dave001.wav) and output a variable containing MFCC vectors.
3. Acoustic modelling
Acoustic modelling is where the acoustic properties (as represented by the feature vectors) are modelled. For this assignment, Deep Neural Networks (DNNs) will be used as the acoustic model. You will implement DNNs using TensorFlow in Python, and specifically using the higher-level Keras functions. The scripts from Lab 4 show how to partition a dataset into training/validation sets, and how to use those sets to train, evaluate and optimise a simple DNN. You should attempt to optimise the validation accuracy of your network by varying its hyperparameters (including but not limited to changing the number of hidden layers and filters). You should use 2D convolutional layers in your network, although you are free to explore different layer types if you have time.
4. Noise compensation
Noise can be added to the clean speech samples to create noisy speech which is more representative of real-world use of speech recognition systems. Adding noise to the speech will reduce the recognition accuracy and increase confusions. To mitigate this, some form of noise compensation may be needed. Different methods can be tested such as applying spectral subtraction to the feature extraction process or training the speech models on noisy speech (matched models). The effect of different types of noise can be investigated and different signal-to-noise ratios (SNRs).
5. Testing and evaluation
Once you have evaluated your network using your evaluation data, training of the speech recogniser is complete and it can now be tested. Testing involves passing a new speech file (in the same feature format as the training data) to the speech recogniser and letting it recognise the speech. You should be able to pass up to 10 separate files, each containing an isolated name, and your network should produce a corresponding list of names recognised in those files.
Therefore, a new set of speech files should be collected (for example a further 10 or 20 examples of each word in the vocabulary) and input into the speech recogniser. You should use Python to compare the recogniser’s classifications to the true labels of the test files. You should be able to report the classification accuracy and present a confusion matrix that shows which word confusions took place.
Within the evaluation you can examine the effects of different configurations of the feature extraction. This may include different numbers of filterbank channels, different spacing of the channels, etc. You should be able to explain the effects on training/validation data loss and accuracy of changing the neural network architecture and hyperparameters, as discussed in the Acoustic modelling section above. You may also want to test your speech recogniser in noisy conditions (for example, factory noise, babble noise, etc) and under different signal-to-noise ratios (SNRs) to examine how the noise affects recognition accuracy. For all tests, be prepared to explain what is happening and why you think this is the case.
Group work
This work is to be undertaken in pairs. You must find your own partner and you should do this as soon as possible. In the coming weeks you will be asked to provide the names of both people in your pair.
In order to be successful in your pair, proper project planning and progress monitoring are vital. Good practice for undertaking a project such as this will be discussed in the lectures.
Relationship to formative assessment
Formative assessment takes place during all lab classes through discussion of your analysis, designs and implementations. These labs underpin the coursework and relate directly to the different parts.
Deliverables
The assessment covers one part and represents one of the assessed components of CMP-6026A:
Practical demonstration of the recogniser, and discussion of your design decisions and results (CW1)
The practical demonstration will take place in a lab. In the practical demonstration you will be asked to say a sequence of names that you will then decode using your speech recogniser. You will also be expected to discuss your system and justify design decisions related to data collection, design, implementation, and evaluation of your speech recogniser. You will present, by way of a slideshow of no more than 10 minutes, an evaluation of the speech recogniser in terms of its performance with different configurations and test conditions (see point 5 above). One member of your group will submit your slideshow on Blackboard as a group submission. Each group will have up to 25 minutes for the demonstration, in total.
Both group members will also submit a document providing your opinions of how the marks should be shared between your group members. This should be expressed as a percentage share for each group member (e.g. 50% Dave, 50% Sarah). We will use this to determine the distribution of marks in your group, following CMP’s Policy on Group Work. You should ensure that both people in the pair make a roughly equal contribution to the work, so that marks are shared fairly.
Resources
You will need to use audio/visual recording equipment/software and Python, as used in the lab classes. These resources have been introduced in the lectures and lab classes.
There will be a briefing session for this coursework in Week 2.
Marking scheme
Marks will be allocated as follows for the assessed component:
CW1: Demonstration and discussion (100%)
• Speech collection and annotation methodology (10%)
• Design of justification of feature extraction (20%)
• Acoustic modelling and noise compensation (10%)
• Short presentation evaluating the performance of the speech recogniser under different conditions (30%)
• Discussion/question answering (30%)
Note – We will follow CMP’s Policy on Group Work (Made available to you on Blackboard) to allocate individual marks for this assignment. In summary, each group member’s estimation of individual contribution will be used to determine individual marks. It is expected that both people in the pairing will make an equal contribution to the work and the demonstration, so that marks can be awarded fairly.