辅导 MATH38161 Multivariate Statistics and Machine Learning讲解 R编程

MATH38161 Multivariate Statistics and Machine Learning

Coursework

November 2024

Overview

The coursework is a data analysis project with a written report. You will apply skills and techniques acquired from Week 1 to Week 8 to analyse a subset of the FMNIST dataset.

In completing this coursework, you should primarily use the techniques and methods introduced during the course. The assessment will focus on your understanding and demonstration of these techniques in alignment with the learning outcomes, rather than the accuracy or exactness of the final results.

The project report will be marked out of 30. The marking scheme is detailed below.

You have twelve days to complete this coursework, with a total workload of approxi- mately 10 hours (including preliminary coursework tasks).

Format

• Software: You should mainly use R to perform. the data analysis. You may use built-in functions from R packages or implement the algorithms with your own codes.

• Report: You may use any document preparation system of your choice but the final document must be a single PDF in A4 format. Ensure that the text in the PDF is machine-readable.

• Content: Your report must include the complete analysis in a reproducible format, integrating the computer code, figures, and text etc. in one document.

• Title Page: Show your full name and your University ID on the title page of your report.

• Length: Recommended length is 8 pages of content (single sided) plus title page. Maximum length is 10 pages of content plus the title page. Any content exceeding 10 pages will not be marked.

Submission process and deadline

• The deadline for submission is 11:59pm, Friday 29 November 2024.

• Submission is online on Blackboard (through Grapescope).

Coursework tasks

Analysis of the FMNIST data using principal component analysis (PCA) and Gaussian mixture models (GMMs)

The Fashion MNIST dataset contains 70,000 grayscale images of fashion products categorised into 10 distinct classes. More information is available on Wikipedia and Github.

The data set to be analysed in this coursework is a subset of the full FMNIST data and contains 10,000 images, each with dimensions of 28 by 28 pixels, resulting in a total of 784 pixels per image. Each pixel is represented by an integer value ranging from 0 to

255. You can download this data subset as “fmnist.rda” (7.4 MB) from Blackboard.

load("fmnist. rda")	# load sampled FMNIST data set
dim(fmnist$x)	# dimension of features data matrix (10000, 784)

## [1] 10000 784

range(fmnist$x) # range of feature values (0 to 255)

## [1] 0 255

Here is a plot of the first 15 images:

par(mfrow=c(3 ,5), mar=c(1 ,1 ,1 ,1))

for (k in 1:15) # first 15 images

{

m = matrix( fmnist$x[k,] , nrow=28 , byrow=TRUE)

image(t(apply(m, 2 , rev)), col=grey(seq(1 ,0 ,length=256)), axes = FALSE)

}

Each sample is assigned to one label represented by an integer from 0 to 9 (as R factor with 10 levels):

fmnist$label[1:15] # first 15 labels

## [1] 7 1 4 8 1 4 7 1 2 0 7 0 8 1 6

## Levels: 0 1 2 3 4 5 6 7 8 9

Task 1: Dimension reduction for FMNIST data using principal components analysis (PCA)

The following steps are suggested guidelines to help structure your analysis but are not meant as assignment-style. questions. Integrate your work as part of a cohesive report with a logical narrative.

• Do some research to learn more about the FMNIST data.

• Compute the 784 principal components from the 784 original pixel variables.

• Compute and plot the proportion of variation attributed to each principal compo- nent.

• Create a scatter plot of the first two principal components. Use the known labels to colour the scatter plot.

• Construct the correlation loadings plot.

• Interpret and discuss the result.

• Save the first 10 principal components of all 10,000 images to a data file for Task 2.

Task 2: Analysis of the FMNIST data set using Gaussian mixture models (GMMs)

Using all 784 pixel variables for cluster analysis is computationally impractical. In this task, use the 10 (or fewer) principal components instead of the original 784 pixel variables. Again, these steps serve as guidelines. Integrate this work into your report logically following from Task 1.

• Cluster the data using Gaussian mixture models (GMMs).

• Find out how many clusters can be identified.

• Interpret and discuss the results.

Structure of the report

Your report should be structured into the following sections:

1. Dataset

2. Methods

3. Results and Discussion

4. References

In Section 1 provide some background and describe the data set. In Section 2 briefly introduce the method(s) you are using to analyse the data. In Section 3 run the analyses and present and interpret the results. Show all your R code so that your results are fully reproducible. In Section 4 list all journal articles, books, wikipedia entries,github pages and other sources you refer to in your report.

Marking scheme

The project report will be assessed out of 30 points based on the following rubrics.

Criteria	Marks	Rubrics
Description of data	6	Excellent (5-6 marks): Provides a clear and thorough overview of the FMNIST dataset, detailing the image structure, pixel data, and its context within multivariate analysis. Good (3-4 marks): Provides a clear overview of the dataset with some context; minor details maybe missing. Adequate (1-2 marks): Basic description of the dataset with limited context; lacks important details. Insufficient (0 marks): Little to no description provided.
Description of Methods	6	Excellent (5-6 marks): Clearly and thoroughly explains PCA and GMMs, their purposes, and how they apply to this dataset. Good (3-4 marks): Provides a clear explanation of PCA and GMMs, with minor gaps in clarity or relevance. Adequate (1-2 marks): Basic explanation of methods with limited detail or relevance to the course techniques. Insufficient (0 marks): Lacks clear explanations of the methods.
Results and Discussion	12	Excellent (10-12 marks): Correctly applies PCA and GMMs, presents clear and informative visualisations, and provides a coherent and insightful interpretation of the results. Good (7-9 marks): Accurately applies PCA and GMMs with mostly clear visuals and reasonable interpretation; minor improvements needed. Adequate (4-6 marks): Basic application of techniques, limited or unclear visuals, minimal interpretation. Insufficient (0-3 marks): Incorrect application of techniques, with little to no interpretation.
Overall Presentation of Report	6	Excellent (5-6 marks): Report is well-organised, clear, and professionally formatted, with a logical narrative and adherence to page limits. Good (3-4 marks): Report is generally clear and organised, with minor structural or formatting issues. Adequate (1-2 marks): Report lacks coherence or has significant formatting issues; may not meet all format requirements. Insufficient (0 marks): Report lacks structure and clarity, does not meet formatting requirements.