辅导 STATS 2DA3 FALL 2025 ASSIGNMENT 1辅导数据结构程序

STATS 2DA3 FALL 2025

ASSIGNMENT 1

1. (4 MARKS)

Visualizing 2 categorical variables;

For the mpg dataset in the ggplot2 package, perform the following tasks:

(a) Create a Double Decker plot, displaying “drv” as a function of “class” (class should be on the x-axis). Make sure you colour the “drv” variable so that each level is a different colour.

(b) For “compact” class cars, which type of drive train is the most common?

(d) Using ggplot make a bar chart (geom bar) displaying “class” . Colour the “class” variable with respect to the “drv” variable.

2. (7 MARKS) Go to the following website,“UCI Machine Learning Repository” and select a dataset that is suitable for visual data analysis. Downloading a .csv file is recommended for ease of use. Do not use a dataset that has been used in class or in your lab.

(a) Report the name of your dataset, the URL that you downloaded it from, and the date and time of downloading.

(b) Produce a table of statistics for your dataset including variable names, number of observations, variable types (categorical, continuous etc). That is, summarize what your data set “looks like”.

i. a bar chart

ii. a histogram

iii. a double decker plot

iv. a scatter plot

v. a box plot

(d) Then create an image (like we did with the Titanic dataset in lecture 3), displaying the 2 graphs in one image. Use R code to carry this out (use R code to display your graphs side by side), do not screen grab images and combine.

3. (11 MARKS) Go to the following website,“UCI Machine Learning Repository” and select a dataset that is suitable for clustering analysis (you may not use the same dataset from question 2). Downloading a .csv file is recommended for ease of use. Do not use a dataset that has been used in class or in your lab.

(a) Report the name of your dataset, the URL that you downloaded it from, and the date and time of downloading.

(b) Perform. basic preprocessing where appropriate: handle missing values, normalize/scale variables if necessary etc.

(d) Produce Silhouette plots for the three different values of k.

(e) Compare the Silhouette results across the three different k values and discuss which choice seems best.

(f) For the value of k that you picked as the best choice, produce a table that compares the labels generated by the clustering algorithm to the true class labels.

4. (5 MARKS)

(a) Explain the goal of clustering.

(b) Describe the k-means algorithm step by step.

(d) Write down the objective function minimized by k-means.

Assignment Standards

• Answer each question. Do not just provide code. Any graphs must be rendered and reproduced in the report

• LATEX or the use of Markdown in R studio is recommended but not required.

• Submit your assignment as one .pdf document.

• All R code should be included and organized either at the end of the as- signment or inline (if using R Markdown).

辅导 STATS 2DA3 FALL 2025 ASSIGNMENT 1辅导 数据结构程序

辅导 STATS 2DA3 FALL 2025 ASSIGNMENT 1辅导数据结构程序