FIT1043 Assignment 2: Description

Due date: Friday 27th September 2019- 11:55pm

Aim

The aim of this assignment is to investigate and visualise data using Python as a data science tool.

It will test your ability to:

1. read a data file in Python and extract related data from it;

2. use various graphical and non-graphical tools to perform data exploration, data wrangling

and data analysis;

3. use basic tools for managing and processing data; and

4. communicate your findings in your report.

Data

The dataset we will use comes from the TAO (Tropical Atmosphere Ocean) project, by the

Pacific Marine Environmental Lab of the U.S. National Oceanic and Atmospheric Administration.

This monitors the atmosphere in the tropical Pacific Ocean.

• The Tropical Atmosphere Ocean dataset we chose (TAO_2006.csv file) contains

atmosphere data from a specific monitoring site: (2◦N,165◦E).

• We chose to investigate environment data from January until September 2006, where the

measurements were taken every 10 minutes.

• The dataset contains information about Timestamp, date (YYYYMMDD) and time

(HHMMSS) of measurements, Precipitation (PREC), Air Temperature (AIRT), Sea Surface

Temperature (SST), Relative Humidity (RH), and the Quality (Q) of measurements.  

• The file is available on Moodle and is publicly available from pmel.noaa.gov.

Hand-in Requirements

Please hand in a PDF file1 containing your answers and a Jupyter notebook file (.ipynb)

containing your Python code to all the questions respectively:

● PDF file should contain:

1. Answers to the questions. In order to justify your answers to all the questions, make

sure to

a. Include screenshots/images of the graphs you generate (You will need to use

screen-capture functionality to create appropriate images.)

b. Include copy/paste of your Python code (not images of your code but the actual

text).

● Ipynb file should contain:

1. A copy of your working Python code to answer the questions.

● You will need to submit two separate files (the PDF file and the ipynb file). Zip, rar or any

other similar file compression format are not acceptable and will have a penalty of 10%.

1 You can use Word or other word processing software to format your submission. Just save the final

copy to a PDF before submitting. Supportive Material/Code:

• Material: In order to complete your assignment, you may want to use regressiondemo.py

code used in week 5 tutorial. If you use this code, you do not need to upload the

regressiondemo.py file in your final submission.

• Code: If "YYYYMMDD" is in datetime format, you can extract month and day from it using

method .dt and create a new column for month and day as follows:

>>> your_dataframe['Month']=your_dataframe['YYYYMMDD'].dt.month

>>> your_dataframe['Day']=your_dataframe['YYYYMMDD'].dt.day

Python Availability

You will need to use Python to complete the assignment. You can do this by either:

1) running a Jupyter Notebook on a computer in the labs; or

2) installing Python (we recommend Anaconda) on your own machine.

Assignment Tasks:

There are two tasks that you need to complete for this assignment. Students that complete only

Tasks A1-A9 can only get a maximum of Distinction. Students that attempt tasks A10 and B

will be showing critical analysis skills and a deeper understanding of the task at hand and can

achieve the highest grade. You need to use Python to complete the tasks.

Task A: Data Wrangling and Analysis on TAO dataset

In this task, you are required to explore the dataset and do some data analysis on the Tropical

Atmosphere Ocean dataset. Have a look at the csv file (TAO_2006.csv) and then answer a series

of questions about the data using Python.

A1. Dataset size

How many rows and columns exist in this dataset?

A2. Min/Max values in each column

Find maximum and minimum values for Precipitation (PREC), Air temperature (AT), Sea surface

temperature (SST) and Relative humidity (RH) in this dataset.

A3. Number of records in each month

List the number of records in each month. In which two months are the number of records at their

lowest? Why?A4. Missing values

There are some missing values: -9.990000 and -99.900000 represent missing values.

1. How many rows contain missing values (-9.990000 or -99.900000) in this dataset?

2. List the months with no missing values in them.

3. Remove the records with missing values.

Note: Use the dataset with missing values removed from here onwards.

A5. Investigating Sea surface temperature (SST) in different months

Now look at the sea surface temperature (SST) column and answer the following questions

1. Using a boxplot, visualize the distribution of SST over different months.

2. Describe the trend of median SST over different months.

3. Which month has the highest median SST? Which month has the lowest?

A6. Exploring precipitation measurements (PREC)

Now look at the Precipitation column and answer the following questions

1. Precipitation values in this dataset show rain rates. Plot Precipitation values over different

timestamps.

2. Due to measurement error, there are some counter-intuitive values in Precipitation

column. Identify those values and replace them with zero.

Note: Use the dataset from previous task (Task A6) and complete Tasks, A7-A9.

A7. Relationship between variables

1. Compute pairwise correlation of columns, precipitation, air temperature and surface

temperature. Which two features have the least linear association?

2. Now let's look at the relationship between air temperature and relative humidity. Plot the

values of these features against each other. Is there any relationship between these two

features? Describe it.

A8. Predicting quality of measurements (Q)

We now want to build a predictive model to predict the quality of measurements (Q) in the dataset

based on four features: Precipitation (PREC), Air temperature (AIRT), Sea surface temperature

(SST) and Relative humidity (RH).

1. Divide the dataset into a 75% training set and a 25% testing set and train a decision tree

model.

2. Using test set, compute the confusion matrix and accuracy.

3. Considering accuracy only, do you think that this is a good model? What other metric(s)

should we consider as well? Why? Elaborate your answer. A9. Investigating daily relative humidity (RH)

We will now investigate the trend in the daily relative humidity over time. For this, you will need to

aggregate the median relative humidity by day.

1. Fit a linear regression using Python to this data (i.e., relative humidity over different days)

and plot the linear fit.

2. Use the linear fit to predict median relative humidity on 2nd September 2006.

3. Can you think of a better model that fits all of the aggregated data to capture the trend in

relative humidity over time? Describe the model you suggested and explain why it is

better suited for this task.

4. Use your new model to predict median relative humidity on 2nd September 2006 and

compare with the prediction of your previous linear fit.

A10. Filling in missing values

Rather than removing the missing values in task A4, fill in the missing values (for column, RH

only) using an appropriate regression model.

Task B: K-means Clustering on Other Data

We have demonstrated k-means clustering algorithm in week 7. Your task in this part is to find an

interesting dataset and apply k-means clustering on a dataset using Python. Kaggle, a private

company which runs data science competitions, provides a list of their publicly available datasets:

https://www.kaggle.com/datasets

In particular you need to:

1. choose two numerical features in your dataset and apply k-means clustering on your data

into k clusters in Python, where k>=2.

2. visualise the data as well as the results of the k-means clustering. Ideally each cluster is

shown in a different colour.

3. describe your findings about the identified clusters.

4. investigate/suggest some appropriate measures to evaluate the quality of your clusters.

You can search online for this task.

Please note you cannot use the same data set used in tutorials in this unit.

Please include a link to your dataset in your report. You may wish to:

1. provide the direct link to the public dataset from the internet, or

2. place the data file in your Monash student - google drive and provide its link in the

submission.

Good Luck!

联系我们

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-21:00
微信：codinghelp

热点文章

mgt202辅导、讲解 java/pytho... 2025-06-28
讲解 pbt205—project-based l... 2025-06-28
辅导 comp3702 artificial int... 2025-06-28
辅导 cs3214 fall 2022 projec... 2025-06-28
辅导 turnitin assignment讲解... 2025-06-28
辅导 finite element modellin... 2025-06-28
讲解 stat3600 linear statist... 2025-06-28
辅导 problem set #3讲解 matl... 2025-06-28
讲解 elen90066 embedded syst... 2025-06-28
讲解 automatic counting of d... 2025-06-28
讲解 ct60a9602 functional pr... 2025-06-28
辅导 stat3600 linear statist... 2025-06-28
辅导 csci 1110: assignment 2... 2025-06-28
辅导 geography调试r语言 2025-06-28
辅导 introduction to informa... 2025-06-28
辅导 envir 100: introduction... 2025-06-28
辅导 assessment 3 - individu... 2025-06-28
讲解 laboratory 1讲解留学生... 2025-06-28
辅导 ct60a9600 renewable ene... 2025-06-28
辅导 economics 140a homework... 2025-06-28

热点标签

msinm014/msing014/msing014b

联系我们 - QQ: 99515681 微信：codinghelp

程序辅导网！