首页 > > 详细

FIT1043 Assignment 2: Description

 

Due date: Friday 27th September 2019- 11:55pm
Aim
The aim of this assignment is to investigate and visualise data using Python as a data science tool.
It will test your ability to:
1. read a data file in Python and extract related data from it;
2. use various graphical and non-graphical tools to perform data exploration, data wrangling
and data analysis;
3. use basic tools for managing and processing data; and
4. communicate your findings in your report.
Data
The dataset we will use comes from the TAO (Tropical Atmosphere Ocean) project, by the
Pacific Marine Environmental Lab of the U.S. National Oceanic and Atmospheric Administration.
This monitors the atmosphere in the tropical Pacific Ocean.
• The Tropical Atmosphere Ocean dataset we chose (TAO_2006.csv file) contains
atmosphere data from a specific monitoring site: (2◦N,165◦E).
• We chose to investigate environment data from January until September 2006, where the
measurements were taken every 10 minutes.
• The dataset contains information about Timestamp, date (YYYYMMDD) and time
(HHMMSS) of measurements, Precipitation (PREC), Air Temperature (AIRT), Sea Surface
Temperature (SST), Relative Humidity (RH), and the Quality (Q) of measurements.
• The file is available on Moodle and is publicly available from pmel.noaa.gov.
Hand-in Requirements
Please hand in a PDF file1 containing your answers and a Jupyter notebook file (.ipynb)
containing your Python code to all the questions respectively:
PDF file should contain:
1. Answers to the questions. In order to justify your answers to all the questions, make
sure to
a. Include screenshots/images of the graphs you generate (You will need to use
screen-capture functionality to create appropriate images.)
b. Include copy/paste of your Python code (not images of your code but the actual
text).
Ipynb file should contain:
1. A copy of your working Python code to answer the questions.
You will need to submit two separate files (the PDF file and the ipynb file). Zip, rar or any
other similar file compression format are not acceptable and will have a penalty of 10%.
1 You can use Word or other word processing software to format your submission. Just save the final
copy to a PDF before submitting. Supportive Material/Code:
Material: In order to complete your assignment, you may want to use regressiondemo.py
code used in week 5 tutorial. If you use this code, you do not need to upload the
regressiondemo.py file in your final submission.
Code: If "YYYYMMDD" is in datetime format, you can extract month and day from it using
method .dt and create a new column for month and day as follows:
>>> your_dataframe['Month']=your_dataframe['YYYYMMDD'].dt.month
>>> your_dataframe['Day']=your_dataframe['YYYYMMDD'].dt.day
Python Availability
You will need to use Python to complete the assignment. You can do this by either:
1) running a Jupyter Notebook on a computer in the labs; or
2) installing Python (we recommend Anaconda) on your own machine.
Assignment Tasks:
There are two tasks that you need to complete for this assignment. Students that complete only
Tasks A1-A9 can only get a maximum of Distinction. Students that attempt tasks A10 and B
will be showing critical analysis skills and a deeper understanding of the task at hand and can
achieve the highest grade. You need to use Python to complete the tasks.
Task A: Data Wrangling and Analysis on TAO dataset
In this task, you are required to explore the dataset and do some data analysis on the Tropical
Atmosphere Ocean dataset. Have a look at the csv file (TAO_2006.csv) and then answer a series
of questions about the data using Python.
A1. Dataset size
How many rows and columns exist in this dataset?
A2. Min/Max values in each column
Find maximum and minimum values for Precipitation (PREC), Air temperature (AT), Sea surface
temperature (SST) and Relative humidity (RH) in this dataset.
A3. Number of records in each month
List the number of records in each month. In which two months are the number of records at their
lowest? Why?A4. Missing values
There are some missing values: -9.990000 and -99.900000 represent missing values.
1. How many rows contain missing values (-9.990000 or -99.900000) in this dataset?
2. List the months with no missing values in them.
3. Remove the records with missing values.
Note: Use the dataset with missing values removed from here onwards.
A5. Investigating Sea surface temperature (SST) in different months
Now look at the sea surface temperature (SST) column and answer the following questions
1. Using a boxplot, visualize the distribution of SST over different months.
2. Describe the trend of median SST over different months.
3. Which month has the highest median SST? Which month has the lowest?
A6. Exploring precipitation measurements (PREC)
Now look at the Precipitation column and answer the following questions
1. Precipitation values in this dataset show rain rates. Plot Precipitation values over different
timestamps.
2. Due to measurement error, there are some counter-intuitive values in Precipitation
column. Identify those values and replace them with zero.
Note: Use the dataset from previous task (Task A6) and complete Tasks, A7-A9.
A7. Relationship between variables
1. Compute pairwise correlation of columns, precipitation, air temperature and surface
temperature. Which two features have the least linear association?
2. Now let's look at the relationship between air temperature and relative humidity. Plot the
values of these features against each other. Is there any relationship between these two
features? Describe it.
A8. Predicting quality of measurements (Q)
We now want to build a predictive model to predict the quality of measurements (Q) in the dataset
based on four features: Precipitation (PREC), Air temperature (AIRT), Sea surface temperature
(SST) and Relative humidity (RH).
1. Divide the dataset into a 75% training set and a 25% testing set and train a decision tree
model.
2. Using test set, compute the confusion matrix and accuracy.
3. Considering accuracy only, do you think that this is a good model? What other metric(s)
should we consider as well? Why? Elaborate your answer. A9. Investigating daily relative humidity (RH)
We will now investigate the trend in the daily relative humidity over time. For this, you will need to
aggregate the median relative humidity by day.
1. Fit a linear regression using Python to this data (i.e., relative humidity over different days)
and plot the linear fit.
2. Use the linear fit to predict median relative humidity on 2nd September 2006.
3. Can you think of a better model that fits all of the aggregated data to capture the trend in
relative humidity over time? Describe the model you suggested and explain why it is
better suited for this task.
4. Use your new model to predict median relative humidity on 2nd September 2006 and
compare with the prediction of your previous linear fit.
A10. Filling in missing values
Rather than removing the missing values in task A4, fill in the missing values (for column, RH
only) using an appropriate regression model.
Task B: K-means Clustering on Other Data
We have demonstrated k-means clustering algorithm in week 7. Your task in this part is to find an
interesting dataset and apply k-means clustering on a dataset using Python. Kaggle, a private
company which runs data science competitions, provides a list of their publicly available datasets:
https://www.kaggle.com/datasets
In particular you need to:
1. choose two numerical features in your dataset and apply k-means clustering on your data
into k clusters in Python, where k>=2.
2. visualise the data as well as the results of the k-means clustering. Ideally each cluster is
shown in a different colour.
3. describe your findings about the identified clusters.
4. investigate/suggest some appropriate measures to evaluate the quality of your clusters.
You can search online for this task.
Please note you cannot use the same data set used in tutorials in this unit.
Please include a link to your dataset in your report. You may wish to:
1. provide the direct link to the public dataset from the internet, or
2. place the data file in your Monash student - google drive and provide its link in the
submission.
Good Luck!
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!