Homework #1
Fall 2018
AD699: Data Mining for Business Analytics
Topics: Data Exploration and Visualization
Anote about submissions: Unlike Olympic figureskatingandski jumping, AD699does not
award style. points. There are some fancy tools within R for generating reports, such as
RMarkdown, but learningthemis not withinthescopeof thiscourse. Themost important thing
here is to answer the questions that ask for writtenanswers, andtoshowscreenshots where
screenshots are asked for.
This assignment is due by 11:59 p.m. on Monday, September 17th.
Step 1:
Download this file from our course Blackboard site:
a) athlete_events.csv
Part I: Data Exploration
1. Bringthis fileintoyour Renvironment. Assignthenameathletes tothis file. Showthe
codethat youusedtodothis. (Remembertofirst set yourworkingdirectorytothefolder
that contains your files).
2. How many rows and how many columns does athletes contain? How do you know this?
3. Are there any missing values in the athletes data set? If so, how do you know this?
(Note: There are MANY ways that you could answer this question, and any valid way is
completely fine).
4. Remove all rows in the athletes data set that contain any missing values, and store the
results of this operation in a new variable called athletes2. What are the dimensions of
athletes2?
5. Based on the data in athletes2, what is the mean age of an Olympic medalist? What is
the median age? Show the code that you used to find this out, along with a screenshot
of your results.
6. How many Olympic medalists were male, and how many were female? (Hint: Use the
table function to help you with this). Show the code that you used to find this out, along
with a screenshot of your results.
7. How old was the youngest Olympic medal winner in the dataset? How old was the oldest
Olympic medal winner in the dataset? Show the code that you used to find this out,
along with a screenshot of your results.
Part II: Data Visualization
1. Filter thedataset sothat it onlycontainsinformationforyourparticularOlympiad.
Student Olympiad assignments can be found in Blackboard, inthesamefolder
that contains this assignment prompt. Assign a new variable name to this
dataset that only contains your Olympiad. Showthecodethat youusedtofind
this out, along with a screenshot of your results.
2. Usingggplot, createahistogramthat depictsthedistributionof medal winnersfor
your Olympiad by age. Showthecodethat youusedtoaccomplishthis, along
with a screenshot of your results.
3. Now, modify your histogramby specifyinganumber of binwidths that youchose
(i.e. not thedefault number). Specify acolor for thebins inyourhistogram, and
specify another color to use for the borders of thebins. Giveyour histograma
descriptive title. Showthe code that you usedtoaccomplishthis, alongwitha
screenshot of your results.
4. Imaginethat your boss is asmart person, but has noideawhat ahistogramis--
how would you explain this plot to your boss? Write a one or two sentence
description of what your histogram shows you.
5. Whichsix NOCs receivedthegreatest numbers of medals? Showthecodethat
you used to find this out, along with a screenshot of your results. Create a
filtereddataset that onlycontainsmedalistsfromthesesixNOCs. Showthecode
that you used to accomplish this, along with a screenshot of your results.
6. Usingggplot, createascatterplot that depicts theheights (onthex-axis)andthe
weights (onthey-axis) of theathletes fromthesix NOCs withthemost medals.
Give your plot a descriptive title. Showthe code that you used to accomplish
this, along with a screenshot of your results. Write a one or two sentence
description of what this scatterplot shows you (again, explain it to your boss).
7. Now, add to the scatterplot that you just created by including a categorical
variable (gender). Show the code that you used to accomplish this, along with a
screenshot of your results. Write a one or two sentence description of what this
scatterplot shows you (again, explain it to your boss).
8. Include yet another categorical variable on your scatterplot -- NOC. Use shape
to represent NOC. Show the code that you used to accomplish this, along with a
screenshot of your results. Write one or two sentences about something that this
plot tells you (you don’t need to summarize the entire plot for this -- you can just
pick a couple data points and describe them here).
9. Again using ggplot, create a barplot that compares the total number of bronze,
silver, and gold medals among the top six NOCs. What do you notice about
these totals? If every Olympic competition generates one gold, one silver, and
one bronze, why might your bars be different heights? (Hint: think about how you
created this subset of the original dataset).