Exploratory Report
Due Friday by 10pm Points 100 Submitting a text entry box
Submit Assignment
For this part of the final project, you will be creating a data report exploring a pair of data sets of your
choosing. Your report will introduce this data, provide some preliminary/summary information, and
then answer some data science questions about those data sets.
Creating this report will require you to apply all of the data analysis and programming skills you've
learned in the course so far. In addition, this project will also require you to use git as a
collaboration tool: you'll write the report together, on different computers using git to share code.
This is a group assignment! You will be working closely with your project group to complete it!
You will also be required to fill out a peer evaluation to
provide us with some data about how well your team collaborated.
Objectives
By completing this assignment you will practice and master the following skills:
Synthesizing skills, tools, and concepts from across the course
Performing a complete data analysis project, from data gathering to presentation
Working with a team of programmers to analyze data
Using git to manage code that is being modified by several people at the same time
Using git branches to organize code development and publish reports as web pages
Setup
For the project you will be working in the same repo you created for the Project Proposal. You will be
generating new files (script and R Markdown) and putting them in this repo.
Make sure that everyone in your group has Admin access
(https://help.github.com/articles/inviting-collaborators-to-a-personal-repository/) to this
repository, and that they have all cloned a copy to their local machines so that you can code
collaboratively.
Collaborating on the Report
One of the goals of the project is to practice collaborative coding. Each person will need to
contribute code to the solution. This means that everyone will need to write some of the code that
is utilized in the report (we will be checking commit history to ensure that everyone has contributed).
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 2/8
The report you will produce has multiple sections; some may be easy to "divide up" and let each
person take the lead, but others will require you to work together. This not intended to be a project
with individual parts, but a collaborative team effort and will be graded as such.
We highly encourage pair-programming (https://en.wikipedia.org/wiki/Pair_programming) (two
people working at one computer). Just make sure you switch off who is the "driver", and that both
partners commit changes to the code.
You are also required to fill out a peer evaluation (https://forms.gle/Xy1D8knKGWezg8s87) to help
us evaluate group collaboration. Scores on this assignment will be adjusted to reflect unequal
contributions (if any), as discussed in lecture.
The project will be graded for the entire group; thus the entire group is responsible for all of the
content in the report.
Using Feature Branches
All of your development work on this report must be done using feature branches. This means that
for each feature (i.e., section) of the report, you will need to checkout a new branch, develop your
code on that branch, and then merge those changes back into main . This is intended to help you
learn to work with branches, as well as to help organize the work.
Note that you can push and pull feature branches to GitHub (with git push origin branch_name ) in
order to share them, and merge commits between feature branches. Thus if two people want to
"combine" their code, they could merge their branches together, and then merge that work back into
the main branch.
Pro tip: If there is a merge conflict in the .html file generated by R Markdown, the easiest way to
resolve it is to fix any errors in the .Rmd file and then simple "re-knit" the output. This will create a
new .html file that will overwrite the previous one... thereby giving you a correctly "updated" file
that you can mark as resolving the conflict.
We will be looking for evidence that you successfully used such branching (e.g., commit messages
reflecting the branch merges, or branches that have been pushed to GitHub).
Report Structure
For this part of the project, you will be collaboratively producing a report that presents an exploratory
analysis of your chosen data sets.
Because this project is open-ended (and its the final project), we will not necessarily give
explicit instructions about how to complete it, or even the order in which you should do the
work. Instead we will describe the overall requirements for your report, and it will be up to you
to determine how to meet those specifications.
Your report will be a web page created using R Markdown and knitr . Your report must be written in
a file called index.Rmd located in the root of your project repository (you can do create this file using
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 3/8
the RStudio wizard).
Your report must include an appropriate title, as well as your names as the author (including your
group number, found on Canvas under "People > Groups"—search for your name to find your
group number).
Your data wrangling ( R code) must be written in separate .R script files, and you must use
source() to load those scripts into the R Markdown. You are welcome to use multiple script files
(e.g., one per report section or one per data set) to better organize your code. However, be sure that
you don't duplicate code between script files.
In particular, don't have multiple scripts read/load the same data file—you should only call
read.csv() once per data set for your entire knitted document. One way to do this is to load a
data file in the setup chunk, and then pass that data frame as an argument to functions defined in
the .R scripts.
Although it may be tempting, do not try to do "one .R file per person". Your code should be
organized based on the data wrangling that occurs, not based on who the author is! You will need
to collaborate on the programming, with multiple people editing the same code files.
Your report will be divided into distinct sections. Each section will need to include an appropriate 2ndlevel
heading. Additionally, make sure each section/subsection has at least a sentence of text
introducing it—don't just launch into data tables or graphics!
Note that each section/subsection can and should contain a code chunk used specifically by that
section, but any "shared calculations" (such as library() and source() calls) should be done in
the report's initial setup chunk.
The sections your report needs to include are described below.
Section 1. Problem Domain Description
The first section of your report swill include a short text introduction to the problem domain that your
data sets are related to. After reading this section, we should understand the "what" and "why" of your
data set. You will explain the context and justification for your analysis. What domain knowledge is
needed to understand your report?
You can (and should!) copy this directly from your project proposal; you don't necessarily need
to do any "new" work for this section.
This section should be a paragraph or two in length (around 150-250 words, depending on the
complexity of the domain). Note that this section doesn't need to contain any R code, though you will
need to include Markdown formatting when appropriate. Be sure to include hyperlinks to relevant
resources.
Section 2. Data Description
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 4/8
This next section of your report will provide descriptions and examples of your chosen data sets. You
are explaining what data you have found that will be able to answer your data science questions,
below. This section will need to include the following content (in whatever order you choose):
It's fine to use a different subsection for each data set if that makes it easier to read or write.
1. A non-technical description of the data sets you will be using (what is the data?) This only needs
to be a sentence or two.
2. An explanation of where the data comes from, who originally collected the data, and any other
information we may need to know about how this data set came to be. Make sure you include
hyperlinks to the source—we must be able to follow the links and find your data sets ourselves.
Again, this only needs to be a couple sentences, and you can borrow ffrom your project proposal.
3. A sample of the data set, so that we can see what raw data you'll be working with ("the data set
looks like this"). This means that you'll need to load the data set into R (e.g., with read.csv() )
and present it as a table (or multiple tables) in your report. Do not include the entire table; just the
a few rows is sufficient. Think about the "user experience" of reading the report!
You don't need to include all columns of your data frames; only including the most
important/relevant ones is acceptable.
You are not required to do substantive data cleaning (or even rename columns), though it
wouldn't hurt to do some of that wrangling now instead of later.
Remember to do your data wrangling in the .R script file, not in the R Markdown file!
You will need to provide samples of both of your two data sets (so at least 2 tables!)
4. A written explanation of the sample data's structure. In particular, explain what each of the
features (columns) represents. You don't need to explain columns like "year" or "country", but any
columns that don't make sense on their own need to be explained. Be sure to note units for
differnet values (e.g., if a value is in dollars of kg).
A Markdown list is a nice format for providing this information.
Overall, this section will explain the data that you will be analyzing—while also ensuring that you can
both access and understand that data!
Subsection 2.2 Summary Analysis
As part of your data description, your report will also provide an overview summary analysis of the
two data sets. This will provide a more quantitative description of the data, helping the reader to
understand the range and distribution of the data and providing a baseline for your analysis.
This will be a subsection (with a third-level heading) of your data description. Your summary analysis
needs to include the following for each data set:
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 5/8
1. Provide summary descriptive statistics and central tendancy measures (e.g., mean() , max() , etc)
for all the major features for interest from each of your data sets. The built-in summary() function
can be helpful for this, as can other summarizations functions.
Note that you are not required to include descriptive statistics for every feature/column, only
for the ones that will be relevant or interesting for your analysis.
You can present these statistics in the form of a table, but you'll need a sentence or two to
explain them. You can also use a list or inline expressions to provide this information.
This is a good point to determine and explain how you will handle missing or NA
values.
2. Include at least 3 graphics or plots illustrating the distribution or trends of the data. For example,
you could use a histogram, box plot, or violin plot to show how the values in your data are spread
out. You could also use a line or bar chart to show changes in values over time.
Each graphic can visualize a single variable (similar to the examples in Ch 15.2.1
(https://learning.oreilly.com/library/view/programming-skillsfor/9780135159071/ch15.xhtml#ch15_2_1)
), or you can include multiple measures in a single
chart. Smooth geometry (https://ggplot2.tidyverse.org/reference/geom_smooth.html) can
also be good for showing trends among messy data.
Again, you don't need to illustrate the distribution of all of the data, only the most important
features. The goal is to get a general sense of the data's trends to then inform your more
specific analysis.
ALL graphics in your report should be well-designed. This means they have good use
of aesthetic mapping, labeling, etc.
3. Identify any significant outliers in your data sets. These are values that are particularly high or low
or missing. You'll want to note the outliers because of how they may skew your analysis—a single
very high value may throw off averages, and missing data can hinder the analysis.
You'll need to use prose to note the outliers (using inline R expressions to dynamically report
the values of course). If there are no significant outliers, you should mention that—and explain
why this might be the case!
Overall, this section should be a couple (2-4) paragraphs in length, in addition to the required
graphics.
Section 3: Specific Question Analyses
This last section will provide answers to specific data science questions (such as the ones you posed
in your Project Proposal). You will perform the data wrangling/analysis needed, explain your process,
and write up the results (answers).
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 6/8
You will need at least one good question per group member. Try to limit your analysis to no more
than two questions per member—anything more than that either means your project scope will be
too big or that your questions are not interesting enough.
At least 40% of your questions need to involve both data sets, such as comparing features from
each.
Each question will be addressed in its own subsection (with a third-level heading). This will help keep
your analysis organized, though you are welcome to refer to other subsections as appropriate. Each
subsection will need to provide the following:
1. Introduce and explain the question if necessary. For example, you may need to define
terminology you use, or otherwise scope the question. If the question is "are Uber rates greater
than taxi rates", then you might explain what you mean by "rate" (given variable pricing structures)
or identify which type of rides you'll be considering. This might only be a couple of sentences.
2. An explanation of your data analysis method. This can be a brief written description of your data
wrangling steps (e.g., "I took the average Uber rate and compared it to the average taxi rate for
each hour of the day"). You can even output specific code if it helps explain your process! The
goal is to enable the reader to know how to repeat your analysis so they can check your work.
This doesn't need to be lengthy; a sentence or two may be enough (though complex analysis may
require a whole paragraph).
The vast majority of your programming and data wrangling work will happen at this step!
3. Provide the results of your analysis. Results will need to be both quantitative (e.g., numbers and
tables) and graphical.
The quantitative results will be the relevant data that your wrangling produces—for example,
you might include a data table of the average Uber and taxi rates that you found. We don't
need the entire "raw" data frame, but there should be a table of values you produced and
considered.
Calculated values such as correlations are also considered quantitative results.
The graphical results will be a least one plot illustrating the results of your data wrangling. For
most questions these plots should be straightforward to design—in the Uber example, it might
be a line chart comparing average rates over time, or a scatterplot showing the distribution of
driver pay rates.
Remember that at least 40% of your questions need to involve both data sets. That
means that they'll almost certainly require joining the data sets together for your
analysis.
4. Finally, you must include an evaluation of your results&. An evaluation is an interpretation of your
results—the information, not just the data. Your evaluation should draw a conclusion from the
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 7/8
results—it is the answer to your question! As (mostly) made-up examples: "Uber rates are
cheaper than taxis only in urban areas" or "Uber does not pay drivers a livable wage".
Note that your evaluation cannot rely purely on visual or anecdotal analysis (no "the line goes
up!" or "the measure for one state looks large!") Be sure to use descriptive statistics (e.g.,
mean/median) or measures of effect strength (e.g., correlations or predictive statistics) to
definitively state relationships among your data. You do not need to perform advanced
statistical analysis—this is not a stats class!—but your conclusions need to be grounded in the
data, not in the representation.
It's quite likely that the results may not provide the answer you expected, and that's okay! In
your evaluation, you can mention that, and offer a guess as to why your assumptions didn't
hold up.
Overall, we are looking for evidence of critical thinking—that you have thought about the
results of your data wrangling in addition to the programming. Considering how the answer to
your question would influence action is a good way to push yourself to think critically.
Overall, this part should present specific insights, using the data (results) itself as evidence
of your claims.
Publishing Your Report
As partially described in the course text (https://learning.oreilly.com/library/view/programmingskills-for/9780135159071/ch18.xhtml#sec18_4)
(and further detailed in class), you should publish your
knitted report online by deploying it to GitHub Pages (https://help.github.com/articles/userorganization-and-project-pages/)
(as a Project Pages site).
You must publish to GitHub pages by pushing to the gh-pages branch on GitHub (Do not use one of
the other publishing methods). The cleanest way to do this is to create a new gh-pages branch
immediately off of your main branch once you're ready to publish. Be sure and push this new ghpages
branch to the gh-pages branch on GitHub.
In order to "activate" GitHub pages, the first push to the gh-pages branch needs to be made by
someone with administrative privileges (not just write privileges)—so make sure all of your
teammates are Admins!
Once you've pushed your report to the gh-pages branch, you should then be able to visit
https://info201b-wi21.github.io/project-YOUR_GITHUB_USER_NAME/index.html
in order to view your report online (specifically, the generated index.html file). Note that sometimes it
takes a few minutes for GitHub pages to update with the new report, so be patient if it doesn't show
up initially. (You can check that the code is pushed correctly by checking the gh-pages branch on the
GitHub web portal; if the code is online but the page isn't showing, check in with us).
iSchool Canvas Support
2021/2/25 Exploratory Report
https://canvas.uw.edu/courses/1434910/assignments/5937467 8/8
Be sure that any future changes are made on the main branch, not on gh-pages (we'll be
grading the code on master ). NEVER EDIT CODE ON THE gh-pages BRANCH. If you make
additional changes (on main ), you will need to then merge them into the gh-pages branch
(watching out for any merge conflicts). If you hit any problems, ask for help!
Submit Your Project Report
In order to submit your project report:
1. Confirm that you've successfully completed the report (e.g., that your code is able to generate a
report that includes all of the required information).
Please proofread your report! Make sure there aren't any half finished sentences or
egregious typos, and that overall it is cleanly formatted and readable. It should be in better
condition than these assignment write-ups! Moreover, since there are multiple people on
your team to read it over, you should definitely have caught and fixed any mistakes
2. Your group must merge the final version of the code (including the index.html file and any .R
files) to the main branch and push code to the GitHub repository (to the main branch).
3. Publish the report to GitHub Pages (push to the gh-pages branch). Make sure you publish the
final version of the knitted report!
4. Make sure that each group member has filled out the peer evaluation
(https://forms.gle/Xy1D8knKGWezg8s87) . All team members must complete this!
5. Submit the URL of your GitHub Repository AS WELL AS the URL of your published report (two
links!) as your assignment submission on Canvas (this page, at the top).
Since this is a group project, only one person needs to submit the link via Canvas.
iSchool Canvas Support