BUSS6002 2020S1 1
BUSS6002 Group Assignment
Due Date: Wednesday 3 June 2020
Value: 25% of the total mark
Rationale
This group assignment has been designed to allow students to apply their data science
skills on a real-world problem in business domains, as well as to help students develop
collaborative skills when working in a team.
Instructions
1. Required submission items via Canvas:
1. ONE written report (PDF format).
• Assignments > Report Submission
2. ONE Jupyter Notebook .ipynb
• Assignments > Upload Your Code File
3. ONE csv file of test results
• Assignments > Submit Your Test Results
2. The assignment is due at 17:00pm on Wednesday, 3 June 2020 AEST. The late
penalty for the assignment is 5% of the assigned mark per day, starting after
17:00pm on the due date. The closing date Wednesday, 10 June 2020, 17:00pm
AEST is the last date on which an assessment will be accepted for marking.
3. As per anonymous marking policy, please include the Group ID and Student IDs of
all group members. Do NOT include names. The name of the report and code file
must follow: GroupID_BUSS6002_2020S1, and the name of test results must
follow: GroupID_Test_Results.csv.
4. Your analyses and answers should be provided as a final report that gives full
explanation and interpretation of any results you obtain. Output without
explanation will receive zero marks. You are required to also submit your code that
can reproduce your reported results, as reproducibility is a key component to data
science. Not submitting your code will lead to a loss of 50% of the mark.
5. Be warned that plagiarism between individuals is always obvious to the markers of
the assignment and can be easily detected by Turnitin.
6. Presentation of the assignment is part of the assignment. There will be 10% marks
for the presentation of your final report and/or code.
7. Numbers with decimals should be reported to the third-decimal point.
Meeting Minutes and Peer Review
1. Each group is required to submit at least 3 meeting minutes as the appendix
attached to the final report. A template will be provided for preparing meetings
minutes. You may use the template provided or a template you choose.
2. We may ask for peer review from each student within a group. The instructions
about how to do this will be released later.
BUSS6002 2020S1 2
3. Each group will be awarded a group mark as per the marking criteria. Individual
adjustments to grades may be made if there is a dispute in a group or the
quality/quantity of contributions made by individuals are significantly different. In
such a case the unit coordinator will seek meeting minutes and peer review reports
from individuals within a group to decide on individual marks.
4. If you encounter any issues with your group members, please report and discuss
with your unit coordinator as early as possible.
Group Competition
A competition will be run among groups to rank the performance of your models on the
test data provided. The top 5 groups will be awarded with bonus marks to top up their
overall assignment mark: the top 3 groups will receive an extra 5 marks, and the 4th and 5th
groups will receive an extra 3 marks.
Project Description and Dataset
Nowadays, e-commerce has revolutionized the way companies do business and consumers
make purchasing decisions. It has become common practice for consumers to use online
reviews to inform their decision making and give opinions about their buying experience.
Companies and individuals are increasingly using such data to better understand their
audience and make better decisions. Through analyzing consumer opinions towards their
products, companies can develop comprehensive insights to customers’ experience, and
use this to improve their offering, build a better brand and improve their business.
Individual consumers can check the opinions of existing users of a product to help them
make wiser purchase decisions.
Suppose you are now working in a Data Science Team for an online clothing retailer. The
company has noticed a recent decline in their net promoter score which measures the share
of customers who would recommend the company to a friend or colleague. Management
suspects that this is the result of a recent change in their procurement strategy for some of
their departments and they tasked you to understand what customers are thinking about the
current collection. To facilitate this, you have been provided with a dataset that consists of
detailed product descriptions and classifications of recently sold items and the reviews
written by customers. Your team is tasked to analyze this dataset and report your findings
to assist the company in improving its appeal to consumers, with the following research
objectives:
• Describe how recommendation and rating patterns are affected across departments
and product types.
• Understand the shopping behavior of consumers and assess how age would affect
the buying and reviewing behavior.
• Conduct an analysis and build a predictive model to understand what influences a
customer’s decision to recommend a product.
There are two data files provided: product_train.csv and product_test.csv.
Only product_train.csv contains the target variable: Recommended, where 1
indicates that the customer recommends the product and 0 indicates he/she does not
recommend the product. The details of the features presented in the above datasets are
given in dictionary.csv. As it may not be feasible to directly use some of these
BUSS6002 2020S1 3
features (in particular, reviews represented as raw text) to build a model, one of your tasks
is to carefully extract or construct meaningful features as input to your analysis.
Tasks
Data Understanding: Conduct a thorough EDA to gain a better understanding of the
given data and business objectives. This includes but not limited to: checking/dealing with
missing data and outliers if any; top popular items sold and their characteristics;
recommendation and rating patterns across departments and product types. buying and
reviewing behavior of different age groups, etc. Carefully present your analysis and
findings in your report.
Build a Benchmark Model to Predict Recommendation: Build a simple logistic
regression model to assess the feasibility of recommendation prediction and establish a
baseline model. For this task, you are required to build your baseline model using bag of
words of the review text only. Use scikit-learn’s logistic regression model with “solver”
set to ‘liblinear’ and all other parameters set to default. Use scikit-learn’s CountVectorizer
with “max_features” set to 500 and all other parameters set to default. You need to choose
appropriate evaluation metrics and model evaluation strategies to validate your model.
Present your analysis and discuss your findings.
Improving Your Benchmark Model: You are required to make attempts to improve the
performance of your benchmark model as much as you can. You should consider using
more advanced feature engineering techniques and adding extra features to rebuild your
model. Your choice of decisions should be justified based on the evidence from the data
and accompanied by detailed explanation. You must properly validate your model and
optimize appropriate hyperparameters that apply. Simply building a model without any
consideration of validation and optimisation does not meet the minimum requirements.
You should demonstrate evidence of your efforts and you will be assessed based on the
depth of your exploration. Provide a summary of what has worked and what has not.
Report on your improved models and make comparisons with the benchmark model.
Note: You must use logistic regression and no other models are allowed for this task.
Interpreting Results: Decide on your best model and provide analysis and interpretation
of its behavior. For example, you may report on the features associated with
positive/negative recommendation. For your interpretation, you should focus on
identifying general rules that might be useful for the company to improve its business in
the future.
Final Test Results: Finally, apply your best model on the test data. You are asked to
report the classification results on the test data. Save your results into a csv file containing
two columns, one for the Review Index (ID from product_test.csv) and the other
column Recommended for the predicted labels (1’s or 0’s). An example file of test results
test_results_example.csv is also provided. Name your file as
GroupID_Test_Results.csv. The results on the test data will be assessed to decide your
group performance among the entire class (group competition!).
BUSS6002 2020S1 4
Presentation
• The assignment material to be submitted will consist of a final report that:
1) Takes a research article form in which you shall have a number of sections
such as introduction, methodology, experiment results,
findings/interpretation, and conclusion. All references should be properly
cited and take a full bibliographical format. Here are a few examples
http://cs229.stanford.edu/proj2015/007_report.pdf
http://cs229.stanford.edu/proj2015/188_report.pdf
http://cs229.stanford.edu/proj2015/031_report.pdf
2) Details ALL steps and decisions taken by the group regarding requirements
above.
3) Demonstrates an understanding of the problem being addressed and the
relevant principles of data science techniques used.
4) Clearly and appropriately presents any relevant graphs and tables.
• The report should be NOT more than 20 pages with font size no smaller than
11pt, including everything like text, figures, tables, small sections of inserted code,
etc., but excluding the cover page and the appendix containing the meeting
minutes. Think about the best and most structured way to present your work,
summarise the procedures implemented, support your results/findings and prove
the originality of your work.
• Your code submission has no length limit, however, make sure your code is as
concise as possible and add comments when necessary to explain the functionality
of your code segments.
• Your group is required to submit at least 3 meetings minutes. Your group may use
the provided template for preparing meeting minutes. Documentation should
include attendance, discussion points, actions decided, etc. You may use your own
form or find something online.
• You, as a member of a group, may be also required to submit your peer review.
Please use the provided criteria sheet for this purpose. You will be advised how to
use an online form when it becomes available.