Informatics MSc Programme Area
Henley Business School
University of Reading
Assessed Coursework Set Front Page
Module code: INMR77
Module name: Business Intelligence and Data Mining
Lecturer responsible: Dr Yin Leng Tan
Work to be handed in by:
Full time students: 26 May 2020
Part time students: 15 June 2020
Assignment Specification
The module is assessed 100% through this coursework assignment.
The aim of this coursework is to assess your understanding of business intelligence and ability
to perform data mining tasks by applying concepts, methods and techniques learned during
the lectures and practical sessions.
The coursework is carried out individually. Students are required to produce an individual
report for the tasks as set out below. The complete report should not exceed 20 pages of A4
(with a variation of 20%) with a minimum font size of 10, including tables and diagrams but
excluding references and appendices. An appendix can be used to include more detailed
materials to back up main body points but will not be assessed. In addition, you are also
required to submit the supplementary materials of your output from SAS Enterprise Miner via
blackboard by the specified deadline.
Case Study - Airbnb and Inside Airbnb
Airbnb - Holiday Lets, Homes, Experiences Places (airbnb.co.uk)
Airbnb is an online marketplace for arranging or offering lodging i.e. temporary
accommodation, primarily homestays, or tourism experiences. It was founded in August 2008
and has 12,736 employees as of 2019.
Service overview: Airbnb provides a platform for hosts to accommodate guests with short-term
lodging and tourism-related activities. Guest can search for accommodation using filters such
as location, price, and specific types of homes. Before booking, users must provide personal
and payment information. Some hosts also require a scan of government-issued identification
before accepting a reservation. Hosts provide prices and other details for their rental or listing
e.g. number of guests included in the price, type of property, type of room, number of
bathrooms, number of bedrooms, number of beds and type of bed, minimum number of nights
for a reservation, and amenities. In addition, Airbnb also provides a review system where hosts
and guests can leave reviews about their experience, and rate each other after a stay. By
October 2019, two million people were staying with Airbnb each night.
Cancellation policy: Airbnb allows hosts to choose between five types of cancellation policies,
made to protect both hosts and guests. Options include: strict_14_with_grace_period,
moderate, flexible, super_strict_30, super_strict_60.
(see https://www.airbnb.co.uk/home/cancellation_policies for definition for each categories)
Security Deposits: some reservations include a security deposit, which can be required by either
Airbnb or the host. This helps build trust for both guests and hosts. Some hosts require a
security deposit for their listing. If you are a guest and you are booking a listing with a host with
host-required security deposit, you will be shown the amount before you make your
reservation. The amount is set by the host, not Airbnb. In this case, no authorisation hold will
be placed, and you will only be charged if a host makes a claim on the security deposit.
(see https://www.airbnb.co.uk/help/article/140/how-does-airbnb-handle-security-deposits
Sources: Wikipedia, Airbnb.co.uk
Further information of Airbnb, please visit: https://www.airbnb.co.uk/
Inside Airbnb – adding data to the debate (http://insideairbnb.com/index.html)
Inside Airbnb is an independent, non-commercial set of tools and data that allows an individual
to explore how Airbnb is really used in cities around the world. It was set up by Murray Cox and
John Morries in 2016.
Airbnb claims to be part of the “sharing economy” and disrupting the hotel industry. However,
data shows that the majority of Airbnb listings in most cities are entire homes, many of which
are rented all year round – disrupting housing and communities. For example, local residents
and governments are more concerned with people who are not present when the rental takes
place and those who have multiple listing on the site, as opposed to a user who is renting a
spare room.
By analysing publicly available information about a city’s Airbnb’s listings, Inside Airbnb
provides filters and key metrics so user can see how Airbnb is being used to compete with the
residential housing market. With Inside Airbnb, user can ask fundamental questions about
Ainrbnb in any neighbourhood, or across the city as a whole, such as:
• how many listings are in my neighbourhood and where are they?
• how many houses and apartments are being rented out frequently to tourists and not
to long-term residents?
• how much are hosts making from renting to tourists (compare that to long-term
rentals)?
• which host are running a business with a multiple listings and where are they?
These questions (and the answers) get to the core of the debate for many cities around the
world, with Airbnb claiming that their hosts only occasionally rent the homes in which they live.
In addition, many city or state legislation or ordinances that address residential housing, short
term or vacation rentals, and zoning usually make reference to allowed use, including:
• how many nights a dwelling is rented per year
• minimum nights stay
• whether the host is present
• how many rooms are being rented in a building
• the number of occupants allowed in a rental
• whether the listing is licensed
The Inside Airbnb tool or data can be used to answer some of these questions. Some
understanding of how the Airbnb platform is being used will help clear up the laws as they
change.
Source: insideairbnb.com
Further information of Inside Airbnb, please visit: http://insideairbnb.com/index.html
Airbnb in Greater Manchester, UK
Dataset: Airbnb_man_reduced.csv (available to download on blackboard), two additional
datasets man_reviews.csv, and man_calander.csv are also provided for information only.
Description of the dataset: The Airbnb data for Greater Manchester is made available by Inside
Airbnb. The original data set was downloaded from the website in November 2019. The
number of variables however is reduced from the original data set. There are 4,848 listings in
the data set with a total of 57 variables. Each row represents a single listing and contains
information about the host of the property, the property’s characteristics and overall rating of
the property, and its associated features by guests. Table 1 shows the name, description, and
type of the 57 variables.
Table 1: variable name and description of the variable for the dataset.
# Variable Name Description Variable Type
1. listing_id Unique identifier for each Airbnb
listing
Numeric
2. listing_url url of the listing Text
3. description Description of the listing Text
4. house_rule Description of house rules Text
5. host_id Unique identifier of the host Numeric
6. host_url url of the host Text
7. host_name Name of the host Text
8. host_since Date since the host is a member Date
9. host_about Description of the host Text
10. host_response_time How quickly the host responds to
inquiries. 5 categories: within a day,
with an hour, a few days or more,
within a few hours, N/A
Categorical
11. host_response_rate Rate at which host responded to
inquiries (percentage value)
Numeric
12. host_is_superhost Is the host a superhost (1 = Yes, 0 =
No)
Binary
13. host_identity_verified Whether the host is verified or not (1 =
Yes, 0 = No)
Binary
14. neighbourhood_cleased Name of the neighbourhood (41
categories)
Categorical
15. borough Name of the borough (10 categories) Categorical
16. property_type Type of the property (30 categories) Categorical
17. room_type Type of the room. 4 categories: Entire
home/apt, Private room, shared room,
hotel room
Categorical
18. accomodates Number of people that can be
accommodated
Numeric
19. bathrooms Number of bathrooms Numeric
20. bedrooms Number of bedrooms Numeric
21. beds Number of beds Numeric
22. bed_type Type of bed. 6 categories: Real Bed,
Pull-out Sofa, Futon, Couch, Airbed
Categorical
23. amenities List of amenities included Text
24. price Price per night (in GBP) Numeric
25. weekly_price Price per week (in GBP) Numeric
26. monthly_price Price per month (in GBP) Numeric
27. Security_deposit Amount of host-required security
deposit.
Numeric
28. cleaning_fee One-time fee charged by host to cover
the cost of cleaning their space.
Numeric
29. guest_included Number of quests included in the price Numeric
30. extra_people Additional charge per person (GBP) Numeric
31. minimum_nights Minimum number of nights for a
reservation
Numeric
32. maximum_nights Maximum number of nights for a
reservation
Numeric
33. calendar_updated Calendar last updated by the host (70
categories)
Categorical
34. has availability Weather the host has availability or
not (1 = Yes, 0 = No)
Binary
35. availability_30 Number of days available for the next
30 days
Numeric
36. availability_60 Number of days available for the next
60 days
Numeric
37. availability_90 Number of days available for the next
90 days
Numeric
38. availability _365 Number of days available for the next
365 days
Numeric
39. number_reviews number of reviews in total Numeric
40. first_review Date of first review Date/Time
41. last_review Date of last review Date/Time
42. review_scores_rating Overall rating of the property
(percentage value)
Numeric
43. review_scores_accuracy Rating for the accuracy of the
description
Numeric
44. review_scores_cleanliness Rating for the cleanliness of the
property
Numeric
45. review_scores_checkin Rating for the check in experience Numeric
46. review_scores_communication Rating for the host communication
with guests
Numeric
47. review_scores_location Rating for the location of the property Numeric
48. review_scores_value Rating for the value of the property Numeric
49. instant_bookable Whether the property can be booked
in an instance (1 = Yes, 0 = No)
Binary
50. cancellation_policy The cancellation policy for the host. 5
categories:
strict_14_with_grace_period,
moderate, flexible, super_strict_30,
super_strict_60
Categorical
51. require_guest_profile_picture Whether guest profile picture is
required or not (1= Yes, 0 = No)
Binary
52. require_guest_phone_verificati
on
Whether guest phone verification is
required or not (1= Yes, 0 = No)
Binary
53. host_listings_count The number of listings of the host Numeric
54. host_listings_count_entire_ho
mes
The number of listings of the entire
home
Numeric
55. host_listings_count_private_ro
oms
The number of listings of private
rooms
Numeric
56. host_listings_count_shared_roo
ms
The number of listing of shared rooms Numeric
57. reviews_per_month Number of reviews per month for the
property
numeric
The local government and residents would like to know how Airbnb is used in the region and
seek your help on this. They would particularly like to know how many of the listings/hosts are
offering lodging and not running as a business i.e. temporary accommodation, primarily
homestays, or tourism experiences and, as opposed to hosts offering long term let with
multiple listing with no owner present (likely to be running a business) which could be illegal.
You goals are to:
a) identify clusters of listings based on different (or a combination) set of variables e.g.
host’s characteristics, listings/property’s characteristics and availability, and reviews
from guests so as to provide insights to the local government and residents.
Note: The are many measurements could be used to differentiate the two e.g. single
listing vs multiple listings although a host may list separate rooms in the same
apartment, or multiple apartments or entire homes. Availability is another measure,
likewise, occupancy. You are asked to justify the variables/measurements used for your
clustering tasks. Greater Manchester uses the following parameters for the
measurements:
• a high availability metric and filter of 60 days per year
• a frequent rented filter of 60 days per year
• a review rate of 50% for the number of guests marking a booking who leave a
review
• an average booking of 3 nights unless a higher minimum nights is configured
for a listing
• a maximum occupancy rate of 70% to ensure the occupancy model does not
produce artificially high results based on the available data (see
http://insideairbnb.com/greater-
manchester/?neighbourhood=filterEntireHomes=falsefilterHighlyAv
ailable=falsefilterRecentReviews=falsefilterMultiListings=false
b) select what you think is the best segmentation/clustering based on the results obtained
in a) and comment on the characteristics. E.g. clusters that best separate between those
are genuine lodging vs those could be illegal i.e. running as a business.
c) develop a classification model to identify those are genuine listings/host vs those could
be considered illegal based on your results obtained in b).
Useful information/websites:
• Clampter (2014) Airbnb in NYC: The Real Numbers Beind the Sharing Story – available
at https://skift.com/2014/02/13/airbnb-in-nyc-the-real-numbers-behind-the-sharing-
story/
• Inside Airbnb http://insideairbnb.com/index.html
What to deliver in the final report:
You report should include the following sections:
1. Introduction: This should include background of Airbnb and Inside Airbnb,
opportunities and challenges of the sharing economy to the business (Airbnb), home
owners (hosts), local residents and governments, and guests/tourists, and how
business intelligence and data mining could be used to address the opportunities and
challenges for the various stakeholders. It should also outline how the report is
structured. Justify your answer with examples/data and findings from literature and
related work in this area.
2. Model building and Results Discussion
a) Identify clusters of listings
In this section, you should discuss the purpose of the data mining tasks, the data
mining process, including data exploration and data preparation/preprocessing,
and approaches taken e.g. variables used for the clustering. You are expected to
justify and discuss any action/decision you made during the data mining process
and models building, make references to your output in SAS Enterprise Miner
within your report where necessary.
Note: In deciding what k to use (and also how many variables to include), the
following factors should be considered: How distinct are the clusters? Is good
separation achieved? How consistent are they? If cluster#1 shows low values on
one measure, does it also show low value on other measures. How simple are they
to describe? Simple clusters are more interpretable by domain knowledge experts,
easier to take action on, and are more likely to be statistically stable and not the
result of random chance.
b) Discuss what is the best segmentation/clustering based on the results obtained
from the process in a). You should discuss what you think is the best
segmentation and comment on the characteristic of these clusters. Consider how
this information could be used by local government and residents. Use
screenshots and/or make references to your output in SAS Enterprise Miner to
illustrate important and interesting findings where necessary.
c) Develop a classification model that classify the data into these segments.
In this section, you should discuss the purpose of the data mining, including the
target segment/cluster, the data mining process, including data
preparation/preprocessing, and rationale and approaches taken e.g. variables used
for the model building. You are expected to justify and discuss any action/decision
you made during the data mining process and models building, as well as model
evaluation, make references to your output in SAS Enterprise Miner within your
report where necessary.
3. Conclusion, critical evaluation and suggestion for improvement
In this section, you are required to conclude and provide a summary of your key
findings, and discuss the limitations of your data models/mining/analyses and
suggestion for improvement by taking into consideration current research issues in
data mining.
The criteria used for grading assignment:
Aspects/Criteria % Range Descriptors
Introduction
(ILO-1, ILO3, ILO5)
70% and
above
A highly effective introduction, setting context and
indicating content that will follow.
Wide background reading; novel examples and use of
relevant literature/sources in supporting the
arguments/viewpoints.
60-69% A very good introduction, setting context and indicating
content that will follow.
Good background reading; generally very good use of
examples and relevant sources/literature in supporting
the arguments/viewpoints.
50-59% Adequate introduction incorporating one or more of
the above, yet lacking in clarity in some area(s).
Good use examples and sources/literature in
supporting the arguments/viewpoints.
49% and
below
A basic introduction with a narrow or limited reference
to defining the area, setting the context and indicating
content that will follow.
Little evidence of appropriate reading or ability to
synthesise information. No or little examples given.
Model Building,
Results Discussion
and Model
Evaluation
(ILO2, ILO3, ILO4,
ILO6)
70% and
above
Novel and originality. A coherent, well focused, original
approaches in the model building, entirely relevant to
the tasks with excellent support and justifications for
the variables, techniques used for the modelling.
Excellent discussion and interpretation of the obtained
results/analysis with original insights.
Excellent model evaluations and comparisons provided
with clear evidence of critical analysis of findings.
60-69% A generally clear and coherent discussion with good
support or justification for the model building, which is
directly relevant to the tasks. Clear rationale for the
approaches taken.
Very good discussion and interpretation of the
obtained results/analysis.
Very good model evaluations and comparisons
provided with some critical analysis of findings.
50-59% Reasonable attempt of the modelling but prone to
being descriptive or narrative; little rationale for the
approaches taken or justification of the variable used.
Generally relevant to the stated tasks.
Reasonable discussion and interpretation of the
obtained results/analysis.
Reasonable discussion of model evaluations and
comparisons though with little evidence of critical
analysis of findings.
49% and
below
Little discussion and evidence of model building.
Failure to understand the purpose of the task.
Little discussion and interpretation of the obtained
results/analysis.
Little or no discussion of model evaluations and
comparisons
Conclusion, critical
evaluation and
future
improvements
(ILO1, ILO5 and
ILO6)
70% and
above
Comprehensive and extremely well discussed with
original insights drawing from the analyses conducted
and suggestion for future improvements.
69-69% Very well discussed with interesting insight, drawing
from the results/analyses conducted. Very good critical
evaluation and suggestion for future improvement.
50-59% Reasonably discussed but prone to being descriptive
with little critical analysis based on the results/analyses
conducted. Generally relevant to the stated tasks.
Some critical analysis but prone to being descriptive or
narrative; evidence supports the conclusion, but not
always very directly /clearly. The question is not fully
addressed.
49% and
below
Largely descriptive. The discussion is limited in scope
and/or relevance. The question is only partially
addressed.