TERM 3 2023
COMM1190: DATA, INSIGHTS, AND DECISIONS
FINAL EXAMINATION
QUESTION 1 30 MARKS
PART A – Data Communication 20 MARKS
You are working as a junior data analyst at CoinDesk, a financial service provider offering its customers data visualisation tools to monitor and compare crypto currencies. Your manager asked you to write a business report comparing two of the major cryptocurrencies, BTC and ETH, unveil insights, and suggest actionable recommendations to stakeholders on how to effectively invest in cryptocurrencies. You generated the chart in Figure 1 to include in your report:
Figure 1. Comparison of BTC and ETH prices over time
You have received feedback from your manager mentioning that there are some issues in your visualisation that might mislead the stakeholders in their decision to invest. According to your manager, the visualisation shows BTC and ETH prices are evolving similarly in the last couple of weeks, which could mislead stakeholders to think that both cryptocurrencies generate a similar level of return. You decide to go back to the drawing board and figure out how to improve your visualisation.
Required:
Regarding the above case, please answer all the following questions:
a) Critically evaluate the visualisation in Figure 1 and identify two aspects you could improve to better address your manager’s feedback above. [max 200 words] (5 marks)
Your manager asked you to suggest more effective visualisations with a focus on (i) the evolution of prices, (ii) the Market Cap of each cryptocurrency, and (iii) whether the spikes/drops in prices are associated with positive or negative news (based on social media sentiment). You decide to generate three charts to address your manager’s requests.
b) Using the four chart types of frameworks (conceptual vs data-driven / declarative vs exploratory), identify the type of charts that you would use to address the manager’s request and justify your answers. [max 200 words] (6 marks)
c) Sketch three charts to address your manager’s requests. For each chart, provide a brief explanation of your design choices and explain how the chart can unveil interesting insights that could help stakeholders in their decisions. To sketch the chart, you can use any tool you want (e.g., you can use a software tool like Infogram, Excel, or R). Alternatively, you can sketch the chart using pencils, pens, or markers on paper, then take a picture of the charts and paste them into your solutions document. You can use illustrative data. [max 300 words] (9 marks)
PART B – Data Ethics 10 MARKS
While Artificial Intelligence (AI) has remarkably contributed to the success of Amazon's e-commerce (e.g. AI recommendation systems, AI for product sales & and demand forecasting, AI for product delivery optimization, etc.), Amazon’s AI hiring system adopted to recruit talented candidates for tech jobs, has encountered unpreceded criticisms and backlash. The story begins in 2015 when a team of machine-learning (ML) specialists working for Amazon uncovered a bias in their new recruiting engine and realised that Amazon's system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way and taught itself that male candidates were preferable. This is because Amazon’s ML models were trained to vet applicants by observing patterns in candidates’ resumes submitted to the company over 10 years. Most came from men, a reflection of male dominance across the tech industry. According to experts in ML, the technology penalized resumes that included the word “women's,” as in “women's chess club captain” , and downgraded graduates of two all-women's colleges. Instead, it favoured candidates who described themselves using verbs more commonly found on male engineers’ resumes, such as “executed” and “captured” .
This has gone viral pointing out that Amazon’s AI-hiring algorithm is designed to secretly promote discrimination as it shows bias against women. This was not highlighted as the only issue because gender bias in Amazon’s AI hiring system might create a problem with the data that underpinned the ML models’ judgments to recommend unqualified candidates for jobs. With the technology returning results almost at random, Amazon claimed they shut down the project. Indeed, employers have long dreamed of harnessing technology to analyse resumes and job applications to optimise the recruitment process and reduce reliance on subjective opinions of human recruiters. But Nihar Shah a computer scientist and researcher in ML at Carnegie Mellon University, claims that there is still much work to do to leverage technology in recruitment and figure out “how to ensure that the algorithm is fair, how to make sure the algorithm is interpretable and explainable - that’s still quite far off”.
Extracted/adapted from “Amazon scraps secret AI recruiting tool that showed bias against women” (Jeffry Dastin)
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-
idUSKCN1MK08G?fbclid=IwAR0KHgyudbhIotZTGQJtuvTx28_UevOVT1dRmtf_vsYh_AwYRj0ufpIpQxo
Required:
Regarding the above case, please answer all of the following questions:
a) Discuss TWO (2) ethical issues in Amazon's AI-hiring algorithm. [max 200 words] (5 marks)
b) Propose TWO (2) recommendations on how to address these ethical issues. Justify your arguments with examples relevant to the use of AI in the recruitment process. [max 200 words] (5 marks)
QUESTION 2 40 MARKS
PART A - Linear Regression Model 21 MARKS
Suppose you are given a dataset with information about a class of 30 students, where for each we can observe the marks (from 1 to 10) in Mathematics, Information Technology, Physics, Literature, and Arts and the average number of hours each spends on studying at home. You want to investigate the relationship between the mark in Mathematics and all other variables in the dataset using a linear regression model. The following output is obtained as shown in Figure 2.
Figure 2. Regression output
Required:
Regarding the above case, please answer all of the following questions:
a) Write down the regression equation for the relationship between the grade in Maths and all other variables. Please provide an interpretation of the regression coefficients, including the intercept. [max 200 words] (5 marks)
b) Use the output from the regression to discuss whether the marks in Arts and Literature are significant predictors of the mark in Maths. Enrich the discussion by describing the hypothesis testing procedure, and the decision-making process. [max 200 words] (5 marks)
c) Consider the following scatterplot, as shown in Figure 3:
Figure 3. Relationship between the Maths and Literature marks
How does it align with the regression output in Figure 2? Provide a justification to support your answer, and comment on the overall goodness of fit by discussing at least two relevant statistics from the output. [max 200 words] (5 marks)
d) What is the difference between the r2 and the adjusted r2 ? How can you explain the difference between these two values in this linear regression? How the model can be improved to yield better predictions of future marks in Maths? [max 100 words] (2 marks)
e) Based on the QQ-plot shown in Figure 4, discuss the Normality assumption of the Maths grades. [max 200 words] (4 marks)
Figure 4. Normal Q-Q Plot
PART B - Logistic Regression Model 11 MARKS
Suppose you are a business analyst in an insurance company, and you want to assess the probability of claiming by a car policyholder. For each one, we can observe:
. Car age (car_age);
. Car value (car_value);
. Age of the policyholder (driver_age);
. Gender (gender);
. Type of area where the car is mostly used (driving_prevalence, which takes three values: “Urban” (1), “Rural” (2), “Highway” (3))
. Average speed (avg_speed)
The following output from the regression is obtained, as shown in Figure 5:
Figure 5. Regression output
Required:
Regarding the above case, please answer all of the following questions:
a) Please comment on the statistical significance of the average speed [max 100 words] (3 marks)
b) Given the confusion matrix shown in Table 1 based on 1000 new policies, compute the accuracy rate (overall, for claiming policies and for non-claiming policies), and comment on the results. [max 100 words] (3 marks)
Table 1. Confusion Matrix
c) The boxplots shown in Figure 6 indicate the fitted probability of claim by a policyholder by driving prevalence and the average speed by driving prevalence. Please comment on these results and their relationship with the results from the logistic regression. [max 200 words] (5 marks)
Figure 6. Box plots
PART C - Data Analysis and Summary Statistics 8 MARKS
a) Suppose you would like to analyse the number of deaths by cause for each calendar year between 2000 and 2023 in Australia. Which graphical tool would you use to obtain a visual preliminary insight? [max 100 words] (3 marks)
The histogram shown in Figure 7 indicates the distribution of the life expectancy beyond age 70. The summary statistics of the data are as follows:
. Mean=75.03
. Median=74.35
. Skewness=1.22
. Kurtosis=1.94
Figure 7. Life expectancy beyond age 70
b) Please comment on the results from the summary statistics, and how these data can be processed to ease statistical analysis and modelling. [max 200 words] (5 marks)
QUESTION 3 - Research Design and Experiments 30 MARKS
Genetically modified organism (GMO) labelling is mandatory in Australian grocery stores but there is no legal requirement to specify it in restaurants. A local fast-food company is performing an experiment to implement genetically modified organism (GMO) labelling on some of their GMO ingredients for some products in eight stores in Eastern suburbs of Sydney over the two quarters that ended 30 June 2023. A data scientist, Jack, was hired as a research consultant to design and implement this experiment. The Eastern suburb’s stores where the GMO labelling has been trailed for the products, are classified as the treatment group whilst stores in other suburbs are considered the control group. The stores’ quarterly sales data of control and treatment groups are captured before and after the intervention. Model 1 compares the treatment group and control after the intervention, whilst model 2 relativises the treatment group and control groups before and after the intervention. Table 2 describes the data before the intervention.
Table 2. Summary Stores’ Statistics in Eastern (Treatment) and Other Suburbs (Control)
Jack analysed the experimental data by running the following two models, as
displayed in the regressions below, and generated the results in Table 3. He used 52 observations, 2 quarters each for 26 stores.
Model 1: yi = βo + β1 TTeatedi + Ei
Model 2: yi = βo + β1 TTeatedi + β2 ∗ AfteTi + β3 ∗ TTeatedi x AfteTi + Ei
where,
yi is the quarterly sales at each store
AfteTi takes the value of 1 if it is after the intervention and 0, otherwise.
TTeatedi is the binary variable that takes the value of one (for treatment) if the store is based in an Eastern suburb, and zero otherwise.
Table 3. Regression results of two models
Required:
Regarding the above case, please answer all of the following questions:
a) Interpret 1 in model 1 (i.e. first regression) and what are the problems associated with interpreting it as causal. [max 180 words] (6 marks)
b) Interpret each of the estimated coefficients in the difference in differences regression in model 2 (i.e. second regression) from Table 3. What is the new estimate for the treatment effect of the GMO labelling change? Provide an example of an unobservable that can lead to biased estimates of the treatment effect in model 2. [max 250 words] (8 marks)
c) Discuss two factors to consider enhancing the internal validity of the research experiment model. [max 250 words] (8 marks)
d) Jack is suggesting revisiting his initial research design and has proposed to use Brisbane, the other city as a control group and keeping Sydney’s Eastern suburb as a treatment group. Describe how he would do this and critically evaluate the revised design of the experiment. [max 250 words] (8 marks)