首页 > > 详细

辅导 COMM1190: DATA, INSIGHTS, AND DECISIONS FINAL EXAMINATION TERM 3 2023讲解 留学生SQL语言

TERM 3 2023

COMM1190: DATA, INSIGHTS, AND DECISIONS

FINAL EXAMINATION

QUESTION 1 30 MARKS

PART A – Data Communication 20 MARKS

You are working as a junior data analyst at CoinDesk, a financial service provider offering  its  customers  data  visualisation  tools  to  monitor  and  compare  crypto currencies. Your manager asked you to write a business report comparing two of the major  cryptocurrencies,   BTC  and   ETH,  unveil  insights,  and  suggest  actionable recommendations to stakeholders on how to effectively invest in cryptocurrencies. You generated the chart in Figure 1 to include in your report:

Figure 1. Comparison of BTC and ETH prices over time

You have received feedback from your manager mentioning that there are some issues in your visualisation that might mislead the stakeholders in their decision to invest. According to your manager, the visualisation shows BTC and ETH prices are evolving similarly in the last couple of weeks, which could mislead stakeholders to think that both cryptocurrencies generate a similar level of return. You decide to go back to the drawing board and figure out how to improve your visualisation.

Required:

Regarding the above case, please answer all the following questions:

a)     Critically evaluate the visualisation in Figure 1 and identify two aspects you could improve to better address your manager’s feedback above. [max 200 words] (5 marks)

Your manager asked you to suggest more effective visualisations with a focus on (i) the evolution of prices, (ii) the Market Cap of each cryptocurrency, and (iii) whether the spikes/drops in prices are associated with positive or negative news (based on social media sentiment). You decide to generate three charts to address your manager’s requests.

b)     Using the four chart types of frameworks (conceptual vs data-driven / declarative vs exploratory), identify the type of charts that you would use to address the manager’s request and justify your answers. [max 200 words] (6 marks)

c)     Sketch three charts to address your manager’s requests. For each chart, provide a brief explanation of your design choices and explain how the chart can unveil interesting insights that could help stakeholders in their decisions. To sketch the chart, you can use any tool you want (e.g., you can use a software tool like Infogram, Excel, or R). Alternatively, you can sketch the chart using pencils, pens, or markers on paper, then take a picture of the charts and paste them into your solutions document. You can use illustrative data. [max 300 words] (9 marks)

PART B – Data Ethics 10 MARKS

While Artificial Intelligence (AI) has remarkably contributed to the success of Amazon's e-commerce (e.g. AI recommendation systems, AI for product sales & and demand forecasting, AI for product delivery optimization, etc.), Amazon’s AI  hiring system adopted to recruit talented candidates for tech jobs, has encountered unpreceded criticisms and backlash. The story begins in 2015 when a team of machine-learning (ML) specialists working for Amazon uncovered a bias in their new recruiting engine and realised that Amazon's system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way and taught itself that male candidates were preferable. This is because Amazon’s ML models were trained to vet applicants by observing patterns in candidates’ resumes submitted to the company over 10 years. Most came from men, a reflection of male dominance across the tech industry. According to experts in ML, the technology penalized resumes that included the word “women's,” as in “women's chess club captain” , and downgraded graduates of two all-women's colleges. Instead, it favoured candidates who described themselves using verbs more commonly found on male engineers’ resumes, such as “executed” and captured” .

This  has gone viral  pointing out that Amazon’s AI-hiring algorithm  is designed to secretly  promote  discrimination  as  it  shows  bias  against  women.  This  was  not highlighted as the only issue because gender bias in Amazon’s AI hiring system might create  a  problem  with  the  data  that  underpinned  the  ML  models’  judgments  to recommend  unqualified candidates for jobs. With the technology returning  results almost at random, Amazon claimed they shut down the project. Indeed, employers have long dreamed of harnessing technology to analyse resumes and job applications to optimise the recruitment process and reduce reliance on subjective opinions of human  recruiters.  But  Nihar  Shah  a  computer  scientist  and  researcher  in  ML  at Carnegie Mellon University, claims that there is still much work to do to leverage technology in recruitment and figure out “how to ensure that the algorithm is fair, how to make sure the algorithm is interpretable and explainable - that’s still quite far off”.

Extracted/adapted from “Amazon scraps secret AI recruiting tool that showed bias against women” (Jeffry Dastin)

https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-

idUSKCN1MK08G?fbclid=IwAR0KHgyudbhIotZTGQJtuvTx28_UevOVT1dRmtf_vsYh_AwYRj0ufpIpQxo

Required:

Regarding the above case, please answer all of the following questions:

a)     Discuss TWO (2) ethical issues in Amazon's AI-hiring algorithm. [max 200 words] (5 marks)

b)     Propose  TWO  (2)  recommendations on how to address these ethical issues. Justify your arguments with examples relevant to the use of AI in the recruitment process. [max 200 words] (5 marks)

QUESTION 2 40 MARKS

PART A - Linear Regression Model 21 MARKS

Suppose you are given a dataset with information about a class of 30 students, where for  each  we  can  observe  the  marks  (from  1  to  10)  in  Mathematics,  Information Technology, Physics, Literature, and Arts and the average number of hours each spends on studying at home. You want to investigate the relationship between the mark in Mathematics and all other variables in the dataset using a linear regression model. The following output is obtained as shown in Figure 2.

Figure 2. Regression output

Required:

Regarding the above case, please answer all of the following questions:

a)     Write  down the regression equation for the relationship between the grade in Maths and all other variables. Please provide an interpretation of the regression coefficients, including the intercept. [max 200 words] (5 marks)

b)     Use the output from the  regression to discuss whether the  marks in Arts and Literature are significant predictors of the mark in Maths. Enrich the discussion by  describing  the  hypothesis  testing  procedure,  and  the  decision-making process. [max 200 words] (5 marks)

c)     Consider the following scatterplot, as shown in Figure 3:

Figure 3. Relationship between the Maths and Literature marks

How does it align with the regression output in Figure 2? Provide a justification to support your answer, and comment on the overall goodness of fit by discussing at least two relevant statistics from the output. [max 200 words] (5 marks)

d)     What is the difference between the r2 and the adjusted r2  ? How can you explain the difference between these two values in this linear regression? How the model can be improved to yield better predictions of future marks in Maths? [max 100 words] (2 marks)

e)     Based on the QQ-plot shown in Figure 4, discuss the Normality assumption of the Maths grades. [max 200 words] (4 marks)

Figure 4. Normal Q-Q Plot

PART B - Logistic Regression Model 11 MARKS

Suppose you are a business analyst in an insurance company, and you want to assess the probability of claiming by a car policyholder. For each one, we can observe:

.    Car age (car_age);

.    Car value (car_value);

.    Age of the policyholder (driver_age);

.    Gender (gender);

.    Type of area where the car is mostly used (driving_prevalence, which takes three values: “Urban” (1), “Rural” (2), “Highway” (3))

.    Average speed (avg_speed)

The following output from the regression is obtained, as shown in Figure 5:


Figure 5. Regression output

Required:

Regarding the above case, please answer all of the following questions:

a)     Please comment on the statistical significance of the average speed [max  100 words] (3 marks)

b)     Given  the  confusion  matrix  shown  in  Table  1  based on  1000  new  policies, compute the accuracy rate (overall, for claiming policies and for non-claiming policies), and comment on the results. [max 100 words] (3 marks)

Table 1. Confusion Matrix

c)     The  boxplots  shown  in  Figure  6  indicate  the  fitted  probability  of  claim  by  a policyholder by driving prevalence and the average speed by driving prevalence. Please comment on these results and their relationship with the results from the logistic regression. [max 200 words] (5 marks)


Figure 6. Box plots

PART C - Data Analysis and Summary Statistics 8 MARKS

a)     Suppose  you would  like to analyse the number of deaths by cause for each calendar year between 2000 and 2023 in Australia. Which graphical tool would you use to obtain a visual preliminary insight? [max 100 words] (3 marks)

The  histogram shown  in  Figure  7  indicates  the distribution of the  life  expectancy beyond age 70. The summary statistics of the data are as follows:

.    Mean=75.03

.    Median=74.35

.    Skewness=1.22

.    Kurtosis=1.94

Figure 7. Life expectancy beyond age 70

b)     Please comment on the results from the summary statistics, and how these data can be processed to ease statistical analysis and modelling. [max 200 words] (5 marks)

QUESTION 3 - Research Design and Experiments 30 MARKS

Genetically modified organism (GMO) labelling is mandatory in Australian grocery stores but there is no legal requirement to specify it in restaurants. A local fast-food company is performing an experiment to implement genetically modified organism (GMO) labelling on some of their GMO ingredients for some products in eight stores in Eastern suburbs of Sydney over the two quarters that ended 30 June 2023. A data scientist, Jack, was  hired as a  research  consultant to design and  implement this experiment. The Eastern suburb’s stores where the GMO labelling has been trailed for the products, are classified as the treatment group whilst stores in other suburbs are considered the control group. The stores’ quarterly sales data of control and treatment groups are captured before and after the intervention. Model 1 compares the treatment group and control after the intervention, whilst model 2 relativises the treatment group and control groups before and after the intervention. Table 2 describes the data before the intervention.

Table 2. Summary Stores’ Statistics in Eastern (Treatment) and Other Suburbs (Control)

Jack analysed the experimental data by running the following two models, as

displayed in the regressions below, and generated the results in Table 3. He used 52 observations, 2 quarters each for 26 stores.

Model 1: yi  = βo  + β1 TTeatedi  + Ei

Model 2: yi  = βo  + β1 TTeatedi  + β2  ∗ AfteTi  + β3  ∗ TTeatedi  x AfteTi       + Ei

where,

yi  is the quarterly sales at each store

AfteTi      takes the value of 1 if it is after the intervention and 0, otherwise.

TTeatedi  is the binary variable that takes the value of one (for treatment) if the store is based in an Eastern suburb, and zero otherwise.

Table 3. Regression results of two models

Required:

Regarding the above case, please answer all of the following questions:

a)     Interpret 1  in model 1 (i.e. first regression) and what are the problems associated with interpreting it as causal. [max 180 words] (6 marks)

b)     Interpret  each  of the  estimated  coefficients  in  the  difference  in  differences regression in model 2 (i.e. second regression) from Table 3. What is the new estimate for the  treatment  effect  of the  GMO  labelling  change?  Provide  an example of an unobservable that can lead to biased estimates of the treatment effect in model 2. [max 250 words] (8 marks)

c)     Discuss two factors to consider enhancing the  internal validity of the  research experiment model. [max 250 words] (8 marks)

d)     Jack is suggesting revisiting his initial research design and has proposed to use Brisbane, the other city as a control group and keeping Sydney’s Eastern suburb as a treatment group. Describe how he would do this and critically evaluate the revised design of the experiment. [max 250 words] (8 marks)



联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!