ETF5952辅导、Analysis讲解、R编程设计调试、R语言辅导解析Java程序|解析SPSS

ETF5952 Quantitative Methods for Risk Analysis
Semester 1, 2020
ASSIGNMENT 2
Deadline: 3PM, June 10, 2020
Important Instruction
• This assignment comprises 25% of the assessment for ETF5952. This is an individual, NOT a syndicate,
assignment. On the Assignment Cover Sheet, read the references to plagiarism and collusion from University
Statute 4.1. Part III-Academic Misconduct.
• Answer all questions, and start from a new page for each question. Your assignment must be typed and
you must submit a pdf file (A4 pages) with an Assignment Cover Sheet (from the ASSIGNMENTS section
of Moodle).
Name your assignment: Surname Initials AS.pdf and Upload this file to Moodle as follows:
1. Go to the “ASSIGNMENTS” section.
2. Click on the “ASSIGNMENT 2” link to upload.
3. The following message will appear momentarily, “File uploaded successfully.”
(To later confirm your upload was successful, go to the “ASSIGNMENTS” section and click. On the
“Assignment 2” uploading link. The uploaded file’s name will be shown.)
• If you have a valid reason not to meet the deadline, you will be requested to submit what you have done at
the due date and receive your grade relative to opportunity. Without any valid reasons, 10% of Assignments
allocated marks will be deducted for each day that it is late.
• Submit one pdf file only. Do NOT submit/attach R scripts or output files. Do not submit your assignment
in a folder.
• You should summarize what you obtain to answer questions, instead of providing all codes and outputs.
If you provide too many outputs relative to questions, then we will consider that you may not understand
the questions and your answers would be subject point deduction.
• If you have questions regarding materials, you are encouraged to use our consultation. The course email
should be used only for pointing out typos and personal matters.
1
Question 1 (25 points: 5+5+5+5+5+5)
To answer this questions, use a mobility data set for Australia, “move au.csv”. This data is extracted from Google
mobility data and see more information from the google site (https://www.google.com/covid19/mobility/). The
data set contains 6 variables regarding mobility information in 8 sub-regions, Australia from Feb 15 to May 7.
We consider a factor mode for the jth variable xi,t,j for region i and time t, given by
E[xi,t,j ] = φj,1νit,1 + φj,2νit,2 + · · · + φj,6νit,6.
Here, since each variable can vary over time and regions, latent factors depend on time and regions (but, the
analysis is similar).
1. To estimate the factor model, apply Principle Component Analysis (PCA). Use the scale option to standardize
6 variables. Report the plot of variances of PCs and explain which component is dominant (no
more than 30 words).
2. Report the estimated loadings in a table. From loading, explain the effect of the first factor on 6 variables
(no more than 30 words).
3. Using the estimated factors, report a boxplot of 6 factors and explain whether the result is consistent with
the one in Question 1.1.
4. Add the estimate first factor to the original data set as a new variable. Also, set “date” as a date variable
by using “as.Date” function. Report a scatter plot with x-axis of date and y-axis of the first factor. Draw
a horizontal line at y = 0. Interpret variations in the first factor over time (no more than 50 words).
5. Notice that the first factor, vit,1, can vary across regions. To see regional variations, create a box plot
of vit,1 for each region (“boxplot” function may not work well without adjustment. If so, I suggest to
use ggplot2 package). According to the first factor variations in Victoria relative to the ones in the other
regions, explain whether human mobility in Victoria decreased (no more than 50 words).
Question 2 (25 points: 5+5+5+10)
We will use a type of difference in difference estimation to estimate the effect of Napster on music sales. In this
question, use ”cex basefile97 02.csv”, which are extracted from several data sets and downloaded from Journal
of Applied Econometrics. Before the analysis, you have to clean the data set. The data set is provided with
“readme.sh.txt” file. Check carefully what kind of variables are in the data set. We do not use “newid”,
“intno” and “firmth”. We use “cdall” and “weight” as a dependent variables and a weight, respectively. The
variables, “year” and “nint” are key variables and consider the other variables as control variables. When
you load the data set, notice that the data set has no variable names in the data set and you have to use an
option for no header (check “?read.csv”). We do NOT use “weight” as weight for all regressions in this question.
Consequential marks will not be provided and you are strongly encouraged to read the readme file and set your
data carefully (it is easy to select variables by the column number. DATA[,3] means the 3rd variables and
DATA[,3:7] means the 3rd-7th variables). When you use gamlr, you do not need to report any hypothesis testing
result.
1. Let yit be music sales and dit take 1 if household HAS internet or 0 otherwise, for household i and year
t. Napster started in 1999 and let t.napt takes 1 if t ≥ 1999 or 0 otherwise. Set d.napit = dit × t.nap.
Without internet access, people cannot use Napster. Thus, we consider the following model
yit = α + βt + γdit + δd.napit + it,
where α, γ and δ are parameters βt is a year effect for t, and the error it. The parameter δ measures
the effect of Napster on music sale. Estimate this model by using the data set. Report only the estimated
effect of Napster and provide interpretation of Napster’s effect (20 words)
2. Estimate the model Question 2.1 with all available control variables. Report only the estimated effect of
Napster and provide interpretation of Napster’s effect (20 words)
2
3. Use lasso (gamlr) to estimate the model Question 2.2 (single machine learning). Report only the estimated
effect of Napster and provide interpretation of Napster’s effect (20 words)
4. Use the double machine learning to the effect of Napster. First, apply lasso (gamlr) to estimate
dit = α + βt + x
0
itλ + ηit,
where xit are a vector of control variables. Note you also have to include time effects βt (time dummies).
Let ˆdit be the fitted values from this estimation.
Second, apply lasso (gamlr) to estimate
yit = α + βt + γdit + δd.napit + φ(
ˆdit × t.napt) + x
0
itπ + it.
Here, keep the term ( ˆdit × t.napt) always. Report only the estimated effect of Napster and provide interpretation
of Napster’s effect (20 words)
Question 3 (25 points: 10+15)
In the lecture, the average treatment effect was introduced under a binary treatment status, but we often
encounter randomized control trials with multiple treatments. Consider the case of three treatment status,
where we have no treatment, treatment 1 and treatment 2. Let d1 be a dummy variable taking 1 for treatment
1 and 0 otherwise, and d2 be a dummy variable taking 1 for treatment 2 and 0 otherwise.
1. We consider the following regression
y = α + βd1 + γd2 + ,
where α, β and γ are parameters and is the error with E[] = 0. Using β and γ in this regression, explain
what you can estimate First, express only the final outcomes mathematically (no derivations) and second,
explain each (no more than 10 words for each).
2. Suppose that we want to estimate the difference of treatment effects between treatment 1 and 2 on average.
To this end, consider and expression a specification of a regression when only y, d1 and d2 are available,
and denote the key parameter by δ. Express mathematically what δ measures.
Question 4 (25 points: 5+5+5+5+5)
Use “Hitters” from the ISLR package.
1. The original data set contains some missing values, denoted by NA. Drop observations with the missing
values and then report the summary statistics of “Salary”, only.
2. Report a histogram of Salary and explain salary inequality at Major Leagues Baseball (no more than 15
words).
3. To understand sources of the salary inequality, we use regression tree for Salary with the remaining variables
in the data set. Provide the estimation result and list the all characteristics of high-salary players (just
make a list of the conditions: no explanation is required).
4. Friend A argues that the regression tree analysis based on Salary may be influenced by outliers. Explain if
the argument is correct or not (no more than 30 words).
5. Given Friend A’s argument, we consider an alternative formation by taking a log of Salary. Estimate a
regression tree for log Salary with the rest of variables as regressors. Provide the estimation result and
explain characteristics of high-salary players (just make a list of the conditions: no explanation is required).