首页 > > 详细

代做program、辅导R程序设计

Econometrics 761
Please upload your solutions to A2L by December 19, 2022 by 23:59 EST. The
submission should be a pdf document.
This document consists of two questions: Q1, which is a combination of theory and
coding, and Q2, which is an applied question asking you to replicate the results in a
published paper. Take the time to read all explanations in this document, and pay
attention to how you report your results after running the code. Please, use full sentences
when answering questions. Treat this as a report that you may have to write for your
future job.
Most of the R code is already written for you: all R code appears in blue in this pdf
(except for this line). This means that everything that is blue can be run in R directly,
i.e. you can just copy-paste the blue code and it will run (provided that you followed all
directions given to you in this document).
You have to pay attention to how you copy-paste from the pdf1!
One of the questions is asking you to use the gmm package. Those of you with
Macs may have issues loading it, so switch to working on the R Studio Cloud,
https://login.rstudio.cloud/login?redirect=%2F.
Warning vs Error Messages in R: You may still get "warnings" in R Studio Cloud --
so pay attention if something is labeled as "error" or "warning." If it is a warning,
it’s fine, move ahead as if nothing happened. Warnings still produce correct answers.
If it’s an error, then you have to investigate.
1For example, when you copy-paste the code from pdf into R markdown, be careful with
the underscores, you will have to add them manually
the comments following ##, you may have to double check that everything that follows ##
in the pdf is commented in your R Markdown file, else you may get an error
the left arrow in R is <-, pay attention to how this gets copied from pdf, you may have to change
it manually before you run the code
1
You can suppress warning message from your output by including in your {r chunk,
warning=FALSE}. For more on this, take a look at this document.
You can also suppress the output from a regression by including in your {r chunk}
the following: {r echo = T, results = "hide"}. You may need this for Q2(e). Please
look at page 18 of PS.pdf.
– So you can write {r echo = T, warning = F, results = "hide"} which would
suppress warnings and the output.
This project should not take you more than 3 days to complete (this is a generous esti-
mate). However, you have a bit more than 2 weeks to work on this project. Given the
fact that you have plenty of extra time to work on this project:
1. you are strongly advised to start working on the assignment as soon as you get it;
2. you are strongly incentivised to start working on the project as soon as possible by
having limited contact with myself and the TAs, in the sense that I will not
answer any questions about this project by email. If you have questions, you have
to sign up for my office hours and/or those of your TAs, or ask your questions on
the discussion board on A2L.
3. you are heavily penalized for late submissions. There will be a penalty of 25%
per 12 hours of being late, no matter the excuse.
Make sure that you start early and aim to submit at least 3 days earlier than the deadline.
This should give you plenty of time to work trough any trouble shooting that you may
need to do as well as to ask questions during office hours, and, in particular, it will help
you keep calm. As most code is provided to you (in blue), this project is mostly about
paying attention to what the problem is asking you, as well as to understanding the code.
Please, pay attention, read everything at least twice, and then double check that you are
answering the question that is asked.
2
Question 1
This question has a theoretical component and a coding (in R) component (99.9% of the
code is already written for you in this text). This exercise is based on Wooldridge (2001).2
You can find this paper on A2L, under the Individual PS module.
? Read pages 87 and 88 (at a minimum) in Wooldridge (2001) to remind yourselves
about the method of moments estimator (MM) and GMM.
Reminder: GMM is a method for constructing estimators, that uses assumptions about
specific moments of the random variables. The assumptions are called “moment con-
ditions.” Moment conditions are expected values that specify the model parameters in
terms of the true moments. The sample moment conditions are the sample equivalents to
the moment conditions. GMM finds the parameter values that are closest to satisfying
the sample moment conditions. When there are more moment conditions than parame-
ters, GMM finds the parameters that get as close as possible to solving weighted sample
moment conditions. Uniform weights and optimal weights are two ways of weighting the
sample moment conditions. The uniform weights use an identity matrix to weight the
moment conditions. The optimal weights use the inverse of the covariance matrix of the
moment conditions. [All this in the videos and slides on GMM that you can find on
A2L.] GMM generalizes the method of moments (MM) by allowing the number of mo-
ment conditions to be greater than the number of parameters. Using these extra moment
conditions makes GMM more efficient than MM. When there are more moment conditions
than parameters, the estimator is said to be overidentified. GMM can efficiently combine
the moment conditions when the estimator is overidentified.
We will illustrate these points by estimating the mean of a normally distributed ran-
dom variable with mean = θ0 and variance = θ
2
0. We will be estimating θ0 by MM, GMM,
and an efficient GMM estimator.
Theoretical part
Let y = (y1, y2, . . . , yn) be a random iid sample with yi ~ N (θ0, θ20), that is, yi are normally
distributed with mean θ0 and variance θ
2
0. Note that there is one parameter here (rather
than two).
2Wooldridge (2001). Applications of Generalized Method of Moments Estimation. Journal of Eco-
nomic Perspectives, 15(4): 87-100.
3
(a) Write down the MM estimator for θ0 that only uses the moment condition E(yi) =
θ0. What is the asymptotic distribution of this estimator as n→∞?
(b) Consider the GMM estimator for θ that uses the moment conditions E[g(yi, θ0)] = 0,
where g(yi, θ) = (yi ? θ, y2i ? 2θ2)′. These are two moment conditions for one
parameter, so the choice of GMM weight matrix matters. We choose the weight
matrix.
Find an explicit expression for the resulting GMM estimator.
(c) What would have been the asymptotically optimal weight matrix choice in (b)?
What is the probability limit of the chosen Wn in (b) as n → ∞? Is the choice of
Wn in (b) asymptotically optimal?
(d) Calculate the asymptotic variance of the GMM estimator obtained in (b), and com-
pare to the asymptotic variance of the estimator in (a).
Coding part
First, we generate our data. We draw a sample of size n = 500 from the population
random variable Y that is normally distributed with mean 2 and variance 4, i.e. Y ~
N (θ0, θ20) = N (2, 4). So for this part we assume that we know the true parameter θ0 = 2.
Note that this is something specific to simulations; in reality, we do not know the true
value of θ0, but in simulations, we do. This will help us understand whether GMM is
indeed more efficient than MM. Without knowing the true θ0 we would not be able to
answer this question.
The lines below generate one sample of iid random variables drawn from this popula-
tion random variable..
set.seed(1234) ## setting the seed to be able to replicate our results
n = 500 ## sample size
y = rnorm(n,2,2) ## y is a normally distributed random variable with mean=2
and variance=4; there is no typo here, to convince yourselves type ??rnorn
in the R console and read the help file
4
Second, we compute the MM estimator in (a) based on this one sample. Using your
answer from (a) above, the MM estimator is given by
theta hat MM = mean(y)
when you copy-paste the code from pdf into R markdown, be careful with
– the underscores, you will have to add them manually
– the comments following ##, you may have to double check that everything
that follows ## in the pdf is commented in your R Markdown file, else you
may get an error
You will notice that there is a small bias by computing theta hat MM - 2.
The value of theta hat MM is based on just one draw from the population. What we
would like now is to draw S times from the population random variable Y , where S is a
large number. That is, we will generate j = 1, . . . , S samples of size n, where each sample
is drawn from Y . Each sample is used to compute the value of θ?j,n, j = 1, . . . , S; that is,
sample j = 1 produces a value for θ?1,n, ..., sample j = S produces a value for θ?S,n. We
will have a vector of size S × 1 of values
The value of our estimator will be the average across simulations, i.e. θ? = 1
S
∑S
j=1 θ?j,n.
(Note that n is the same number across simulations, i.e. n the sample size does not vary
with S, we always draw the same sample size but we draw it many times.)
If you remember, when we talk about inference, we say “if we were to repeat the
experiment many times,...” What this means is that we would like to draw different
samples of size n repeatedly (and many times) from the same population random
variable Y . We would then compute the estimator based on each sample, call
it θ?j,n, j = 1, . . . , S, and then we would average across all draws, such that the
estimator that we report in a simulation study will be
1
S
S∑
j=1
θ?j,n.
5
To do this, we add the lines:
S = 1000 ## number of draws
theta hat MM ← matrix(NA, nrow = S, ncol = 1) ## creates a vector of
NAs that has S rows and one column (row j is the value θ?j,n, j = 1, . . . , S)
for(j in 1:S){
y = rnorm(n,2,2) ## a new sample j is drawn from the population Y
theta hat MM[j] = mean(y) ## θ?j,n =mean of sample j
}
? pay attention to { and }!
So your code so far should look like this
set.seed(1234)
n = 500
S = 1000
theta hat MM ← matrix(NA,nrow=S,ncol=1)
for(j in 1:S){
y = rnorm(n, mean=2, sd=2)
theta hat MM[j] ← mean(y)
}
Run this code.
The vector theta hat MM contains 1000 values each computed on a sample drawn
from the population random variable.
(e) As we learned, estimators have sampling distributions. The estimator theta hat MM
has a sampling distribution that can be plotted by writing (after running the code
above):
hist(theta hat MM, freq=F, ylim=c(0,7)) ## histogram
lines(density(theta hat MM), col=“red”) ## sampling density
6
The CLT says that this sampling distribution is approximately normal. To
see this, we add to this plot a normal density with mean = theta hat MM
and variance = sqrt(var(theta hat MM)). So add to the code the following
line:
lines(seq(1.1, 2.5, by=.01), dnorm(seq(1.1, 2.5, by=.01),mean(theta hat MM),
sd(theta hat MM)), col=“black”)
You should now have a plot with a histogram, a red curve, and a black curve.
The red curve is the sampling distribution. The normal density is the black
curve.
Does theta hat MM that you plotted look unbiased? How can you tell simply
by looking at the plot? Does this plot show us that the sampling distri-
bution of the MM estimator is approximately normal? How can you tell?
Report now mean(theta hat MM) and var(theta hat MM). What is the
bias of mean(theta hat MM)? Does this match your plot?
(f) We will implement now the GMM estimator in (b) using the formula for the esti-
mator that you derived in (b). To write this formula in R, we write the line below:
theta hat gmm ← ((1/2)*mean(y)*mean(y?2))?(1/3)
As before, we want now to compute this for each draw S = 1, 2, ..., 1000. So
we add this to our for-loop:
theta hat MM ← matrix(NA,nrow=S,ncol=1) ## vector that stores MM es-
timator after each draw
theta hat gmm ← matrix(NA,nrow=S,ncol=1) ## vector that stores GMM
estimator from (b) after each draw
for(j in 1:S){
y = rnorm(n, mean=2, sd=2)
theta hat MM[j] ← mean(y) ## computes MM estimator
theta hat gmm[j]← ((1/2)*mean(y)*mean(y?2))?(1/3) ## computes the GMM
estimator from (b)
}
So your entire code should look like
7
set.seed(1234)
n = 500
S = 1000
theta hat MM ← matrix(NA,nrow=S,ncol=1)
theta hat gmm ← matrix(NA,nrow=S,ncol=1)
for(j in 1:S){
y = rnorm(n, mean=2, sd=2)
theta hat MM[j] ← mean(y)
theta hat gmm[j] ← ((1/2)*mean(y)*mean(y?2))?(1/3)
}
Run this code.
To compare the MM and the GMM estimator, write
c(mean(theta hat MM),var(theta hat MM))
c(mean(theta hat gmm),var(theta hat gmm))
Which estimator has the smaller bias? which one has the smaller variance? Is
this what you expected given the theory?
(g) First, install and load the gmm package, i.e.
install.packages(“gmm”)
library(gmm)
Then create the moment conditions function g(y, θ) from (b). The lines below
generate a function that takes as arguments x (which is your data) and tet
(which is your theta), and return an n x 2 matrix:
g ← function(tet,x){
m1 ← (x-tet)
m2 ← (x?2 - 2*tet?2)
f ← cbind(m1,m2)
return(f)
8
}
Write the lines above and run the code. Then test the function. To do this,
write:
yy = c(1,2,3)
g(0,yy)
Print g(0,yy) on the screen. g should be a 3 x 2 matrix
Compute g(y, θ) from (b) where y =yy (from above, yy=c(1,2,3)) and θ = 0
by hand (again, you should get a 3 x 2 matrix). How do they compare,
the g computed with the code and the g that you calculated by hand? Are
they equal? Should they be?
Now to compute the GMM estimator calculated with the optimal matrix, write
the following code:
init ← 2 ## intial value to be used in the GMM algorithm
theta hat optm gmm ← gmm(g, y, init, wmatrix=“optimal”)$coefficient ##
computes GMM estimator using the moments in g, the data y, the initial
starting value in init, and the optimal weight matrix; it stores only the
GMM estimator for the coefficient
We want now to compute S such GMM estimators. So we need to add two
lines to the code below. What are these two lines? You can use what we
did in (f) as an example. It is straight-forward if you understand what
happened in (f). Your code, with the two lines (you need to replace ??) is:
set.seed(1234)
n = 500
S = 1000
theta hat MM <- matrix(NA,nrow=S,ncol=1)
theta hat gmm ← matrix(NA,nrow=S,ncol=1)
theta hat optm gmm ← ??
init ← 2
g ← function(tet,x){
9
m1 ← (x-tet)
m2 ← (x?2 - 2*tet?2)
f ← cbind(m1,m2)
return(f)
}
for(j in 1:S){
y = rnorm(n, mean=2, sd=2)
theta hat MM[j] ← mean(y)
theta hat gmm[j] ← ((1/2)*mean(y)*mean(y?2))?(1/3)
theta hat optm gmm[j] ← ??
}
Replace ?? above and then run the code. Then report then the following:
c(mean(theta hat MM),var(theta hat MM))
c(mean(theta hat gmm),var(theta hat gmm))
c(mean(theta hat optm gmm),var(theta hat optm gmm))
Which estimator has the smallest variance, and which estimator is efficient?
Is this as you expected given what you learned about these estimators in
this class?
(h) Take the plot in (e) above, and add curves for the sampling densities of theta hat gmm
and theta hat optm gmm. Color the density of theta hat gmm blue and that of
theta hat optm gmm green. (You have to write the code for this, and you can fol-
low the code in (e) as example of how to do this). So your plot will have: histogram
of the MM estimator, samplind density of the MM estimator (in red), sampling
density of the GMM estimator (blue), and sampling density of the GMM estimator
with optimal weight matrix (in green). It should look like Figure 1 below:
10
Figure 1: Sampling densities for all three estimators.
Does this plot match with your answer from (g) above? Explain what the red, green,
and blue curves are and how they support or not your answer in (g) above.
11
Question 2
This question is based on the paper of Bedard and Dhuey (“The Persistence of Early
Childhood Maturity: International Evidence of Long-Run Age Effects”, QJE, 2006).
The paper is available on A2L under the Individual PS module.
The data used in this paper is on A2L under the same module.
You are expected to reproduce some of the results in this paper. To do this, you
will have to write R code. I provide below code that is supposed to help you with
this. For parts a and b, you pretty much have to copy-paste the code that I wrote
for you, while for d and e you have to adapt the code from lm regression to ivreg
based on the code that I provided for you for part b.
When you report your results for Q2, use the fixest package that Caleb in-
troduces in his video "Ordinary Least Squares in R", that you can find on
Eduflow under R Videos or on YouTube. You can also use the stargazer pack-
age if you find it easier to use, directions on the last page of the document whose
link you can find here or in the description of the module on Individual Project up
on A2L.
Pay particular attention to what the question is asking you! Look at the code,
look at what the question is asking you. This is an exercise in attention more than
anything, as you have the code written and you have to adapt it to the question.
The paper uses data from multiple sources. We only want to reproduce a small fraction
of the results in the paper. We only use the TIMSS (Trends in International Mathematics
and Science Study) data from Canada for 3rd and 4th graders (9 year olds in 1995), while
Bedard and Dhuey (2006) also consider many other countries and also another age group
(13 year olds). The dataset is available on A2L under the same module (“timss-canada-
1995-pop1.dta”).
Every entry in the dataset corresponds to one student. The dataset contains the
variables weight (sampling weights that can be used to obtain nationally representa-
tive estimation results) and idschool (identifier for the school of the student). The
outcome variable is math score (math test score, normalized to have mean 50 and stan-
dard deviation 10). The main explanatory variable of interest, which is treated as endoge-
nous, is age (age in months), and the the corresponding instrumental variable is r age
12
(relative age in months, see paper for the definition). Additional control variables are
grade (either 3rd or 4th grade, since students are 9 years old), female (gender dummy),
mother native (dummy for mother born in country), father native (dummy for fa-
ther born in country), both parents (dummy for both parents present in household),
calculator (dummy for calculator at home), computer (dummy for computer at home),
books100 (dummy for whether more than 100 books at home), hh size (a measure for
household size), and rural (dummy for whether school is in rural area).
(a) Start by reading the paper. Explain – in your own words – why age is probably
endogenous here. Explain how the instrumental variable (relative age) is computed.
Explain why relative age is arguably exogenous. Explain why relative age is arguably
a relevant instrument for age.
(b) Reproduce the OLS and IV (i.e. 2SLS) results for math test scores in
Canada in Table III of Bedard and Dhuey (2006). So these would be col-
umn 1 and column 4 for Math in Table III. Report your estimation results for
both unweighted OLS and IV and for population weighted OLS3 and
IV (weighted 2SLS). Code to help you with this is provided at the end of this
document, you just have to follow the directions given and then copy-paste the
code. Report your findings for non-weighted regressions and for weighted regres-
sions both with heteroscedasticity robust standard errors, and compare to Bedard
and Dhuey (2006).
– You will not obtain the same answers as in the table, which is to be expected.
Think why this may be the case. Read the description of the data in the paper,
look at the data that you have imported in R and the exact regression that
the authors ran vs the one that you ran, read my note below. What could
justify the fact that the numbers are not exactly reproduced? But even if you
cannot reproduce their results numerically, can you reproduce their qualitative
findings, e.g. how does the statistical significance of your coefficient in front of
age compare to theirs?
3However, the weights are not chosen as the inverse of the variance of the errors with
the goal of minimizing the variance, but instead are computed with the goal of making the
estimates nationally representative. The WLS standard errors will be larger here than the
OLS standard errors.
13
– Note that our normalization of the outcome variable is different from the one
used in Bedard and Dhuey (2006), because they normalize over their whole
sample, while we only normalize over the data from Canada. The household
size variable might also be defined slightly differently.
(c) For the estimation results in (b), are the signs of the coefficients on the control
variables as expected? Explain.
In the following we only perform unweighted estimation, i.e. we do not use the weight
variable anymore, which would allow us to obtain nationally representative results.4 It
also turns out to not matter much whether heteroscedasticity robust standard errors are
reported here, so it is OK to not use them in the answer to the following questions.
(d) Repeat the IV estimation in (b) for the following eight subpopulations: only female
students, only male students, only students whose mother is native to Canada, only
students whose mother is not native to Canada, only students who report to have
a calculator at home, only students who report not to have a calculator at home,
only students who report to have a computer at home, only students who report
not to have a computer at home. Code to help you with this is provided at the end
of the document. For this part, you have to adapt the code to answer the question
asked. The code that I provide is only for subpopulation of female students, so
you have to adapt the code to the other subpopulations. Interpret your findings,
focusing on the differences in age effect for the different subpopulations (e.g.
male vs. female age effect). Note that the calculator and computer dummies can
be seen as rough proxies for socioeconomic status, and mother native is probably
also positively correlated with socioeconomic status.
– You have to run the code with all covariates for the different populations, but
you can report only the coefficient for age (with its standard deviation and
t-stat). Example of how you can do this in code at the end of the document.
(e) As a robustness check, repeat the analysis in (d), but this time including dummy
variables for each school as regressors (so called “school specific fixed effects”). Code
4The results using weights are less robust and have larger standard errors, because they are driven
by a relatively small subset of individuals with large weights. Not using weights also simplifies the
implementation somewhat, e.g. xtivreg does not work with individual specific weights.
14
to help you with this is provided at the end of the document. For this part, you
have to adapt the code to IV regression. The code I gave you is for OLS (it uses
the lm function) but you need to report results for IV regressions. How do the
differences in age effects (e.g. male vs. female age effects) change compared to
(d)? Explain why the inclusion of school dummies in the regression might change
the results.
– You have to run the code with all covariates for the different populations, but
you can report only the coefficient for age (with its standard deviation and
t-stat). Example of how you can do this in code at the end of the document.
Directions and code to help you with Q2
First you have to load your data into R and install a few packages.
Go to RStudio, File, Import Dataset, From Stata..., Browse, Select the data set
that you downloaded from A2L, called “timss-canada-1995-pop1.dta”, and click
“Import.” You have now imported the Stata data set in R.
Install and load the packages
– “lmtest”,
– “MASS”,
– and you may also need “sandwich”.
Directions and code to help you with Q2(b).
To run the unweighted OLS regression (similar to that in Table III first column for
Canada), just regress math score on all the variables mentioned in the footnote of
that table that appear in your data set, i.e.
lmAPI← lm(math score~age+grade+female+mother native+father native+both parents
+calculator+computer+books100+hh size+rural,
data=timss canada 1995 pop1)
and then correct for heteroskedasticity by running:
15
coeftest(lmAPI, vcov=vcovHC(lmAPI, “HC1”) ## this line produces HAC
robust standard errors; you may need to install and load “sandwich” for
this depending on your R version
To produce the weighted OLS regression that appears in Table III, column 1, run
lmAPI w← rlm(math score~age+grade+female+mother native+father native+both parents
+calculator+computer+books100+hh size+rural,
data=timss canada 1995 pop1,weights=weight)
and then you correct for heteroskedasticity by running:
coeftest(lmAPI w, vcov=vcovHC(lmAPI w, “HC1”)
This is supposed to reproduce the weighted OLS regression with heteroskedasticity robust
standard errors in Table III, column 1 for Canada.5
To run an IV regression, install and load the AER package, and then run the un-
weighted regression
lmAPI iv← ivreg(math score~age+grade+female+mother native+father native
+both parents+calculator+computer+books100+hh size+rural | r age+grade
+female+mother native+father native+both parents+
calculator+computer+books100+hh size+rural,
data=timss canada 1995 pop1)
Pay attention to the code above. After the conditioning “|” we add all the instrument
and ALL EXOGENEOUS explanatory variables that we used in the regression. Since age
is exogeneous, it is not added. Pay attention to how the code is written as if you want to
read more on this, look up ivreg in R help or on stack exchange.
To run the weighted version of the IV regression above, add weights=weight after
data=timss canada 1995 pop1, just as we did for the weighted OLS regression.
5You will not get the same numerical answers. But are they qualitatively the same?
16
Code to help you with Q2(d).
The code below is for lm, you have to adapt it for ivreg. To run an lm regression for
different subsets of the data, we subset the data as follows. For example, for female
students, we run
lmAPI w female← lm(math score~age+grade+mother native+father native
+both parents+calculator+computer+books100+hh size+rural,
data=subset(timss canada 1995 pop1,female==1))
Pay attention to the line that runs the regression! There is no female as a covariate
anymore. If you include female as a covariate you will get an error saying that ’x’ is
singular: singular fits are not implemented in lm. It is the same for ivreg – you must
eliminate female as an exogeneous covariate when you run ivreg! Pay attention, else you
will get an error.
Code to help you with Q2(e).
The code below is for lm, you have to adapt it to ivreg. To generate school fixed
effects, you can add “factor(idschool)” as an exogeneous covariate, and run the
following:
“‘{r echo = T, results = "hide"} ## this line hides the output (in particular,
it hides the output, as it contains all ID schools which would make it too
difficult to read; run your chuck with and without results=”hide” to see
what happens)
lmAPI w female schoolFE ← rlm(math score~age+factor(idschool)+grade
+mother native+father native+both parents+calculator
+computer+books100+hh size+rural,
data=subset(timss canada 1995 pop1,female==1))
summary(lmAPI w female schoolFE )$coef[1,] ## this line reports the result
for the coefficient on age only
When you run ivreg with fixed effects, you have to include factor(idschool) to the
right of “|” and to the left of it, as it is an exogeneous covariate!

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!