EC224 Stata commands for the empirical data analysis project Spring 2025
Suppose that in the Census data the variable denoting race has 10 categories (White, Pacific, Asian, Black, Hispanic, and so on). Since race represents a qualitative factor, not quantitative (the values corresponding to White or Hispanic do not bear any numerical meaning), we should convert the categorical variable into the set of binary regressors, one binary variable for each category in the original variable, race.
There are two commands that you can use to convert categorical variables into the set of dummy regressors:
one of them is
tabulate race, generate(racebinary)
the generated set of new binary regressors will be called according to that command racebinary1, racebinary2, and so on, one binary regressor for each category of the original categorical variable race)
Another incorporates it into the regression command:
regress earning i.race
The former command has the advantage because you can choose manually from the set of new binary regressors the specific race type which you would like to have in your regression; the latter will drop the first one by default and use all the other binary racial regressors. Their coefficients will be interpreted as the difference between the earnings oftheir race and the reference (dropped) category.
You can also try the following useful steps to take care of the categorical regressors:
tab race //this command shows frequencies of each value in your sample of observations.
*If you want to see the frequencies of several variables at once, you need to use tab1 *command
– it will produce multiple, individual frequency distributions for each variable listed: tab1 gender race
Suppose that you use secondary (from the Internet) data, and you have a variable called GENDER
When you tabulate the frequencies of the two values in that variable, all what you see from the tab output is the frequencies for 0 and 1 observations, without having a vague idea of which observation is Male and which observations is Female.
You might want to supplement the command with the nol qualifier – it will allow you to see the numeric codes of the categories of the Gender variable:
tab gender, nol
Knowing the numeric codes of your categorical variable can be helpful if you want to recode it. For example, you would like to recode a binary variable named gender that takes on the value of 1 for male to the binary variable that takes the value of 1 for female and name this new binary variable female. Then use the following command:
recode gender (1=0) (0=1), gen(female)
You can use the same command to create one binary variable out of the categorical variable:
Creating a dummy variable from the categorical variable:
Say, you want to see the differences in demand for textbooks (X) between the months of September to April and the period of May-August. However, you have the data on monthly demand only. Then, use the recode command to generate a binary variable that equals 1 for summer months only:
recodeX (1/8=0) (9/12=1), gen(Xsummer)
***You could use any two numbers you want to represent each category. Assigning 0 and 1 to these types of indicator variables, however, is a common practice. It makes it easier to interpret the slope coefficients on binary variables this way.
Next, to check that the command did what you were intending, type
tab X Xsummer //generates a cross-tabulated values of the two variables
Since the new variable does not have any value labels, we may want to attach the labels to it. We can do it in TWO steps:
lab def season 0 “academic year” 1 “summer break” //defines a new value label called season such that observations in the new binary variable Xsummer with a zero value will be labeled as an academic year observations and observations coded as 1 will be labeled as a summer break period observations.
lab val Xsummer season //the defined value label season is attached to the binary variable Xsummer
Another way to create a dummy variable from the categorical variable
To create a separate dummy for each level of a categorical variable area (1- urban, 2- suburban, 3 - rural), use
tabulate area, gen(area) //it generates a series of 3 dummies, area1 – area3, for each possible value of the original categorical variable area.
Creating dummy variables out of the continuous variables:
Suppose we have a continuous variable density measuring population density across cities. We will use command generate and a logical operator & to generate a binary variable that takes the value of 1 if the population density is lower than 300 people per square mile, and the value of 0 if the population density is greater than 100:
gen suburban = (100 >= density)&(density <= 300) if !missing(density)
There are three additional ways to create dummy variables: one is to use generate, which creates one dummy variable at a time; another is to use tabulate, which creates whole sets of dummies at once; and the third is to use xi, which may allow you to avoid the issue of dummy-creation altogether.
Answer 1 of 3: Use generate
You could type
gen young = 0
replace young = 1 if age<25
or
gen young = (age<25)
This statement does the same thing as the first two statements. age<25 is an expression, and Stata evaluates it; returning 1 if the statement is true and 0 if it is false.
If you have missing values in your data, it would be better if you type
gen young = 0
replace young = 1 if age<25
replace young = . if missing(age)
or
gen young = (age<25) if !missing(age)
Stata treats a missing value as positive infinity, so the expression age<25 evaluates to 0, not missing, when age is missing. (If the expression were age>25, the expression would evaluate to 1 when age is missing.)
You do not have to type the parentheses around the expression.
gen young = age<25 if !missing(age)
is good enough. Here are some more illustrations of generating dummy variables:
gen male = sex==1
gen top = answer=="very much"
gen eligible = sex=="male" & (age>55 | (age>40 & enrolled))
In the above line, enrolled is itself a dummy variable—a variable taking on values zero and one. We could have typed & enrolled==1 but typing & enrolled is good enough.
Summary stats
Once you deleted missing observations and transformed all the categorical variables (such as education and marital status) into the binary variables (also known as the dummy variables, taking on two values only: 0 or 1), you can use the following commands to complete the analysis of your data for the project paper:
sum Y X1 X2 X3 X4 X5 X6
To learn which observations (individuals or countries, depending on your topic) represent minimum and maximum values indicated in the output table for the summarize command, use command list:
list country if gdppercapita>9000
To learn how many observations (individuals or countries, depending on your topic) having values indicated in the summarize command (i.e., such that gdppercapita is greater than 9000 $ per year), use command count:
count country if gdppercapita>9000
Econometric analysis
Regress Y X1 //X1 is your key regressor
If you would like to implement your regression analysis using a subset of observations in your sample, (for example, without country of Malta), type the following command in the Command window:
Regress Y X1 if country != “Malta” //do not forget the quotes – they are necessary for string (not numerical) variables
*interpret the slope coefficient for X1 and comment on the high likelihood of the omitted variable bias in the single-regressor model
Regress Y X1 if X1<2 //regression will be implemented using only observations for which X1<2
Generate a scatterplot of Y against X1 with the fitted line superimpose don it:
twoway (scatter Y X1) (lfit Y X1)
*lfit stands for linear fit; if you use qfit instead, Stata will superimpose a quadratic curve on the scatterplot)
Transforming variables to capture possible nonlinearities in the data
Utilize command graph matrix Y X1 X2 X3 X4 X5 to visualize pairwise relationships between all your variables at once. This will help to determine if there are nonlinear relationships between the dependent variable (Y) and explanatory factors (X’s), as well as detect possible linear correlation between pairs ofthe X variables (multicollinearity).
If the scatter plots from graph matrix command indicate nonlinearities, you can handle them by transforming your Y and (or) X variables using the following commands:
To generate the polynomial terms and natural logarithms of the variables use:
generate x1_sq=x1^2
generate lnGDP = log(gdp)
Implement multivariate regression model utilizing the regress command and robust option for robust standard errors of the coefficients:
regress Y X1 X2 X3 X4 X5 X6 X5X6, robust
Note that I added the interaction between X5 and X6 into the multiple regression model above.
I encourage interaction terms. Please interact any regressors, except your key regressor (X1), because you will be using X1 on the left-hand side of the first-stage regression in the IV regression model.
To generate the interaction terms, use generate command, for example:
generate x5x6=x5*x6 //note that interactions are only meaningful between the explanatory factors, do not include the Y variable or the instrumental variables!)
Implement instrumental variable regression model utilizing the added instrumental variables, using ivregress command, for example:
ivregress 2SLS Y X2 (X1=X3 X4) X5 X6 X5X6, first
Important: if 2sls option does not work, use the GMM option instead of 2SLS:
ivregress gmm Y X2 (X1= X3 X4) X5 X6 X5X6, first
Request Stata to create summary stats of the data that are actually participate in your regression analysis:
sum Y X1 X2 X3 X4 X5 X6, e(sample)
Or, more practical, don’t indicate any method in the IV regress command and let Stata to choose the default method, like this: ivregress Y X2 (X1= X3 X4) X5 X6 X5X6, first
It is useful to test endogeneity of the key regressor using Hausman test in Stata by implementing command estat endog after the ivregress command has been executed.
Test for the relevance of the instruments in the context of the instrumental variable regression:
To test joint significance of the slope coefficients for all the instruments by implementing a separate command for the first stage regression (the F test reported in the right top corner of the iv regression output tests ALL slope coefficients for zero value; we need only the instruments):
regress X1 X2 X3 X4 X5 X6
test X3 X4
Next, we need to implement the test of instruments exogeneity – note that this test needs to be implemented right after the ivregress command!
ivregress 2sls Y X2 (X1= X3 X4) X5 X6, first
estat overid
Interpretation of the test result: since the null hypothesis is that all the instruments are exogenous, the p-value below 0.05 would indicate that the instruments are not exogenous – if it happens it’s not good but due to the short term nature of the empirical project in this class, just recognize that the best would be to search for the other instrumental variables but you will settle on the chosen ones due to the lack of time.
See the video by Stata’s Chuck Huber on blackboard for the details of the instrumental variable regression implementation in Stata.
A note on missing R2 in the ivregress command output:
For two-stage least squares, some of the regressors enter the model as instruments when the parameters are
estimated. However, since our goal is to estimate the structural model, the actual values, not the instruments for the endogenous right-hand-side variables, are used to determine the model sum of squares (MSS). The model’s residuals are computed over a set of regressors different from those used to fit the model. This means a constant-only model of the dependent variable is not nested within the two-stage least-squares model, even though the two-stage model estimates an intercept, and the residual sum of squares (RSS) is no longer constrained to be smaller than the total sum of squares (TSS). When RSS exceeds TSS, the MSS and the R2 will be negative and may not be reported in the output.
Creating the table of all your competing regression models using either command outreg2 (check out the Internet on help on this command) or the series of commands as following:
quietly reg Y X1 //simple linear regression)
estimates store SimpleLinear
quietly reg Y lnX1 //transformed (not linear) simple regression
estimates store SimpleNonlinear
quietly reg Y X1 X2 X3 X4 X5 X6 X5X6 //I encourage using the interactions of regressors (except X1) with other control variables
estimates store Multiple
quietly ivreg Y (X1=X3 X4) X2 X5 X6
estimates store IVreg
estimates table SimpleLinear SimpleNonlinear Multiple IVreg, b(%9.3f) se(%6.3f)
p(%9.3f) stats(N rmse r2 r2_a F) title (“Regression Models in EC224 empirical project”)