首页 > > 详细

辅导POLI 171解析数据结构设计、回归解析

POLI 171: A Summary 
Policy Making with Data 
Two big questions to address 
• Why do we need to use data in policy analysis and evaluation 
• How do we use data? 
Why do we do this? 
Turns out͕ there is a lot of things ǁe don͛t knoǁ about the ǁorld 
Why do we do this? 
Turns out͕ there is a lot of things ǁe don͛t knoǁ about the ǁorld 
͙and a lot of things ǁe thought we knew about the world 
Oh ƚhe ƚhings ǁe don͛ƚ knoǁ 
In the past, many terrible policies have been made with arguably good 
intentions 
• The United States recruited and sent people who measured below 
mental and medical standards to Vietnam͕ hoping to ͞training and 
opportunitLJ͟ to the uneducated and poor 
• China exterminated rats, flies, mosquitoes, and sparrows 
in an attempt to protect crops 
• India and many countries instituted a ban on child labor 
• Australia fought a war ʹ and lost ʹ against emus 
Rigorous empirical research is the only way to subject our 
beliefs and intentions to test 
How we do this 
The backbone of our analysis is the Potential Outcome Framework 
How we do this 
The backbone of our analysis is the Potential Outcome Framework 
• The potential outcome model 
• Causal effects 
• The fundamental problem of causal inference 
• Causal estimands: ATE, ATT 
• Omitted variable bias 
The potential outcome model 
The potential outcome model 
• For everLJ ͞treatment͟ ;a policLJ͕ membership in an 
organization/community/group, a given characteristic, etc.), and for 
every outcome, each observation has two potential outcomes 
• An outcome under treatment condition (Y1) 
• An outcome under control condition (or, in the absence of the treatment) (Y0) 
• Which of the two outcome is exhibited depends on treatment status 
• Let the observed outcome be Y 
• If observation is treatedÆ we observe only treated outcome: Y=Y1 
• If observation is untreated Æ we observe only untreated outcome: Y=Y0 
Causal Effect 
Fundamental Problem of Causal Inference 
1. We can never observe both Y1i and Y0i simultaneously 
2. As a result, we can never know causal effect with certainty 
Causal Estimand: Average Treatment Effect 
(ATE) 
Average Treatment Effect on the Treated (ATT) 
Omitted variable bias 
Omitted variable bias 
What is NOT omitted variable bias: 
• Variables that influence likelihood of getting treatment but absolutely no 
independent relationship with outcome 
• e.g. A thunderstorm makes large-scale protests less likely to happen (treatment), but 
;arguablLJͿ has no independent relationship ǁith federal government s͛ ǁillingness to 
implement social change 
• In practice, quite difficult to find example of things that truly have no independent 
relationship with outcome 
• Variables that influence outcome but have no relationship with treatment 
• e.g. The amount of sleep is correlated with adult height (outcome), but has no 
relationship with the amount of milk consumed during childhood 
• Also similar: Variables in how treated observations take up a treatment 
• e.g. Whether people wear masks correctly influence COVID-19 infection likelihood 
(outcome), but does not influence likelihood of wearing mask (treatment) 
Omitted variable bias 
• Selection bias: 
• A characteristic of an individual that makes them systematically more or less 
likely to select themselves into the treatment condition AND exhibit 
systematically different outcome 
• e.g. Diligence. Diligent students are more likely to attend review session (the 
treatment) and also tend to score higher in exams (the outcome) 
• Endogeneity (aka reverse causality) 
• Where an individual s͛ outcome influences their tendencLJ to get treatment 
• e.g. Healthy people tend to eat well and engage in regular exercise, which in 
turn improve health 
Identification strategies 
1.Experimental methods: Randomized Control Trials 
2.Non-experimental methods 
• Matching 
• Regression 
• Difference-in-Differences 
Randomized experiments 
• What are the stages of an experiment? 
• What does random assignment do? 
• How to estimate the treatment effect in an experiment? 
• How to improve precision? 
• Assumptions? 
• Advanced designs? 
Stages of an experiment 
Random Assignment Prevents Omitted 
Variable Bias 
Estimation in randomized experiments 
We use the difference in means estimator, and test for its statistical 
significance using a t-test. 
All of this are included in R through the lm() function: 
݉݋݈݀݁ ൏ െ ݈݉ሺ~ݐݎ݁ܽݐ, ݀ܽݐܽሻ 
ݏݑ݉݉ܽݎݕሺ݉݋݈݀݁ሻ 
Accuracy vs. Precision 
(Unbiasedness vs. Reliability) 
How to increase precision: 
Increase the size of our sample 
• Higher sample -> law of large number kicks in -> lower impact of extreme outliers 
Make our treatment group smaller than control group 
• Technically reduces precision, but allows you to offer much bigger sample size given same 
cost 
Controlling for pre-treatment variables 
• Reduce variations in outcome that͛s not caused bLJ variations in treatment status 
Differencing our outcome variable 
• Reduce variations in outcome that s͛ not caused bLJ variations in treatment status 
Blocking on pre-treatment variables 
• Increases similarity between treated and control group with regard to blocked variables 
Clustering 
• Actually decreases precision in exchange for less costly implementation AND reduce chance 
of spillover effect 
How to increase precision: 
Precision is reflected in standard error 
Standard error: The standard deviation of a sampling 
distribution of an estimate 
Lower precision -> Larger standard error compared to the 
estimated treatment effect -> lower p-value 
Assumptions 
Excludability: 
OnlLJ the treatments and nothing else outside the researcher s͛ 
control are ͞assigned͟ to the groups 
Non-interference/No spillovers/SUTVA: 
One unit s͛ treatment status should not influence another unit s͛ 
outcome 
Assumptions can never be tested! 
Advanced designs 
Multiple treatment arms 
• One group receives no treatment 
• One group receives treatment A1 
• One group receives treatment A1 + A2 
• One group receives treatment Aϭ н AϮ н Aϯ͙ 
• Effect of each component estimated by comparing one group with the one 
immediate to it 
Factorial experiment (Interaction effects) 
• One group receives no treatment 
• One group receives treatment A 
• One group receives treatment B 
• One group receives treatment A + B 
• Effect of interaction effect estimated by comparing A+B effect with sum of A s͛ and B͛s 
effect 
Non-experimental designs 
When we do this? 
• We have some treated and control units 
• We didn͛t assign the treatment 
Methods 
• Matching 
• Regression 
• Diff-in-diff 
What we covered 
• Intuition 
• Assumptions 
• Code 
Matching: Intuition 
• For each treated unit, find one control/untreated unit that resembles 
it the most in pre-treatment variables 
• Discard all control observations that have no match 
• Then, pretend we have an experiment and perform the same analysis 
Matching: Assumptions 
• Selection on observables: 
• Whatever drives selection into treatment or control group have already been 
observed and measured 
• Two units that have the same observed pre-treatment variables have the 
same likelihood of being in treated or control group. 
• Their eventual treatment status is ͞as-if͟ random 
Matching: Code 
Matching and estimation performed through Match() function in Matching package 
݉ܽݐ݄ܿ.݉݋݈݀݁ ൏ െܽݐ݄ܿሺ, ݎ, , ൌ 1, ݁ݔܽܿݐ ൌ , ݎ݁݌݈ܽܿ݁ ൌ , 
݁ݏݐ݅݉ܽ݊݀ ൌ "", ݅ܽݏ݆݀ݑݏݐ ൌ ሻ 
ݏݑ݉݉ܽݎݕሺ݉ܽݐ݄ܿ.݉݋݈݀݁ሻ 
Y A vector of outcomes. Example: df$outcome 
Tr A vector of treatment status. Example: df$treat 
X A vector of pre-treatment variables to match on. Example: 
df΀͕c;͞age͕͟͟income͕͟͟educ͟Ϳ΁ 
M M matches per treated unit 
exact Whether to do exact matching 
replace Whether to reuse matched control units 
estimand Which quantity to estimate. 
BiasAdjust Whether to do extra regressions to adjust for remaining imbalances. Needs 
replace=TRUE to work. 
Matching: Code 
• Exact matching: Set the argument exact=TRUE in Match() function 
• Tips: Try to use only categorical or binary variables 
• Distance matching: Set the argument exact=FALSE in Match() function 
• Default is normalized Euclidean distance, which is somewhat similar to Mahalanobis 
distance 
• Propensity score matching 
• Manually calculate propensity score: 
model. ݌ݎ݋݌ ൏ െ ݈݉ ݐݎ݁ܽݐ~ݔ1 ൅ ݔ2 ൅ ݔ3, ݀ܽݐܽ ൌ ݂݀ 
݌ݎ݋݌ ൏ െ݉݋݈݀݁. ݌ݎ݋݌$݂݅ݐݐ݁݀. ݒ݈ܽݑ݁ݏ 
• Then put the vector of fitted values into the argument X=prop in Match() function 
match.model ൏ െܽݐ݄ܿሺ, ݎ, ൌ ݌ݎ݋݌, ൌ 1, 
݁ݔܽܿݐ ൌ , ݎ݁݌݈ܽܿ݁ ൌ , 
݁ݏݐ݅݉ܽ݊݀ ൌ "", ݅ܽݏ݆݀ݑݏݐ ൌ ሻ 
ݏݑ݉݉ܽݎݕሺ݉ܽݐ݄ܿ.݉݋݈݀݁ሻ 
Matching: Code 
Balance tests performed through MatchBalance() function 
ܽݐ݄݈ܿܽܽ݊ܿ݁ ݂݋ݎ݉ݑ݈, ݀ܽݐܽ,݉ܽݐ݄ܿ. ݋ݑݐ 
formul Treatment status variable on left, pre-treatment 
variables on right. Example: treat~x1+x2+x3 
data The dataset containing observations to match 
match.out Output of a Match() function. Include when you 
want to compare before vs. after match 
Regression: Intuition 
• Do not discard any unit 
• Include all pre-treatment variables into a regression model, and take 
advantage of its poǁer to statisticallLJ ͞hold everLJthing constant͟ 
• We consider the coefficient of the treatment variable our estimated 
treatment effect 
• It s͛ like magic͕ but cooler 
Regression: Assumptions 
• Selection on observables 
• Linear relationships of variables on outcome 
• A bunch of other assumptions about the standard errors 
Regression: Code 
Simply use the lm() function 
݉݋݈݀݁ ൏ െ ݈݉ ~ݐݎ݁ܽݐ ൅ ݔ1 ൅ ݔ2 ൅ ݔ3… , ݀ܽݐܽ 
ݏݑ݉݉ܽݎݕሺ݉݋݈݀݁ሻ 
Regression: Code 
If you include a categorical variable in the model, or convert a 
numerical variable into categorical using as.factor(variable), R will 
perform a fixed effects regression 
• Do this when you suspect observations from different groups behave 
differently in ways you cannot fully measure 
• When reading regression outcomes, focus on estimated treatment 
effect and standard error of the treatment ʹ don͛t ǁorrLJ too much 
about the many estimates of the fixed effects 
Difference in differences: Intuition 
• Two groups, two time periods 
• In first period, no group receives treatment 
• In second period, one group receives treatment 
• We measure ;ϭͿ hoǁ first group s͛ outcome changes betǁeen Ϯ 
periods͕ and ;ϮͿ hoǁ second group s͛ outcome changes betǁeen Ϯ 
periods 
• Take the difference between (2) and (1) to find the treatment effect 
Difference in differences: Assumptions 
• Parallel trends: Outcomes of treated group would have moved the 
same way as the outcome group in the absence of treatment 
• Stable Composition: Groups have same membership over time 
• ͞Nothing else happens͗͟ Treatment is the onlLJ thing that happens to one 
group and not other after the treatment 
Difference in differences: Code 
Estimation is performed through lm() function 
First, find out if your data is in the long or in the wide format 
Long format: 
݉݋݈݀݁ ൏ െ ݈݉ ~ݐݎ݁ܽݐ ൅ ݂ܽݐ݁ݎ ൅ ݐݎ݁ܽݐ ∗ ݂ܽݐ݁ݎ ൅ ݔ1 ൅ ݔ2, ݀ܽݐܽ 
ݏݑ݉݉ܽݎݕሺ݉݋݈݀݁ሻ 
treat whether observation comes from group that eventually 
gets treatment 
after whether observation is in post-treatment period 
x1, x2 additional controls 
Difference in differences: Code 
Estimation is performed through lm() function 
First, find out if your data is in the long or in the wide format 
Wide format: 
݂݀$݂݂݀݅ ൏ െ ݂݀$1 െ ݂݀$0 
݉݋݈݀݁ ൏ െ ݈݉ ݂݂݀݅~ݐݎ݁ܽݐ ൅ ݔ1 ൅ ݔ2, ݀ܽݐܽ ൌ ݂݀ 
ݏݑ݉݉ܽݎݕሺ݉݋݈݀݁ሻ 
treat whether observation comes from group that eventually 
gets treatment 
after whether observation is in post-treatment period 
x1, x2 additional controls 
Difference in differences: Code 
Be aware that the standard errors of diff-in-diffs estimates are often 
wrong 
Solutions: Clustered standard errors, HC standard errors, 
bootstrapping, etc. 
How can I remember all of this? 
The ansǁer͗ No͕ LJou can͛t 
The ansǁer͗ No͕ LJou can͛t 
͙ but that s͛ alright 
Iƚ s͛ alrighƚ ƚo forgeƚ sƚƵff 
Causal Inference 
• You͛re gonna forget all the Y1, Y0 stuff 
• But LJou͛ve seen hoǁ good research is done 
Statistics 
• You͛re gonna forget bias correction and clustered SEs 
• But you know good statistical analysis is not scary 
• You͛re gonna forget all the messy arguments or how to fix a for loop 
• But hopefullLJ LJou͛re not afraid of ǁriting code anLJmore 
Key take-aways 
Correlation is not causation 
• Mainly because of selection bias 
Compare like with like 
• Find methods to eliminate selection bias 
Think of the counterfactuals 
• Use statistics to predict counterfactuals 
No substitution for good on-the-ground research 
• Assumptions are edžamined through intense detective ǁork and ͞knoǁing the case͟ 
Seeks evidence to falsify your beliefs, not to confirm them 
• Hypothesis testing matters in real life 
What can you do with this knowledge 
• Jobs in business analytics, government, or non-profit sector 
• Data analysis 
• Consulting 
• Field research 
What can you do with this knowledge 
• Jobs in business analytics, government, or non-profit sector 
• Data analysis 
• Consulting 
• Field research 
• Bridge the ideological gap in debates on social issues 
What can you do with this knowledge 
• Jobs in business analytics, government, or non-profit sector 
• Data analysis 
• Consulting 
• Field research 
• Bridge the ideological gap in debates on social issues 
• Support your causes 
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!