ECMT1020 Introduction to Econometrics Week 2, 2023S1
Lecture 2: Distributions, Samples, and Estimators
Instructor: Ye Lu
Please read Chapter R.5每R.8 of the textbook.
Contents
1 Four Important Probability Distributions 1
1.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Samples and Estimators 4
2.1 Sampling and double structure of a sampled random variable . . . . . . . . . 4
2.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Loss functions and mean squared error . . . . . . . . . . . . . . . . . . . . . . 8
3 Exercises 10
1 Four Important Probability Distributions
In the previous lecture, we discussed discrete and continuous random variables and their
probability distributions. Before today*s review on sampling, estimators, and hypothesis
testing, we first review/introduce four probability distributions, all continuous, that turns
out to be important in the statistical inference that we will use. They are the normal
distribution, the t distribution, the F distribution, and the chi-squared (聿2) distribution.
1.1 Normal distribution
The normal distribution is the most commonly-used distribution in econometrics. The prob-
ability density function (pdf) of the normal distribution is symmetric and beautifully bell-
shaped. It is fully determined by the mean/expectation 米 ﹋ R and variance 考2 > 0 of the
distribution, and has the form. (1)
The structure of the normal distribution is shown in Figure 1. Think about below questions:
What are the mean, median, and mode of the normal distribution?
How will the shape of the normal pdf change when 考2 becomes larger or smaller?
1
Figure 1: Structure of the normal distribution (Figure R.12 in the textbook)
When 米 = 0 and 考 = 1 the normal distribution is called the standard normal distribution,
and the pdf of the standard normal distribution is usually denoted as ?(x):
where f(x|米, 考2) is the normal pdf defined in (1). Note that we have the relationship
f(x|米, 考2) = 1.
Therefore, every normal pdf can be considered as derived from the standard normal pdf by
the following three steps:
1. relocate the center of standard normal pdf from 0 to 米;
2. stretch/scale the whole domain of the standard normal pdf by a factor of 考;
3. multiply the pdf function by 1/考. (This is to ensure the pdf integrates to 1.)
For this reason (the steps 1 and 2 above in specific), 米 is called the location parameter and
考 is called the scale parameter of the normal distribution.
We call a random variable, say X, as a &normal random variable* or &normally distributed*
if it follows a normal distribution. Given the location parameter 米 and scale parameter 考,
we write
X ‵ N(米, 考2).
It is clear that E(X) = 米 and Var(X) = 考2.
1.2 t distribution
The t distribution, or Student*s t distribution, arises in statistics when estimating the pop-
ulation mean of a normally distributed random variable in situations where the sample size
is small and the population variance is unknown. The Student*s t distribution was named
after English statistician William Sealy Gosset under the pseudonym of &Student*.
2
The pdf of t distribution is1,
where 忙(﹞) is the gamma function, and the parameter 糸 > 0 is called the &degrees of freedom*
of the t distribution.
The pdf of t distribution is also symmetric and bell-shaped, however, it has &fatter tails*
than normal distribution. Note there are two special cases of t distribution:
When 糸 = 1, the t distribution becomes the well-known Cauchy distribution which
does not have a well-defined (because it is infinity) expectation/mean.
When 糸 ↙﹢, the t distribution becomes the standard normal distribution2.
The notation of a random variable X following t distribution with degrees of freedom 糸
is
X ‵ t(糸) or X ‵ t糸 .
1.3 Chi-squared distribution
The chi-squared (or sometimes chi-square or 聿2) distribution with k &degrees of freedom*,
denoted as 聿2(k) or 聿2k, is the distribution of a sum of the squares of k independent standard
normal random variables. Here the parameter k is a positive integer.
In other words, if Z1, . . . , Zk are independent, standard normal random variables, then
the random variable
X :=
k﹉
i=1
Z2i = Z
2
1 + ﹞ ﹞ ﹞+ Z2k
follows the chi-squared distribution with k degrees of freedom, and the notation is
X ‵ 聿2(k) or X ‵ 聿2k.
Clearly, a chi-squared random variable can only take values on [0,﹢). The chi-squared
distribution is commonly used for critical values for &asymptotic* tests3.
1You don*t need to remember this, but I have it here for completion.
2This is because lim糸↙﹢ f(x|糸) = ?(x) where ?(x) is the pdf of the standard normal distribution defined
in (2). For your curiosity of why this is the case, first note below two results on the limits of functions:
每 As one way of defining exponential function, we have lim糸↙﹢(1 + x/糸)糸 = ex for any x.
每 By using the Stirling*s approximation of gamma functions, we have lim糸↙﹢
忙((糸+1)/2)﹟
糸忙(糸/2)
= 1/
﹟
2.
Then we have, as 糸 ↙﹢,
f(x|糸) = 忙((糸 + 1)/2)﹟.
3The &asymptotic* tests are the hypothesis tests conducted when the sample size is large enough to be
approximated as infinity.
3
1.4 F distribution
The F distribution with two parameters, 糸1 and 糸2, indicating &degrees of freedom* is the
distribution of a random variable X defined as
X :=
Q1/糸1
Q2/糸2
,
where
Q1 ‵ 聿2(糸1) and Q2 ‵ 聿2(糸2);
Q1 and Q2 are independent.
The notation is
X ‵ F (糸1, 糸2).
The F distribution was tabulated by a 1934 paper by Snedecor who introduced the notation
F as the distribution is related to Sir Ronald Fisher*s work on the analysis of variance.
2 Samples and Estimators
The unifying methodology of modern econometrics was articulated by Trygve Haavelmo
in his seminar paper ※The probability approach in econometrics§ (1944, [link]). In this
paper, Haavelmo argued that quantitative economic models must necessarily be &probability
models* instead of deterministic models, because the latter are blatently inconsistent with
observational economic quantities. Once we acknowledge that
1. an economic model is a probability model, and
2. observational4 economic data are &realizations* of some random variables whose popu-
lation distributions are not fully known,
it follows naturally that an appropriate way to quantify, estimate, and conduct inferences
about the economic phenomena should be through the powerful theory of mathematical
statistics.
2.1 Sampling and double structure of a sampled random variable
In a certain application, to infer some population characteristics of a random variable or
to infer the relationship among a set of random variables, an econometrician uses a set of
repeated measurements on these variables. For example, in a labor application the variables
could include weekly earnings, years of education, age, gender, among others. We call these
measurements the data, dataset, or sample. We use the term observations to refer to distinct
repeated measures on the variables.
An individual observation may corresponds to a specific economic unit, such as the
income of a person, household, firm, city, country, etc. ↙ cross-sectional observations
4Most economic data are &observational* instead of &experimental* which is more common in natural science.
This is because conducting experiments in social and economic studies is oftentimes condemned as immoral
or simply impossible. The constraint of having only observational data makes the inference of &causality*
particularly hard in econometrics.
4
An individual observation may also corresponds to a measurement at a point in time,
such as quarterly GDP or a daily stock price. ↙ time-series observations
Now let*s formulate things mathematically. Let X be the random variable we are inter-
ested in, and we want take to a sample of n observations to infer, say the population mean
of X. A subtle but important point here is how we understand these n observations in our
sample.
Before (Pre) the sample is generated, the n potential observations of X are considered
as a set of n random variables which follow the same distribution as that of X. Fol-
lowing our convention of using upper case Roman letters to denote random variables,
we denote the n observations of X in a sample as
{X1, X2, . . . , Xn}.
In particular, we call {Xi : i = 1, . . . , n} a random sample if they are (1) mutually
independent, and (2) identically distributed (i.i.d.) across i = 1, . . . , n. In the following,
unless mentioned otherwise, the samples we will discuss are random samples.
After (Post) the sample is generated, the observations of X become n specific numbers.
We denote these numbers as
{x1, x2, . . . , xn},
using lower case letters. A statistician would refer to {x1, . . . , xn} as a realization of
the random variable X.
Understanding such &double structure* of a sampled random variable before and after the
sample is generated is crucial for understanding the (pre-sample) analysis of the properties
of estimators and the procedure of hypothesis testing.
2.2 Estimators
An estimator can be, in general, considered as a function of the sample. It takes all the
observations in the sample, X1, . . . , Xn, as inputs, and produce an output quantity (based
on a particular rule) to estimate a certain population characteristic of the random variable
X. For example, suppose we have a random sample {X1, . . . , Xn} of X and a random sample
{Y1, . . . , Yn} of Y :
If we want to estimate the (population) mean 米X of X, then we may consider the
&sample mean* If we want to estimate the (population) variance 考2X of X, then we may consider the
&sample variance*
If we want to estimate the (population) covariance 考XY of X and Y , then we may
consider the &sample covariance*
Question: why do we divide by n 1 rather than n in the formulas of 考?2X and 考?2XY ?
(Read textbook R.7, pages 33-34)
If we want to estimate the (population) correlation coefficient 老XY of X and Y , then
we may consider the &sample correlation coefficient*.
Note that all the quantities above, namelyX, 考?2X , 考XY , and 老XY are estimators. Two questions:
1. Can we talk about the probability distributions of these estimators? Why?
2. If yes, then what are the mean and variance of X for example? How do the mean and
variance of X depend on the sample size n?
The key here is, again, the distinction between the potential distribution of the estimator
before the sample is generated, and the actual realization after the sample is generated.
Before the sample is generated, an estimator is a function of the observations in the
sample(s) which are considered as random variables. Therefore, an estimator is also a
random variable in the pre-sample analysis.
After the sample is generated, the random variables (Xi or Yi) in the formula of the
estimator can be replaced by their actual realizations (xi or yi). We describe the
realized value of the estimator as the estimate, which is just a specific number.
See Figure 2 for an illustration of this &double structure* of estimator inherited from the
double structure of a sampled random variable.
Having fixed the idea that an estimator is a random variable which follows certain
probability distribution in the pre-sample analysis, we can now talk in general about the
mean/expectation and the variance of an estimator. The study of these two characteristics
in the distribution of an estimator leads us to the analysis of the two important properties
of an estimator, namely unbiasedness and efficiency.
2.3 Bias and Variance
Let*s adopt some generic notations. Let Z = Z(X1, . . . , Xn) be an estimator for the value of
a population characteristic (say mean, variance, etc.), denoted as 牟. We say Z is an unbiased
estimator if
E(Z) = E[Z(X1, . . . , Xn)] = 牟. (3)
If equation (3) does not hold, then we say Z is a biased estimator, and the bias is E(Z)? 牟.
6
Figure 2: Sample and estimator (Table R.5 in the textbook)
If the bias is negative, or E(Z) < 牟, then there is an under-estimation bias.
If the bias is positive, or E(Z) > 牟, then there is an over-estimation bias.
Taking sample mean as an example. Suppose X = 1n
﹉n
i=1Xi is used to estimate the
unknown population mean 米X of X. We say X is unbiased if.
Questions:
1. Can you show that X is unbiased for 米X? What if the observations X1, . . . , Xn are
not mutually independent? Does it matter?
2. Is X the only unbiased estimator for 米X?
If there are more than one unbiased estimators, how do we compare them? See Figure 3.
The idea of efficiency comparison5 is that we prefer the estimator to have as high probability
as possible of giving a close estimation of the population characteristic. ↙ pdf as concen-
trated as possible around the true value. Another way to put it is that we want the variance
of the estimator to be as small as possible.
5Note that efficiency is a comparative concept, and you should use the term only when comparing different
estimators rather than summarizing changes in the variance of a single estimator.
7
Figure 3: Two unbiased estimators and efficiency comparison (Figure R.8 in the textbook)
Mathematically, suppose we have two unbiased6 estimators Z1 = Z1(X1, . . . , Xn) and
Z2 = Z2(X1, . . . , Xn) for the population characteristic 牟; we say Z1 is more efficient than Z2
if
Var(Z1) < Var(Z2),
and vice versa. Note that in the definition, we requite Z1 and Z2 to use the same amount
information: X1, . . . , Xn as observations on random variable X. This is for fair comparison.
In the textbook (R.45)每(R.49), it shows that the sample mean is the most efficient es-
timator for the population among all the estimators of the weighted average kind, with a
simple illustration where the sample size n = 2.
2.4 Loss functions and mean squared error
Clearly, both unbiasedness and mininum variance are desirable properties of an estimator.
But sometimes there can be conflicts between these two properties when we choose among
estimators. See Figure 4.
There is not a sure answer to the question of which estimator to choose. It all depends
on the circumstances and what criterion one would like to use. In the decision theory of
statistics, a &loss function* denoted as ?(Z, 牟) is often introduced to quantify the cost of using
an estimator, say Z, to estimate a target parameter 牟. The loss function can be very general
as long as it satisfies:
? ?(Z, 牟) ≡ 0 for any Z.
? ?(Z, 牟) = 0 if Z = 牟.
Given the loss function, the &optimal* estimator is considered as the one which minimizes the
expected loss E[?(Z, 牟)].
Just to name a few examples of the loss function:
6In the textbook, the comparisons of efficiency are mostly confined to unbiased estimators.
8
Figure 4: Which estimator to choose? (Figure R.9 in the textbook)
quadratic/squared loss: ?(Z, 牟) = (Z ? 牟)2
linear/absolute loss: ?(Z, 牟) = |Z ? 牟|
Hubor loss: quadratic for small values of |Z ? 牟| and linear for large values of |Z ? 牟|
In particular, the quadratic loss function is the most commonly used. The expected loss
when quadratic loss function is used is known as the mean squared error (MSE):
MSE of estimator Z = E[(Z ? 牟)2].
So the estimator that minimizes the expected loss when quadratic loss function is used is the
estimator that has the smallest mean squared error.
Next we show a useful decomposition of MSE:
MSE of an estimator := Variance of the estimator + Bias of the estimator squared . (4)
In mathematical form, let Z be the estimator for 牟, and let 米Z and 考
2
Z denote the mean and
variance of Z, respectively. We decompose the MSE of Z as follows:
MSE(Z) = E[(Z ? 牟)2]
= E[(Z ? 米Z + 米Z ? 牟)2]
= E[(Z ? 米Z)2 + (米Z ? 牟)2 + 2(Z ? 米Z)(米Z ? 牟)]
= E[(Z ? 米Z)2] + E[(米Z ? 牟)2伴 佞佞 伴
constant
] + 2E[(Z ? 米Z) (米Z ? 牟)伴 佞佞 伴
constant
]
= E[(Z ? 米Z)2]伴 佞佞 伴
考2Z
+( 米Z ? 牟伴 佞佞 伴
=E(Z)?牟
)2 + 2(米Z ? 牟) E(Z ? 米Z)伴 佞佞 伴
=E(Z)?米Z=0
= Var(Z) + Bias2(Z).
Because of this decomposition, the MSE is sometimes used to generalize the concept of
9
efficiency to cover comparisons of biased as well as unbiased estimators.
Example: MSE of the sample variance as the estimator of population variance and the
idea of shrinkage in statistics.
3 Exercises
The below questions are from Exercises R.15, R.19-23, R.30-33 in the textbook. Note that
R.23 and R.30-33 are related to hypothesis testing that you are supposed to have learned in
a pre-requisite course. You should take these exercises as a review together with reading the
corresponding textbook sections (Chapter R.9每R.13).
R.15 For the special case 考2X = 1 and a sample of two observations X1 and X2, calculate the
variance of the generalized estimator, Z = 竹1X1+ 竹2X2 with 竹1+ 竹2 = 1, of the population
mean. Using the fact that
竹21 + 竹
2
2 = 竹
2
1 + (1? 竹1)2 = 2竹21 ? 2竹1 + 1,
obtain the variance of Z with values of 竹1 from 0 to 1 at steps of 0.1, and plot it in a diagram.
Is it important that the weights 竹1 and 竹2 should be exactly equal?
R.19? In general, the variance of the distribution of an estimator decreases when the sample
size is increased. Is it correct to describe the estimator as becoming more efficient?
R.20 If you have two estimators of an unknown population parameter, is the one with the
smaller variance necessarily more efficient?
R.21? Suppose that you have observations on three variables X,Y , and Z, and suppose
that Y is an exact linear function of Z:
Y = 竹+ 米Z
where 竹 and 米 are positive constants. Show that 老?XZ = 老?XY . (This is the counterpart of
Exercise R.14.)
R.22 A scalar multiple of a normally-distributed random variable also has a normal dis-
tribution. A random variable X has a normal distribution with mean 5 and variance 10.
Sketch the distribution of Z = X/2.
R.23 Suppose that a random variable with hypothetical mean 10 may be assumed to have
a normal distribution with variance 25. Given a sample of 100 observations, derive the
acceptance and rejection regions for X, (a) using a 5 percent significance test, (b) using a 1
percent test.
R.30 A drug company asserts that its course of treatment will, on average, reduce a per-
son*s cholesterol level by 0.8 mmol/L. A researcher undertakes a trial with a sample of 30
individuals with the objective of evaluating the claim of the drug company. What should he
report if he obtains the following results:
(a) a mean decrease of 0.6 units, with standard error 0.2 units;
(b) a mean decrease of 0.4 units, with standard error 0.2 units;
(c) a mean increase of 0.4 units, with standard error 0.2 units?
10
R.31 When a local sales tax was abolished, a survey of 20 households showed that mean
household expenditure increased by $160 and the standard error of the increase was $60.
What*s the 99 percent confidence interval for the effect of the sales?
R.32 Determine the 95 percent confidence interval for the effect of an increase in the min-
imum wage on employment, given the data in Exercise R.29, for each part of the exercise.
How do these confidence intervals relate to the results of the t tests in that exercise?
R.33? Demonstrate that the 95 percent confidence interval defined by equation
X ? tcrit,2.5% ℅ s.e.(X) ≒ 米 ≒ X + tcrit,2.5% ℅ s.e.(X)
has a 95 percent probability of capturing 米0 if H0 : 米 = 米0 is true.