讲解 EC 223. Empirical Application in Mathematical Statistics.讲解留学生Matlab编程

EC 223. Empirical Application in Mathematical Statistics.

Department of Economics.

Part A. Modeling economic phenomena using the concepts of probability distributions.

I. 1. Let X be a continuous random variable with probability density function f(x) = 2 – 2x, defined on the domain 0 ≤ X ≤ 1. Provide a graph of this function, with clearly labeled axes.

2. Show, that this function satisfies the following requirements of probability density functions:

a) f(x) ≥ 0

3. Using the property of the probability density function that states P(a ≤ X ≤ b) = f(x)dx and b are constants within the domain of the function, compute the probability P(0.25 ≤ X ≤ 0.75).

4. What is the cumulative distribution function, F(X), for this random variable X?

5. Compute the probability P(0.25 ≤ X ≤ 0.75) using the cumulative distribution function, F(X), that you determined in part 4 above.

6. Show the probability P(0.25 ≤ X ≤ 0.75) on the graph of the p.d.f. of random variable X.

7. Compute the expected value of random variable X using calculus.

8. Compute the variance of random variable X, using the following property of the variance:

9. Provide an example of possible application of this probability density function in economics.

Part B. Implementing statistical analysis in Stata.

Suppose that an airplane seat designer must consider the average hip size of passengers in order to allow adequate room for each person, while still designing the plane to carry the profit- maximizing number of passengers. What is the average hip size, or more precisely hip width, of U.S. flight passengers? If a seat 18 inches wide is planned, what percent of customers will not be able to fit? Questions like this must be faced by manufacturers of everything from golf carts to women’s jeans. How can we answer these questions? We certainly cannot take the measurements of every man, woman, and child in the U.S. population. This is a situation when statistical inference is used. Infer means ‘‘to conclude by reasoning from something known or assumed.’’ Statistical inference means that we will draw conclusions about a population based on a sample of data.

To carry out statistical inference, we need data. The data should be obtained from the population in which we are interested. For the airplane seat designer this is essentially the entire U.S. population above the age of two, since small children can fly ‘‘free’’ on the laps of their suffering parents. A separate branch of statistics, called experimental design, is concerned with the question of how to actually collect a representative sample. How would you proceed if you were asked to obtain 50 measurements of hip size representative of the entire population? This is not such an easy task. Ideally the 50 individuals will be randomly chosen from the population, in such a way that there is no pattern of choices. Suppose we focus on only the population of adult flyers, since usually there are few children on planes. Our experimental design specialist draws a sample that is shown in the table below and stored in the data file on blackboard under the empirical assignment folder.

Table1. The airline data on hip measurements (in inches).

1. A first step when analyzing a sample of data is to examine it visually. Draw a histogram of the 50 data points. Based on this figure, “eyeball” the average hip size in this sample. What is it approximately?

Answer: the average hip size in this sample seems to be between … .. and … … inches:

Copy-as-a-picture the jhistogram from Stata output and paste it here:

2. For our profit-maximizing designer this casual estimate based on the histogram is not sufficiently precise. He wants to set up a statistical model and starts by considering the hip size data that were obtained by sampling. Sampling from a population is an experiment. The variable of interest in this experiment is an individual’s hip size. Before the experiment is performed, we do not know what the values will be, thus the hip size of a randomly chosen person is a random variable. Let us denote this random variable as Y. We choose a sample of N=50 individuals, Y1; Y2; . . . ; YN , where each Yi represents the hip size of a different person. The data values in Table 1 are specific values of the variables, which we denote as y1;y2;...;yN.

The designer assumes that the ec223 students learned about the basics of experimental design and suggests that the population probability distribution of the values of hip size has a center, which we describe by the expected value of the random variable Y, E(Y)=µ. Recall that we use the Greek letter µ (‘‘mu’’) to denote the mean of the random variable Y, and also the mean of the population we are studying. Thus, if we knew µ we would have the answer to the question ‘‘What is the average hip size of adults in the United States?’’ To indicate its importance to us in describing the population we call µ a population parameter, or, more briefly, a parameter. Our objective is to use the sample of data in Table 1 to make inferences, or judgments, about the unknown population parameter µ .

The other random variable characteristic of interest is its variability, which we measure by its variance, σ2 , which is also an unknown population parameter: Var(Y) = E[Y – E(Y)]2 = E[Y - µ] 2 = σ2 In the context of the hip data, the variance tells us how much hip sizes can vary from one randomly chosen person to the next.

The statistical model is not complete. If our sample is drawn randomly, we can assume that Y1; Y2; . . . ; YN are statistically independent. The hip size of any one individual is independent of the hip size of another randomly drawn individual. Furthermore, we assume that each of the observations we collect is from the population of interest, so each random variable Yi has the same mean and variance, or Yi ~ (µ, σ2). The

observations Yi constitute a random sample, in the statistical sense, because Y1; Y2; . . . ; YN are statistically independent with identical probability distributions (random i.i.d. sample). It is sometimes reasonable to assume that population values are normally distributed, which we represent by Yi ~ N(µ, σ2).

How shall we estimate the population mean m given our sample of data values in Table 1? The population mean is given by the expected value E(Y)=µ . The expected value of a random variable is its average value in the population. It seems reasonable, by analogy, to use the average value in the sample, or sample mean, to estimate the population mean, denoted y-bar:

Compute the sample mean hip size, using data in Table 1.

3. How good is the sample mean as an estimator of the mean ofa population?

We do not know the value of the estimator Y until a data sample is obtained, and different samples will lead to different values. To illustrate, we can collect 10 more samples of size N = 50 and calculate the average hip size. The estimates differ from sample to sample because Y is a random variable. This variation, due to collection of different random samples, is called sampling variation.

For example,

Table 2. Sample means in the repeated sampling context.

It is an inescapable fact of statistical analysis that the estimator Y—indeed, all statistical estimation procedures— are subject to sampling variability. Because of this terminology, an estimator’s probability density function is called its sampling distribution. We can determine how good the estimator Y is by examining its expected value, variance, and sampling distribution.

a) Show that the expected value of the estimator Y is the population mean µ that we are trying to estimate. How do we call this desirable property of a sample estimator?

b) Show that the variance of the sample mean is smaller than the population variance. Recall that the variance of Y can be obtained using the procedure for finding the variance of a sum ofuncorrelated (zero covariance) random variables. We can apply this rule if our data are obtained by random sampling, because with random sampling the observations are statistically independent, and thus are uncorrelated.

Furthermore, we have assumed that var(Y) = 。2 for all observations.

c) Illustrate the property of consistency of the sample mean by using the graph of its sampling distribution in large samples. Provide a brief comment on your graph.

Figure 1.

d) Suppose we want our estimate of the population hip size to be within 1 inch of the true value, µ . For the purpose of illustration assume that the population is normal, σ2 = 10 and µ = 40. Compute the probability of getting an estimate of the sample mean that is within ε = 1 inch of the population mean, µ — that is, within the interval [µ-1, µ+1]. Can you tell if this probability would be greater or smaller if the sample size increases from N=40 to N=80? Explain your reasoning.

e) We were able to carry out the above analysis because we assumed that the population of hip widths ofU.S. adults, has a normal distribution. A question we need to ask is ‘‘If the population is not normal, then what is the sampling distribution of the sample mean of hip width?’’ State the fundamental theorem that answers this question.

f) For example, the hip width distribution can be triangular, with the probability density given by f(y) = 2y for 0<y<1 and f(y) = 0 otherwise. Applying the central limit theorem, compute the mean and variance of the continuous random variable Y, hip width, that have the triangular distribution. Also, draw a sketch of the triangular pdf to understand its name.

The graph of the triangular pdf defined on 0≤ y≤ 1 is:

g1) Clear memory and open the data on hip width measurements, use filename, clear. Calculate the sample variance in the hip data (denoted y in the data file), using Stata command sum y.

Indicate the formula for the estimator of the sample variance. What is the degrees of freedom in the denominator of the sample variance formula that produce the unbiased estimator of。2?

Answer: The unbiased estimator of the sample variance for the hip data is computed as

Stata output for command summarize y:

g2) Calculate the standard error of the sample mean hip width using the computed variance of y and show the formula for the standard error of the sample mean.

h1) What is the method-of-moments (MoM) estimator of the sample skewness and kurtosis? Provide the MoM formulas ofthe sample skewness and kurtosis. Compute the skewness and kurtosis ofthe hip data using Stata command summarize y, detail.

Answer: MoM formulas for skewness and kurtosis are given by:

The output of Stata command summarize y, detail:

h2) Based on your result in part h1), can you conclude that the hip data follow a normal distribution? Hint: in the normal distribution, skewness is zero (it is symmetric) and kurtosis is three.

h3) There is a more formal statistical test checking normality of the data. The Jarque-Bera test is based on the statistical difference between the sample values of skewness and kurtosis and their theoretical counterparts in the normal distribution. In other words, the test is a statistical test that evaluates whether sample data adhere to the skewness and kurtosis characteristic of a normal distribution. Using skewness and kurtosis calculated in part h1), conduct the Jarque-Bera test at the 5% level of statistical significance. Hint: in the Jarque-Bera test, the null hypothesis is that the data are normally distributed versus the alternative hypothesis that the data are not normally distributed. Under the null hypothesis, the test statistic, JB, has the chi-squared distribution with two degrees of freedom and is equal to:

where N is the sample size, S is skewness and K is kurtosis.

If the true distribution is symmetric and has kurtosis three, which includes the normal distribution, then the JB test statistic has a chi-square distribution with two degrees of freedom if the sample size is sufficiently large. If a =0.05; then the critical value of the chi-square distribution with 2d.f. is 5.99. We reject the null hypothesis and conclude that the data are nonnormal if JB ≥5.99. If we reject the null hypothesis, then we know the data have nonnormal characteristics, but we do not know what distribution the population might have.

h4) The p-value for this test is the tail area of a x22-distribution to the right of the computed test statistic, JB. Compute the p-value for the JB test, using the following series of commands in Stata:

quietly summarize y, detail

scalar number_observations = r(N)

scalar s = r(skewness)

scalar k = r(kurtosis)

*Computing the test statistic JB:

scalar jb = (number_observations/6)*(s^2 + ((k-3)^2)/4))

*Computing the 95th percentile of the chi-square distribution:

scalar chi2_95 = invchi2(2, .95)

*Computing the p-value

scalar pvalue = 1 – chi(2, jb)

display “JB test statistic ” jb

display “95th percentile chi2(2) ” chi2_95

display “p-value is ” pvalue

Answer: The p-value for this test is the tail area of a x22-distribution to the right of the JB test statistic, as is computed here:

i) If the population of Y, from which the data is drawn, is normally distributed, then the sample random variable Yi follows a normal distribution. In this case the sample mean, Y, also follows a normal distribution. Is it true or false?

j) Based on the information from the data on the hip size of U.S. adults, if an airplane seat is 18 inches wide, what percentage of customers will not be able to fit? Or, recasting this question in probabilistic terms: what is the probability that a randomly drawn person will have hips larger than 18 inches? Hint: calculate the standardized value of the sample mean (the z-score) and use the standard normal table to find the probability to the right of this z—score.

k) Use the standard normal table or Stata, to determine how large would a seat have to be to fit 95% of the population?

l) We have introduced the empirical problem faced by an airplane seat design engineer. Given a random sample of size n = 50 we estimated the mean U.S. hip width to be y = 17.158 inches. Furthermore, we estimated the population variance to be 2=3.265 , the standard deviation to be , and the standard error of the sample mean to be

Construct and interpret the 95% confidence interval for the population mean using the hip data. Since the population variance is unknown, use the critical value from the t-distribution with 50 -1 = 49 degrees of freedom, using Stata command: scalar tc975 = invtail(df,.025). Hint: You can check the constructed 95% confidence interval by using command meany or command ci y in Stata.

m) Hypothesis testing procedures compare a conjecture, or a hypothesis, that we have about a population to the information contained in a sample of data. The conjectures we test here concern the mean ofa normal population. In the context of the problem faced by the airplane seat designer, suppose that airplanes since 1970 have been designed assuming the mean population hip width is 16.5 inches. Is that figure still valid today? Implement the following one-tailed hypothesis test at the level of significance of α =0.05, assuming that the test statistic has the Student-t distribution with N-1 degrees of freedom (N is the sample size): the null hypothesis that the population hip size is 16.5 inches, against the right-sided alternative that it is greater than 16.5 inches.

Hint: note that the critical value of the test statistic of the right-sided test at the level of significance =0.05 is given by the 95th percentile of the student t distribution. To compute it, use the following commands in Stata:

sum y, detail

scalar nobs = r(N)

scalarybar = r(mean)

scalar df = nobs – 1

scalar sigma_hat = r(sd)

scalar se = sigma_hat / sqrt(nobs)

scalar t1 = (ybar – 16.5)/se

scalar tc95 = invtail(df, 0.05)

scalar p1 = ttail (df, t1)

display “right-tail test”

display “tstat = ” t1

display “tc95 =” tc95

display” p-value = ” p1

n) Implement the same hypothesis test in Stata, using command: ttesty == 16.5 , place the output in the document, and briefly describe the output. Hint: lots of output is produced by this command. You will need to use the p-value on the right for the right-tail alternative in the given test.

Stata output:

Comment:

o) Test the null hypothesis that the population mean hip size is 17 inches, against the two-sided alternative that it is not equal to 17 inches at the level of significance of α=0.05. Assume that the test statistic has Student t distribution with N-1 degrees of freedom (N is the sample size). Hint: the critical value of the test statistic from the Student t distribution with 49 d.f. for the two-tail test at the level of significance =0.05 is the 97.5th percentile of the student t distribution and can be estimated using the following commands in Stata:

sum y, detail

scalar t2 = (ybar – 17)/se

scalar p2 = 2* ttail (df, abs(t2))

display “two-tail test”

display “tstat = ” t2

display “tc975 =” tc975

display” p-value = ” p2

p) Calculate and interpret the p-value for the test of the population mean based on the hip data to test H0 : µ = 16.5 against H1 : µ > 16.5 (see part n above). Use Stata’s command ttail(n,t) to compute the upper-tail probability of the Student t distribution with n degrees of freedom to the right of the computed test statistic t (for details, see Stata’s help functions). Sketch the p-value on the graph of the pdf of the sample mean.

The right-tail p-value is shown in Figure below:

r) Calculate and interpret the p-value for the test H0 : µ = 17 against H1 : µ ≠ 17 (test from part o above). Show the p-value on the graph of the pdf of the sample mean. Use the following Stata’s commands to compute the two-tail p-value for this test (see also Stata’s help on functions):

scalar p_value_test =2* ttail(49, 0.6191)

display “Prob(t(49)>0.6191 and Prob(t(49) < -0.6191 = ” p_value_test

The two-tail p-value is shown in Figure below:

联系我们