Statistical Modelling for Business
Q1 Why in OLS for SLR is the sample average error, eˉ = 1n
∑n
i=1 ei = 0?
(a) Because it is an error term , it has to average 0.
(b) Because each error ei is equal to 0, therefore the average of a set of 0s is 0.
(c) Because when we do OLS, we take the 1st derivative of the RSS, and the derivative
with respect to β1 is ?2× the sum of the errors times X, i.e. ?2 ×
∑n
i=1 eiXi. We
set this sum equal to 0 to get the LS estimate. Thus eˉ = 0
(d) Because when we do OLS, we take the 1st derivative of the RSS, and the
derivative with respect to β0 is ?2× the sum of the errors, i.e. ?2×
∑n
i=1 ei.
We set this sum equal to 0 to get the OLS estimate. Thus eˉ = 0.
(e) None of the above are correct.
Q2 True or False? LSA 2 states that E(εi|Xi) = 0. This implies that the residual series ε
and X are uncorrelated. Answer: True
Q3 LSA 2 states that E(εi|Xi) = 0. This implies that the error series ε and X are uncorre-
lated, because:
(a) Other factors always exist and are implicitly affecting Y through ε, thus ε and X
must be uncorrelated.
(b) ε is an i.i.d error series and hence must be uncorrelated with X.
(c) If they were correlated, then the slope of the regression of εi on Xi would
not be 0, i.e. we could write E(εi|Xi) = γ0 + γ1Xi and γ1 6= 0. Thus LSA 2
would not be correct.
(d) Other factors always exist and are implicitly affecting Y through ε, thus ε and X
must be correlated. Hence, LSA 2 does not imply they are uncorrelated.
Why does RSS always decrease when you add another X variable to the regression model?
Q4
(a) Because the new X variable is always significantly related to Y
2(b) Because now there is one extra parameter with which to optimize RSS, meaning a
more optimum, hence lower, RSS can be found.
(c) Because the OLS estimate of the new X’s regression slope will not be exactly 0.
(d) It doesn’t, sometimes RSS increases or stays the same, e.g. if the new X variable is
not related to Y.
(e) Both (b) and (c) are true.
Q5 Would the variable number of children cause OVB regarding the effect of Salary on
Amount Spent?
(a) Number of children would not be correlated with Salary, so: NO.
(b) Number of children is likely correlated with Salary, but it would not be a factor
determining Amount Spent, so: NO.
(c) Number of children is likely correlated with Salary. Also, number of children could
be a factor determining Amount Spent, so: YES.
(d) Even though number of children is a likely determinant of Amount Spent, it would
not be correlated with Salary, so NO.
(e) We should first look at the sample correlation between number of children
and Salary here. Then, decide whether number of children could be a
determinant of Amount Spent.
Q6 Would the variable IQ level cause OVB regarding the effect of Salary on Amount Spent?
(a) It is likely that IQ is correlated with Salary. It is unlikely that IQ is a
determinant of Amount Spent for a company like Direct Marketing which
sells clothing,books and sports gear. So: NO.
(b) IQ would not be correlated with Salary nor would it determine Amount Spent, so
NO.
(c) IQ would be correlated with Salary and thus also be correlated with Amount Spent,
since Salary is correlated with Amount Spent. Thus, YES.
(d) IQ would not be correlated with Salary, but it would help determine Amount Spent,
so NO.
Q7 The Mann-Whitney U test is preferred to the t-test whenever:
(a) The dataset in each group has a large enough sample size, ni, for the central limit to
work, i.e. ni ≥ 30.
(b) The data has no outliers and has a symmetric shaped distribution in each group.
3(c) The data has some outliers and it is unclear if E(Y 4) <∞ in each group.
(d) The data are on the ordinal scale.
(e) Both (c) and (d) are correct.
Q8 The t-test for a mean (difference) is very popular and mostly used in practice because:
(a) It has comparatively high power and is also robust to outliers.
(b) Its properties are very well known under the LSA; e.g. BLUE, consistency,
etc
(c) It has higher power than both the Mann-Whitney and median tests, for data with
infinite 4th moments.
(d) It has lower power than both the Mann-Whitney and median tests, for data with
infinite 4th moments.
(e) None of the above.
Q9 Consider a MLR model with p predictors, estimated by OLS. If another predictor was
added to the model and then the model was re-estimated by OLS, with those p+1 predictors,
then:
(a) R2 would increase and SER would decrease.
(b) R2adj would increase and SER would decrease.
(c) R2 would increase and RSS would decrease.
(d) R2adj would increase and RSS would decrease.
(e) None of the above would occur.
Q10 Omitted variable bias usually occurs in an SLR of Y on X whenever:
(a) OLS is used, but not when LAD is used.
(b) we have observational-type data, but not with randomized experimental
data.
(c) we have experimental data, but not with observational data.
(d) we estimate the SLR.
(e) all of the above occur.
Q11 Web designers conducted an A/B test regarding a new ”call to action” design on their
website. Visitors to the page were randomly assigned to see only one of either the Original,
or the New, call to action button on the website. It is then recorded as to whether each
4visitor clicked on the call to action button or not. The main question of interest is: Is there
a difference in button click rates between the two designs?
The observed contingency table for this dataset is given below.
Clicked
Button Yes No Total
Old 351 ?? 3642
New 485 ?? 3556
Total 836 ?? ??
Fill in the missing values in the contingency table.
There are four cells missing: o12, o22, then the sum of ”No” and the Total sum or
total sample size. o12 = 3642 ? 351 = 3291. o22 = 3556 ? 485 = 3071. Sum of ”No”
= 3291 + 3071 = 6362. Total sample size = 3642 + 3556 = 7198.
Q12 In the context of Q11, the expected values in the contingency table are calculated as:
Clicked
Button Yes No
Old ??? 3219.01
New 413.01 3142.99
Calculate it for the (1,1) cell (i.e. ”Yes” and ”Old”).
e1,1 =
R1×C1
N =
3642×836
7198 = 422.99.
Q13 In the context of Q11, are the conditions for Pearson’s chi-squared test satisfied in this
data?
There are 4 cells here, thus we need all 4 cells to have expected values of at least
5. All expected values are ≥ 5 (the smallest os 413), as required. We also need
iid data, which could be achieved by a random sample. Whilst the people coming
to the website are not randomly chosen, they are randomly allocated to either
the new or old call to action buttons. Thus, there is a reasonable chance that
this is close to an iid sample. Thus the conditions for Pearson’s test seem well
satisfied.
Q14 In the context of Q11, Pearson’s chi-squared test is conducted, giving a test statistic of
27.67 and a p-value of 1.43× 10?7. What are the hypotheses and conclusion of the test?
5The hypotheses are: H0 : Type of call to action button and whether the customer
clicks the button are unrelated, or independent. H1 : Type of call to action
button and whether the customer clicks the button are related, or dependent.
I choose α = 0.05 as standard. The test stat is V = 27.67, which follows a χ2
with (2 ? 1)(2 ? 1) = 1 degree of freedom, under the null hypothesis. The p-val
is P (χ21 ≥ 27.67) = 1.43 × 10?7 ≈ 0. The p-value < α = 0.05, so we reject the null
hypothesis and conclude that the variables type of button and clicking the button
are significantly related to each other.
Q15 LSA 5 states that homoskedasticity is assumed. This assumption implies that:
(a) the residual series ε and X are uncorrelated.
(b) the conditional variance V (Y |X) is a constant.
(c) there is omitted variable bias in the OLS estimates.
(d) the residual series ε and Y are uncorrelated.
(e) none of the above are true
Q16 When there are omitted variables in the regression, which are determinants of the
dependent variable, then
(a) the OLS estimator is biased if the omitted variable is correlated with the
included variable.
(b) you cannot measure the effect of the omitted variable, but the estimator of your
included variable(s) is (are) unaffected.
(c) this has no effect on the estimator of your included variable because the other variable
is not included.
(d) this will always bias the OLS estimator of the included variable.
Q17 A regression diagnostic tool used to study the possible effects of collinearity is
(a) the Variance Inflation Factor
(b) the slope
(c) the Durbin-Watson statistic
(d) the standard error of the estimate
Q18 Managed funds offer investors a convenient method for diversifying their portfolios.
However, there are many types of funds to choose from. The following table show the level
of return over five years for a sample of investors for various categories of managed funds.
6The observed contingency table for the data is given below:
Fund type High Ret Med Ret Low Ret Total
Maximum capital gain 108 46 71 225
Long-term growth 18 12 30 60
Balanced income 35 14 26 75
Common stock 25 7 8 40
Total 186 79 135 400
The estimated conditional probability of high return for Maximum capital gain is
(a) 0.48
(b) 0.3
(c) 0.467
(d) 0.625
Q19 Consider the following regression line: Y? = 10 ? 15X1 + 20X2. You are told that the
t-statistic on the slope coefficient of X1 is -3. What is the value of the standard error of the
slope coefficient on X1?
(a) 5
(b) 20
(c) -15
(d) 1.96
Q20 Consider the following regression model Y = β0 + β1X1 + β2X2 + . . .+ βpXp + . If any
X variable has R2j > 0.80 then
(a) V IFj < 20
(b) VIFj > 5
(c) VIFj > 20
(d) V IFj > 10