STAT3017辅导、辅导Python,Java编程
STAT3017/7017 - Big Data Statistics - Assessment 3 Page 1 of 2
Assessment 3
Due by Monday 19 September 2022 09:00
An important problem in multivariate analysis is the test of sphericity of the data, that is, the
null hypothesis H0 : Σ = σ2Ip where σ2 is unspecified. This hypothesis expresses the fact that
the error is cross-sectionally uncorrelated (independent if the data is normally distributed) and
have the same variance (homoscedasticity). Clearly, data sampled from a multivariate normal
Np(μ,Σ) would exhibit sphericity as the density function is constant on the ellipsoids
(? μ)TΣ?1(? μ) = k
for every positive value of k and ∈ ?p. A general class of distributions with this property is
the class of elliptical distributions. A random vector with zero mean follows an elliptical
distribution if (and only if) it has the stochastic representation
= ξA, (?)
where the matrix A ∈ ?p×p is nonrandom and rank(A) = p, ξ ≥ 0 is a random variable
representing the radius of , and u ∈ ?p is the random direction, which is independent of ξ
and uniformly distributed on the unit sphere Sp?1 in ?p, denoted by ~ Unif(Sp?1).
Question 1 [8 marks]
(a)[2] Write a function runifsphere(n,p) that samples n observations from the distribution
Unif(Sp?1) using the fact that if ~ Np(0, Ip) then /∥∥ ~ Unif(Sp?1). Check
your results by: (1) set p = 10, n = 100 and show that the (Euclidean) norm of each
observation is equal to 1, (2) generate a scatter plot in the case p = 2, n = 500 to
show that the samples lie on a circle.
(b)[2] A classic statistic for testing sphericity (called John’s test) that is proposed in [A]
and [B] is
where is it shown that when p is fixed and n →∞, under the null hypothesis, it holds
that np
2
U
d→ χ2ρ with ρ := 12p(p + 1) ? 1. Perform a simulation to show that np2 U
is distributed like χ2ρ under the null hypothesis in the case n = 5000 observations,
p = 5, and with data generated from Np(0, Ip).
(c)[2] Check the impact on the distribution of np
2
U when the data is sampled from a double
exponential distribution (i.e., a particular case of an elliptic distribution). This can
be generated using (?) with ξ ~ Gamma(p, 1) and A = Ip.
(d)[2] Implement a hypothesis test for sphericity (H0 : Σ = σIp) using John’s test. Plot its
empirical size and power in the case that the data is normal (as per question b) and
in the case that the data is double exponential (as per question c).
Dale Roberts - Australian National University
Last updated: September 2, 2022
STAT3017/7017 - Big Data Statistics - Assessment 3 Page 2 of 2
Question 2 [6 marks]
Recently, there have been a few recent research papers that consider high-dimensional
sample covariance matrices in the case where the data is sampled from an elliptical
distribution.
(a)[2] Have a look at the paper [C], consider Theorem 2.2, Eq. (2.10), and the notation
used (for all the following terms in this question). Perform a simulation experiment
to examine the fluctuations of β?n1 and β?n2. In the experiment, take Hp = 12δ1 +
1
2
δ2
and choose the distribution of ξ ~ k1Gamma(p, 1) with k1 = 1/
√
p + 1. Set the
dimensions to be p = 200 and n = 400. Choose the number of simulations based on
the computational power of your machine. Similar to Figure 1 in [C], use a QQ-plot
to show normality.
(b)[2] Unfortunately, the results of [C] do not cover all elliptic distributions due to a
moment condition on the distribution, see Table 1 in [C]. The results in [D] extend
their results to more general elliptic distributions such as multivariate Gaussian
mixtures1. A p-dimensional vector ∈ ?p is a multivariate Gaussian mixture with k
subpopulations if its density function has the form
f () =
k∑
j=1
pjφ(;μj ,Σj)
where (pj) are the k mixing weights and φ(·;μj ,Σj) denote the density function of
the jth subpopulation with mean vector μj and covariance Σj . In the case where
μ1 = μ2 = · · ·μk = 0 ∈ ?p and Σj = vjΣ for some vj > 0 with j = 1, . . . , k . Write
an R function to sample from such a distribution using the representation from Eq.
(11) in [D].
(c)[2] Using your code from (b), perform a simulation experiment to simulate fluctations
of β?2 under a Gaussian scale mixture model where the variable ξ has a discrete
distribution with two mass points ?(ξ = 1.8√p) = 0.8 and ?(ξ = 1.5√p) = 0.2.
Consider the cases: (i) p = 100, n = 150, (ii) p = 600, n = 900. In both cases,
plot a histogram of the distribution of β?2 against the theoretical limiting density and
also a QQ-plot similar to Figure 1 in [D]. Note: this is the experiment just above
Section 3 in [D].
References
[A] John (1971). Some optimal multivariate tests. Biometrika.
[B] John (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika.
[C] Hu, Li, Liu, Zhou (2019). High-dimensional covariance matrices in elliptical distributions with application
to spherical test. Annals of Statistics.
[D] Zhang, Hu, Li (2022). CLT for linear spectral statistics of high-dimensional sample covariance matrices
in elliptical distributions. Journal of Multivariate Analysis.
1Recall I mentioned in Lecture 1 that one difficulty in big datasets is the presence of multiple subpopulations.
Dale Roberts - Australian National University
Last updated: September 2, 2022