Final Project
Stat 428
I. Simulation Problem (50 points)
In the lecture, we discussed Nearest Neighbor Tests and Energy Distance Test for two sample testing problem.
We consider another two tests: two-sample Hotelling’s T-square test statistic and graph-based two sample
test. Suppose the data we observe X1, . . . , Xn and Y1, . . . , Ym, where Xi
, Yj ∈ R
d are multivariate random
vectors. Here, X1, . . . , Xn are drawn from distribution F and Y1, . . . , Ym are drawn from distribution G. The
hypothesis of interest in two sample testing problem is
H0 : F = G and H1 : F 6= G.
Graph-based two sample test is defined in the following way. We pool all data together
Z1, . . . , Zn+m = X1, . . . , Xn|Y1, . . . , Ym
Based these n + m observations, we construct a graph G = (V, E) such that the set of vertex is V =
{1, . . . , n + m} and there is an edge between i and j if kZi − Zjk ≤ Q, where Q is a positive number. Let E
be the collection of edges. The graph-based two sample test statistic is defined as,
where |E| means the number of edges in the edge set E. Here, Ie = 1 if the two vertex connected by e have
the same label and Ie = 0 otherwise.
Question 1 Report
A pharmaceutical company would like to test whether the effect of two treatments are similar or not. The
manager want to choose one two sample testing method from nearest neighbor tests, energy distance test,
Hotelling’s T-square test and graph-based two sample test and ask your advice for the choice of two sample
test. First, could you help the manager to implement these four methods from the scratch: nearest neighbor
tests, energy distance test, Hotelling’s T-square test and graph-based two sample test? Second, could you
prepare a report to provide some suggestions for the manager? In this report, you need to address at least
four of the following points:
1
• Several different parts can be customized in these tests, e.g., the threshold Q in graph-based test, the
number of neighbor in nearest neighbor test and the specific form of distance in energy distance test
and graph-based test. Could you provide some suggestion on the choice of these customized part? You
need to show some numerical experiment as your evidence.
• Are these tests sensitive to the dimension of data d?
• Are these tests sensitive to specific distribution of F or G?
• Which test has larger power under what condition?
• Clearly, the power of the test relies on the sample size n, m and how different F and G are under
alternative hypothesis. Could you prepare a plot to show effect of sample size on power? Could you
prepare another plot to show effect of the difference bewteen F and G on power?
• Are these methods able to control Type I error?
You need to submit both Rmd and pdf file of your report.
Question 2 Presentation and Slides
Based your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions?
Assume your audience is the manager from this pharmaceutical company, who has only very limited statistic
background. In this question, you need to submit a video (I need to see you in this video) and your slides
(Both Rmd and pdf).
Question 3 R package (Bonus question: extra 10 points for the final project)
Could you prepare an R package to include all your four two sample testing methods and a manual that
introduces how these methods can be used? To finish this question, you need to submit a compressed R
package.
II. Real Data Problem (50 points)
The data for this project describe payments for child support made to a government agency. A “case” refers
to a legal judgment that an absent parent (abbreviated in variable names as “AP”) must make child support
payments. The data is distributed in four CSV files, whcih can be downloaded from Compass2g. The data
are distributed “as is” as obtained from the agency (albeit anonymized). Most categorical variables are
self-explanatory.
The file cases.csv has six columns, one for each case:
• CASE_NUM Unique case identifier
• CASE_STATUS ACV (active), IN_ (inactive), IC_ (closed), IO_ (legal), IS_(suspend)
• CASE_SUBTYPE AO (arrears), EF (foster), MA (medical), NO (arrears), RA (regular), RN (regular)
• CASE_TYPE AF (AFDC), NA (non-afdc), NI (other)
• AP_ID Identifying number for absent parent
• LAST_PYMNT_DT Recorded date of last payment
The file parents.csv has 10 columns, one for each parent:
• AP_ID Unique identifier for parent
• AP_ADDR_ZIP Coded na for missing, 0 for “known unknown”, 1 for city, 2 south state, 3 north state,
4 other
• AP_DECEASED_IND AP is deceased
• AP_CUR_INCAR_IND AP is incarcerated
• AP_APPROX_AGE
• MARITAL_STS_CD Self-explanatory
• SEX_CD
• RACE_CD Categorical
• PRIM_LANG_CD Language code
2
• CITIZENSHIP_CD Citizenship code
The file children.csv has 9 columns:
• CASE_NUM Case number
• ID Unique identifier for child
• SEX_CD
• RACE_CD
• MARITAL_STS_CD Marital status code
• PRIM_LANG_CD Primary language
• CITIZENSHIP_CD
• DATE_OF_BIRTH_DT
• DRUG_OFFNDR_IND Past drug offence
The file payments.csv has only six columns, but more than 1.5 million records:
• CASE_NUM Case number for the payment
• PYMNT_AMT Dollar amount of payment
• COLLECTION_DT Date of payment
• PYMNT_SRC A (regular), C (worker comp), F (tax offset), I (interstate), S (st tax), W (garnish)
• PYMNT_TYPE A (cash), B (bank), C (check), D (credit card), E (elec trans), M (money order)
• AP_ID Absent parent ID
Question 1 File linkage integrity
(a) Read the four CSV files into R, building four data frames with the names “Cases”, “Parents”, “Children”
and “Payments”. Show the dimensions of these data frames. (You may find it useful to save these data
frames as Rdata objects in a file using the save command. You can then recover them with the load
command more quickly than reading the CSV file.)
(b) What is the distribution of the number of children attached to a case? Show an appropriate plot of the
distribution, and mark the location of the average number in the plot.
(c) The file children.csv may have more than one record for each child. What is the largest number of
cases associated with a child, and indicate why you believe that this is indeed the same child.
(d) Does every absent parent (AP_ID) identified in the payments data have an identifying record in the
parents data file?
Question 2 Recoding categories
Some categorical variables among these data frames are sparse (seldom observed). For example, the variable
PYMNT_SRC in Payments has category ‘M’ with 2 cases and category ‘R’ with 7. These are too few for
modeling in regression.
Write a function named “pool_categories” that recodes a categorical variable into a “simpler” factor with
fewer categories by pooling categories with counts below a threshold into a category labeled ‘Other’ (a factor
level which your function should check does not already exist!). You might find the R function %in% useful
for this exercise.
Question 3 Payment counts and amounts
You must use ggplot2 for generating the plots asked for in this question.
(a) Make a variable Payments$DATE which is a viable R date by converting the COLLECTION_DT
variable. Use this variable to find (i) the range of dates of all payments and (ii) the percentage of the
total number of payments made before May 1, 2015.
3
(b) Show a sequence plot of the total number of payments made on each day from May 1, 2015 through the
end of the data.
(c) What explains the bimodal shape of the marginal distribution of the number of payments over this
time period? Explain with some evidence how you reached your opinion.
(d) Describe the distribution of the payment amounts. Do you have an explanation for its shape? (You
might find it useful to work with a sample for plotting. R takes a while to draw 1.5 million points.)
Question 4 Most common parent
(a) Identify the parent with the most cases.
(b) Identify all of the different children associated with the cases of the parent identified in (a).
(c) What is the average age of these children, in years? Use their age as of Jan 1, 2017. (Fractions of a
year are fine.)
(d) Show a plot of the payment history for this parent.
Question 5 Payments for cases
The unit of analysis for this question is the payment behavior of an absent parent. Hence, if the parent is
involved in several cases, you will need to accumulate the relevant information. You may find it useful for
this and the next question to build a data frame for parents that collects the relevant information for each
parent. You may find dplyr useful here and elsewhere, but you don’t have to use it.
(a) It has been conjectured that parents deemed responsible for more children are more likely to make
either a larger number of payments or a larger total payment amount over this period. Is that true?
(b) It has been conjectured that parents responsible for younger children are more likely to make more
payments. Is the average age of the children of an absent parent associated with the total amount of
payments made by the absent parent? (Define a child’s age as the age on Jan 1, 2017.)
(c) Does the location of the parent (AP_ADDR_ZIP) anticipate the total amount of payments made by
the absent parent?
(d) Does the combination of attributes of the parent with the number and average age of the children
involved predict the total amount of payments made by a parent? Explain your results briefly. (Note:
It makes no sense to remove cases with missing values of a categorical variable. Missingness just defines
another category of the variable.)
Question 6 Consistency
Again, the unit of analysis for this question is an absent parent. An important aspect of payments is the
consistency of the payments over time. A steady income stream is, for many, preferable to a highly volatile,
unpredictable payment schedule, even if the latter has a higher average.
(a) Among all parents who made payments, is there any association between the SD of total daily payments
and the average of total daily payments?
(b) The coefficient of variation (CV) is the ratio of the SD of daily payments to the mean. Show time
sequence plots of the payments of 3 parents, with low, medium and high CV. That is, find three
representative parents who make payments. One of these three should have a high CV, another an
medium CV, and a third a low CV.
(c) Is the CV of payments associated with the total amount of payments over this time period?
(d) Do any attributes of the parent as revealed in these data anticipate that the parent will make consistent
payments, that is, have small CV?