THE UNIVERSITY OF AUCKLAND
SEMESTER 2, 2018
Campus: City
STATISTICS
Statistical Computing
(Time allowed: THREE Hours)
INSTRUCTIONS
• Attempt ALL questions.
• Total marks are 100.
• Calculators are permitted.
• R Quick Reference is available in Attachment.
Page 1 of 22
STATS 380
Part I: Programming
For questions in Part I, avoid using explicit loops or anything equivalent as much as
possible, unless the question asks to use them.
1. Write down the evaluation results of the following R expressions.
(a) 1:3 + c(T, F, T, T, F, T)
[2 marks]
(b) 2^1:5/10
[2 marks]
(c) matrix(1:10, 2, 5, b = TRUE)
[2 marks]
(d) {x = c(-0.5, 1, 0.8, 1.5); pmax(0, pmin(1, x))}
[2 marks]
(e) {x = 7; repeat {print(x); x = x + 1; if (x > 9) break}}
[2 marks]
(f) levels(factor(c("foo", "boo", "loo")))[2]
[2 marks]
(g) substring("Statistical computing", 9:12)
[2 marks]
(h) {x = c(5, 1, 4, -3, 2); ifelse(x > 3, mean(x), median(x))}
[2 marks]
[16 marks]
2. Use :, seq(), rep() and some other commonly used arithmetic operators/functions
to create the sequences given below.
Note: Do not use c() or any loop to create the sequences.
(a) 3 6 9 12 15
[2 marks]
(b) 10.000 5.000 2.500 1.250 0.625
[2 marks]
(c) 123423453456
[2 marks]
[6 marks]
Page 2 of 22
STATS 380
5. (a) Figure 1 shows 7 squares filled with 7 distinct colours.
Note: For black and white printing purposes of this paper 7 shades of gray
are used instead.
1 2 3 4 5 6 7
Figure 1: Colored squares.
Write R code to reproduce Figure 1. You need to
• display the number at the center of each square,
• generate the colors using hcl() ranging from purple-ish for the leftmost
square and red-ish for the rightmost square.
Hint: You may have to set the aspect ratio of y- and x-axis to visualize the
rectangles as squares. The coordinates used in your code only need to be
roughly similar to Figure 1.
[8 marks]
Page 5 of 22
STATS 380
Figure 2: A layout to display 5 plots.
[4 marks]
[12 marks]
Page 6 of 22
STATS 380
Part II: Data Technology
6. Write down the evaluation results of the following R expressions:
(a) > text = "good and bad"
> gregexpr("d", text)
[5 marks]
(b) > df1 = data.frame(n = c("a", "b", "c"), x = 1:3)
> df2 = data.frame(n = c("a", "c"), y = c(6, 9))
> merge(df1, df2)
[5 marks]
(c) > breakText = function(t) {
strsplit(t, " ")[[1]]
}
> text = c("roses are red", "and so are you")
> lapply(text, breakText)
[5 marks]
(d) > text1 = "John (fishing, hunting), Paul (hiking, biking),"
> text2 = "Carol, Smith (fishing, swimming)"
> text = paste(text1, text2)
> newtext = gsub("(\\(.[^)]*?,) ", "\\1]", text)
> newtext = strsplit(newtext, ", ")[[1]]
> regmatches(newtext, regexpr(",]", newtext)) = ", "
> newtext
[5 marks]
(e) > text = c("Smith,(919)319-1677", "Ali, 800-899-2164",
"Richard, 7042982145")
> patt1 = "\\([2-9]\\d\\d\\) ?[2-9]\\d\\d-\\d\\d\\d\\d"
> patt2 = "[2-9]\\d\\d-[2-9]\\d\\d-\\d\\d\\d\\d"
> patt = paste0(c("(", patt1, ")|", "(", patt2, ")"), collapse = "")
> data = do.call("rbind", strsplit(text, ","))
> data = as.data.frame(data)
> colnames(data) = c("name", "phone")
> pos = grepl(patt, data[, 2])
> data$valid = ifelse(pos, "valid", "invalid")
> data
[5 marks]
[25 marks]
Page 7 of 22
STATS 380
7. Suppose data is a data frame and has 3 columns. It contains information of the
population of administrative divisions of 198 countries. Column “population”
contains the number of people in the given region. The first 6 rows of this data set
are shown below. Write R code that, for each country, extracts regions with the
population size greater than the average population size of all regions within the
country. To do so you may follow these steps:
• For each country calculate the average population size of the administrative
regions.
• Merge the results to data.
• Select the appropriate subset.
[10 marks]
> head(data)
admin_region country population
1 Badakhshan Afghanistan 805500
2 Badghis Afghanistan 420400
3 Berat Albania 193855
4 Adrar Algeria 311615
5 Dibre Albania 191035
6 Baghlan Afghanistan 762500
8. Suppose Sales is a data frame in R, which stores the number of total sales for a
company in di↵erent months and days of the week during 1990 – 2017. The first
column is the year, the second column indicates the month of sale, and columns
Mon to Fri contain the number of total sales in each day of the week.
(a) Write R code which uses the function melt() in Library reshape2 to reshape
Sales to a long form. Assign the result to the symbol SalesLong. The first 6
rows of SalesLong are shown on next page.
[5 marks]
(b) Write R code which uses SalesLong to create a data frame which contains
the total number of Sales per month. Assign the result to the symbol result.
The first 6 rows of result are shown on next page.
[5 marks]
(c) Write an R expression which extracts the row from result with minimum
sale.
[5 marks]
[15 marks]
Page 8 of 22
STATS 380
> head(Sales)
Year Month Mon Tue Wed Thu Fri
1 2017 January 23 20 14 29 25
2 2017 February 20 15 13 28 29
3 2017 March 26 21 15 30 30
4 2017 April 24 23 14 32 33
5 2017 May 23 25 16 27 27
6 2017 June 26 19 13 26 23
> head(SalesLong)
Year Month variable value
1 2017 January Mon 23
2 2017 February Mon 20
3 2017 March Mon 26
4 2017 April Mon 24
5 2017 May Mon 23
6 2017 June Mon 26
> head(result)
Month value
1 January 2953
2 February 3050
3 March 2883
4 April 3062
5 May 3147
6 June 2995
Page 9 of 22
ATTACHMENT FOLLOWS
ATTACHMENT STATS 380
R QUICK REFERENCE
Basic Data Representation
TRUE, FALSE logical true and false
1, 2.5, 117.333 simple numbers
1.23e20 scientific notation, 1.23 ⇥ 1020.
3+4i complex numbers
"hello, world" a character string
NA missing value (in any type of vector)
NULL missing value indicator in lists
NaN not a number
Inf positive infinity
-Inf negative infinity
"var" quotation for special variable name (e.g. +, %*%, etc.)
Creating Vectors
c(a1,...,an) combine into a vector
logical(n) logical vector of length n (containing falses)
numeric(n) numeric vector of length n (containing zeros)
complex(n) complex vector of length n (containing zeros)
character(n) character vector of length n (containing empty strings)
Creating Lists
list(e1,...,ek) combine as a list
vector(k, "list") create a list of length k (the elements are all NULL)
Basic Vector and List Properties
length(x) the number of elements in x
mode(x) the mode or type of x
Tests for Types
is.logical(x) true for logical vectors
is.numeric(x) true for numeric vectors
is.complex(x) true for complex vectors
is.character(x) true for character vectors
is.list(x) true for lists
is.vector(x) true for both lists and vectors
Page 10 of 22
ATTACHMENT STATS 380
Tests for Special Values
is.na(x) true for elements which are NA or NaN
is.nan(x) true for elements which are NaN
is.null(x) tests whether x is NULL
is.finite(x) true for finite elements (i.e. not NA, NaN, Inf or -Inf)
is.infinite(x) true for elements equal to Inf or -Inf
Explicit Type Coercion
as.logical(x) coerces to a logical vector
as.numeric(x) coerces to a numeric vector
as.complex(x) coerces to a complex vector
as.character(x) coerces to a character vector
as.list(x) coerces to a list
as.vector(x) coerces to a vector (lists remain lists)
unlist(x) converts a list to a vector
Vector and List Names
c(n1=e1,...,nk=ek) combine as a named vector
list(n1=e1,...,nk=ek) combine as a named list
names(x) extract the names of x
names(x) = v (re)set the names of x to v
names(x) = NULL remove the names from x
Vector Subsetting
x[1:5] select elements by index
x[-(1:5)] exclude elements by index
x[c(TRUE, FALSE)] select elements corresponding to TRUE
x[c("a", "b")] select elements by name
List Subsetting
x[1:5] extract a sublist of the list x
x[-(1:5)] extract a sublist by excluding elements
x[c(TRUE, FALSE)] extract a sublist with logical subscripts
x[c("a", "b")] extract a sublist by name
Extracting Elements from Lists
x[[2]] extract an element of the list x
x[["a"]] extract the element with name "a" from x
x$a extract the element with name name "a" from x
Logical Selection
ifelse(cond, yes, no) conditionally select elements from yes and no
which(v) returns the indices of TRUE values in v
List Manipulation
lapply(X, FUN, ...) apply FUN to the elements of X
split(x, f) split x using the factor f
Page 11 of 22
ATTACHMENT STATS 380
Sequences and Repetition
a:b sequence from a to b in steps of size 1
seq(n) same as 1:n
seq(a,b) same as a:b
seq(a,b,by=s) a to b in steps of size s
seq(a,b,length=n) sequence of length n from a to b
seq(along=x) like 1:length(n), but works when x has zero length
rep(x,n) x, repeated n times
rep(x,v) elements of x with x[i] repeated v[i] times
rep(x,each=n) elements of x, each repreated n times
Sorting and Ordering
sort(x) sort into ascending order
sort(x, decreasing=TRUE) sort into descending order
rev(x) reverse the elements in x
order(x) get the ordering permutation for x
Basic Arithmetic Operations
x+y addition, “x plus y”
x-y subtraction, “x minus y”
x*y multiplication, “x times y”
x/y division, “x divided by y”
x^y exponentiation, “x raised to power y”
x %% y remainder, “x modulo y”
x %/% y integer division, “x divided by y, discard fractional part”
Rounding
round(x) round to nearest integer
round(x,d) round x to d decimal places
signif(x,d) round x to d significant digits
floor(x) round down to next lowest integer
ceiling(x) round up to next highest integer
Common Mathematical Functions
abs(x) absolute values
sqrt(x) square root
exp(x) exponential functiopn
log(x) natural logarithms (base e)
log10(x) common logarithms (base 10)
log2(x) base 2 logarithms
log(x,base=b) base b logarithms
Page 12 of 22
ATTACHMENT STATS 380
Trigonometric and Hyperbolic Functions
sin(x), cos(x), tan(x) trigonometric functions
asin(x), acos(x), atan(x) inverse trigonometric functions
atan2(x,y) arc tangent with two arguments
sinh(x), cosh(x), tanh(x) hyperbolic functions
asinh(x), acosh(x), atanh(x) inverse hyperbolic functions
Combinatorics
choose(n, k) binomial coecients
lchoose(n, k) log binomial coecients
factorial(x) factorials
lfactorial(x) log factorials
Special Mathematical Functions
beta(x,y) the beta function
lbeta(x,y) the log beta function
gamma(x) the gamma function
lgamma(x) the log gamma function
psigamma(x,deriv=0) the psigamma function
digamma(x) the digamma function
trigamma(x) the trigamma function
Bessel Functions
besselI(x,nu) Bessel Functions of the first kind
besselK(x,nu) Bessel Functions of the second kind
besselJ(x,nu) modified Bessel Functions of the first kind
besselY(x,nu) modified Bessel Functions of the third kind
Special Floating-Point Values
.Machine$double.xmax largest floating point value (1.797693 ⇥ 10308)
.Machine$double.xmin smallest floating point value (2.225074 ⇥ 10308)
.Machine$double.eps machine epsilon (2.220446 ⇥ 1016)
Page 13 of 22
ATTACHMENT STATS 380
Basic Summaries
sum(x1,x2,...) sum of values in arguments
prod(x1,x2,...) product of values in arguments
min(x1,x2,...) minimum of values in arguments
max(x1,x2,...) maximum of values in arguments
range(x1,x2,...) range (minimum and maximum)
Cumulative Summaries
cumsum(x) cumulative sum
cumprod(x) cumulative product
cummin(x) cumulative minimum
cummax(x) cumulative maximum
Parallel Summaries
pmin(x1,x2,...) parallel minimum
pmax(x1,x2,...) parallel maximum
Statistical Summaries
mean(x) mean of elements
sd(x) standard deviation of elements
var(x) variance of elements
median(x) median of elements
quantile(x) median, quartiles and extremes
quantile(x, p) specified quantiles
Page 14 of 22
ATTACHMENT STATS 380
Uniform Distribution
runif(n) vector of n Uniform[0,1] random numbers
runif(n,a,b) vector of n Uniform[a,b] random numbers
punif(x,a,b) distribution function of Uniform[a,b]
qunif(x,a,b) inverse distribution function of Uniform[a,b]
dunif(x,a,b) density function of Uniform[a,b]
Binomial Distribution
rbinom(n,size,prob) a vector of n Bin(size,prob) random numbers
pbinom(x,size,prob) Bin(size,prob) distribution function
qbinom(x,size,prob) Bin(size,prob) inverse distribution function
dbinom(x,size,prob) Bin(size,prob) density function
Normal Distribution
rnorm(n) a vector of n N(0, 1) random numbers
pnorm(x) N(0, 1) distribution function
qnorm(x) N(0, 1) inverse distribution function
dnorm(x) N(0, 1) density function
rnorm(n,mean,sd) a vector of n normal random numbers with given mean and s.d.
pnorm(x,mean,sd) normal distribution function with given mean and s.d.
qnorm(x,mean,sd) normal inverse distribution function with given mean and s.d.
dnorm(x,mean,sd) normal density function with given mean and s.d.
Chi-Squared Distribution
rchisq(n,df) a vector of n 2
random numbers with degrees of freedom df
pchisq(x,df) 2
distribution function with degrees of freedom df
qchisq(x,df) 2
inverse distribution function with degrees of freedom df
dchisq(x,df) 2
density function with degrees of freedom df
t Distribution
rt(n,df) a vector of n t random numbers with degrees of freedom df
pt(x,df) t distribution function with degrees of freedom df
qt(x,df) t inverse distribution function with degrees of freedom df
dt(x,df) t density function with degrees of freedom df
F Distribution
rf(n,df1,df2) a vector of n F random numbers with degrees of freedom df1 & df2
pf(x,df1,df2) F distribution function with degrees of freedom df1 & df2
qf(x,df1,df2) F inverse distribution function with degrees of freedom df1 & df2
df(x,df1,df2) F density function with degrees of freedom df1 & df2
Page 15 of 22
ATTACHMENT STATS 380
Matrices
matrix(x, nr=r, nc=c) create a matrix from x (column major order)
matrix(x, nr=r, nc=c, create a matrix from x (row major order)
byrow=TRUE)
Matrix Dimensions
nrow(x) number of rows in x
ncol(x) number of columns in x
dim(x) vector coltaining nrow(x) and ncol(x)
Row and Column Indices
row(x) matrix of row indices for matrix x
col(x) matrix of column indices for matrix x
Naming Rows and Columns
rownames(x) get the row names of x
rownames(x) = v set the row names of x to v
colnames(x) get the column names of x
colnames(x) = v set the column names of x to v
dimnames(x) get both row and column names (in a list)
dimnames(x) = list(rn,cn) set both row and column names
Binding Rows and Columns
rbind(v1,v2,...) assemble a matrix from rows
cbind(v1,v2,...) assemble a matrix from columns
rbind(n1=v1,n2=v2,...) assemble by rows, specifying row names
cbind(n2=v1,n2=v2,...) assemble by columns, specifying column names
Matrix Subsets
x[i,j] submatrix, rows and columns specified by i and j
x[i,j] = v reset a submatrix, rows and columns specified by i and j
x[i,] submatrix, contains just the rows a specified by i
x[i,] = v reset specified rows of a matrix
x[,j] submatrix, contains just the columns specified by j
x[,j] = v reset specified columns of a matrix
x[i] subset as a vector
x[i] = v reset elements (treated as a vector operation)
Matrix Diagonals
diag(A) extract the diagonal of the matrix A
diag(v) diagonal matrix with elements in the vector v
diag(n) the n⇥n identity matrix
Applying Summaries over Rows and Columns
apply(X,1,fun) apply fun to the rows of X
apply(X,2,fun) apply fun to the columns of X
Page 16 of 22
ATTACHMENT STATS 380
Basic Matrix Manipulation
t(A) matrix transpose
A %*% B matrix product
outer(u, v) outer product of vectors
outer(u, v, f) generalised outer product
Linear Equations
solve(A, b) solve a system of linear equations
solve(A, B) same, with multiple right-hand sides
solve(A) invert the square matrix A
Matrix Decompositions
chol(A) the Choleski decomposition
qr(A) the QR decomposition
svd(A) the singular-value decomposition
eigen(A) eigenvalues and eigenvectors
Least-Squares Fitting
lsfit(X,y) least-squares fit with carriers X and response y
Page 17 of 22
ATTACHMENT STATS 380
Factors and Ordered Factors
factor(x) create a factor from the values in x
factor(x,levels=l) create a factor with the given level set
ordered(x) create an ordered factor with the given level set
is.factor(x) true for factors and ordered factors
is.ordered(x) true for ordered factors
levels(x) the levels of a factor or ordered factor
levels(x) = v reset the levels of a factor or ordered factor
Tabulation and Cross-Tabulation
table(x) tabulate the values in x
table(f1,f2,...) cross tabulation of factors
Summary over Factor Levels
tapply(x,f,fun) apply summary fun to x broken down by f
tapply(x,list(f1,f2,...),fun) apply summary fun to x broken down by several factors
Data Frames
data.frame(n1=x1,n2=x2,...) create a data frame
row.names(df) extract the observation names from a data frame
row.names(df) = v (re)set the observation names of a data frame
names(df) extract the variable names from a data frame
names(df) = v (re)set the variable names of a data frame
Subsetting and Transforming Data Frames
df[i,j] matrix subsetting of a data frame
df[i,j] = dfv reset a subset of a data frame
subset(df,subset=i) subset of the cases of a data frame
subset(df,select=i) subset of the variables of a data frame
subset(df,subset=i,select=j) subset of the cases and variables of a data frame
transform(df,n1=e1,n2=e2,...) transform variables in a data frame
merge(df1,df2,...) merge data frames based on common variables
Page 18 of 22
ATTACHMENT STATS 380
Reading Lines
readline(prompt="") read a line of input
readLines(file, n) read n lines from the specified file
readLines(file) read all lines from the specified file
Reading Vectors and Lists
scan(file, what = numeric()) read a vector or list from a file
Formatting and Printing
format(x) format a vector in a common format
sprintf(fmt, ...) formatted printing of R objects
cat(...) concatenate and print vectors
print(x) print an R object
Reading Data Frames
read.table(file, header=FALSE) read a data frame from a file
read.csv(file, header=FALSE) read a data frame from a csv file
Options for read.table and read.csv
header=true/false does first line contain variable names?
row.names=··· row names specification
col.names=··· variable names specification
na.strings="NA" entries indicating NA values
colClasses=NA the types associated with columns
nrows=··· the number of rows to be read
Writing Data Frames
write.table(x, file) write a data frame to a file
write.csv(x, file) write a data frame to a csv file
String Handling
paste(..., sep = " ", collapse = NULL) paste strings together
strsplit(x, split) split x on pattern split (returns a list)
grep(pattern, x) return subscripts of matching elements
grep(pattern, x, value = TRUE) return matching elements
sub(pattern, replacement, x) replace pattern with given replacement
gsub(pattern, replacement, x) globally replace
Page 19 of 22
ATTACHMENT STATS 380
High-Level Graphics
plot(x, y) scatter plot
plot(x, y, type = "l") line plot
plot(x, y, type = "n") empty plot
Adding to Plots
abline(a, b) line in intercept/slope form
abline(h = yvals) horizontal lines
abline(v = xvals) vertical lines
points(x, y) add points
lines(x, y) add connected polyline
segments(x0, y0, x1, y1) add disconnected line segments
arrows(x0, y0, x1, y1, code) add arrows
rect(x0, y0, x1, y1, col) add rectangles filled with colours
polygon(x, y) a polygon(s)
Low-Level Graphics
plot.new() start a new plot/figure/panel
plot.window(xlim, ylim, ...) set up plot coordinates
Options to plot.window
xaxs="i" don’t expand x range by 8%
yaxs="i" don’t expand y range by 8%
asp=1 equal-scale x and y axes
Graphical Parameters
par(... ) set/get graphical parameters
Useful Graphical Parameters
mfrow = c(m,n) set up an m by n array of figures, filled by row
mfcol = c(m,n) set up an m by n array of figures, filled by column
mar=c(m1,m2,m3,m4) set the plot margins (in lines)
mai=c(m1,m2,m3,m4) set the plot margins (in inches)
cex=m set the basic font magnification to m
bg=col set the device background to col
Measuring Text Size
strwidth(x, "inches", font, cex) widths of text strings in inches
strheight(x, "inches", font, cex) heights of text strings in inches
Layouts
layout(mat,heights,widths) set up a layout
layout.show(n) show layout elements (up to n)
lcm(x) size specification in cm
Page 20 of 22
ATTACHMENT STATS 380
Compound Expressions
{ expr1, ... , exprn} compound expressions
Alternation
if (cond) expr1 else expr1 conditional execution
if (cond) expr conditional execution, no alternative
Iteration
for (var in vector) expr for loops
while (cond) expr while loops
repeat expr infinite repetition
continue jump to end of enclosing loop
break break out of enclosing loop
Function Definition
function(args) expr function definition
var function argument with no default
var=expr function argument with default value
return(expr) return the given value from a function
missing(a) true if argument a was not supplied
Error Handling
stop(message) terminate a computation with an error message
warning(message) issue a warning message
on.exit(expr) save an expression for execution on function return
Language Computation
quote(expr) returns the expression expr unevaluated
substitute(arg) returns the expression passed as argument arg
substitute(expr,subs) make the specified substitutions in the given expression
Page 21 of 22
ATTACHMENT STATS 380
Interpolation
approx(x, y, xout) linear interpolation at xout using x and y
spline(x, y, xout) spline interpolation at xout using x and y
approxfun(x, y, xout) interpolating linear function for x and y
splinefun(x, y, xout) interpolating spline for x and y
Root-Finding and Optimisation
polyroot(coef) roots of polynomial with coecients
in coef
uniroot(f,interval) find a root of the function f in the given interval
optimize(f,interval) find an extreme of the function f in the given interval
optim(x,f) find an extreme of the function f starting at the point x
nlm(f,x) an alternative to optim
nlminb(x,f) optimization subject to constraints
Integration
integrate(x,lower,upper) integrate the function f from lower to upper