#
代做STATS762作业、代写R编程设计作业、R课程设计作业代做、代写data作业
代写Python编程|代写Python编程

STATS762 Regression for Data Science

Assignment 3

Due date: 10am, 1 June 2020

Instruction

• Please submit both your R Markdown document and a pdf file containing

the document it generates. To create a pdf you should start your R Markdown

document with the following lines (having made the appropriate

changes):

---

title: "STATS 762 Assignment 3"

author: "Your Name, ID 1234567"

date: "Due: 10am, 1 June 2020"

output: pdf_document

---

• Add the set.seed-function before your R-script to obtain the same output

when it is resimulated.

• All answers should be written with corresponding question numbers.

• Working must be shown.

• Each answer should be written explicitly and a R-code itself does not

make an answer.

For example, the question is finding an average height of 6 trees: (1, 2, 1,

3, 1.5).

Good answer Bad answer

• If any of above is unsatisfied, a penalty may be applied.

1. The spreadsheet avocado2.csv contains historical 338 avocado sales in

various markets in California, US. The attributes follow;

Total.Volume Total number of sold avocados

AveragePrice Average price of a single avocado

type Production type; organic and conventionally produced avocados

1

A researcher wants to investigate how the amount of sales relates to an average

price and a production type (organic/conventional). Total.Volume

is transformed in a log-scale to fit a linear regression model with AveragePrice

and type.

(a) Write how a log-transformed total number of sold avocados is useful

for modelling a quantile using a linear regression. [2 marks]

(b) Find a suitable linear regression model for the 0.2 quantile of log(Total.Volume)

and express a typical 0.2 quantile of total number of sold avocados

for a given price and production type. [5 marks]

(c) Find a suitable linear regression model for the 0.8 quantile of log(Total.Volume)

and express a typical 0.8 quantile of total number of sold avocados

for a given price and production type. [5 marks]

(d) Using your model, predict the 0.2 quantile of the total sales for $1.2

conventional avocados and $1.8 organic avocados. [1 marks]

(e) What conventional avocado price does result that 80% of markets

sold at most 5.4 millions avocados? [3 marks]

2. The spreadsheets (banktrain.csv and banktest.csv) are related with

direct marketing campaigns of a bank. The marketing campaigns were

based on phone calls. Often, more than one contact to the same client was

required, in order to access if the product (bank term deposit) would be

(or not) subscribed. The interest is to predict if the client will subscribe a

term deposit (variable y).

The attributions follow;

gender - gender (categorical: ”male”,”female”)

age - age (numeric)

marital - marital status (categorical: ”married”,”divorced”,”single”)

education - education information of client (categorical: ”unknown”,”secondary”,”primary”,”tertiary”)

default - credit account status (categorical: ”yes”,”no”)

balance - average yearly balance, in euros (numeric)

housing - housing loan status (categorical: ”yes”,”no”)

loan - personal loan status (categorical: ”yes”,”no”)

contact - contact communication type (categorical: ”unknown”,”telephone”,”cellular”)

duration - last contact duration, in seconds (numeric)

campaign - number of contacts performed during this campaign and for this client (numeric)

previous - number of contacts performed before this campaign and for this client (numeric)

poutcome - outcome of the previous marketing campaign (categorical: ”unknown”,”other”,”failure”,”success”)

y - Has the client subscribed a term deposit? (categorical: ”yes”,”no”)

2

We use the train data (banktrain.csv) to find a model and the test data

(banktest.csv) to examine the predictability of a model. Note that the

number of cross validation folders is 10.

The function in make.r reforms a data that each categorical variable creates

indicator variables corresponding to categorical levels. It produces

a list with two objects; a reformed data (data) and a vector of group

memberships (gpname).

(a) Using the train data, complete the following questions.

i. Using an appropriate penalty on the model complexity, find a

model minimizing the cross validation error. Show how you

found the model and describe the model with the client characters

included. [4 marks]

ii. Using an appropriate penalty on the model complexity, find

a parsimonious model. Show how you found the model and

describe the model with the client characters included. [4 marks]

(b) Estimate the predictability of each model using an appropriate measure

and, compare the predictability. [3 marks]

(c) Using your parsimonious model, describe a type of client who is

very likely to subscribe a term deposit. [3 marks]

(d) If a marketing focuses on a single client character what would be the

feature to succeed the marketing campaign? [3 marks]

3