ENVX3002辅导、R程序语言调试、R设计讲解、辅导data 解析C/C++编程|解析C/C++编程

ENVX3002: Statistics in the Natural Sciences, 2020
Assessment 4
Updated 11th May 2020

DUE DATE: Week 13, 11:59 PM Friday 29th May 2020 – online submission via Canvas Turnitin.
WEIGHTING: 12.5%, mark is out of 20
1.The assessable exercises you submit must be your own work
2.All answers are to be submitted via Turnitin as a Word file.
3.Combined answers should not be longer than 6 pages.
4.Marks will be deducted for sloppy presentation.
5.Include your R code either in situ in the document, or as an Appendix

Note: Anonymous marking has been introduced across the University. Do not put your name or SID on your submission.
The file “health-social-environment.csv” is a comma separated text file, contains data of health, social and environment from regional Australia. The data is grouped by Local Government Area (LGA), and each LGA has a value related to health and socioeconomic factors. The data are from Social Health Atlas of Australia. The aim is to link health (infectious disease rate) with socioeconomic and environment factors.
Within each LGA, environment and land attributes are sampled. Typically, within each LGA, 20 locations were sampled. At each location, the climate, soil, and land factors were registered based on maps of climate, soil and land use.
The dataset has 33 variables, they are
1.LGAcode: (factor) Local Government Area code
2.Name: (character) Local Government Area code name
Variables of interest
3.Pub_hosp_infect: (numeric) infectious and parasitic disease public hospital admissions in 2011-2013. Calculated from age-standardized rate of hospital admissions per 100,000 population.
4.Parasite Infection class: (factor) Parasitic disease public hospital admissions: From (1) Low to (5) Very high
Socioeconomic factors
5.IRSD: (numeric) Index of relative socioeconomic disadvantage, the higher the value the more disadvantage
6.Population_dens: (numeric) population density (person/km2)
7.Overseas_born_NES: (numeric) percentage of population born overseas (non-English speaking)
8.Age25-44: (numeric) percentage of population in the age group of 25-44 years
9.Age.gt.65: (numeric) percentage of population in the age group > 65 years
10.Internet access: (numeric) percentage of population with internet access from home
11.Financial stress: (numeric) Estimated number of people aged 18 years and over whose household could raise $2,000 within a week
12.Children developmentally vulnerable: (numeric) % Children developmentally vulnerable on two or more domains
13.Fertility rate: (factor) a: medium; b: high
Environmental factors
14.rainfall_mean: (numeric) Mean rainfall (mm)
15.temp_mean: (numeric) Mean annyal temperature (oC)
16.temp_diurnal: (numeric) Diurnal temperature range (oC)
17.Drought: (factor) Frequency of drought, factor: (1) frequent, (2) less frequent, (3) not frequent
18.soil_cec: (numeric) Soil cation exchange capacity (in meq/100 g)
19.soil_nitrogen: (numeric) Soil nitrogen content (g/kg)
20.clay: (numeric) soil clay content (%)
21.gravel: (numeric) soil gravel content (%)
22.soil hardness: (factor) hardness of soil surface: (1) not hard, (2) hard
23.remoteness: (factor) Inner Regional Australia, or Outer Regional Australia
24.prop_conservn: proportion of conservation land, percentage (of 3 km radius area)
25.prop_natural: Proportion of natural land, percentage (of 3 km radius area)
26.prop_inland_waterbodies: Proportion of inland water bodies, percentage (of 3 km radius area)
27.prop_salt_lakes: Proportion of salt lakes, percentage (of 3 km radius area)

Questions:
The aim of this exercise is to find socioeconomic and environmental factors that may influence human health (infectious disease rate) in regional Australia.
Using a machine learning model (NOT linear regression), conduct an analysis on the data to answer the following questions:
(1)Perform an exploratory data analysis to see the relationship between human health (infectious disease rate) and socioeconomic, environment and land factors (10%).
(2)Use a data split method to separate the data into a calibration (70%) and validation (30%) dataset. Note: since each LGA is made up of 20 data rows (of the same “health” values), you need to split the data based on LGA. (10%)
(3)Derive models to predict infectious and parasitic disease incidence (Pub_hosp_infect)
a.Using all socioeconomic, environmental factors.
b.Using only environmental factors only.
Validate the models based on accuracy measures. (20%)
(4)According to your model, which top 5 important socioeconomic and environmental factors that may affect infectious and parasitic disease incidence? Explain the likely relationships (e.g. via partial plots). Are they causative? (20%)
Hint: you can perform a sensitivity analysis by removing a predictor at a time to see if the predictor influences the overall validation statistic. Beware of correlated covariates. ML model can be agnostic in correlated variables. Also consider variables that appear both in IncMSE and NodePurity.

(5)Using a categorical machine learning model, derive a model to Parasite Infection incidence class:
a.Using all socioeconomic, environment and land factors.
b.Using only environmental and land factors
Validate the model based on accuracy measures for categorical variables.
Calculate the sensitivity and specificity for the prediction of serious infection cases (add class 4 & 5 together) and discuss (25%)

(6)Similar to (3) explain the top 5 important socioeconomic and environmental factors that may affect the parasite infection incidence. Are they causative? (15%)