讲解 SOCS0055 Advanced Computational辅导 R编程

SOCS0055 Advanced Computational

Techniques for Data Science

Summative Assessment 100% - Autumn Term 2025-2026

1. There are four tasks in this assessment that add up to 100 marks

2. There is no word limit. We advise to write concise, short, focused answers; otherwise, you may run out of time

3. To achieve full marks, no additional references/bibliography are required or expected. However, if you use them, list them at the end of the assessment following an accepted referencing style (check these UCL guidelines)

4. AI can be used as an assistive tool. However, it is not permitted to generate any part of your exam. Use of AI needs to be acknowledged at the end of the exam.

5. The assessment is split into two parts, A and B. For Part A, students must create their own wrangled version of the UKHLS dataset. For Part B, a cleaned version of the dataset is provided.

a. Instructions to download the full USoc data from the UK Data Service are provided in an Appendix. Note, these instructions will download data for both USoc and its predecessor, the British Household Panel Survey (BHPS), but we only want you to use USoc data in this assessment.

b. For Part B, a cleaned version of the dataset is provided on the moodle asssesment section.

The specific outputs required differ between Part A and B. You must submit:

• Coversheet.docx

Part A

• “partA_task1.qmd” (Quarto file containing code)

• “partA_task2.qmd” (Quarto file containing code)

• “partA_task2_table.html” (Output table as a HTML)

• “partA_task3.qmd” (Quarto file containing code)

Part B

• “partB_task4.html” (html file with your explanation, your code and your output)

• “partB_task4.qmd” (Quarto file of your task)

The individual Quarto files needs to be self-contained and must be capable of running in full, from scratch in a new, ‘clean’ session (i.e., from loading packages and data to performing analyses). The code should also be well formatted with comments, sections, sensible variable names, and so on.

Your final submission must be a zip folder containing the above files and everything necessary to run the code (except the data, which your examiners will have access to already). Please name this file: “ [CandidateNumber].zip” where you replace [CandidateNumber] with your anonymous candidate number (not your student id). You can use the course materials as an example.

Part A

Task 1 (25% of Marks)

Create a “long” dataset with one row per pidp x wave combination from all of the adult interview (*_indresp.dta) files. The dataset should have rows only for those participants who appeared in the a_indresp.dta file (i.e., completed an adult interview in Wave 1 of USoc).

The dataset should have cleaned versions of current smoking status (binary, derived from *_smever and *_smnow), obesity (binary, BMI ≥ 30 kg/m2, derived from *_bmi_dv), SF- 12 mental and physical component summary scores (*_sf12pcs_dv and *_sf12mcs_dv), overall life satisfaction, age at interview (*_age_dv), date of interview (month-year), sex (*_sex_dv), whether the participant has a degree or above level education (binary, derived from *_hiqual_dv), and employment status (binary, derived from *_jbhas and *_jboff).

Include a step to save the cleaned dataset as using the saveRDS() function — you will need to load and use this dataset in Tasks 2 and 3.

Most of the stubs of the variables that should be used to create these variables are provided above, but for overall life satisfaction and date of interview, you are required to find relevant variables yourself. Note, some of the variables are only available in some waves but not all. See this website for a helpful tool which provides variable names and waves of collection. You can alternatively use the labelled::lookfor() function.

All that is required for this question is the code used to complete this task, provided in a Quarto (.qmd) document. Marks will be awarded to students who reduce the amount of code written to complete this task, compared with writing out all instructions manually (e.g., using functions called repeatedly or combining data to clean variables in one fell swoop), and to students who include helpful comments explaining their code.

Task 2 (10% of marks)

Load the dataset from Task 1 and create a ‘Table 1 ’ which shows descriptive statistics for each variable in the dataset. This table should report statistics separated by study wave (i.e., in separate columns). For categorical variables, report sample sizes and proportions (as % of non- missing) in each category. For continuous variables, report mean and standard deviation.

Format the table so that variable names are descriptive and human interpretable rather than repeating the R column names. Also remove any variables that are not helpful to report.

Save the table as an .html file. You can use whatever package you wish to create the table (e.g., gt, gtsummary, flextable, and so on).

Along with the code used to complete this task (and helpful explanations) in a Quarto (.qmd) file, we require the saved table file(s). Marks will be awarded for completing the task successfully (e.g., displaying the correct descriptive statistics), but also for producing a table that is ‘publication-ready’ – i.e., well formatted, aesthetically pleasing, and understandable.

Task 3 (15% of marks)

Load the dataset from Task 1 and run a series of cross-sectional regressions for each (valid) combination of:

- Outcome: smoking status, SF-12 MCS, SF-12 PCS, obesity, and life satisfaction

- Exposure: employment status and degree qualification

- Control variables: age, age-squared and date of interview

- Wave: 1-14, separately.

- Sex: males, females, and males and females combined

Store the results of the regressions in a single tibble containing coefficients, standard errors, and upper and lower confidence intervals for the exposure of interest, as well as details on the parameters of the regression (outcome, exposure, … , sex used). For each regression, use OLS regression (lm()), regardless of whether the outcome variable is continuous or binary.

As with Task 1, all that is required for this question is the code used to complete this task, provided in a Quarto (.qmd) document. Marks will again be awarded to students who reduce the amount of code written to complete this task, compared with writing out all instructions manually, and to students who include helpful comments explaining their code.

Part B

Task 4 (50% of marks)

The datafile “Brexit.RData” contains data from Understanding Society, wave 8. It includes approx. 25,000 individuals and information on the following characteristics:

• pidp: Person ID

• bornuk_dv: Born in UK

• gor_dv: Region of residence

• brexit_leave: Intention to vote "leave" (1 means intention to vote “leave”)

• age_dv: Age

• sex_male: Male

• marstat_dv: Marital status

• migback_gen: Migration background

• ethn_dv: Ethnic group

• hiqual_dv: Highest level of education

• nkids_dv: Number of kids

• jobstat: Employment status

• unemployed: Unemployed

• fihhmnnet1_dv: Household labour income

• hh_inc_oecd: Household equivalence income

• ind_lab_inc: Individual labour income

• tenure_dv: Housing tenure

• area_rural: Lives in rural area

• lkmove: Likes to move to a new place of residence

• financial_diff_fut: Subjective financial situation - future (Looking ahead, how do you think you will be financially a year from now, will you be)

• financial_diff_now: Subjective financial situation - current (How well would you say you yourself are managing financially these days? Would you say you are)

• health: General health

• distress: Level of distress (0 indicating the least amount of distress; 36 indicating the greatest amount of distress)

• lifesat: Satisfaction with life overall (1 Completely dissatisfied; 7 Completely satisfied)

• deprivation: Household-level material deprivation (0 no deprivation - 100 highest level of deprivation)

• problems_bills: Having trouble to pay the bills

• problems_counciltax: Having trouble to pay the council tax

• nbh_deprivation: Neighbourhood deprivation percentile (1 lowest level of deprivation, 10 highest level of deprivation)

• nbh_foreign: Neighbourhood share of foreigners (1 lowest percent of foreign residents, 10 highest percent of foreign residents)

• nbh_above65: Neighbourhood share of residents above 65 (1 lowest percent of residents above 65, 10 highest percent of residents above 65)

Factors have labels in the data. For binary indicators (1 mean yes and 0 mean no).

Here is your task: Build a machine learning algorithm to predict the intention to vote “Leave” in the Brexit referendum (brexit_leave). Your main aim is to increase the out-of-sample accuracy. You can use any algorithm that we have covered in our module, and you can also use different algorithms if you can justify your decision.

Your submission should include an HTML document that clearly addresses the following:

A) Data Preparation

• Describe any recoding or preprocessing steps you performed.

B) Sample Size

• Report the final sample size used in your analysis.

• If it differs from the original sample, explain why.

C) Train/Test Split

• Explain how you split the data into training and test sets.

D) Hyperparameter Tuning

• Describe how you selected or tuned hyperparameters for your model(s). If you used Cross-validation, explain briefly what you did. The code for CV needs to be executable.

E) Prediction Methods

• Briefly explain which machine learning methods you used to generate predictions and why you selected those. You can also compare the performance of multiple algorithms to make your final choice — if so, include the comparison in your code.

F) Performance Evaluation

• Select a final model that is your preferred choice for predicting the outcome. Your final model must be recognizable in the code – so that we can use it for prediction with our hold-out sample.

• Report your final performance metric (e.g., accuracy, AUC, MSE).

G) Feature Importance

• Provide a summary of feature importance and a short explanation of the key findings from your preferred model.

Your document should make clear how you selected your final prediction model and provide evidence of its out-of-sample performance using the provided data.

Note: The primary goal is out-of-sample prediction accuracy.

For Part B of your coursework, please submit:

• A document with a narrative explanation of your approach (HTML / DOC / PDF)

• Your quarto / R script. with fully replicable code (must run on a different device)

This code should be self-contained and be capable of running in full from scratch – i.e., from a clean R session, to loading packages, wrangling the provided data, and producing and saving relevant outputs (e.g., plots or tables).

We will test your model’s performance on hold-out data that is not included in the dataset you received. Extra marks will be awarded to submissions that achieve a high prediction accuracy on this hold-out data (defined as being within a reasonable range of the best-performing model).

联系我们