Inf2-FDS Coursework 1 - Data wrangling and visualisation

Released: Monday, 25 October 2021

Submission deadline: Friday, 5 November, 16:00 UK time

Late submission rules

This coursework uses the Informatics Late Submission of Coursework Rule 3 with a maximum 6 day extension:

Extensions, Extra Time Adjustment (ETA) for Extra Time and for Extra Time for Proof Reader/Interpreter are permitted, but cannot be combined. The maximum extension is up to 6 days, or fewer if specified.

Penalty: If assessed coursework is submitted late without an approved ETA extension, it will be recorded as late and a penalty of  5% per calendar day will be applied for up to the specified number of calendar days (), after which a mark of zero will be given.  

For electronic submissions, the last version that has been submitted by the deadline will be the one that is marked (late submission will only be accepted if no submission in time has been made).

If a student with an extension or either type of ETA submits late beyond the specified extended deadline a mark of zero will be given.  

It is very important that you read and follow the instructions below to the letter: you will be deducted marks for not adhering to the advice below.

Good Scholarly Practice

Please remember the University requirement as regards all assessed work for credit. Details about this can be found at: http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct

Specifically, this coursework must be your own work. We want you to be able to discuss the class material with each other, but the coursework you submit must be your own work. You are free to form study groups and discuss the concepts related to, and the high-level approach to the coursework. You may never share code or share write-ups. It is also not permitted to discuss this coursework on Piazza. The only exception is that if you believe there is an error in the coursework, you may ask a private question to the instructors, and if we feel that the issue is justified, we will send out an announcement.

Assessment information and criteria

This assignment accounts for 20% of the grade for this course.

It is graded based on the PDF export from a Jupyter notebook (see General instructions below), which you are to submit via Gradescope (see Assessment/Coursework 1 - Data wrangling and visualisation folder in Learn).

The assignment is marked out of 100 and the number of points is indicated by each question.

We will assess your work on the following criteria:

functional code that performs the computations asked for, as measured by verifying some of the numeric outputs from your processing and reading your code if there is doubt about what you have done

the quality of the visualisations, measured against the Visualisation principles and guidance handout in the S1 Week 5 workshop (PDF available in S1 Week 5 folder in Learn)

the quality of your textual comments - as measured by how accurate, complete and insightful they are.

General instructions

Read the instructions carefully, answering what is required and only that.

Fill in your answers in the cells indicated. You may delete text like "Your answer to Q1.2 goes here". Do not edit or delete any other cells.

Keep your answers brief and concise.

For answers involving visualisations, make sure to label them clearly and provide legends where necessary.

For answers involving numerical values, use correct units where appropriate and format floating point values to a reasonable number of decimal places.

Once you have finished a question, use a Jupyter notebook server to export your PDF, by selecting File->Download as->PDF via LaTeX (.pdf).

Check this PDF document looks as you expect it to. If changes are needed, update the notebook and export again.

Once you have finished all the questions, submit the final PDF using the submission instructions in the Assessment->Coursework 1 - Data wrangling and visualisation folder in Learn. Please allow enough time to upload your PDF before the deadline.

Coursework questions

This coursework consists of five questions, divided into three parts (A, B and C). Complete all questions.

We ask for multiple types of responses:

Numeric responses, which we will use to verify that the processing has been done correctly

Visualisations, which will be assessed using the parts of the Visualisation principles and guidance PDF that apply to the visualisation in question

Comments on your findings, which may ask you to describe, explain or interpret your results, and which will be assessed on how accurate, complete and insightful they are.

Throughout the coursework, a decade refers to ten years starting from a year divisible by ten, e.g. 1850-1859 (1850s), 1860-1869 (1860s), etc.

In order to understand the meaning the datasets, you may need to follow the links citing the data.

Read through all of a question before starting, as some parts build on each other.

Good luck - we hope you enjoy it!

# Imports - run this cell first, and add your own imports here.

# You can use any package you want, but we suggest you stick to ones used in the labs

import pandas as pd

import matplotlib.pyplot as plt

import matplotlib

matplotlib.rcParams['figure.dpi'] = 300 # Make figures have reasonable resolution when exporting

Part A - Identifying and correcting bad visualisation practices

Designing a good visualisation can be time-consuming, but is important for clear communication. In this part we will ask you to identify and correct bad visualisation practices. (Note that we'd like you to continue using good visualisation practices in other parts of this coursework too!)

Question 1 (15 points)

The code in the cell below sets up a small data frame that relates cities, the population density of the city, the number of universities in the city and the percentage of the population that commutes by bicycle in the city.

# This dataframe holds the data. The columns explain the values.

data = pd.DataFrame(columns=['City',

'pop. density [k/km^2]',

'# universities',

'% commuting by bike'])

data.loc[0] = ['London', 5.701, 40, 3.62]

data.loc[1] = ['Edinburgh', 1.830, 6, 4.3]

data.loc[2] = ['Glasgow', 3.400, 5, 1.31]

data

The next cell contains code for a plot that is an example of bad visualisation practice. It shows the population density in 1000s per km (), the number of universities () and the percentage of people commuting by bicycle () London (red), Edinburgh (green) and Glasgow (yellow). The plot is also given here for reference. bad_example_plotting.png

# This is the figure for you to improve on.

fig, ax1 = plt.subplots(figsize=(20, 20))

ax1.plot(list(data.iloc[0])[1:], color='r')

ax1.plot(list(data.iloc[1])[1:], color='g')

ax1.plot(list(data.iloc[2])[1:], color='y')

Question 1.1 (8 points)

Run the code above to produce the plot. Now make an improved version of the plot, so someone presented with it could easily interpret it. There are at least five changes you can make to improve it. Remember that when you export this notebook to a PDF, it will have a text width of 6 inches.

# Your code for Q1.1 goes here

Question 1.2 (5 points)

List the changes you have made. Explain briefly why each change is an improvement.

Your answer to Q1.2 goes here

Question 1.3 (2 points)

Describe the trends that your improved plot shows.

Your answer to Q1.3 goes here

Part B - Cleaning and exploring UK weather data

Questions 2 and 3 consider historical weather data in the UK. The data comes from https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data, but has been processed to get you started. The data consists of separate files for each station, with each data file having a header with coordinates and information on how to interpret the data. We use only the columns giving the year, rainfall and sunshine. Some of the data is missing. Some values are marked to denote them being special in some way, as described in the header of each file.

Question 2 (25 points)

In question 2 we ask you to read in the weather data, explore it, and look for trends. To make trends clear, we will ask you to average across all sites. To make trends even clearer, we will ask you to smooth out the localised highs and lows of individual years by averaging over decades. We also want you to consider whether there are any limitations of the data that should be considered.

The data for this question is in the folder weather_sites, and will need some cleaning.

Questions 2.1 and 2.2 can be solved together, so make sure you read through both before starting.

Question 2.1 (10 points)

Read in and clean the data in the folder weather_sites. Then compute:

The mean rainfall in mm/month measured at UK weather stations for the decades 1850s and 1990s (one number per decade). Give the answers to 2 decimal places.

The mean sunlight in hours/month measured at UK weather stations for the years 1899 and 1999 (one number per year). Give the answers to 2 decimal places.

The number of data rows you are using (the number of rows in the Pandas DataFrame you used to compute the mean rainfall and mean sunlight).

To make your answers clear, please either write out the answer in the cell for "your written answer" below the code cell, or make sure that your code produces very clear output.

Hint: check the data type of each column to verify the processing. Include all values, also those marked as provisional, estimates (*), or special in some way ($ and #).

Hint: use .map(lambda xx: str(xx)) on a data series to convert all entries to strings, .rstrip(cc) on strings to remove the character cc from the right of a string, and .astype('float') to change the data type back to float.

# Your code for Q2.1 goes here

Your written answer for Q2.1 goes here

Question 2.2 (9 points)

Plot the mean rainfall and sunlight per year and per decade, in mm/month and in hours/month, respectively. Make two plots, one for rainfall and one for sunlight, with both plots having both yearly and per-decade averages.

# Your code for Q2.2 goes here

Question 2.3 (6 points)

Describe the trends you see. Comment on limitations of the data. Are there any outliers you can explain?

Your answer to Q2.3 goes here

Question 3 (10 points)

Now we want to compare the data to a different data set on the UK climate. It comes from https://www.metoffice.gov.uk/research/climate/maps-and-data/uk-and-regional-series. Load it from the file UK.txt in the folder weather_uk-wide. It gives the sunlight for each month, season and year between 1919 and 2021.

Questions 3.1 and 3.2 can be solved together, so make sure you read through both before starting.

Question 3.1 (2 points)

Compute the mean sunlight for the 1910s decade and the 1990s decade of the UK-wide data (one number per decade), again in hours/month. Use the annual totals to calculate the average. Two decimal places are sufficient.

Hint: pd.read_fwf() can be used to read .txt files.

# Your code for Q3.1 goes here

Your written answer for Q3.1 goes here

Question 3.2 (5 points)

Make a copy of the sunlight plot from the question 2.2 and add the UK-wide values.

# Your code for Q3.2 goes here

Question 3.3 (3 points)

Describe the trends you see. How do you explain the differences, given that both data sets come from the UK Met Office?

Your answer to Q3.3 goes here

Part C - Exploring European migration patterns

Questions 4 and 5 focus on migration statistics in Europe between the years 1990 and 2019 from Eurostat. The Eurostat dataset we use contains tables for immigration and emigration across different countries for these three decades (https://ec.europa.eu/eurostat/databrowser/view/MIGR_IMM8__custom_1301560/default/table?lang=en and https://ec.europa.eu/eurostat/databrowser/view/MIGR_EMI2__custom_1301550/default/table?lang=en). In Question 5 we also use a dataset that provides more information about individual countries.

Question 4 (24 points)

In this question we will focus on migration trends throughout Europe. We have already merged the data for immigration and emigration for you, which is presented in the migration_data/EUROSTAT_migrants.csv file. Furthermore, variables not in use in this exercise were removed and columns were renamed for your convenience. The EUROSTAT_migrants.csv file includes the age group (at the time of migration), sex of the migrant, the country code, the year, and the number of immigrants (count_in) and emigrants (count_out).

Question 4.1 (3 points)

State the total number of male and female immigrants with known age present in the dataset. "Known age" here means belonging to the age groups Y_LT1, Y_GE100, and Y_{num}, where num is an integer such that 0

# Your code for Q4.1 goes here

Your written answer for Q4.1 goes here

Question 4.2 (5 points)

State which known age group contains the highest number of immigrants, and the respective number of immigrants. Do this for each sex.

# Your code for Q4.2 goes here

Your written answer for Q4.2 goes here

Question 4.3 (10 points)

Plot the total number of male and female immigrants across different known-age groups.

# Your code for Q4.3 goes here

Question 4.4 (6 points)

Interpret the plot, identify interesting patterns within it, and comment on why these patterns might be present.

Your answer to Q4.4 goes here

Question 5 (26 points)

In this exercise we wish to compare how migration has changed in different European regions.

We provide further information about individual countries in the file migration_data/Countries.csv. This includes: the country's name; the country code; its population in 2011 stated in millions of inhabitants; and what larger region it belongs to according to the multidisciplinary thesaurus (controlled vocabulary) EuroVoc. The idea is that the four regions (North, West, South, East) have different population sizes, which directly affects the number of migrants associated with each region.

Firstly, we want to find the net migration of each region for each year, and scale it appropriately. By "appropriately" we mean that the scaling factor should be the sum of the populations of all contributing countries from that region for that year - e.g., Czechia does not provide any data for year 1991, so its population will not be considered in the scaling factor for Eastern Europe for that year. On the other hand, Slovakia and Croatia are countries classified in the Eastern Europe region with data available for 1991, hence their individual populations will contribute towards the sum that represents the region's scaling factor for that year.

Once we have all the net migration values scaled for each region for each of the 30 years within our dataset, we want a plot of the distributions of the scaled net migration for the four regions over the three decades - one distribution per region per decade.

European Regions According to EuroVoc

Question 5.1 (3 points)

Load EUROSTAT_migration.csv afresh. State the sum of net migration (migrants coming in minus migrants going out of each country) across Europe over the past three decades according to the data. Use the data with age group TOTAL.

# Your code for Q5.1 goes here

Your written answer for Q5.1 goes here:

Question 5.2 (3 points)

Merge the migration and countries datasets. State how many countries represent each region in the merged dataset.

# Your code for Q5.2 goes here

Your written answer for Q5.2 goes here

Question 5.3 (5 points)

Compute the scaling factors for each region for each year. The scaling factor is the sum of the populations of the countries within the region that contributed data for that year. Assume the population of each country is constant across the years and is equal to the 2011 population data provided in Countries.csv. State:

the maximum and minimum scaling factor of net migration in Eastern Europe

the years in which the maximum and the minimum occur.

# Your code for Q5.3 goes here

Your written answer for Q5.3 goes here

Question 5.4 (9 points)

Compute the scaled net migration for each year for each region. Plot the distribution of scaled net migration for each region for each of the three decades (1990s, 2000s, 2010s).

# Your code for Q5.4 goes here

Question 5.5 (6 points)

Analyse and interpret the plot. Are there any interesting patterns? Can you link them to some economic or geopolitical events?

Your answer for Q5.5 goes here

联系我们