receives as input a string s representing the text, and a dictionary d which contains
: long_form> key-value pairs as defined in Q1b). The function returns another string as
output. In this output, all acronyms in s have been replaced with their long forms. The
following rules apply:
- If an acronym has a long form, the sentence wherein the long form was defined
remains unchanged. For any other sentence, the acronym is replaced by the long form.
- If an acronym has no long form, it is not replaced anywhere.
- If you add the long form at the beginning of a sentence, make sure that its first word is
capitalised.
For instance, in our example above the output of the function is the string:
"A GPU, which stands for graphics processing unit, is different from
CPUs, says the IT expert. For some operations, a graphics processing
unit is faster than a CPU. Graphics processing units are not always
faster though."
As a starting point, use Q1.py from Learning Central. Do not rename the file or the function.
Q2 Statistics (Total 35 marks)
In this question, your task is to implement several statistical functions that perform t-tests,
linear regression, and variable selection.
Q2 a) Mass t-tests (10 marks)
In this question, your task is to implement two functions that perform dependent and
independent t-tests on input data. You can use the corresponding t-test functions in
scipy.stats.
Write a function mass_paired_ttest(X) that performs a series of paired-samples t-tests. It
receives as input a numpy array X with dimensions ��, where � is the number of rows and
� is the number of columns. Each of the � columns represents one sample. Your function
should find the pair of columns that yields the lowest p-value i.e. it is the 'most significant'.
Then the function returns a tuple with three elements (index of the first column from the pair,
index of the second column from the pair, corresponding p-value).
Example: imagine your dataset is of (100, 3) shape i.e. has three columns. Assume the pvalues for the three pairs of colums are p = 0.4 (col 0 vs col 1), p = 0.12 (col 0 vs col 2), p =
0.08 (col 1 vs col 2). The lowest p-value is obtained for col 1 vs col 2 and its value is 0.08, so
the tuple that is returned by the function is t = (1, 2, 0.08).
Write a similar function mass_independent_ttest(*X) that performs a series of
independent t-tests. It takes multiple inputs: Each input is a vector (1-D array) representing
a single sample, so X is a list of Numpy arrays. The arrays can have different lengths. You
can access each array using its index, e.g. X[0] is the first array, X[1] is the second array etc.
Like for the paired-samples t-test, find the most significant pair of columns and return the
tuple of three elements.
Q2 b) Ridge regression (10 marks)
In this question your task is to implement ridge regression from scratch using Numpy. Do not
use statsmodels or scipy for this question. Ridge regression is a slightly modified version
of linear regression which is more stable for collinear data.
Let us first develop the theory behind linear regression: Assume you have a vector of
responses �∈ℝ�
, where � is the number of samples. Let �1, �2, �3,..., �p∈ℝ�
be our
predictors, where � is the number of predictors. Then our linear regression model is
�̂= �0 + �1�1+�2�2+ �3�3 + ... + �p�p
with �0 being the intercept and �1,...,�� being the slopes for the predictors. For convenience,
we store our predictors in a matrix �∈ℝ�×(�+1)
=[ �1, �2, �3,..., �p,�]. In other words, each
column of � represents one predictor. The last column consists entirely of ones, it represents
the intercept. We also store all �'s in a vector �=[ �1, �2, ... , �p, �0]∈ℝ�+1
. To calculate
� we can use the equation
� = (�⊤
�)
−1
�⊤
�
where �⊤ is the matrix transpose of � and the superscript ()−1
refers to the matrix inverse.
Unfortunately the inverse can be unstable or even undefined if �⊤
� is not well-conditioned.
As a fix, we will use a different formula called ridge regression which adds a so-called
regularization term ��.
� = (�⊤
�+��)
−1 �⊤
�
where �∈ℝ(�+1)×(�+1)
is an identity matrix and � is a positive number that represents the
regularization strength. The inverse then always exists as long as �>0. The parameter � has
to be provided by the user.
Your task: Write a function fit_ridge(y, X, a) that implements ridge regression as
defined above. It receives the following inputs:
- The response vector y is a numpy array with shape (n,1).
- The matrix X is a numpy array of predictors with shape (n, p). Note that X does not
contain the column of 1's, so you need to add it yourself.
- The input a represents the strength of regularization. a can be either a single number
(e.g. a = 1) or a list with multiple numbers (e.g. a = [1, 5, 10]).
If a is a single number, the function returns �, the ridge regression coefficients using a for
the regularization. If a is a list with multiple numbers, separate ridge regression solutions
should be calculated for each value of a. In this case, the function returns a Python list of
vectors of regression coefficients [�0,�1,�2,...], where �0 is the regression coefficients
using the first value of a, �1 is the regression coefficients using the second value a, and so
on.
Tip: remember than the * operator operates element-wise on Numpy arrays. If you want
proper matrix or vector multiplication like in linear algebra, you can use the @ operator.
Q2 c) Variable selection in linear regression (15 marks)
In this question, your task is to use statsmodels to implement two variable selection
functions for standard linear regression (a.k.a. OLS regression). The motivation is that
regression models can have dozens or even hundreds of predictors. This can make it difficult
to interpret the relationship between the predictors and the response variable y. Ideally, one
wants to identify a subset of the predictors that carries most of the information about y. A
possible approach is variable selection. In variable selection (‘variable’ means the same as
‘predictor’), variables get iteratively added or removed from the regression model. Once
finished, the model typically contains only a subset of the original variables.
In the following, we will call a predictor "significant" if the p-value of its coefficient is
smaller or equal to a given threshold. Your approach operates in two stages: In stage 1, you
iteratively remove predictors that are not significant. This leaves you with a subset of the
original predictors. In stage 2, you iteratively add interaction terms and keep them in the
model if they are significant. Remember what an interaction term is: if �1 and �2 are two
predictors, then the variable �= �1⋅ �2 is their corresponding interaction term. We will split
the two stages into two functions:
Stage 1 (remove variables)
Write a function remove_variables(y, X, threshold = 0.05, variable_names =
None). The function receives the following inputs:
• y and X are numpy arrays like in Q2b).
• threshold is the cut-off value that determines whether a p-value is significant. If a pvalue <= threshold, it counts as significant.
• variable_names is a Python list of variable names that a user can provide. This is the
names for the columns of X (e.g. ['TV', 'radio', 'newspaper'] for the advertisement
dataset discussed in the lecture). If no variable names are provided, your function
should create the variable names ['x1', 'x2’, ‘x3’, ...] where 'x1' is the name for the
first column of X, 'x2' is the name for the second column of X, and so on.
The function returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after non-significant variables have been removed. It
should not include the column of 1’s corresponding to the intercept.
• new_variable_names is a list of strings containing the variable names for the
columns of new_X.
Use the statsmodels function add_constant to make sure that X contains a column of 1's for
the intercept, and use the intercept in all fits. Next, these are the details on how to implement
the two stages of variable selection:
• To start, fit an OLS model using all of the predictors in X.
• Identify the predictor whose coefficient has the largest p-value. If it is not significant,
remove it and fit the model again.
• Repeat this process until either all predictors have been removed or all predictors left are
significant.
• Never remove the intercept irrespective of whether or not it is significant.
• If no predictors are left after stage 1, return the tuple (None, None).
Tip: It might be useful to use Boolean arrays to select subsets of columns of X.
Stage 2 (add interaction terms)
Write a function add_interaction_terms(y, X, threshold = 0.05, variable_names
= None). The inputs have the same meaning as in remove_variables. The function
returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after the interaction terms have been added. Hence,
it contains the predictors in X plus the interaction terms that have been added as new
columns to the right. It should not contain the column of 1’s corresponding to the
intercept term.
• new_variable_names is a list of strings containing the variable names for the
columns of new_X. For interaction terms, use names that combine the two variable
names with a ‘*’ sign. For instance, if you add the interaction term for ‘tv’ and
‘radio’, then call their interaction variable ‘tv*radio’.
The function implements the following algorithm:
• To start, fit an OLS model using all of the predictors in X.
• Test whether it is useful to add interaction terms: For each pair of predictors, add their
interaction term into the model. If the interaction term is significant, keep it in the model.
If it is not significant, remove it again.
• Continue this until you checked every pair of predictors.
• Never add an interaction term involving the intercept.
• It can happen that when you add new interaction terms, predictors that you previously
added become non-significant. You can ignore this issue.
• Add the interaction terms in order, starting from the leftmost predictor in X. For instance,
if you have predictors with column indices 1, 2, 3, and 4, you first add the [1, 2]
interaction term, then [1, 3], [1, 4], [2, 3], [2, 4], and finally [3, 4].
• After you checked the interaction terms for all pairs of predictors, you are finished.
Return the new set of predictors and variable names as defined above.
Finally, note that it should be possible to run both functions one after the other. For instance,
given y and X, the following two lines of code
(new_X,new_variable_names)=remove_variables(y, X)
(new_X,new_variable_names)=add_interaction_terms(y, new_X, variable_names=new_variable_names)
should first perform removal of variables and then add interaction terms.
As a starting point, use Q2.py from Learning Central. Do not rename the file or the function.
Question 3 – Ethics (Total 30 Marks)
In this question you will investigate bias in text corpora (document collections). You are
provided with two datasets from a recent data science competition on Hyperpartisan News
Detection [1]. These datasets are
- bias_corpus.txt: a corpus of news articles from media that have been
classified as exhibiting right or left political bias.
- nobias_corpus.txt: a corpus of news articles that have been classified
as neutral.
These newspaper articles are mostly written in the context of US politics. They could be used
for building targeted political ads (reader of newspaper X will prefer to see ads of party Y),
user or community profiling, etc. However, some articles may depict certain protected
communities (women, immigrants or LGBT) in a negative way. This may bias any data
science model built on top of this data.
In this question you implement a 'pattern matching' procedure for investigating how protected
communities are depicted in both corpora (biased vs non-biased). As an inspiration, you can
start experimenting with Hearst patterns [2], which are often used to identify word pairs in
which a type-of relationship holds. An example for a Hearst pattern is ‘X is a type of Y’.
The slots X and Y will be filled with matches in corpora, e.g., ‘cat is a type of animal’
or ‘sofa is a type of furniture’. Such patterns can also be used to reveal how certain
communities are depicted. For example, ‘immigrants and other x’ would reveal how
immigrants are depicted in these media. A neutral example could be ‘immigrants and
other communities’, whereas a (negatively) biased example could be ‘immigrants and
other criminals’. An initial list of patterns is provided below (with actual examples from
the data). However, you are free and encouraged to experiment with text patterns of
your own design. You can experiment using only X, only Y, or both X and Y as empty slots
(regex groups).
Pattern Example occurrence
X is a Y Obama is a citizen
X is Y Trump is threatening
X and other Y Refugees and other criminals
marginalized Y, especially X marginalized groups, especially gays
X works as a Y He works as a manager
or
She works as a hairdresser
Your tasks:
• Download and uncompress the text corpora from this url:
https://drive.google.com/drive/folders/1ATp_zALwRRG5-
rd9o0WEcP9IXKOGADSd?usp=sharing
• Decide on a person or community of interest. Define a pattern which you hypothesize
is likely to reveal how this person/community is depicted. This pattern could include
regular expressions and group matching for a slot x to fill. For example, the pattern
‘Trump is x’ is likely to match more verbs in non-biased media because they talk
more about what he does (‘Trump is speaking’ or ‘Trump is attending’). In
biased media, however, we could find more adjectives (‘Trump is arrogant’ or
‘Trump is great’).
• Retrieve and count the hits you get for each value of x in the biased and the non
biased corpora separately and store those in the dbias and dnobias dictionaries. For
example, if ‘Trump is speaking’ occurs twice in the non-biased corpus, then the
dictionary entry dnobias['speaking'] has the value 2.
• Do this pattern extraction process for three different persons/communities to obtain a
total of three case studies. You can use the same or different patterns.
Then, report your results in the template document Q3.docx as follows: For each of the three
case studies, write a short justification (up to 300 words for each) with
1. Your initial hypothesis, why you chose the pattern and no other, how many patterns
you tried for the case you wanted to test, etc.
2. Provide the comparison match frequency table for the two corpora (see example,
provided as a comment, at the end of Q3.py).
3. Discuss differences, if any, between the results obtained from the two corpora, and
highlight the stereotypical or negative depictions you found.
As a starting point, use Q3.py and Q3.docx from Learning Central. Your submission should
only include the Word document, not the Python script.
References:
[1] Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., ... & Potthast, M. (2019,
June). Semeval-2019 Task 4: Hyperpartisan news detection. In Proceedings of the 13th
International Workshop on Semantic Evaluation (pp. 829-839).
(Available at https://www.aclweb.org/anthology/S19-2145/)
[2] Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text corpora.
In Proceedings of the 14th conference on Computational Linguistics-Volume 2 (pp. 539-545).
Association for Computational Linguistics.
(Available at https://www.aclweb.org/anthology/C92-2082/)
Learning Outcomes Assessed
• Carry out data analysis and statistical testing using code
• Critically analyse and discuss methods of data collection, management and storage
• Reflect upon the legal, ethical and social issues relating to data science and its
applications
Criteria for assessment
Credit will be awarded against the following criteria. The score in each implemented function
is judged by its functionality. For Q1 and Q2, the functions you have implemented will be
tested against different data sets to judge their functionality. Additionally, quality and
efficiency (Q1) will be assessed. For Q3, marks are based on the written report. The below
table explains the criteria.
Criteria Distinction
Feedback on your coursework will address the above criteria. Feedback and marks will be
returned within 4 weeks of your submission date via Learning Central. In case you require
further details, you are welcome to schedule a one-to-one meeting. Feedback from this
assignment will be useful for next year’s version of this module as well as the Python for Data
Analysis module.