辅导554.488/688 Final Project Assignment

代写w554.488/688 Final Project Assignment

554.488/688 Computing for Applied Mathematics
Spring 2023 - Final Project Assignment
The aim of this assignment is to give you a chance to exercise your skills at prediction using
Python. You have been sent an email with a link to data collected on a random sample from some
population of Wikipedia pages, to develop prediction models for three different web page attributes.
Each student is provided with their own data drawn from a Wikipedia page population unique
to that student, and this comes in the form of two files:
A training set which is a pickled pandas data frame with 200,000 rows and 44 columns. Each
row corresponds to a distinct Wikipedia page/url drawn at random from a certain population
of Wikipedia pages. The columns are
– URLID in column 0, which gives a unique identifier for each url. You will not be able to
determine the url from the URLID or the rest of the data. (It would be a waste of time
to try so the only information you have about this url is provided in the dataset itself.)
– 40 feature/predictor variable columns in columns 1,...,40 each associated with a particular
word (the word is in the header). For each url/Wikipedia page, the word column gives
the number of times each word appears in the asociated page.
– Three response variables in columns 41, 42 and 43
* length = the length of the page, defined as the total number of characters in the
page
* date = the last date when the page was edited
* word present = a binary variable indicating whether at least one of 5 possible words
(using a word list of 5 words specific to each student and not among the 40 feature
words) 1 appears in the page
A test set which is also a pickled pandas data frame with 50,000 rows but with 41 columns
since the response variables (length, date, word present) are not available to you. The rows
of the test dataset also correspond to distinct url/pages drawn from the same Wikipedia
url/page population as the training dataset (with no pages in common with the training set
pages). The response variables have been removed so that the columns that are available are
– URLID in column 0
– the same 40 feature/predictor variable columns corresponding to word counts for the
same 40 words as in the training set
Your goal is to use the training data to
predict the length variable for pages in the test dataset
1What this list of 5 words is will not be revealed to you and you it would be a waste of time tring to figure out
what it is.
predict the mean absolute error you expect to achieve in your predictions of length in the test
dataset
predict word present for pages in the test dataset, attempting to make the false positive as
close as you can to .05 2
, and make the true positive rates as high as you possibly can 3
,
predict your true positive rate for word present in the test dataset
predict edited 2023 for pages in the test dataset, attempting to make the false positive as
close as you can to .05 4
, and make the true positive rates as high as you possibly can 5
,
predict your true positive rate for edited 2023 in the test dataset
Since I have the response variable values (length, word present, date) for the pages in your test
dataset, I can determine the performance of your predictions. Since you do not have those variables,
you will need to set aside some data in your training set or use cross-validation to estimate the
performance of your prediction models.
There are 3 different parts of this assignment, each requiring a submission:
Part 1 (30 points) - a Jupyter notebook containing
– a description (in words, no code) of the steps you followed to arrive at your predictions
and your estimates of prediction quality - including a description of any separation of
your training data into training and testing data, method you used for imputation,
methods you tried to use for making predictions (e.g. regression, logistic regression, ...)
followed by
– the code you used in your calculations
Part 2 (60 points) - a cvs file with your predictions - this file should consist of exactly 4
columns with 6
– a header row with URLID, length, word present, edited 2023
– 50,000 additional rows
– every URLID in your test dataset appearing in the URLID column - not altered in any
way!
– no mssing values
– data type for the length column should be integer or float
– data type for the word present column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
2
false positive rate = proportion of pages for which word present is 0 but predicted to be 1
3
true positive rate = proportion of pages for which word present is 1 and predicted to be 1
4
false positive rate = proportion of pages for which edited 2023 is 0 but predicted to be 1
5
true positive rate = proportion of pages for which edited 2023 present is 1 and predicted to be 1
6
a notebook is provided to you for checking that your csv file is properly formatted
– data type for the edited 2023 column should be either integer (0 or 1), float (0. or 1.)
or Boolean (False/True)
Part 3 (30 points) - providing estimates of the following in a form:
– what do you predict the mean absolute error of your length predictions to be?
– what do you predict the true positive rate for your word present predictions to be?
– what do you predict the true positive rate for your edited 2023 predictions to be?
Your score in this assignment will be based on
Part 1 (30 points)
– evidence of how much effort you put into the assignment (how many different methods
did you try?)
– how well did you document what you did?
– was your method for predicting the quality of your performance prone to over-fitting?
Part 2 (60 points)
– how good are your predictions of length, word present, edited 2003 - I will do predictions
using your training data and I will compare
* your length mean absolute deviation to what I obtained in my predictions
* your true positive rate to what I obtained for the binary variables (assuming you
managed to appropriately control the false positive rate)
– how well did you meet specifications - did you get your false positive rate in predictions
of the binary variables close to .05 (again, compared to how well I was able to do this)
Part 3 (30 points)
– how good is your prediction of the length mean absolute deviation
– how good is your prediction of the true positive rate for the word present variable
– how good is your prediction of the true positive rate for the edited 2023 variable
How the datasets were produced
This is information that will not be of much help to you in completing the assignment, except
maybe to convince you that there would be no point in using one of the other students’ data in
completing this assignment.
I web crawled in WIkipedia to arrive at a random sample of around 2,000,000 pages.
I made a list of 100 random words and extracted the length, the word counts, and the last
date edited for each page.
To create one of the student personal datasets, I repeated the following steps for each student
Repeat
Chose 10 random words w0,w1,...,w9 out of the 100 words in the list above
Detemined the subsample of pages having w0 and w1 but not w2, w3 or w4.
Used the words w5,w6,w7,w8 and w9 to create the word_present variable
Until
the subsample has at least 250,000 pages
Randomly sampled 40 of 90 unsampled words without replacement
Randomly sampled without replacement 250,000 pages out of the subsample
Retained only the 250,000 pages and
word counts for the 40 words
length
word_present
last date edited
Randomly assigned missing values in the feature (word count) data
Randomly separated the 250,000 pages into
200,000 training pages
50,000 test pages