辅导TAT 7008、辅导Python程序语言

TAT 7008 - Assignment 3

Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in
the final score.

Web Scraping
1. (25 points) Crawl information from https://www.sciencedirect.com
(1) (13 points) Crawl some key information about all articles published in 2022 from the
website https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including
year, volume, article content, title, authors and pages. Crawl the volume numbers from 226
to 230 only.
(2) (6 points) Remove “\xa0” in volume_name and store the crawled data into pandas
DataFrame.

(3) (6 points) Filter the author with Null value and then find the top 10 authors that published
the most articles.
Hint:
i. Click the button of the targeted item

ii. Pass the html to BeautifulSoup and get all links

iii. Use requests to get article content, title, authors and pages for each block

For this example,
article content Research article
title Identification in nonparametric models for
dynamic treatment effects
authors Sukjin Han
pages Pages 132-147

Scikit-learn
2. (10 points) Handwritten digits dataset loading and preprocessing
(1) (2 points) Load the digits data by load_digits.
(2) (4 points) Use MinMaxScaler to normalize the covariates X.
(3) (4 points) Split the data into training and test set
with test_size=0.2 and random_state=2020.

3. (15 points) Following question 2, fit the model specified below with different hyper-
parameters, and report the performance.
(1) (7 points) Fit the naive bayes model MultinomialNB on the digits training set with
different values of the parameter alpha α∈{1,2,…,20}.
(2) (4 points) Record the accuracy scores on the test set for each α.
(3) (4 points) Draw the line plot of the accuracy scores versus different α.
4. (15 points) Following question 2, apply dimensionality reduction methods applied on the
digits dataset.
(1) (3 points) Fit Principal Component Analysis (PCA, n_components=2) model to Digits
training set for dimension reduction.
(2) (3 points) Apply model from (1) to train/test set for dimensionality reduction, compute
the 2-dimensional embedded train/test set.
(3) (3 points) Fit a nearest neighbor classifier (KNN, n_neighbors=3) on the embedded
training set. Compute the nearest neighbor accuracy on the embedded test set, plot the
projected test set points and show the evaluation score.
(4) (6 points) Use Neighborhood Components Analysis (NCA, n_components=2) for
dimensionality reduction, repeat (1), (2) and (3).
Note: output results in following image format, no need for outputs in (1) and (2)

Computer vision
5. (18 points) Face and Eye Detection
(1) (12 points) Please write down the code to detect the faces and the eyes in face.jpg. Draw
the red rectangle for the faces and the green rectangle for the eyes.
(2) (6 points) If we want to open the front camera for video capturing and performing face
and eye detection. How can we modify the above codes?
Hints: you may use the auxiliary .xml files and the detection algorithm based on Haar-like
features, provided by opencv.

Natural language processing
6. (17 points) Word embedding (Skip-gram)
see the attached jupyter notebook with partially finished code: wb_partial_code.ipynb