辅导CIS 545、Data Analytics辅导、讲解Python程序语言、Python讲解辅导留学生 Statistics统计、回归

CIS 545 - Big Data Analytics - Fall 2019 Have you ever wondered about (1) what it takes to be a data scientist or "data person", and (2) how so
work?
This homework is focused on (1) working with hierarchical data stored in dataframes, (2) traversing re
data, (3) understanding a bit about performance.
We will focus on questions about data scientists from "our" crawl of the LinkedIn dataset, which was a
extended notebook.
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml
import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile
import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
We need to pull the ziple
with LinkedIn data from Amazon S3 (where it is shared) to your local machi
machine. Only when the data is local can we eciently
parse it (and we'll read directly out of a zip le)
The zip le
contains three les
with the same schema. You can start with the tiny instance to test yo
brave and have a lot of time feel free to use the full le.
Step 0: Acquire and load data
Due October 11, 2019 at 10pm
Homework 2: Querying Linked (LinkedIn) Data
We will grade your homework using small . Hidden test 0.0 will override your le
selection, so as lon
in a cell that comes after that one, you will be ne.
linkedin.json (3M records)
linkedin_small.json (100K records)
linkedin_tiny.json (10K records)
The cell below will download a 3GB le
to your Google Cloud. It may take a while. You do not need to m
#url = 'https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip'
#filehandle, _ = urllib.request.urlretrieve(url,filename='local.zip')
filehandle = 'local.zip'
# What's the zip file actually called locally?
filehandle
The cell below creates pointers to the three versions of our dataset. To switch between them, simply c
the cell below.
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, 'r')
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None

linkedin_tiny = fetch_file('linkedin_tiny.json')
linkedin_small = fetch_file('linkedin_small.json')
linkedin_huge = fetch_file('linkedin.json')
# CIS 545 Hidden Test 0.0 - please do not modify or delete this cell!
# Set the input file to process
file = linkedin_small In the cell below, adapt the data loading code from the in-class notebook. You will need the function th
the function that converts relations to dataframes. Read in a maximum of 20000 people. Put the code
relations, removes the interval eld,
and stores the eld
information with a try statement, just in case.
command to move on. At the end of the next cell, you should have nine dataframes with the following
Step 0.1: Store data in dataframes
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 3/12
1. people_df
2. names_df
3. education_df
4. groups_df
5. skills_df
6. experience_df
7. honors_df
8. also_view_df
9. events_df
# TODO: Adapt the data loading code from class.
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 0.1 - please do not modify or delete this cell!
display(experience_df)
# CIS 545 Hidden Test 0.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.3 - please do not modify or delete this cell!
Next save the data to SQLite... Again, using the same approach as in the sample notebook.
Step 0.2: Convert to SQL conn = sqlite3.connect('linkedin.db')
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 0.2.1 - please do not modify or delete this cell!
people_df.describe()
# CIS 545 Sanity Check 0.2.2 - please do not modify or delete this cell!
skills_df.describe()
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 4/12
_
# CIS 545 Sanity Check 0.2.3 - please do not modify or delete this cell!
experience_df.describe()
In this homework, we will use LinkedIn to analyze what it means to be a data scientist (as of a few yea
Step 1: What is a data scientist?
Our rst
question is: for anyone who's job revolves around data (database administrators, data curator
are the most common skills?
Step 1.1: What are common skills for data scientists?
Complete the collect_skills function below. This and the other functions in this homework allow u
queries even if your data do not match ours. The function should:
1. Using experience_df , nd
all people with a position containing "data" in the title. Remember upper versus lo
2. Using skills_df , nd
all people with "data science" as a skill. Again, remember to account for case.
3. For all of the unique people found in steps 1 and 2, nd
the rest of their skills
4. Return a dataframe of the top 15 skills, by frequency (see pandas.DataFrame.sort_values). The columns shou
scientists (the count of the number of data scientists with this skill).
Step 1.1.1: Collect skills (Pandas) # TODO: Find the top 15 skills for data scientists (Pandas)
def collect_skills(experience_df, people_df, skills_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.1.1 - please do not modify or delete this cell!
top_skills_df = collect_skills(experience_df, people_df, skills_df)
display(top_skills_df)
if "skill" not in top_skills_df:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_df:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) != 15:
    raise AssertionError("dataframe does not have top 15")
# CIS 545 Hidden Test 1.1.1.1 - please do not modify or delete this cell!
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 5/12
# CIS 545 Hidden Test 1.1.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.1.1.3 - please do not modify or delete this cell!
Compute the same table as in 1.1.1 using SQL. Store it as top_skills_sql but otherwise matching t
to also save the data to SQLLite in a table called top_skills , as we will be testing to see if this table
Step 1.1.2: Top skills (SQL) # TODO: Find the top 15 skills for data scientists (SQL)
# YOUR CODE HERE
raise NotImplementedError()
display(top_skills_sql)
# CIS 545 Sanity Check 1.1.2 - please do not modify or delete this cell!
if "skill" not in top_skills_sql:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_sql:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) < 1:
    raise AssertionError("dataframe has no results")
if len(top_skills_sql.merge(top_skills_df)) != len(top_skills_sql):
    raise AssertionError("Pandas and SQL versions are not of the same length")
# CIS 545 Hidden Test 1.1.2 - please do not modify or delete this cell!
Complete the collect_titles function below that aggregates the most recent titles of people with d
use the given dataframes as input and return a two column dataframe: one column called title and
consider people who have at least min_skills of the top skills for a data scientist. You should also o
min_count times.
For extra practice, you can also do this in SQL, although we are not grading that.
Step 1.2: What are common titles for those with data science skills? # TODO: Find the common titles (Pandas)
d f ll t titl (t kill df kill df l df i df i kill i
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 6/12 def collect_titles(top_skills_df, skills_df, people_df, experience_df, min_skills, min
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.2 - please do not modify or delete this cell!
ds_titles_df = collect_titles(top_skills_df, skills_df, people_df, experience_df, 6, 2
display(ds_titles_df)
if "title" not in ds_titles_df:
    raise AssertionError("title column not defined")
if "count" not in ds_titles_df:
    raise AssertionError("count column not defined")
if len(ds_titles_df) < 1:
    raise AssertionError("dataframe has no results")
# CIS 545 Hidden Test 1.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.3 - please do not modify or delete this cell!
Now let's nd
the list of companies that have employed people with the above titles, ranked by numbe
Step 1.3: Who employs "data people" based on title?
Complete the collect_employers function below that aggregates the employers with positions corr
people with data science skills. This function should use the given dataframes as input and return a tw
org and the other called people . Show the names of companies (in eld
org ) with at least min_cou
(include that count in the people column). Order the dataframe by the count of data people in the com
Step 1.3.1: Data employers # TODO: Find the data employers
def collect_employers(experience_df, ds_titles_df, min_count):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.3.1 - please do not modify or delete this cell!
employers_df = collect_employers(experience_df, ds_titles_df, 5)
display(employers df)
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 7/12
p y( p y _ )
if "IBM" not in employers_df['org'].tolist():
    raise AssertionError("Missing IBM")

if employers_df['people'].min() < 4:
    raise AssertionError("Not filtering properly")
# CIS 545 Hidden Test 1.3.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.1.2 - please do not modify or delete this cell!
Complete the collect_employees function below that aggregates the employees of employers with
recent titles of people with data science skills. In other words, who are the employees of the data emp
their titles? This function should use the given dataframes as input and return the org , family_name
person.
Step 1.3.2: Their employees # TODO: Find the employees of the data employers
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 1.3.2 - please do not modify or delete this cell!
title_people_df = collect_employees(people_df, experience_df, employers_df, names_df,
display(title_people_df)
if len(title_people_df.columns) != 4:
    raise AssertionError('Wrong number of columns. Check schema again')
# CIS 545 Hidden Test 1.3.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.3 - please do not modify or delete this cell!
Step 1.4: Find peers
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 8/12
In many common social graph settings, we can make recommendations to people based on their simi
dene
similarity in terms of the number of identical skills.
Suppose A and B have similar skills: A -> X1 and B -> X1, A -> X2 and B -> X2, etc. up to A -> Xk and B ->
Then given that A and B have similar skills, we might recommend A's employer to B, and vice versa.
Let's consider only the rst
100 people in people_df .
Find, out of this set, the pairs of people with the most shared/common skills, and return the closest 20
this to make a recommendation for a potential employer and position to each person.
Step 1.4.0: Making the problem tractable in Pandas
Complete the collect_peers function below that nds
the top num pairs of peers. In other words, co
person, counting the total set of skills in common. This function should use the given dataframes and
dataframe: person_1 , person_2 , and common_skills . The rst
two columns should be person IDs a
of skills that this pair of people shares.
Hint: Doing this requires a Cartesian product, i.e., every ID paired with every other ID. Think about how t
then add a eld
to this dataframe that will let us combine every record with every record.
Step 1.4.1: Compute the top pairs of peers # TODO: Finish the collect_peers function
people_df_subset = people_df.head(100)
def collect_peers(people_df_subset, skills_df, num):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.4.1 - please do not modify or delete this cell!
recs_df = collect_peers(people_df_subset, skills_df, 20)
display(recs_df)
if "person_1" not in recs_df:
    raise AssertionError("person_1 column not defined")
if "person_2" not in recs_df:
    raise AssertionError("person_2 column not defined")
if "common_skills" not in recs_df:
    raise AssertionError("common_skills column not defined")
if(len(recs_df) != 20):
    raise AssertionError('Wrong number of rows in recs_df')
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 9/12 # CIS 545 Hidden Test 1.4.1.1 - please do not modify or delete this cell!
# CIS 545 Hidden Test 1.4.1.2 - please do not modify or delete this cell!
Complete the last_job function below that takes experience_df as input and returns the person ,
person's last (most recent) employment experience (three column dataframe).
Step 1.4.2: Get the last jobs # TODO: Complete the last_job function
def last_job(experience_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.4.2 - please do not modify or delete this cell!
last_job_df = last_job(experience_df)
display(last_job_df)
if(len(last_job_df.columns) != 3):
    raise AssertionError('Wrong number of columns in last_job_df')
# CIS 545 Hidden Test 1.4.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.3 - please do not modify or delete this cell!
Complete the recommend_jobs function below that takes recs_df , names_df , and last_job_df as
person_2 's most recent title and org .
Step 1.4.3: Recommend jobs # TODO: Complete the recommend_jobs function
def recommend_jobs(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    raise NotImplementedError()
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 10/12 # CIS 545 Sanity Check 1.4.3 - please do not modify or delete this cell!
recommended_df = recommend_jobs(recs_df, names_df, last_job_df)
display(recommended_df)
if "family_name" not in recommended_df:
    raise AssertionError("person_1 column not defined")
if "given_name" not in recommended_df:
    raise AssertionError("person_2 column not defined")
if "person_2" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "org" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "title" not in recommended_df:
    raise AssertionError("common_skills column not defined")
# CIS 545 Hidden Test 1.4.3 - please do not modify or delete this cell!
This last section relates to our discussions in lecture about computation eciency
with big data.
Step 2: Compare Evaluation Orders
Let's look at some computation and optimization tasks. We'll start with the code from our lecture note
dataframes.
Step 2.0: Load custom functions # Join using nested loops
def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    S_ = S.reset_index().drop(columns=['index'])
    T_ = T.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S_.loc[s_index, l_on] == T_.loc[t_index, r_on]:
                ret = ret.append(S_.loc[s_index].append(T_.loc[t_index].drop(labels=r_
    print('Merge compared %d tuples'%count)
    return ret

# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
T map {}
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 11/12     T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    T_ = T.reset_index().drop(columns=['index'])
    for t_index in range(0, len(T)):
        # Make sure we aren't overwriting an entry!
        assert (T_.loc[t_index,r_on] not in T_map)
        T_map[T_.loc[t_index,r_on]] = T_.loc[t_index]
        count = count + 1
    # Now find matches
    S_ = S.reset_index().drop(columns=['index'])
    for s_index in range(0, len(S)):
        count = count + 1
        if S_.loc[s_index, l_on] in T_map:
                ret = ret.append(S_.loc[s_index].append(T_map[S_.loc[s_index, l_on]].d
    print('Merge compared %d tuples'%count)
    return ret
Reimplement recommend_jobs using the above merge or merge_map functions instead of Pandas' m
You should start with the dataframes recs_df , names_df , and last_job_df from above. Store your
Step 2.1: Find an optimal order of evaluation. # TODO: Reimplement recommend jobs using our custom merge and merge_map functions
def recommend_jobs_new(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 2.1 - please do not modify or delete this cell!
%%time
recs_new_df = recommend_jobs_new(recs_df, names_df, last_job_df)
if(len(recs_new_df.columns) != 5):
    raise AssertionError('Wrong number of columns in recs_new_df')
1. When you are done, select “Edit” at the top of the window, under the lename,
not the one that may appear ab
do this just before turning is your homework because it reduces the size of your le.
Step 3: Submitting Your Homework
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 12/12
2. In the same menu under the lename,
select “File” and then “Download .ipynb”. It is very importa
of this downloaded notebook. Make sure that something like “(1)” did not get added to the lena
the .py version. Our autograder can only handle .ipynb les
with the correct le
name.
3. Compress the ipynb le
into a Zip le
hw2.zip.
4. Go to the submission site, and click on the Google icon. Log in using your Google@SEAS (if at al
student) GMail account.
5. Click on the Courses icon at the top, then select CIS 545 and Save. Select cis545-2019c-hw2 an
6. You should see a message on the submission site notifying you about whether your submission
necessary, but may have to withdraw your previous submission in OpenSubmit in order to do so.
If you have not already, please go to Settings and set your Student ID to your PennID (all numbers).