ST446 Assessment 2
Dataset We use an English-language Wikipedia dump dataset in this assessment, similarly to Assignment 1. You must use the
dump file available for download from here. This is a bzip2 compressed XML file.
@ Cluster configuration Each problem requires a different cluster configuration (see below). You can submit separate notebooks for each
question.
Remember to adjust to your project name, bucket and other parameters.
For P1, use the following configuration:
gclouddataprocclusterscreatest446-cluster--properties=^#^spark:spark.jars.packages=graphframes:graphfram
For P2, use the following configuration:
BUCKET="st446-w9-bucket"
scloudbetadataprocclusterscreatest446-cluster--projectst446-1t2023--bucket$iBUCKET}--regioneurope-
P1 Graph data processing In this exercise, the task is to perform graph data processing using PySpark Graphframes API. You should use graph
queries not dataframe/SQL queries.
In this exercise, the task is to perform graph data processing using PySpark Graphtrames API. You should use graph queries not dataframe/SQL queries.
P1.1 Creating a Vertex dataframe
You need to create a Vertex dataframe by first creating three Vertex dataframes and then creating the final Vertex dataframe as the union of the three Vertex dataframes (union by column name). The specification for the three Vertex dataframes is as follows.
l Collaborator Vertex dataframe vco
l Vertex ID is md5 hash of the concatentation of username and contributor id strings
l Attribute column name type, String type, column values = "contributor"
l Attribute column name contributorID, String type, column values are contributor id values
l Attribute column name name , String type, column values are username values
l Page Vertex dataframe vpa
l Vertex ID is md5 hash of the concatenation of page id and page title
l Attribute column name type, String type, column values = "page"
l Attribute column name pageID, String type, column values are page id values
l Attribute column name title , String type, column values are page titles
l Category Vertex dataframe vca
l Vertex ID is md5 hash of category name
l Attribute column name type, String type, column values = "category"
l Attribute column name category, String type, column values are category names
The final Vertex dataframe, v, must be the union of Vertex dataframes vco, vpa and vca (union by column name).
Show the schema and top 5 rows for each of the four Vertex dataframes.
Note: all the md5 hash values must be encoded in hexadecimal format.
P1.2 Creating an Edge dataframe
You need to create an Edge dataframe, by first creating two Edge dataframes and then creating one final Edge dataframe as union of the two Edge dataframes that you have created. The two Edge dataframes that you need to create first are such that one (contributor-page) contains information about edges connecting a contributor as the source vertex and a page as the destination vertex, and the other (page-category) contains information about edges connecting a page as the source vertex and a category as the destination vertex. The specification for the two Edge dataframes is as follows:
Contributor-page Edge dataframe ep