THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY
Department of Computer Science and Engineering
MSBD5008: Introduction to Social Computing
Fall 2020 Assignment 1
IMPORTANT NOTES
Your grade will be based on the correctness and clarity.
Late submission: 25 marks will be deducted for every 24 hours after the deadline.
ZERO-Tolerance on Plagiarism: All involved parties will get zero mark.
NetworkX
In this question, you are required to use NetworkX to do basic data analysis on a Wikipedia vote network dataset. It
contains 7,115 nodes and 103,689 (directed) edges. The dataset can be downloaded from http://snap.stanford.
edu/data/wiki-Vote.html.
1. Use the function nx.read edgelist() to load the dataset Wiki-Vote.txt.
2. Output the following information related to degree:
average degree, average in-degree, average out-degree;
degree distribution (plot both the degree and frequency in log scale);
density (E/N2
), where E is the number of edges, and N is the number of nodes;
3. Find the largest strongly connected component (giant component), and output the number of nodes in it;
4. Output the following information about this giant component related to distance and clustering:
distribution of path length
average path length;
distribution of clustering coefficient;
average clustering coefficient.
5. Treat the network as undirected. Output the following information related to degree:
average degree;
degree distribution (plot both the degree and frequency in log scale);
density (E/N2
).
1
Deep Graph Library (DGL)
In this question, you are required to use DGL to build a graph neural network for node classification. The dataset
hw dataset.pkl can be downloaded from https://drive.google.com/file/d/1cZo93mIX37kI0wBKxulWE8CwSvjfUGZH/
view?usp=sharing
1. Load the dataset with the following command:
dataset = pkl.load(open("hw_dataset.pkl", "rb"))
This file contains a dictionary object with the following information of a directed graph:
nodes: a list containing the id’s of all the nodes in the graph;
labels: a list containing the label of each node;
num classes: the total number of node labels;
features: a matrix of size: number-of-nodes × feature-dimensionality;
source nodes: a list containing the source node-id of each (directed) edge;
target nodes: a list containing the target node-id of each (directed) edge;
train mask: a list (of values “True” or “False”) indicating whether each node is used in the training set or not;
val mask: This has the same format as train mask, and shows whether each node is used in the validation set
or not.
2. You have to use the graph neural network model dgl.nn.pytorch.conv.GINConv in DGL. It implements the following
neighborhood aggregation:
This model includes the graph neural network model discussed in class, but is more general. For details, read
https://docs.dgl.ai/api/python/nn.pytorch.html#dgl.nn.pytorch.conv.GINConv.
Your task is to find a model with high node classification accuracy. Your grade will be based on your model’s node
classification accuracy on a test set (which is hidden from you). We will use the following code to test your model.
Your code should include a test function (with your model and a mask as inputs) so that we do not need to retrain
your model.
load_checkpoint("best_model.pth", model)
# the test_mask here is hidden for you. you can replace the test_mask with the val_mask.
accuracy = test(model, test_mask)
print("Testing Acc {:.4}".format(accuracy))
Please also use the following functions
to save your final model:
def save_checkpoint(checkpoint_path, model):
# state_dict: a Python dictionary object that:
# - for a model, maps each layer to its parameter tensor;
state = {’state_dict’: model.state_dict()}
torch.save(state, checkpoint_path)
print(’model saved to %s’ % checkpoint_path)
save_checkpoint("best_model.pth", model)
to reload your model for evaluation:
def load_checkpoint(checkpoint_path, model):
state = torch.load(checkpoint_path)
model.load_state_dict(state[’state_dict’])
print(’model loaded from %s’ % checkpoint_path)
load_checkpoint("best_model.pth", model)
Submission Guidelines
Please submit two Python notebooks (A1.ipynb and A2.ipynb) and a report (report.pdf) for your results and conclusions.
The submitted folder should be Zip all the files into A1 awangab 12345678 (replace awangab with your ust
account and 12345678 your student id). Please submit the assignment by uploading the compressed file to Canvas.
Note that the assignment should be clearly legible, otherwise you may lose some points if the assignment is difficult
to read. Plagiarism will lead to zero point on this assignment.