FIT9133 Unit Guide

1 Introduction 4 1 Introduction This assignment is due on 24 May 2019 (Friday) by 17:00. It is worth 20% of the total unit mark. No late submissions are accepted. Refer to the FIT9133 Unit Guide for the policy on extensions or special considerations. Note that this is an individual assignment and must be your own work. Please pay attention to Section 4.2 of this document on the university policies for the Academic Integrity, Plagiarism and Collusion. This second assignment consists of three tasks and all the tasks should be submitted as separate Python 3 source code files with output files, and supporting documentation on its usage. All the Python files and any supporting documents should be compressed into one single zip file for submission. (The submission details are given in Section 4.) © 2019, Faculty of IT, Monash University 2 The Assignment 5 2 The Assignment In this assignment, you will implement a basic parser to investigate the natural-language posts from Q&A (Question and Answering) site. The parser is able to perform basic data extraction, statistical analysis on a number of linguistic features and also to present the analysis results using some form of visualization. For all three tasks, we provide the template source code files, please fill in your code within the template files following the instructions. Not following the rules in the template file may invite mark penalties. You can only make use of the following Python libraries if you wish: • Math • re • Numpy, Scipy, Matplotlib, and Pandas 2.1 The Dataset: HardwareRecs Before you get started with any of the programming tasks, you should read through the description of the dataset that we will be using for the purpose of this assignment. The dataset is known as HardwareRecs [https://hardwarerecs.stackexchange.com] which is a Q&A site for people seeking specific hardware recommendations. Q&A site is a platform for users to exchange knowledge by asking and answering questions such as Quora, Zhihu, and Stack Overflow. Within HardwareRecs, users can ask questions about hardware recommendation, while other users can also answer those questions with corresponding suggestions. The data is written in XML (Extensible Markup Language) format. Apart from the first two lines and the last line which are XML specific format, each line in the dataset represents a record of a post in the Q&A site, i.e., the row beginning with “” is a piece of date in this assignment. As seen in Figure 1, each post contains four attributes: • Id: the unique identifier to represent each post • PostTypeId: the type of the post: 1 = Question 2 = Answer 3 to 8 = Others • CreationDate: the creation date and time of the post (format as yyyy-mm-ddThh:mm:ss) © 2019, Faculty of IT, Monash University 2.1 The Dataset: HardwareRecs 6 Figure 1: An excerpt from HardwareRecs • Body: the content of the post You should note that there are many different “PostTypeId” recorded in the dataset. However, for the purpose of this assignment, the data required for processing and analysis are the questions and answers in the site, which are those rows indicated by the “PostTypeId” as 1 or 2. Note: You should download the dataset from the FIT9133 S1 2019 Moodle site before attempting the following tasks. The dataset is named as data.xml. © 2019, Faculty of IT, Monash University 2.2 Task 1: Handling with File Contents and Preprocessing 7 2.2 Task 1: Handling with File Contents and Preprocessing In the first task, you will begin by reading in all the posts of the given dataset. You will then conduct a number of pre-processing tasks to clean the post content (Body) needed for analysis in the subsequent tasks (Task 2 and 3) in this assignment. Upon completing the pre-processing tasks, the content of questions and answers should be saved as two individual output files. This would be a more efficient approach whenever we need to manipulate the cleaned dataset without having to repeat the pre-processing task, especially for large-scale data analysis. For each post, you should first extract the content/body of it i.e., the string embedded within “Body:"..."” in each row of the XML file. Then you need to carry out some preprocessing steps to it as follows: (a) In HTML, XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference. We need to convert those special character references back to its original form by following the rules in Table 1. Table 1: Character Reference Transformation. Character reference Original form & & " " ' ’ > > < < Example: Before filtering:

In $200 price range, should I be looking at cards from AMD or Nvidia?

After filtering:

In $200 price range, should I be looking at cards from AMD or Nvidia?

(b) Replace special characters including “ ”, “ ” by a single empty space. (c) Remove all HTML tags. All data within the body attribute are content of post in HardwareRecs site, and it is rendered by HTML (Hypertext Markup Language) format. Within HTML, there are many tags to annotate the content such as