首页 > > 详细

FIT9133 Unit Guide

1 Introduction 4 1 Introduction This assignment is due on 24 May 2019 (Friday) by 17:00. It is worth 20% of the total unit mark. No late submissions are accepted. Refer to the FIT9133 Unit Guide for the policy on extensions or special considerations. Note that this is an individual assignment and must be your own work. Please pay attention to Section 4.2 of this document on the university policies for the Academic Integrity, Plagiarism and Collusion. This second assignment consists of three tasks and all the tasks should be submitted as separate Python 3 source code files with output files, and supporting documentation on its usage. All the Python files and any supporting documents should be compressed into one single zip file for submission. (The submission details are given in Section 4.) © 2019, Faculty of IT, Monash University 2 The Assignment 5 2 The Assignment In this assignment, you will implement a basic parser to investigate the natural-language posts from Q&A (Question and Answering) site. The parser is able to perform basic data extraction, statistical analysis on a number of linguistic features and also to present the analysis results using some form of visualization. For all three tasks, we provide the template source code files, please fill in your code within the template files following the instructions. Not following the rules in the template file may invite mark penalties. You can only make use of the following Python libraries if you wish: • Math • re • Numpy, Scipy, Matplotlib, and Pandas 2.1 The Dataset: HardwareRecs Before you get started with any of the programming tasks, you should read through the description of the dataset that we will be using for the purpose of this assignment. The dataset is known as HardwareRecs [https://hardwarerecs.stackexchange.com] which is a Q&A site for people seeking specific hardware recommendations. Q&A site is a platform for users to exchange knowledge by asking and answering questions such as Quora, Zhihu, and Stack Overflow. Within HardwareRecs, users can ask questions about hardware recommendation, while other users can also answer those questions with corresponding suggestions. The data is written in XML (Extensible Markup Language) format. Apart from the first two lines and the last line which are XML specific format, each line in the dataset represents a record of a post in the Q&A site, i.e., the row beginning with “” is a piece of date in this assignment. As seen in Figure 1, each post contains four attributes: • Id: the unique identifier to represent each post • PostTypeId: the type of the post: 1 = Question 2 = Answer 3 to 8 = Others • CreationDate: the creation date and time of the post (format as yyyy-mm-ddThh:mm:ss) © 2019, Faculty of IT, Monash University 2.1 The Dataset: HardwareRecs 6 Figure 1: An excerpt from HardwareRecs • Body: the content of the post You should note that there are many different “PostTypeId” recorded in the dataset. However, for the purpose of this assignment, the data required for processing and analysis are the questions and answers in the site, which are those rows indicated by the “PostTypeId” as 1 or 2. Note: You should download the dataset from the FIT9133 S1 2019 Moodle site before attempting the following tasks. The dataset is named as data.xml. © 2019, Faculty of IT, Monash University 2.2 Task 1: Handling with File Contents and Preprocessing 7 2.2 Task 1: Handling with File Contents and Preprocessing In the first task, you will begin by reading in all the posts of the given dataset. You will then conduct a number of pre-processing tasks to clean the post content (Body) needed for analysis in the subsequent tasks (Task 2 and 3) in this assignment. Upon completing the pre-processing tasks, the content of questions and answers should be saved as two individual output files. This would be a more efficient approach whenever we need to manipulate the cleaned dataset without having to repeat the pre-processing task, especially for large-scale data analysis. For each post, you should first extract the content/body of it i.e., the string embedded within “Body:"..."” in each row of the XML file. Then you need to carry out some preprocessing steps to it as follows: (a) In HTML, XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of characters called a character reference. We need to convert those special character references back to its original form by following the rules in Table 1. Table 1: Character Reference Transformation. Character reference Original form & & " " ' ’ > > < < Example: Before filtering:

In $200 price range, should I be looking at cards from AMD or Nvidia?

After filtering:

In $200 price range, should I be looking at cards from AMD or Nvidia?

(b) Replace special characters including “ ”, “ ” by a single empty space. (c) Remove all HTML tags. All data within the body attribute are content of post in HardwareRecs site, and it is rendered by HTML (Hypertext Markup Language) format. Within HTML, there are many tags to annotate the content such as

, . All tags contain start tags like

, and end tag like

. All of these tags are written as the format “<*>”, and some tags even have detailed attributes like “”. © 2019, Faculty of IT, Monash University 2.2 Task 1: Handling with File Contents and Preprocessing 8 (a) question.txt (b) answer.txt Figure 2: The example output of the first task You should remove all these HTML tags (including their attributes inside) accordingly. Note that we assume that the content in the body contain complete tags i.e., all start tags are also accompanied by related end tags. Example: Before filtering:

In $200 price range, should I be looking at cards from AMD or Nvidia?

After filtering: In $200 price range, should I be looking at cards from AMD or Nvidia? Note that If you wish to read more about Character References, Wikipedia has a page on the topic https://en.wikipedia.org/wiki/Character_encodings_in_HTML#XML_character_ references. More details about the HTML tags can be seen as https://www.w3schools. com/tags/. However it is not necessary to read these or understand HTML/XML to finish the assignment. Finally, once you have completed with the filtering process, you should identify if the post is a question or answer. You should then save the data into two different files “question.txt” and “answer.txt” according to the post type shown in the data. The cleaned body/content for each post need to be saved in one line in the output file. Examples can be seen in Figure 2. Note: You should write your code within the given template file “preprocessData_studentID.py”, and name the file with your own ID. There are two functions in the file: preprocessLine(inputLine) for dealing with the each valid data row from the file, and splitFile(inputFile, outputFile_question, outputFile_answer) for reading the input file, calling preprocessLine function to process the line, and saving the cleaned questions and answers into output files. All files should be saved in the current folder as that of source code file i.e., not using the absolute path. © 2019, Faculty of IT, Monash University 2.3 Task 2: Building a Class for Data Analysis 9 2.3 Task 2: Building a Class for Data Analysis The second task is about collating the required data for analysis. Apart from extracting the clean body as achieved in Task 1, the main task here is to further parse the given row of the data in XML format with object-oriented programming. Your class “Parser” should contain the following methods: • __init__(self, inputString): This is the constructor required for creating instances of this class. The inputString will be the row of data from the XML file. • __str__(self): Re-define this method to present your data (the instance variables) in a readable format. You should return a formatted string in this method. The order of output should be “ID, post type, creation date quarter, the cleaned content”. • getID(self): Get Id of the post (indicated by “Id” attribute) • getPostType(self) Get the post type of the post (indicated by “PostTypeId” attribute) with 1 as the question, 2 as the answer, and 3-8 as others. • getDateQuarter(self) Get the date quarter of the creation date (indicated by “CreationDate” attribute). One year has four quarters inlcuding Q1 (Jan to Mar), Q2 (Apr to Jun), Q3 (Jul to Sep) and Q4 (Oct to Dec). For example, given “2016-04-07T18:11:33.793” as the CreationDate, your program should return a string named “2016Q2”. • getCleanedBody(self) Get the cleaned body of the posts (indicated by “Body” attribute) which is the extracted cleaned body as that of task 1. You can import the function preprocessLine() in the template of Task 1 to reuse the pre-processing functionality of Task 1. But different from the Task 1, we do not require splitting the question/answers or saving to the file. • getVocabularySize(self) Get the number of unique words in the cleaned body converted in the lower case. Note that we do not count space or punctuation as the word. For example, given the sentence “Although I use Mac, I do not like Mac.”, there are 7 unique words inlcuding {“although”, “i”, “use”, “mac”, “do”, “not”, “like”}. The counting process may involve spliting the words from the cleaned body returned from the getCleanedBody() method. Note, just using str.split(" ") is not enough, as it may mistakenly recognize “mac,” as a word instead of “mac”. © 2019, Faculty of IT, Monash University 2.3 Task 2: Building a Class for Data Analysis 10 When instantiating this class with the data row from the XML file "data.xml" as input, e.g., .listkeyword { color: #990099; font-size: 14px; margin:0px 0px 0px 17px; word-wrap: break-word; text-align:left; }
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!