Analyzing the File for Data Visualization
2.4 Task 3: Analyzing the File for Data Visualization 11
2.4 Task 3: Analyzing the File for Data Visualization
In the last task, based on the class defined in Section 2.3 (Task 2) , you will implement two
functions to visualise the statistics as some form of graphs. The implementation of these two
functions should make use of the external Python packages, including NumPy, SciPy, Pandas,
and/or Matplotlib in order to create the suitable graphs for comparing the statistics collected
for posts.
The implementation of two functions should follow the requirement below:
• visualizeVocabularySizeDistribution(inputFile, outputImage):
Given the input file “data.xml”, you should count the vocabulary size for each post. Then
you should draw a bar chart in Python to visualize the distribution of the vocabulary
size of all posts. The x-axis is the vocabulary size, and the y-axis represents the number
of posts with certain vocabulary size. Note that for the x-axis, the vocabulary size
interval is 10 and once the vocabulary size is larger than or equal to 100, you should put
them into “others”, i.e., 0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90,
90-100, others (left inclusive). You should save your visualization figure into a png file
named as “vocabularySizeDistribution.png”.
• visualizePostNumberTrend(inputFile, outputImag):
This function displays the trend of the post number in the Q&A site. Given the input
file “data.xml”, you should first get the number of questions and answers in each quarter.
Then following the time order, you should draw a line chart to annotate the number
of posts in each quarter. Note that you should draw two lines for question number
and answer number respectively, and add a legend in the figure to tell which line is for
which type of posts. You should save your visualization figure into a png file named as
“postNumberTrend.png”.
Note: Please import the class defined in Section 2.3 (Task 2). Apart from the defining these
two functions, you should also call these two functions and obtain the png files. You should
put your code for this final task into the template file “dataVisualization_studentID.py”,
and name the file with your own ID.
© 2019, Faculty of IT, Monash University