首页 > > 详细

辅导 FIT5196-S1-2025 assessment 1讲解 Python编程

FIT5196-S1-2025 assessment 1 (35%)

This is a group  assessment and worth  35%  of your total mark for FIT5196.

Due date: 11:55 PM, Friday, 11 April 2025

Text documents, such as those derived from web crawling, typically consist of topically coherent content. Within each segment of topically coherent data, word usage exhibits more consistent lexical distributions compared to the entire dataset. For text analysis tasks, such as  passage retrieval in information retrieval (IR), document summarization, recommender systems, and learning-to-rank methods, a linear partitioning of texts into topic segments is effective. In this assessment, your group  is  required to successfully complete all the five tasks listed below to achieve full marks.

Task 1: Parsing Raw Files(7/35)

This task touches on the very first step of analysing textual data, i.e., extracting data from semi-structured text files.

Allowed libraries: re, json, pandas, datetime, os

Input Files

Output Files (submission)

group.txt

group.xlsx

(all input files are in student_group# zip file)

task1_.json

task1_.csv

task1_>.ipynb

task1_.py

(the has 0 paddings ie.001,010… )

Your group is provided with Amazon product review data (ratings, text, images, etc.). Please use the input data files with your group_number, i.e. student_group.zip in the Google drive folder (student_data).

Note: Using a wrong input dataset will result in ZERO marks for ‘Output’ as in A1 marking rubrics. Please double check that you have the correct input data files!

Your dataset is a modified version of Amazon Product Reviews. Each review is encapsulated in a record that contains 11 attributes.  Please check with the sample input files (sample_input) for all the available attributes.

Your task is to extract the data from all of your input files (in total 15 files, including 1 excel file and 14 text (txt) files following a mis-structured xml format). You are asked to extract and transform. the data into a csv file and a JSON file. The format requirements are listed as follows.

For the csv file, you are required to produce an output with following columns:

parent_product_id: Parent ID of the product.- output format: string

review_count: the number of total reviews for a parent product ID. -output format: int

review_text_count: the number of reviews that contains a text (excluding 'none').

-output format: int

For the JSON file, you are required to produce an output file with the following fields (All fields in the output file should be of type ‘string9):

parent_product_id

reviews:  a root element with one or more reviews, contains fields:

category - Category of the product

reviewer_id - ID of the reviewer

rating - Rating of the product

review_title - Title of the user review

review_text - Text body of the user review.

attached_images - Images that users post after they have received the product

product_id - ID of the product

review_timestamp - Time of the review (unix time)- output format: UTC time as a string in 'YYYY-MM-DD HH:MM:SS'

is_verified_purchase - User purchase verification

helpful_votes - Helpful votes of the review

VERY IMPORTANT NOTE:

1.  All the tag  names are case-sensitive in the output json file. You can refer to the sample  output for the  correct json file structure. Your output csv and JSON files MUST follow the above attribute lists to avoid mark redundancy.

2.   The sample output files are just for you to understand the structure of the required output and the correctness of their content in task 1 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content.

Task 1 Guidelines

To complete the above task, please follow the steps below:

Step 0: Study the sample files

●   Open and check your input .txt files and try to find any ‘potential interesting’ patterns for different data elements

Step 1: Txt file parsing and excel file parsing

●    Load the input files

●    Use regular expression (Regex) to extract the required attributes and their values as listed from the txt files

●    Extract necessary data from the excel file

Combine all data together

Step 2: Further process the extracted text from Step 1

●    Remove any duplicates

●    Replaces empty values to ‘none’ across all variables

Convert all text to lowercase

●    Further process the extracted data

●    Note for review_texts: they must be transformed into lowercase, with no HTML tags, no emojis, only valid UTF-8 characters, and be entirely in English. To ensure this:

To remove emojis, make sure your text data is in utf-8 format

○    Remove all HTML tags while keeping the content intact

○    Remove all emoji symbols and non-UTF-8 characters, including unreadable symbols (e.g., ◆ , □) and invalid Unicode sequences

○    If a review text does not contain enough English letters(this determination is based on the proportion of English letters in the text, with a minimum threshold of 1), it will be labeled as 'none'

Step 3: file output

●   Output the  required files  based on the specified structures provided above, make sure your data is utf-8 encoded.

Submission Requirements

You need to submit the following four files:

●   A task1_<group_number>.json file containing the correct review information with all the elements listed above.

●   A task1_<group_number>.csv file containing the correct review information with all the elements listed above.

●   A Python notebook named task1_<group_number>.ipynb containing a well-documented  report that will demonstrate your solution to Task 1. You need to present  the  methodology clearly,  i.e.,  including the entire step-by-step  process of your solution with appropriate comments and explanations. You should follow the suggested steps in the guideline above. Please keep this notebook easy-to-read. You will lose marks if it is hard to read and understand (make sure you PRINT OUT your cell output).

●   A task1_<group_number>.py file. This file will be used for plagiarism check (make sure you clear your cell output before exporting).

In Google colab:

Requirements on the Python notebook (report)

●    Methodology - 35%

○   You need to demonstrate your solution using correct regular expressions. Results from each step would help to demonstrate your solution better and be easier to understand.

○   You should present your solution in a proper way including all the required steps. Skipping any steps will cause a penalty on marks/grades.

○   You need to select and  use the appropriate Python functions for input, process and output.

○   Your solution should be computationally efficient without redundant operations, and without unnecessary data (read and write) operations.

●    Report organisation and writing - 15%

○   The report should be organised in a proper and well-organized structure that can allow you to present your Task 1 solutions. Make sure you include clear and meaningful titles for sections (or subsections/ sub-subsections) if needed.

○    Each step in your solution should be clearly described. For example, you should explain your solution idea, any specific settings, and the reasons for using any particular functions, etc.

○    Explanation of your results including all the intermediate steps is required. This can help the marking team to  understand your solution and give partial marks even if the final results are not fully correct.

○   All your codes need to be properly commented. Try to focus on writing concise and precise comments (but   not excessive, lengthy, and inaccurate paragraphs).

○   You can refer to the notebook templates provided as a guideline for a properly formatted notebook report.

Task 2: Text Pre-Processing (10/35)

This task involves the next step  in textual data analysis: converting extracted text into a numerical representation for downstream modelling tasks. You are required to write Python code to preprocess Amazon product reviews text from Task 1 and transform it into numerical representations. These numerical representations are the standard format for text data, suitable for input into NLP systems such as recommender systems,  information retrieval algorithms, and machine translation. The most fundamental step in natural language processing (NLP) tasks is converting words into numbers to enable machines to understand and decode patterns within a language. This step, although iterative, is crucial in determining the features for your machine learning models and algorithms.

Allowed libraries: ALL

Input Files

Output Files (submission)

task1_.json

_vocab.txt

_countvec.txt task2_.ipynb

task2_.py

In this task you are required to continue working with the data from task1.

You are asked to use the review text from all reviews of parent products that have at least 50 text  reviews.  Then  pre-process  the  abstract text and generate a vocabulary  list  and numerical representation for the corresponding text, which will be used in the model training by your colleagues. The information regarding output files is listed below:

_vocab.txt comprises     unique      stemmed     tokens     sorted alphabetically, presented in the format of token:token_index

_countvec.txt includes  numerical  representations  of  all  tokens, organised    by    parent_product_id     and    token     index,    following    the    format parent_product_id, token_index:frequency.

Carefully examine the sample output files (here) for detailed information about the output structure. For further details, please refer to the subsequent sections.

VERY IMPORTANT NOTE: The sample outputs are just for you to understand the structure of the required output and the correctness of their content in task 2 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content.

Task 2 Guideline

To complete the above task, please follow the steps below:

Step 1:  Text extraction

You are required to extract the review text from the output of task 1.

●   You are only required to extract the vocab and countvec lists for reviews from parent products that have at least 50 text reviews (excluding 'none')

Step 2: Generate the unigram and bigram lists and output as vocab.txt

●    The  following  steps  must  be  performed  (not  necessarily  in  the  same  order) to complete the assessment. Please note that the order of preprocessing matters and will result in different vocabulary and hence different count vectors. It is part of the assessment to figure out the correct order of preprocessing which makes the most sense as we learned in the unit. You are encouraged to ask questions and discuss them with the teaching team if in doubt.

a.   The word tokenization must use the following regular expression, "[a-zA-Z]+"

b.   The context-independent and context-dependent stopwords must be removed from the vocabulary.

■    For   context-independent,  The   provided   context-independent  stop words list (i.e, stopwords_en.txt) must be used.

■    For  context-dependent  stopwords,  you  must  set  the  threshold  to words that appear in   more than 95% of the  parent products that have at least 50 text reviews.

c.   Tokens should be stemmed using the Porter stemmer.

d.   Rare tokens  must  be removed from the vocab (with the threshold set to be words that appear in less than 5% of the parent products that have at least 50 text reviews.

e.   Tokens with a length less than 3 should be removed from the vocab.

f. First  200  meaningful  bigrams (i.e., collocations)  must  be included in the vocab   using PMI measure,  then   makes  sure  the  collocations  can  be collocated within the same review.

g.   Calculate the vocabulary containing both unigrams and bigrams.

●    Combine the unigrams and bigrams, sort the list alphabetically in an ascending order and output as vocab.txt

Step 3: Generate the sparse numerical representation and output as countvec.txt

1.   Generate sparse representation by using the countvectorizer() function OR directly count the frequency using FreqDist().

2.   Output the sparse numerical representation into txt file with the following format:

parent_product_id1,token1_index:token1_frequency,

token2_index:token2_frequency, token3_index:token3_frequency,

parent_product_id2,token2_index:token2_frequency,

token5_index:token5_frequency, token7_index:token7_frequency,

parent_product_id3,token6_index:token6_frequency,

token9_index:token9_frequency, token12_index:token12_frequency,

Note: the token_index comes from the vocab.txt and make sure you are counting bigrams

Submission Requirements

You need to submit the following four files:

1.  A _vocab.txt that contains the unigrams and bigrams tokens in the following format, token:token_index. Words in the vocabulary must be sorted in alphabetical order.

2.  A _countvec.txt file,  in  which  each  line  contains  the  sparse representations of one of the parent product id in the following format:

parent_product_id1,token1_index:token1_frequency,token2_index:token2_frequen

cy, token3_index:token3_frequency,

Please note: the tokens with zero word count should NOT be included in the sparse representation.

3.  A task2_.ipynb file that contains your report explaining the code and the methodology. (make sure you PRINT OUT your cell outputs)

4.  A task2_.py file for plagiarism checks. (make sure you clear your cell outputs)

Requirements on the Python notebook (report)

●    Methodology - 35%

You need to demonstrate your solution using correct regular expressions.

○   You should  present your solution in a proper way including all required steps.

○   You  need to select and  use the appropriate Python functions for input, process and output.

○   Your  solution  should   be  computationally  efficient  without   redundant operations and unnecessary data read/write operations.

●    Report organisation and writing - 15%

○   The report should be organised in a proper and well-organized structure that  can  allow  you  to  present  your  Task  2  solutions.  Make  sure  you include   clear    and   meaningful   titles   for   sections    (or   subsections/ sub-subsections) if needed.

○    Each step in your solution should be clearly described. For example, you should explain your solution idea, any specific settings, and the reasons for using any particular functions, etc.

○    Explanation of your results including all the intermediate steps is required. This  can  help  the  marking  team  to  understand your solution  and give partial marks even if the final results are not fully correct.

○   All your codes  need to be properly commented. Try to focus on writing concise   and   precise   comments    (but   not    excessive,   lengthy,   and inaccurate paragraphs).

○   You can  refer to the notebook templates provided as a guideline for a properly formatted notebook report.

Task 3: Data Exploratory Analysis  (15/35)

In this task, you are asked to conduct a comprehensive exploratory data analysis (EDA) on the provided Amazon product review data. The goal is to uncover interesting insights that can be useful for further analysis or decision-making.

Allowed libraries: ALL

Input Files

Output Files (submission)

task1_.json

task1_.csv

task3_.ipynb

task3_.py

Task 3 Guideline

To complete the above task, please follow the steps below:

Step 1: Understand the Amazon product review data:

●    Review and try to understand the data.

Summarise the key features and variables included in the dataset.

●    Identify any initial patterns and trends

Step 2: Data Analysis:

●    Perform. an exploratory data analysis to investigate and uncover interesting insights.

●   You are required to investigate and present at least 5 insights from your data analysis.

Example of a basic insight

Question: What is the distribution of ratings in the selected category?

●   Visualisation: A simple bar chart showing the percentage of each rating (1 to 5 stars) in the dataset.

●    Interpretation: This reveals general user satisfaction trends for products in this

category … which means there could be potential chances to improve profits by … . For future suggestions, the owner could …

You are strongly recommended to read the detailed grading guidelines in the marking rubric.

Submission Requirements

You need to submit 2 files:

5.  A task3_.ipynb file that contains your report explaining the code and the methodology. (make sure you PRINT OUT your cell outputs)

6.  A task3_.py file for plagiarism check. (make sure you clear your cell outputs)

Task 4: Video presentation for Task 3 (2/35)

Create a video presentation (5-8 minutes) to effectively communicate the findings from your exploratory data analysis (EDA) on the Amazon product review data. The goal is to present your methodology and insights in a clear, concise, and engaging manner.

Output Files (submission)

task4_.mp4

Submission Requirements

Here are the key components you need to include in your submission:

Introduction:

●    Please briefly introduce yourself, including your student ID, and provide context for the analysis.

●    Explain the purpose of the EDA and the datasets used (Amazon product review data).

Methodology:

●    Describe the steps taken during the data analysis process.

Insights:

●    Present at least 5 insights uncovered from the analysis.

●    Use visual aids such as charts, graphs, or tables to support your insights.

●    Explain the significance of each insight and how it can be applied or interpreted.

Conclusion:

Summarise the key findings and their potential implications.

●    Discuss any limitations of the analysis and suggest areas for further research.


联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!