辅导 MKT 566 – Fall 2025 Homework 4: Predicting Yelp Review Ratings辅导留学生Python语言

MKT 566 – Fall 2025

Homework 4: Predicting Yelp Review Ratings

Overview

In this assignment, you will use machine learning models to predict whether a Yelp review rating is greater than 3 (positive) or less than or equal to 3 (negative) based on review text and metadata features.

This project mirrors a real-world marketing analytics task — understanding how customersʼ language and review patterns relate to satisfaction. Youʼll practice data preprocessing, feature engineering, model training, evaluation, and interpretation.

Objective

● Practice end-to-end ML workflow: cleaning, feature engineering, training, and evaluating models.

● Interpret model results and communicate insights clearly.

● Understand the importance of hyperparameter tuning, model comparison, and feature importance.

Dataset

You will receive two files in JSONL (JSON Lines) format:

● train.jsonl: Contains review text, metadata, and ratings (this is the file you will to use to train and test your model).

● test_no_stars.jsonl: Contains the same fields but without ratings (this dataset is for the bonus competition only)

Each observation represents a Yelp review and contains:

● Review-level fields: review_id, text, stars, date, useful, funny, cool

● User-level metadata: user_id, user_review_count, user_average_stars, user_fans

● Business-level metadata: business_id, business_name, business_city, business_state, business_stars, business_review_count

Your target variable will be binary: y = 1 if stars > 3, else 0.

Tasks

1. Data Preprocessing

● Load and clean the dataset.

● Handle missing values and inconsistent entries.

● Prepare the data for modeling (e.g., tokenize text, encode categorical variables).

2. Exploratory Data Analysis (EDA)

Perform. EDA to understand the data before modeling.

Your EDA should include:

● At least five visualizations (e.g., distribution of ratings, word frequencies, category breakdowns).

● Summary statistics for key variables.

● Insights: discuss any interesting trends or relationships.

Make sure the code will generate and display the figures, not just the code.

3. Feature Engineering

● Create at least ten features, justified by your EDA findings. 5 should be derived from text data, and 5 from metadata.

● Possible examples:

○ Average user rating or review count.

○ Text features (review length, sentiment, word embeddings, TF-IDF).

○ Business category or location variables.

● Explain the rationale behind each feature.

● Scale or encode features appropriately.

● Note: feature engineering is crucial for model performance, so be thorough and creative! Using more features is encouraged as long as they are justified.

4. Train/Test Split

● Split train.jsonl into 80% training and 20% testing.

● Explain why this split is relevant for model evaluation.

● Set a seed for reproducibility.

5. Model Training (optimize for best AUC)

Train at least three classification models, including:

● Logistic Regression (mandatory benchmark)

● Two others (e.g., Random Forest, SVM, Gradient Boosting, XGBoost, etc.)

● As we discussed in class, model parameters tuning is important to improve performance. The caret library can help you with that. See here for a description of the package. In the case you are using Python, you can use GridSearchCV from sklearn.model_selection for hyperparameter tuning.

● Include a short explanation of why tuning matters for model performance.

● Report key parameters used to train the model.

6. Model Performance and Evaluation

Evaluate the model performance on the test set:

● Report Accuracy, Precision, Recall, and AUC (ROC) of each model in a well-structured table. Explain what each metric indicates about model performance.

● Plot the ROC curve for each model and interpret the results.

● Report and discuss feature importance.

● Explain tradeoffs between models (e.g., interpretability vs. performance).

Deliverables

Submit the following to Brightspace:

1. Report (PDF): This file must include the figures and tables generated in your analysis, along with explanations and interpretations. The report should be well-organized and clearly written. Use the following structure:

1. Introduction: Describe the problem and dataset.

2. EDA Findings: Key patterns and visualizations.

3. Feature Engineering: Created features and their description and rationale for creating them.

4. Model Training: Algorithms used, parameter tuning, and justification.

5. Model Evaluation: Results, metrics, and interpretations.

6. Conclusion: Main takeaways and future improvements.

2. Code: This file should contain all the code used for data preprocessing, EDA, feature engineering, model training, and evaluation. Ensure that the code is well-commented and organized for readability. You can use a notebook format as we have been using during the course (Jupyter Notebook for Python or R Markdown for R), or a script. format (.py or .R).

IMPORTANT: If you want to write the report in R Markdown or Jupyter Notebook, make sure to knit/export it to PDF before submission. If not, a PDF file is required.

3. Predictions File (for Bonus Competition only): A CSV file with your predictions on the test_no_stars.jsonl dataset with columns: review_id, predicted_probability | (see details below).

Bonus Competition (up to +5 points)

After training your models, apply your best-performing model based on AUC to the provided test_no_stars.jsonl dataset (which does not include ratings) and generate predictions for each review.

How It Works

● Each student submits one predictions file named HW4_predictions.csv to Brightspace which contains your predictions for the test_no_stars.jsonl.

● After the deadline, these predictions will be evaluated against the true labels (which are held out).

● The evaluation metric is AUC (Area Under the ROC Curve), so make sure that your model is optimized for AUC during training and tuning.

● Recall that AUC is a number between 0 and 1, and a higher AUC means your model better distinguishes between positive and negative reviews.

Scoring and Bonus Pointshw4.md

● The top 5 students with the highest AUC scores receive bonus points:

○ 1st place: +5 points

○ 2nd place: +4 points

○ 3rd place: +3 points

○ 4th place: +2 points

○ 5th place: +1 point