MKT 566 – Fall 2025
Homework 4: Predicting Yelp Review Ratings
Overview
In this assignment, you will use machine learning models to predict whether a Yelp review rating is greater than 3 (positive) or less than or equal to 3 (negative) based on review text and metadata features.
This project mirrors a real-world marketing analytics task — understanding how customersʼ language and review patterns relate to satisfaction. Youʼll practice data preprocessing, feature engineering, model training, evaluation, and interpretation.
Objective
● Practice end-to-end ML workflow: cleaning, feature engineering, training, and evaluating models.
● Interpret model results and communicate insights clearly.
● Understand the importance of hyperparameter tuning, model comparison, and feature importance.
Dataset
You will receive two files in JSONL (JSON Lines) format:
● train.jsonl: Contains review text, metadata, and ratings (this is the file you will to use to train and test your model).
● test_no_stars.jsonl: Contains the same fields but without ratings (this dataset is for the bonus competition only)
Each observation represents a Yelp review and contains:
● Review-level fields: review_id, text, stars, date, useful, funny, cool
● User-level metadata: user_id, user_review_count, user_average_stars, user_fans
● Business-level metadata: business_id, business_name, business_city, business_state, business_stars, business_review_count
Your target variable will be binary: y = 1 if stars > 3, else 0.
Tasks
1. Data Preprocessing
● Load and clean the dataset.
● Handle missing values and inconsistent entries.
● Prepare the data for modeling (e.g., tokenize text, encode categorical variables).
2. Exploratory Data Analysis (EDA)
Perform. EDA to understand the data before modeling.
Your EDA should include:
● At least five visualizations (e.g., distribution of ratings, word frequencies, category breakdowns).
● Summary statistics for key variables.
● Insights: discuss any interesting trends or relationships.
Make sure the code will generate and display the figures, not just the code.
3. Feature Engineering
● Create at least ten features, justified by your EDA findings. 5 should be derived from text data, and 5 from metadata.
● Possible examples:
○ Average user rating or review count.
○ Text features (review length, sentiment, word embeddings, TF-IDF).
○ Business category or location variables.
● Explain the rationale behind each feature.
● Scale or encode features appropriately.
● Note: feature engineering is crucial for model performance, so be thorough and creative! Using more features is encouraged as long as they are justified.
4. Train/Test Split
● Split train.jsonl into 80% training and 20% testing.
● Explain why this split is relevant for model evaluation.
● Set a seed for reproducibility.
5. Model Training (optimize for best AUC)
Train at least three classification models, including:
● Logistic Regression (mandatory benchmark)
● Two others (e.g., Random Forest, SVM, Gradient Boosting, XGBoost, etc.)
● As we discussed in class, model parameters tuning is important to improve performance. The caret library can help you with that. See here for a description of the package. In the case you are using Python, you can use GridSearchCV from sklearn.model_selection for hyperparameter tuning.
● Include a short explanation of why tuning matters for model performance.
● Report key parameters used to train the model.
6. Model Performance and Evaluation
Evaluate the model performance on the test set:
● Report Accuracy, Precision, Recall, and AUC (ROC) of each model in a well-structured table. Explain what each metric indicates about model performance.
● Plot the ROC curve for each model and interpret the results.
● Report and discuss feature importance.
● Explain tradeoffs between models (e.g., interpretability vs. performance).
Deliverables
Submit the following to Brightspace:
1. Report (PDF): This file must include the figures and tables generated in your analysis, along with explanations and interpretations. The report should be well-organized and clearly written. Use the following structure:
1. Introduction: Describe the problem and dataset.
2. EDA Findings: Key patterns and visualizations.
3. Feature Engineering: Created features and their description and rationale for creating them.
4. Model Training: Algorithms used, parameter tuning, and justification.
5. Model Evaluation: Results, metrics, and interpretations.
6. Conclusion: Main takeaways and future improvements.
2. Code: This file should contain all the code used for data preprocessing, EDA, feature engineering, model training, and evaluation. Ensure that the code is well-commented and organized for readability. You can use a notebook format as we have been using during the course (Jupyter Notebook for Python or R Markdown for R), or a script. format (.py or .R).
IMPORTANT: If you want to write the report in R Markdown or Jupyter Notebook, make sure to knit/export it to PDF before submission. If not, a PDF file is required.
3. Predictions File (for Bonus Competition only): A CSV file with your predictions on the test_no_stars.jsonl dataset with columns: review_id, predicted_probability | (see details below).
Bonus Competition (up to +5 points)
After training your models, apply your best-performing model based on AUC to the provided test_no_stars.jsonl dataset (which does not include ratings) and generate predictions for each review.
How It Works
● Each student submits one predictions file named HW4_predictions.csv to Brightspace which contains your predictions for the test_no_stars.jsonl.
● After the deadline, these predictions will be evaluated against the true labels (which are held out).
● The evaluation metric is AUC (Area Under the ROC Curve), so make sure that your model is optimized for AUC during training and tuning.
● Recall that AUC is a number between 0 and 1, and a higher AUC means your model better distinguishes between positive and negative reviews.
Scoring and Bonus Pointshw4.md
● The top 5 students with the highest AUC scores receive bonus points:
○ 1st place: +5 points
○ 2nd place: +4 points
○ 3rd place: +3 points
○ 4th place: +2 points
○ 5th place: +1 point
● Bonus points are added to your HW4 grade but capped at 100% total.
● If thereʼs a tie, all students tied at that position receive the same bonus value.
Format:
review_id,predicted_probability
12345,0.87
67890,0.12
● If you do not submit this file, you will not be considered for the bonus competition.
● If your predictions file is incorrectly formatted, you will be disqualified from the bonus competition.