FOUNDATIONAL BUSINESS ANALYTICS
COURSEWORK 2025-2026
Release Date: 13th Oct 2025
Deadline Date: 11th Dec 2025, 3:00 pm
Submission: via Moodle coursework submission link on the FBA module web page
1. The Problem Definition:
To better protect public health, a major city’s Department of Health and Mental Hygiene historically inspected restaurants on a fixed timetable. While this ensured broad coverage, the department has been faced with the difficulty of efficiently managing public health risk, as truly problematic establishments may not be identified and inspected quickly enough. Traditionally, the department passively reacts to inspection cycles and only takes significant action after a poor inspection result is recorded. However, this passive strategy creates a potential for significant public health crises as high-risk establishments may operate for extended periods between inspections.
To ensure public safety and enhance operational efficiency, the department plans to launch a proactive risk management programme. The goal is to enable the department to “see into the future” and know in advance which restaurant inspection is more likely to result in a significant failure. With the ability to predict a high-risk inspection, the department can intervene at the earliest opportunity, prioritizing resources to prevent potential foodborne illness outbreaks and avoiding public health crises.
This data can shed light on the sort of establishments that will fail to meet hygiene standards and the restaurants that are low-risk. Your task, as a consultant, is to analyse the historical dataset and generate a model that can be used to predict if any given inspection will result in a high score.
As well as robustly testing, justifying and unpacking your selected model (guided by the Director’s needs, as detailed below), the department also wants you to produce some business recommendations – what you think the department should focus on as a result of your investigations. You will submit a formal business report (with a strict 8 page and 3000 word maximum). Additionally, you will submit your model implementation, with instructions on how to use it to test new data (written in either Python, Orange3 or some combination – formal specifications are detailed below). Good luck!
2. Important Message from the Director of Food Safety:
“Our mission is to keep the public safe. A critical violation, like improper food temperatures or signs of vermin, can lead to serious foodborne illness outbreaks. We need to find these problems before they hurt people. Our inspectors’ time is our most valuable asset. Sending an inspector to a restaurant that we know is always pristine is an inefficient use of their time. The real failure is when a high-risk restaurant goes uninspected for too long and a health crisis occurs.
Our model must help us find the potential problems early. If we use the model to inspect a restaurant and find nothing wrong, we’ve only lost a few hours of an inspector’s day. But if we fail to inspect a restaurant that’s a ticking time bomb, the consequences—for both public health and public trust—are severe. Your system’s primary goal must be to successfully identify the restaurants that are likely to have serious issues. ”
3. The Available Dataset:
You will be working with a real-world dataset of all restaurant inspections conducted in the City. The data follows the schema below:
4. Formal Task Specification
• You must provide a classification approach to predict which inspections are more likely to result in a high score (SCORE >= 14). This will require a stage of statistical analysis, a stage of model selection, a stage of final model training, and then an analysis of implications. You may use any software you desire for your analysis, but your model must be produced in either Python3 or Orange3 for this coursework (or some combination).
• Your submission will consist of a zip of the files for your model, and a report of a maximum of 8 pages. Your model will be tested on a hidden dataset (with the same schema as the training dataset, but without the feature “SCORE” or “Y”).
• Your report must strictly adhere to the following sections. Please take into account the marks available for each in structuring your submission:
Section A: Summarization [10 marks available]
□ In this section you must provide a summary statistical analysis of the dataset. Consider how each input feature present is related to the output variable (“Y”). Additionally, you may want to examine how they relate to each other. Please feel free to use tables, bar charts, or scatter graphs depending on the feature - it is totally up to you. Note, the point of this section is to be informative rather than overloading your client with information, so also summarize the key analytical points you have observed in the dataset.
Section B: Preparation and Feature Engineering [20 marks available]
□ Provide a comprehensive overview of your data preparation process. This should include a detailed description of how you curated and cleaned the dataset, discussing any methods employed to address missing values or outliers.
□ This section is the most critical for your model’s success. You must detail the advanced feature engineering you performed. The raw features alone are insufficient. You must create and justify new features that summarize a restaurant’s history prior to the inspection you are trying to predict. For each new feature, you must elucidate the rationale behind its creation and why you believe it will be valuable for predicting a high inspection score.
□ Examples of the type of feature engineering (you do not need to exactly follow the example though, ):
• Time-based features: time_since_last_inspection, days_since_last_grade_A.
• Historical performance features: average_score_in_last_3_inspections,
number_of_past_high_score_failures, max_score_in_previous_year.
• Trend features: is_score_trending_upwards, change_in_score_from_last_inspection.
• Violation history features: count_of_past_critical_violations,
frequency_of_specific_violation_codes (e.g., vermin-related codes).
Section C: Model Evaluation [20 marks available]
□ Select at least 3 different classification model classes (selecting only from those we cover in FBA lectures: Logistic Regression, Decision Trees, Random Forests, Naive Bayes Classifier and k- nearest neighbours), and assess their effectiveness in modelling your historical training dataset.
□ In your report, detail the models selected to test and why they were chosen. Detail the parameterizations you chose for each model, explaining why you have chosen the parameters that you have.
□ Describe the evaluation strategy you chose to compare models to each other. Your evaluation strategy MUST be based on a chronological, time-based split of the data. For example, you might train your model on all inspections occurring before January 1st, 2023, and test it on all inspections occurring after that date. A random shuffle-and-split of the data is inappropriate for this problem and will be penalized, as it does not reflect the real-world task of predicting future events and leads to data leakage.
□ Analyze the performance of each model using confusion matrices. Crucially, you must justify your choice of performance metric (e.g., Accuracy, Precision, Recall, F1-score) by explicitly linking it to the Director’s message in Section 2.
Section D: Final Assessment [5 marks available]
□ Given the analysis in Section C, justify a ‘winning ’ classifier and why you have selected it for your final model, paying close attention to the business case in your consideration of measuring success.
Section E: Model Implementation [5 marks available for write-up]
□ Having selected the single, best-performing model, that model must then be trained against the whole training dataset ready for deployment. This section should specify that choice and briefly describe the resulting code/project files that are attached to your submission. In particular, this section should be used to supply brief instructions on how the recipient should use your submitted model code/files to process a new test set and make new predictions from your model.
□ N.b., marks awarded here are only foryour write-up/instructions, with more marks available for the assessment model’s implementation code/files- see “further available marks. ”
Section F: Business Case Recommendations [10 marks available]
□ Summarize the business case for the Department of Health. Provide actionable recommendations based on your model and analysis. For example, what types of cuisines or neighborhoods warrant more attention? Could inspection schedules be dynamically allocated based on your model’s risk scores?
Further Available Marks:
□ Overall presentation of your report, its argument, and professionalism → [5 marks available]
□ The standard of your submitted Evaluation/Final modelling code/workflows. It will be expected that in this code/workflow, you will also have supplied some means for the user to load new data (in the same format as your supplied dataset) and make new predictions. → [20 marks available]
□ The effectiveness of your model as assessed against our held-back test dataset → [5 marks available]
Note that the models you submit will be tested on another external dataset that I have held out (and which you will not have access to, reflecting the fact that these represent “future” customer repayment status). Thus, as well as receiving marks for your report, your model implementation, and how well you have tested, evaluated, and justified its construction, there are also additional marks for how well it will predict our hidden test set!
6. Submission
→ In your submission, please submit a zip of the following files:
1. Your Final Report (maximum 8 pages excluding the front page. Please indicate your word count at the end of your report).
2. Your Evaluation Code / Workflow files and Final Model Code / Workflow
→ Submissions must be submitted via the Moodle submission link → Submission must be received by:
11th Dec 2025, 3:00 pm
Potential Penalties:
→ Late submissions will lose 5% of their final mark per day.
→ Submitted reports over 8 pages will be received, but only the first 8 pages will be assessed. This is a strict rule.
7. Final Important Note on Plagiarism
→ While everyone is provided with the same foundational dataset, your model performance is expected to vary. These differences will primarily stem from your unique feature engineering strategies, which are a key component of this assessment.
→ This means that the set of features you design, create, and justify in your Section B table is a direct reflection of your individual analytical approach. It is expected to be unique to you. Consequently, it will be a primary point of scrutiny for academic integrity.
→ Submissions with identical or suspiciously similar sets of engineered features, justifications, or implementation logic will be considered strong evidence of plagiarism.
→ All code and workflows will also be examined to ensure there is no repetition between submissions. While you are able to discuss high-level ideas and strategies with peers, the final implementation and analysis must be 100% your own individual work. Any plagiarised work will immediately receive zero marks and be reported to the School.
8. Some Additional Tips!
• Throughout this coursework, showing thought processes and understanding of how you assess a model in light of the business case is more important than the final predictive test result.
• AVOID DATA LEAKAGE AT ALL COSTS! This is the most common and most serious mistake in this type of project. Your features for predicting the outcome of an inspection on date ‘X ’ can only use information known before date ‘X ’. If you create features using any information from the inspection you are trying to predict (e.g., using the critical_flag count of an inspection to predict the score of that same inspection), your model’s performance will be artificially inflated and useless in practice. Your model must predict the future, not describe the present.
• You may use any analysis tools to formulate your report, but your submitted model must be implemented in Python or Orange (or both). You can assume the recipient is using python 3 and Orange respectively, and has sklearn, scipy, numpy, pandas, matplotlib, seaborn installed.
• Note the page length available in total, and the available marks for each section to assess how much time and effort to place in each.
• Note that presentation of your work is also being assessed. This is a formal report directed to a business professional, and should be formatted and worded accordingly.
• Using python rather than Orange will not necessarily gain you any extra marks. However, it will likely give opportunity to show off your work with more sophisticated analysis, and increase potential of obtaining higher marks in those respective areas.