Individual Assignment (15%)
Assignment 1: Used Car Price Prediction
Objective:
In this assignment, you will take on the role of a data science consultant for a national used car resale company in America. The company wishes to expand its business by purchasing more used cars from the market. Your task is to develop a machine learning model to predict the selling prices of these cars. Accurate price predictions will help the company buy cars that can be sold for a higher price, thereby increasing profit margins and avoiding cars with lower resale value.
You are tasked with delivering the first sprint of this project within three weeks. This first sprint focuses on data understanding and preparation. Specifically, you need to produce the following deliverables presented in a report format at the end of this sprint.
• Data Quality Summary: Validate with client that the data received is accurate, reliable, and representative.
• Data Exploration Summary: Provide insights on interesting patterns and relationships between variables.
• Data Preparation and Feature Engineering Plan: Detail the strategies to address data issues and outline the feature engineering process.
Suggested steps for to Follow:
1. Perform Data Profiling: Understand the data by examining its structure and contents (e.g. statistical summary, visual exploration, etc.)
2. Identify Insights: Analyze the data to uncover interesting patterns and relationships (e.g. correlation analysis, etc.)
3. Address Data Issues: Investigate ways to fix any data issues (e.g. outlier detection, data capping, etc.)
4. Feature Engineering: Transform. existing features and derive new ones to improve the predictive power of the ML model. (e.g. data standardization, feature transformation, features construction, etc.)
This assignment will help you apply concepts learned in the first three lectures of this course, facilitating practical understanding and application of machine learning methodologies for financial market modelling opportunities.
Dataset Description
You will be provided with a CSV file (`car_data.csv`) containing historical resale data of used cars. Here's a description of the dataset:
Name: pre-owned_cars_data.csv (7,100 records, 13 variables)
Column
|
Description
|
Name
|
Brand name and model
|
|
Year_Manufactured
|
Manufacturing year of the car
|
Mileage_Offered
|
The standard mileage offered by the car company in km per liter
|
Type_Of_Owner
|
Number of previous ownerships
|
Type_Of_Transmission
|
Automatic/Manual
|
Car_Engine
|
The displacement volume of the engine in CC.
|
Distance_Driven
|
Total kilometers driven by the previous owner(s)
|
Type_Of_Fuel
|
Type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
|
Number_Of_Seats
|
The number of seats in the car
|
Car_Power
|
The maximum power of the engine in bhp (brake horsepower).
|
Price_When_New
|
The price of a new car of the same model in USD.
|
Location
|
The location in which the car was sold
|
Price
|
The price of this car sold in USD
|
Submission Requirements:
• To successfully complete Assignment 1, please submit a client consultancy report in PowerPoint format. The report should include data profiling, insight generation, a data preparation strategy, and the input dimensions of the data and the description of selected variables to be used for the machine learning model development.
• At the end of the report, you also have to include an appendix section with answers to the specific questions listed below. You should find it easy to answer these questions, as you will have considered them while preparing your consultancy report.
• Marks will be deducted if you do not include this appendix section.
Deadline:
• Submit your assignment by January 15, 2025, 23:59:00.
Answer these specific questions put your answers in the appendix section of your ppt.
Data Profiling Questions:
1. What are the basic statistics and distributions of values for each variable?
2. Are the values in the numeric variables normally distributed?
3. What are the cardinalities of the categorical variables?
4. Are there any variables with missing or invalid values?
5. Will you create any new variables from the “Name” column? If so, what are they?
Data Cleaning Questions:
6. Will you reformat values in any of the variables?
7. Excluding null or empty values, how many invalid values are present in the “Number_Of_Seats” variable prior to data cleaning?
8. What is a more accurate method for filling in missing values in the “Number_Of_Seats” variable, using mean, median, mode or other method?
9. When imputing missing values in the “Car_Engine” column, is it better to use the median or mean? Please explain your reasoning.
10. For the “Mileage_Offered” variable, how would you handle a value of 0.0 kmpl? Would you convert it to a null value, change it to a float number of 0.0, or use another method? Please explain your reasoning.
11. Similarly, how would you address the value “null bhp” that appears in the “Car_Power” variable?
Feature Engineering Questions:
12. Will you discretize “Car_Engine” variable? If “yes”, which discretization method is better for this variable: equal-width discretization or equal-frequency discretization? Please explain your choice.
13. If you plan to use linear machine learning prediction algorithms with the provided dataset, to build develop your model:
a) How would you encode the “Type_Of_Owner” variable which is a categorical variable? Would you use one-hot encoding, label encoding, or ordinal encoding? Explain your choice.
b) Similarly, how would you encode the “Location” variable? Explain your choice.
c) What is one disadvantage of encoding the “Type_Of_Transmission” variable using two dummy variables with one-hot encoding instead of a single dummy variable?
14. For the “Mileage_Offered” variable, should you use min-max scaling or z-score standardization for feature scaling? Please explain your reasoning.
Feature Selection Questions:
15. Are there any dependent variables that are highly correlated with each other? What are they?
16. To assess the correlation between the “Location” variable and the “Price” variable, should you use Pearson Correlation or Analysis of Variance (ANOVA)?
17. Are there any variables you would eliminate for the machine learning model which is used to predict the “Price”? If there is, which variables will you eliminate and why?