首页 > > 详细

辅导 Individual Assignment Assignment 1: Used Car Price Prediction讲解 R语言

Individual Assignment (15%)

Assignment 1: Used Car Price Prediction

Objective:

In this assignment, you will take on the role of a data science consultant for a national used car resale company in America. The company wishes to expand its business by purchasing more used cars from the market. Your task is to develop a machine learning model to predict the selling prices of these cars. Accurate price predictions will help the company buy cars that can be sold for a higher price, thereby increasing profit margins and avoiding cars with lower resale value.

You are tasked with delivering the first sprint of this project within three weeks. This first sprint focuses on data understanding and preparation. Specifically, you need to produce the following deliverables presented in a report format at the end of this sprint.

Data  Quality  Summary:  Validate  with  client  that  the  data  received  is  accurate,  reliable,  and representative.

Data Exploration Summary:  Provide insights on interesting  patterns and relationships between variables.

Data Preparation and Feature Engineering Plan: Detail the strategies to address data issues and outline the feature engineering process.

Suggested steps for to Follow:

1. Perform Data Profiling: Understand the data by examining its structure and contents (e.g. statistical summary, visual exploration, etc.)

2. Identify Insights: Analyze the data to uncover interesting patterns and relationships (e.g. correlation analysis, etc.)

3. Address Data Issues: Investigate ways to fix any data issues (e.g. outlier detection, data capping, etc.)

4. Feature Engineering: Transform. existing features and derive new ones to improve the predictive power of the ML model. (e.g. data standardization, feature transformation, features construction, etc.)

This assignment will help you apply concepts learned in the first three lectures of this  course, facilitating practical understanding and application of machine learning methodologies for financial market modelling opportunities.

Dataset Description

You will be provided with a CSV file (`car_data.csv`) containing historical resale data of used cars. Here's a description of the dataset:

Name: pre-owned_cars_data.csv (7,100 records, 13 variables)

Column

Description

Name

Brand name and model

Year_Manufactured

Manufacturing year of the car

Mileage_Offered

The standard mileage offered by the car company in km per liter

Type_Of_Owner

Number of previous ownerships

Type_Of_Transmission

Automatic/Manual

Car_Engine

The displacement volume of the engine in CC.

Distance_Driven

Total kilometers driven by the previous owner(s)

Type_Of_Fuel

Type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)

Number_Of_Seats

The number of seats in the car

Car_Power

The maximum power of the engine in bhp (brake horsepower).

Price_When_New

The price of a new car of the same model in USD.

Location

The location in which the car was sold

Price

The price of this car sold in USD

Submission Requirements:

•    To   successfully   complete   Assignment   1,  please  submit  a  client  consultancy  report  in PowerPoint  format.  The  report  should  include data profiling, insight generation, a data preparation strategy, and the input dimensions of the data and the description of selected variables to be used for the machine learning model development.

•    At the end of the report, you also have to include an appendix section with answers to the specific questions listed below. You should find it easy to answer these questions, as you will have considered them while preparing your consultancy report.

Marks will be deducted if you do not include this appendix section.

Deadline:

•    Submit your assignment by January 15, 2025, 23:59:00.

Answer these specific questions put your answers in the appendix section of your ppt.

Data Profiling Questions:

1.    What are the basic statistics and distributions of values for each variable?

2.    Are the values in the numeric variables normally distributed?

3.    What are the cardinalities of the categorical variables?

4.    Are there any variables with missing or invalid values?

5.    Will you create any new variables from the “Name” column? If so, what are they?

Data Cleaning Questions:

6.    Will you reformat values in any of the variables?

7.    Excluding null or empty values, how many invalid values are present in the “Number_Of_Seats” variable prior to data cleaning?

8.    What  is  a  more  accurate  method  for  filling  in  missing  values  in  the  “Number_Of_Seats” variable, using mean, median, mode or other method?

9.    When imputing missing values in the “Car_Engine” column, is it better to use the median or mean? Please explain your reasoning.

10.  For the “Mileage_Offered” variable, how would you handle a value of 0.0 kmpl? Would you convert it to a null value, change it to a float number of 0.0, or use another method? Please explain your reasoning.

11. Similarly,  how  would you  address the value “null  bhp”  that  appears  in  the  “Car_Power” variable?

Feature Engineering Questions:

12. Will you discretize “Car_Engine” variable? If “yes”, which discretization method is better for this variable: equal-width discretization or equal-frequency discretization? Please explain your choice.

13.  If you plan to use linear machine learning prediction algorithms with the provided dataset, to build develop your model:

a)    How would you encode the “Type_Of_Owner” variable which is a categorical variable? Would you use one-hot encoding, label encoding, or ordinal encoding? Explain your   choice.

b)   Similarly, how would you encode the “Location” variable? Explain your choice.

c)    What is one disadvantage of encoding the “Type_Of_Transmission” variable using two dummy variables with one-hot encoding instead of a single dummy variable?

14.  For   the    “Mileage_Offered”    variable,    should   you    use    min-max    scaling    or    z-score standardization for feature scaling? Please explain your reasoning.

Feature Selection Questions:

15. Are there any dependent variables that are highly correlated with each other? What are they?

16. To assess the correlation between the “Location” variable and the “Price” variable, should you use Pearson Correlation or Analysis of Variance (ANOVA)?

17. Are there any variables you would eliminate for the machine learning model which is used to predict the “Price”? If there is, which variables will you eliminate and why?


联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!