首页 > > 详细

讲解 CIS 5450 Homework 3: Hypothesis Testing and Machine Learning辅导 Python语言

CIS 5450 Homework 3: Hypothesis Testing and Machine Learning

Due Date: October28that 10:00PM EST

101pointstotal(=85 autograded+ 16manuallygraded).

Welcome to CIS 5450 Homework 3! In this homework you will gain some familiarity with machine learning models for supervised learning. Over the next few days you will strengthen your understanding of hypothesis testing via simulation and ML concepts using baseball, insurance, and diabetes datasets. Some housekeeping below! 

Before you begin:

·   Be sure to click "Copy to Drive" to make sure you're working on your own personal version of the homework

·   Check the pinned FAQ post on Ed for updates! If you have been stuck, chances are other students have also faced similar problems.

Note: We will be manually checking your implementations and code for certain problems. If you incorrectly implemented a procedure using Scikit-learn (e.g. creating predictions on training dataset, incorrectly process training data prior to running certain machine learning models, hardcoding values, etc.), we will beenforcing a penalty system up to the maximum value of points allocated to the problem. (e.g. if your problem is worth 4 points, the maximum number of points that can be deducted is 4 points).

·  Note: If your plot is not run or not present after we open your notebook, we will deduct the entire manually graded point value of the plot. (e.g. if your plot is worth 4 points, we will     deduct 4 points).

·  Note: If your .py ile is hidden because it's too large, that's ok! We only care about your .ipynb ile.

Part 0. Import and Setup

Import necessary libraries (do not import anything else!)

%%capture

!pip3  install  penngrader-client

import  numpy  as  np

import  pandas  as  pd

import  seaborn  as  sns

import  matplotlib.pyplot  as  plt

from  sklearn.neighbors  import  KNeighborsClassifier  from  sklearn.linear_model  import  LogisticRegression

from  sklearn.model_selection  import  train_test_split,  GridSearchCV,  StratifiedKFold

from  sklearn.metrics  import  accuracy_score,  precision_score,  recall_score

from  sklearn.preprocessing  import  OneHotEncoder,  OrdinalEncoder,  StandardScaler

from  sklearn.ensemble  import  RandomForestClassifier

import  random

import  math

from  xgboost  import  XGBClassifier from  penngrader.grader  import  *

!apt  install  zstd

!wget  -nc  -O  diabetes_prediction_dataset.csv.zst  https://www.dropbox.com/scl/fi/p8qpv4eja0xp3

!unzstd  -f  diabetes_prediction_dataset.csv.zst

!wget  -nc  -O  games.csv.zst  https://www.dropbox.com/scl/fi/43au9nv0bty84pqg6aw64/games.csv.zst

!unzstd  -f  games.csv.zst

!wget  -nc  -O  medical_cost.csv.zst  https://www.dropbox.com/scl/fi/8nz07htxxi07xilddsulx/medica

!unzstd  -f  medical_cost.csv.zst

PennGrader Setup

#  PLEASE  ENSURE  YOUR  PENN-ID  IS  ENTERED  CORRECTLY.  IF  NOT,  THE  AUTOGRADER  WON'T  KNOW

#  TO  ASSIGN  POINTS  TO  YOU  IN  OUR  BACKEND

STUDENT_ID  =    #  YOUR  PENN-ID  GOES  HERE  AS  AN  INTEGER  #

SECRET  =  STUDENT_ID

%%writefile  config.yaml


grader_api_url:  'https://23whrwph9h.execute-api.us-east-1.amazonaw s.com/default/Grader23

d       k     ' l k                    k               '

%set_env  HW_ID=cis5450_fall24_HW3

grader  =  PennGrader('config.yaml',  'cis5450_fall24_HW3',  STUDENT_ID,  SECRET)

Part 1: Hypothesis Testing via Simulation [17 Points Total]

1.1: Estimating Pi through Simulation [4 points]

Consider a circle with radius 1/2 inside of a unit square:

 

We could compute the area of the circle with a well-known formula and the value of π, but we can also compute both the area of the circle, and the mysterious value of π, via simulation!

If we randomly sample a point inside the unit square, the probability that the point falls within

the circle is equal to the area of the circle divided by the area of the square. Thus, if we sample a total of Pt points and Pc of them are in the circle, we can write the area of the circle Ac as:

 

Solving for π gives:

 

Below is some Python code that simulates picking a random point in the square, testing if that point is inside the circle, and keeping track of Pc  and Pt. Run this code to ensure it works, and see how long it takes. The simulation should sample 10 million points.


%%time

def  pt_in_circle(x,  y):

return  math.sqrt(x**2  +  y**2)  <  0.5

for  i  in  range (10_000_000):

#  Sample  x  and  y  uniformly  from  -0.5  to  0.5

x  =  random.uniform(-0.5,  0.5)

y  =  random.uniform(-0.5,  0.5)

if  pt_in_circle(x,  y):

p_c  +=  1

p_t  +=  1

#  Estimate  pi

pi_estimate  =  4  *  p_c  /  p_t

print(f"Estimated  value  of  pi:  {pi_estimate}")

Next, let's accelerate our simulation with vectorization! Using  numpy , write a vectorized version of the simulation.

Your solution must:

·   Contain no loops ( while ,  for , etc.) or list comprehensions

·  Contain no if statements or conditionals

·   Only use built-in or numpy  np. functions

·   Should make only a single call to  np.random .

Note:You will not get any credit if you violate any of the above instructions.

%%time

#  TODO:  Sample  10,000,000  points  and  calculate  pi

pi_estimate  =

#  Do  NOT  change  anything  below  this  line

grader.grade(test_case_id  =  'test_estimate_pi',  answer  =  (pi_estimate,  _ih[-1]))

1.2 Hypothesis Testing [13 Points]

It is commonly believed that in many sports, the home team tends to have an advantage over the away team, often referred to as "home ield advantage." In this part, we will perform a hypothesis test to determine, from a statistical standpoint, if such an advantage exists in Major League Baseball (MLB) games. We will guide you through each step of the process, using the MLB Games Dataset, and conduct the test through simulation. For this part, we will be using vectorization only, so no  for loops should be used!

1.2.1 Load Data

Before diving into the simulation, we need to load in the data i rst.

TODO:

·   Load  games.csv and save the data to a dataframe. called  games_df .

·   Inspect theirst ive rows. There are many columns in this dataframe, but think about which ones we will actually need for hypothesis testing.

#  DO  NOT  CHANGE #  Import  Data

games_df  =  pd.read_csv("games.csv") games_df.head()

In lecture, you have learned that in hypothesis testing, we start by assuming a baseline called the null hypothesis. For this test, the null hypothesis (H0 ) is that home ield advantage does not exist in MLB games. This means that, under the null hypothesis, the probability of the home team winning is equal to the probability of the away team winning (i.e., 50%).

As a brief review, to determine whether we can reject or fail to reject the null hypothesis, we will:

1. Set up the hypotheses:

o   H0: The probability of the home team winning is 50% (no home ield advantage).

o   Ha  (alternative hypothesis): The probability of the home team winning is greater than 50% (home ield advantage exists).

2. Analyze the data: We will calculate the observed proportion of home team wins using the MLB Games Dataset.

3. Simulate random outcomes: We will simulate a large number of seasons where home

teams win exactly 50% of the time, assuming the null hypothesis is true. In lecture you saw the Gaussian distribution. Here, the nature of the data requires us to draw from the

binomial distribution.

4. Compare observed results to simulations: We will determine how often the simulated

results produce home win rates as extreme or more extreme than the observed data. This will give us a p-value, which tells us the likelihood of observing the current data under the null hypothesis.

5. Make a decision: Based on the p-value, we will decide whether to reject or fail to reject the null hypothesis:

o   If the p-value is below a threshold (in this case we'll use 0.05), we will reject the null hypothesis and conclude that home ield advantage likely exists. Intuitively, a small p-value means that the data we observed is extremely unlikely to occur under the null hypothesis.

 If the p-value is higher, we will fail to reject the null hypothesis, meaning the evidence is not strong enough to suggest home ield advantage.

1.2.2 Calculate Original Test Statistic [3 Points]

We will now move on to Step 2, which is to calculate the original test statistic, i.e., the home win rate of the given data.

TODO:

·  Under special circumstances, there can be a tie. For this part, just drop games that ended in a tie.

·   Create a column called  home_win that is a 1 if the home team won that game and 0 otherwise.

·   Calculate the proportion of times that the home team wins and store it in home_win_rate

#  Drop  ties  and  reset  index

#  Create  a  column  that  is  1  if  the  home  team  won

#  Calculate  original  test  statistic

home_win_rate  =

#  Grader  Cell  (3  points)

grader.grade(test_case_id  =  'original_test_statistic',  answer  =  (games_df['home_win'],  home

1.2.3 Simulation and Plotting [7 points] (7 manually graded)

Now we will simulate the null world to get a distribution.

TODO:

·  Simulate 10,000,000 trials

·   Each trial should be drawn from a binomial distribution. If you are unfamiliar with the

binomial distribution, take a look at the numpy documentation and pay careful attention to what your n and p should be in this case. Recall what it means when we are sampling

from the null world.

·  Calculate the simulated proportion by dividing by total games Note:You should be using  numpy vectorization

#  Simulate  random  home  win  outcomes  for  each  game  across  all  simulation

simulated_home_wins  =


#  Calculate  the  simulated  proportion  of  home  wins

l   d

Now, let's visualize the distribution of our simulation.

Task:

·  Plot a histogram of the distribution of the test statistic in the null world. Use 50 bins.

·  Title the plot: Distribution of Simulated Home Winrate

·   Label the x-axis: Home Winrate ·   Label they-axis: Frequency

·  Add a red vertical line to indicate the original test statistic. Label this vertical line "Observed Home Winrate: {home_win_rate}". Round to four decimal points.

·  Add a legend to your plot.

Hint:Take a look at matplotlib.pyplot.axvline

#  Plot  the  distribution  of  simulated  home  win  counts

1.2.4 Calculate p-value [3 points] (2 manually graded)

Finally, we can calculate the simulated p-value. Remember what the p-value represents, and use your simulated win rates to calculate it.

#  Calculate  the  p-value

simulated p value  =

#  Grader  Cell  (1  points)

grader.grade(test_case_id  =  'test p value',  answer  =  simulated p value)

After calculating the p-value, briely describe what it represents. Does the p-value represent

exactly what you might get if you were to calculate it mathematically? State whether we should reject or fail to reject the null hypothesis.

 




联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!