首页 > > 详细

讲解2 2020讲解留学生Python语言

Actuarial Data and Analysis, T2 2020 
Assignment Part A 
Due time: Week 5 Wednesday, 1 July 2020, 11.55 am (sharp) 
1 Skills developed 
This assignment provides you with an opportunity to get familiar with the given datasets before applying 
modeling techniques you are learning in the course lectures to a business task involving data. In addition, 
your skills in understanding/applying data manipulation and analysis methods (from the course materials 
and any additional reference material you consider) will be developed via this assignment. Communication 
of the results of your investigations and analysis is also an important skill developed. 
2 Task 
You are a fresh actuarial graduate who has just joined the US Medicare Fraud Department as an analyst. 
Your team is in charge of analyzing Medicare data for detecting Medicare frauds made by the providers. 
Your manager has currently tasked you with providing a preliminary report on the attached datasets for you 
to be familiar with the data and the Medicare provider characteristics, and get ready for further analysis. 
Your main tasks involve data manipulation and analysis, as well as a report and a recommendation for 
further analysis (i.e. modeling). 
Note that all relevant steps in the data manipulation as well as data analysis results should be included in 
the report or appendix. 
3 Additional information and mark allocation 
3.1 Data manipulation and analysis (17 marks) 
For the data you have, you should manipulate the data to prepare for data analysis. This includes (but is 
not limited to): data exploration, data cleaning (if necessary), combining all the datasets and aggregating 
the data per provider (see the Resources section for documentation). 
The analysis of the data should provide a good sense of the datasets, insights on beneficiary, claim and 
provider characteristics as well as providing drive for further analysis. You may find interesting insights by 
analysing both the combined and aggregate datasets. 
This task does not consist of modeling but you should keep in mind that the question your team will 
ultimately be looking at is which providers are likely to have fraudulent claims. 
See the section on data for details. 
Mark allocation for the assignment can be found in the rubrics (on the course Moodle webpage). 
3.2 Presentation Format (3 marks) 
Communication of quantitative results in a concise and easy-to-read manner is a skill that is vital in practice. 
As such, marks will be given for the presentation of your results. In order to maximize your marks for 
presentation you may wish to consider issues such as: table size/readability, figure axes/formatting, ease 
of reading, grammar/spelling, and report structure. You may also wish to consider the use of executive 
summaries and appendixes, where appropriate. Provide sufficient details to the reader so that they can 
judge what you are doing, using appendices for non-essential but useful results for the report as necessary. 
Note that sufficient detail must be provided (in either the report body and/or appendices) so that the 
reviewer can follow all the steps and derivations required in your work. 
Note that a maximum page limit of 2 pages (excluding tables and graphs) is applicable to the main body 
of the report.1 You should also consider the rubric for the presentation component (on the course Moodle 
webpage). There is no limit to the length of the appendix. 
3.3 Software 
You may choose which software package to use (e.g. R, Python or other), however, nearly every function you 
will be required to use for this task is available in R. Note also that code enabling you to perform most of 
the computing can be found in the learning activities of the course and the Resources section. Note that 
any assumptions must be clearly identified and justified (if used). 
4 Data 
The data is related to US Medicare claims and beneficiary details of 4436 providers from 2008 to 2009 and 
consists of 4 datasets: 
1. Medicare_Provider.csv 
2. Medicare_Inpatient.csv 
3. Medicare_Outpatient.csv 
4. Medicare_Beneficiary.csv 
Similar (but not identical) datasets are provided here. You may wish to check that webpage for further 
information about the context, data and problem.2 
You may also wish to have a look at the following explanatory data analysis based on the Kaggle datasets 
to give you an idea of why and how to start the data analysis: Healthcare Fraud Detection With Python: 
The importance of exploratory data analysis (weblink here). This data analysis is just a brief example and 
is not based on your datasets. Different and more variables may be of interest for your analysis. 
4.1 Medicare_Provider.csv (Provider Data) 
This dataset provides the provider ID and if yes or no they are fraudulent providers. 
Variable Description 
ProviderID: A unique ID assigned to each provider (character) 
Fraud: Is fraudulent? (categorical: “no”,“yes”) 
1Please kindly note that this is a maximum - you should feel free to use less pages if it is sufficient! 
2Optional readings for extra information and context on Medicare Fraud in US can be found here: link 1 and link 2. 
4.2 Medicare_Inpatient.csv (Inpatient Data) 
This dataset provides insights about the claims filed for those patients who are admitted to hospital. It also 
provides additional details about the admission, discharge dates and diagnosis code. 
Variable Description 
BeneID: A unique ID assigned to each beneficiary (chr) 
ClaimID: A unique ID assigned to each claim (chr) 
ClaimStartDt: Start date of the claim (date) 
ClaimEndDt: End date of the claim (date) 
InscClaimAmtReimbursed: Claim amount reimbursed (num) 
AttendingPhysician: Attending physician (chr) 
OperatingPhysician: Operating physician (chr) 
OtherPhysician: Other physician (chr) 
AdmissionDt: Admission date (date) 
ClmAdmitDiagnosisCode: Claim admission diagnosis code (chr) 
DeductibleAmtPaid: Deductible amount paid (num) 
DischargeDt: Discharge date (date) 
DiagnosisGroupCode: Diagnosis group code (chr) 
ClmDiagnosisCode_1: Claim diagnosis code 1 (chr) 
ClmProcedureCode_1: Claim procedure code 1 (num) 
ProviderID: A unique ID assigned to each provider (chr) 
Important remark: Variables ClmAdmitDiagnosisCode, DiagnosisGroupCode, ClmDiagnosisCode_1 and 
ClmProcedureCode_1 correspond to specific international or national codifications.3 You don’t need to know 
or understand the details of the meaning of the codification. You can treat those variables as categorical 
and investigate only the most significant levels. 
• ClmAdmitDiagnosisCode represents the diagnosis code on the institutional encounter indicating the 
beneficiary’s initial diagnosis at admission. This diagnosis code may not be confirmed after the patient 
is evaluated; it may be different than the eventual diagnoses. 
• DiagnosisGroupCode represents the diagnostic group to which a hospital claim belongs. It is a unique 
identifier of a hospital case type that is based on similar clinical problems. 
• ClmDiagnosisCode_1 represents the diagnosis code in the 1st position identifying the condition(s) for 
which the beneficiary is receiving care. 
• ClmProcedureCode_1 indicates the principal procedure performed during the period covered by the 
institutional claim. 
4.3 Medicare_Outpatient.csv (Outpatient Data) 
This dataset provides details about the claims filed for those patients who visited hospitals as outpatients. 
Variable Description 
BeneID: A unique ID assigned to each beneficiary (chr) 
ClaimID: A unique ID assigned to each claim (chr) 
ClaimStartDt: Start date of the claim (date) 
ClaimEndDt: End date of the claim (date) 
InscClaimAmtReimbursed: Claim amount reimbursed (num) 
AttendingPhysician: Attending physician (chr) 
3Reference: Research Data Assistance Center, weblink here. 
Variable Description 
OperatingPhysician: Operating physician (chr) 
OtherPhysician: Other physician (chr) 
ClmDiagnosisCode_1: Claim diagnosis code 1 (chr) 
ClmProcedureCode_1: Claim procedure code 1 (num) 
DeductibleAmtPaid: Deductible amount paid (num) 
ClmAdmitDiagnosisCode: Claim admission diagnosis code (chr) 
ProviderID: A unique ID assigned to each provider (chr) 
4.4 Medicare_Beneficiary.csv (Beneficiary Details Data) 
This dataset contains beneficiary individual details (e.g. date of birth, date of death, health conditions, state, 
etc). 
Variable Description 
BeneID: A unique ID assigned to each beneficiary (chr) 
DOB: Date of birth (date) 
DOD: Date of death (date) 
Gender: Gender 1 or 2 (categorical) 
Race: Race 1 to 5 (categorical) 
RenalDiseaseIndicator: Renal disease indicator “0” (No) or “Y” (Yes) (chr) 
State: US state number (num) 
County: County (num) 
NoOfMonths_PartACov: Number of months Medicare Part A covered (num) 
NoOfMonths_PartBCov: Number of months Medicare Part B covered (num) 
ChronicCond_Alzheimer: Chronic condition Alzheimer 1 (Yes) or 2 (No) (num) 
ChronicCond_Heartfailure: Chronic condition Heart failure 1 (Yes) or 2 (No) (num) 
ChronicCond_KidneyDisease: Chronic condition Kidney Disease 1 (Yes) or 2 (No) (num) 
ChronicCond_Cancer: Chronic condition Cancer 1 (Yes) or 2 (No) (num) 
ChronicCond_ObstrPulmonary: Chronic condition Obstructive Pulmonary 1 (Yes) or 2 (No) (num) 
ChronicCond_Depression: Chronic condition Depression 1 (Yes) or 2 (No) (num) 
ChronicCond_Diabetes: Chronic condition Diabetes 1 (Yes) or 2 (No) (num) 
ChronicCond_IschemicHeart: Chronic condition Ischemic Heart 1 (Yes) or 2 (No) (num) 
ChronicCond_Osteoporasis: Chronic condition Osteoporasis 1 (Yes) or 2 (No) (num) 
ChronicCond_rheumatoidarthritis: Chronic condition rheumatoidarthritis 1 (Yes) or 2 (No) (num) 
ChronicCond_stroke: Chronic condition stroke 1 (Yes) or 2 (No) (num) 
IPAnnualReimbursementAmt: Inpatient annual reimbursement amount (num) 
IPAnnualDeductibleAmt: Inpatient annual deductible amount (num) 
OPAnnualReimbursementAmt: Oupatient annual reimbursement amount (num) 
OPAnnualDeductibleAmt: Outpatient annual deductible (num) 
5 Resources 
• Data manipulation with R: dplyr (weblink here) 
• Merging with R (weblink here) 
• Tidy data in R (weblink here) 
• Explanatory Data Analysis with R (weblink here) 
• Data visualistion in R with ggplot2 for fancy plots (weblink here) 
• For any code related question google.com or stackoverflow.com are pretty helpful! 
• As usual you can ask your questions on the course Ed forum. 
6 Assignment submission procedure 
6.1 Turnitin submission 
Your assignment report must be uploaded as a unique document and all parts must be in portrait 
format. As long as the due date is still future, you can resubmit your work; the previous version of your 
assignment will be replaced by the new version. 
Assignments must be submitted via the Turnitin submission box that is available on the course Moodle 
website. Turnitin reports on any similarities between your cohort’s assignments, and also with regard to 
other sources (such as the internet or all assignments submitted all around the world via Turnitin). More 
information is available at: [click]. Please read this page, as we will assume that you are familiar with its 
content. You can also find on the Moodle webpage the Turnitin Similarity Report Interpretation Guide 
(2019). 
Please also submit any programming code used in your analysis as a separate file in the dedicated 
“Code only” Moodle assignment box on the course webpage. These will be referred to by the marker only if 
needed, and in particular the report (with appendix) should be self-contained. 
You need to check your document once it is submitted (check it on-screen). We will not mark assignments 
that cannot be read on screen. 
Students are reminded of the risk that technical issues may delay or even prevent their submission (such 
as internet connection and/or computer breakdowns). Students should allow enough time (at least 24 
hours is recommended) between their submission and the due time. The Turnitin module will not 
let you submit a late report. No paper copy will be either accepted or graded. 
6.2 Late submission 
Please note that it is School policy that late submission of assignments will incur in a penalty. 
A penalty of 25% of the mark the student would otherwise have obtained, for each full (or part) day of 
lateness (e.g., 0 day 1 minute = 25% penalty, 2 days 21 hours = 75% penalty). Students who are late 
must submit their assignment to the LIC via e-mail. The LIC will then upload documents to the relevant 
submission boxes. The date and time of reception of the e-mail determines the submission time for the 
purposes of calculating the penalty. 
More information on Late submissions, extensions and special consideration is available in the Moodle course 
webpage section Additional resources from UNSW (at the bottom). 
6.3 Plagiarism awareness 
Students are reminded that the work they submit must be their own. While we have no problem with 
students working together on the assignment problems, the material students submit for assessment must 
be their own. 
Students should make sure they understand what plagiarism is—cases of plagiarism have a very high prob- 
ability of being discovered. For issues of collective work, having different persons marking the assignment 
does not decrease this probability. 
More information on Academic integrity and plagiarism is available in the Moodle course webpage section 
Additional resources from UNSW (at the bottom). 
 
联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!