Stat 4914 Project 2
Project 2: Supervised and Unsupervised Learning
The main focus of this project will be supervised and unsupervised learning.
Using the ncaam2025 data files on Carmen, in addition to the bracket you submitted on March 20, reevaluate your picks to determine which picks seemed good and bad in hindsight.
Limit your report to 15 pages, including reproducible code. Hide unnecessary code output (library loading, ggplot code, etc.) but include relevant exploratory and modeling code (collinearity screening, model build- ing, assumption checking, etc.). Caption figures and tables, cite all sources, and proofread your submission.
a) Data Selection
Use a subset of your choosing from the files provided on Carmen in the ncaam2025 folder:
list.files ( "Datasets\\ncaam2025")
## [1] "coaches.csv" "conf_team_mapping .csv" "kenpom_defense .csv"
## [4] "kenpom_efficiency .csv" "kenpom_height .csv" "kenpom_misc .csv"
## [7] "kenpom_offense.csv" "kenpom_pointdist .csv" "kenpom_summary .csv"
## [10] "mm2002_2025 .csv" "postseason .csv"
Feel free to find and use additional, high-fidelity data of your choosing - please provide a reference.
b) Exploratory Analysis
Conduct an exploratory analysis of your data, including preliminary variable screening and exploratory plots. Your exploration should be thorough and provide high-level visualizations and interpretation in context. Consider several aspects of a successful college basketball team - offense, defense, experience, coaching, level of competition, etc.
c) Unsupervised Learning
Unsupervised learning can serve two purposes in this data set: first, to screen for important variables, and second, to as a proxy to determine which teams are more similar (and hence, more plausible to be successful in the tournament). Implement at least two unsupervised clustering methods: K-means, DBSCAN, t-SNE, hierarchical clustering, principal components analysis, or other methods of your choosing. Note two is the minimum.
Create appropriate visualizations of your unsupervised clusters.
d) Supervised Learning
Supervised learning can serve to predict teams’ success. Because you don’t have the full win-loss data for the season, you must you another variable to measure success, such as total number of wins, tournament seed, or another metric.
Implement at least three supervised learning methods. Although methods like multiple regression, GLMs, and linear mixed models can be considered supervised learning, we considered those on Project 1 so while you can feel free to implement them here for comparison, please consider the following methods: nonlinear regression, splines regression, principal components regression, LASSO/Ridge, random forest, classifica- tion/regression trees, XGBoost, Naive Bayes, K-nearest neighbors, or other methods of your choosing. Note three is the minimum.
e) Analysis and Findings
Provide a professional write-up of your findings, describing both the statistical and practical findings of your analysis.