This page contains an in-depth daily schedule for our STA 295 course. Be sure to check back frequently for updates.
Lecture Notes: What is Stat Learning?
(These tasks should be completed before the start of class on Tuesday)
Complete the pre-class survey (link in the Welcome email sent to your Grinnell email address)
Read Chapter 1 and Chapter 4 in Happy Git and GitHub for the useR
Create a GitHub account (if you don’t already have one)
Version Control with Git and GitHub
~Coding in R/RStudio~
The listed reading assignments should be completed prior to class
Foundations of Statistical Learning, Part 1
Coding in R/RStudio
The listed reading assignments should be completed prior to class
Foundations of Statistical Learning, Part 2
Coding in R
Simple Linear Regression
Inference for Linear Models
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Building Linear Models in R
Diagnosing Problems with Linear Models
Functions in R
Live code from class
The listed reading assignments should be completed prior to class
Read Sub-Section 3.3.3 in ISLR 2e
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Multilinear Models
Model Accuracy for Multilinear Models
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
Suppose we want to predict the value of \(Y\) based on three variables \(X_1, X_2, X_3\). In what ways is a single multiple regression model for \(Y\) based on \(X_1, X_2, X_3\) different from creating 3 separate simple regression models for \(Y\) based on each of \(X_1\), \(X_2\) and \(X_3\) individually.
Give an intuitive explanation for why training \(R^2\) will always increase as more predictors are added to a model, while the number of observations are held fixed. Then explain why adding more predictors will not necessarily lead to a more accurate model.
Extending Multilinear Models
Interaction Terms
Transformations
Qualitative Predictors
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
More on MLR extensions
Non-Parametric Models
K-Nearest Neighbors
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
More K-Nearest Neighbors
KNN in R
Data Wrangling with dplyr
The listed reading assignments should be completed prior to class
dplyr
package for data wrangling.Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
dplyr
tutorial. Write a solution to the task using R code. Then, in 1 or 2
sentences, describe what your code does.Validation Sets
Cross-Validation
Bootstrapping
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following three questions:
In your own words, describe one problem that cross-fold validation attempts to solve.
In your own words, describe one problem that cross-fold validation attempts to solve.
Compare and contrast bootstrapping and cross-validation. In what ways are they similar? In what ways are they different?
rsample
package for cross-validation and
bootstrappingThe listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Live Code:
for
loops in R
The map
functions from purrr
Tuning Parameters
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following three questions:
What is a for
loop and why is it useful?
What does the map
function do? Why is this
useful?
Compare and contrast iterating using for
loops vs
iteration using the map
function.
Feature Selection
Feature Engineering
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
What is one advantage and one disadvantage that the forward or backward selection algorithms have over the best subset algorithm?
Why is it advantageous to use a selection criterion like \(C_p\), \(AIC\), or \(BIC\), over a criterion like \(RSS\), \(R^2\) or adjusted \(R^2\)?
Penalized Regression
Ridge Regression
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer the following question
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
Briefly explain one similarly and one difference between Ridge Regression and LASSO.
What is one benefit of using LASSO to perform feature selection compared to the best subset algorithm?
Classification Problems
Assessing Classification Accuracy
The Bayes Classifier
KNN for Classification
The listed reading assignments should be completed prior to class
Read Section 2.2.3 in ISLR 2e
Read Section 11.2 and 11.3 of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer the following question:
Logistic Regression
Simple logistic regression
Predicting classes using logistic regression
Multiple logistic regression
Multinomial logistic regression
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
What is one reason linear regression is not often used for classification problems?
Describe a particular real-world binary classification problem in which it is more costly to make one type of misclassification mistake than the other (i.e. if the two levels of the response are coded as 0 and 1, it is more costly to classify a true 0 as a 1 than to classify a true 1 as a 0.)
Extensions of Logistic Regression
Multinomial models
Data transformations
Penalized models
Practice with Classification Problems
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Generative Probability Models
Naive Bayes
The listed reading assignments should be completed prior to class
Read Section 13.6 of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)
Read Section 4.7.5 in ISLR 2e
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
What is the “naive” assumption that is made in the Naive Bayes model? In what ways does this simplify the model?
Explain what each of the following means: “posterior probability of the class”, “prior probability of the outcome”, “prior probability of the predictor” and “conditional probability”
Decision Trees for Regression
Trees in R
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Classification Trees
Practice with Trees
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
Choose one of either Gini Index or the Information Statistic (or entropy) and briefly explain in 2 to 3 sentences why it is a measurement of node purity. Be sure to include what values correspond to high and to lower impurities.
What parameters need to be tuned for classification or regression trees? What are the possible consequences of leaving these parameters at their default values?
Ensemble Models
Bagging
Random Forests
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Answer ONE of the following two questions:
What is one advantage and one disadvantage of Bagged Trees, compared to simple decision trees?
In what ways do Random Forests differ from Bagged Trees? Then explain why this difference tends to lead to increases in performance.
Random Forests
Boosted Trees
~Bayesian Additive Trees~
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Boosted Trees
Beyond the gbm
package
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
tidymodels
framework, compared to the workflow we’ve used
earlier in the term? What is one disadvantage?Boosted Trees using tidymodels
Finish discussing tidymodels
Work on final project
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
None
Discuss elements of effective presentations
Work on final project
The listed reading assignments should be completed prior to class
Responses to questions should be added as an answer on the day’s
post in the Discussions
section on the class Github
Organization Grinnell-STA-295-S24.
These responses should be submitted before the start of class
Reflect on academic presentations you’ve witnessed as well as participated in.
Describe two specific elements of an effective presentation, and briefly discuss why these elements are important.
Describe one thing that should be avoided in an effective presentation (alternatively, one element that is often included in an ineffective presentation). Explain why it should be avoided.
Describe one specific element of an effective presentation about data and statistical models (this element should be specific to presentations about data, and not presentations in general).