This page contains an in-depth daily schedule for our STA 295 course. Be sure to check back frequently for updates.

Week 1

Tuesday 1-23

Lecture Notes: What is Stat Learning?

Lecture Notes: Using GitHub

Topics

Due

(These tasks should be completed before the start of class on Tuesday)

Thursday 1-25

Lecture Notes

Topics

Reading Assignment

The listed reading assignments should be completed prior to class


Week 2

Tuesday 1-30

Lecture Slides

Topics

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 2.1, as well as 2.3.1, 2.3.3 and 2.3.4 in ISLR 2e

Due

  • HW 1 due on Tuesday 1/30 at 11:59pm (Submit assignment by making a final push with all commits to your .Rmd and .pdf file to your copy of the hw_1 repo on github)

Thursday 2-1

Lecture Slides

Topics

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 2.2 in ISLR 2e

  • Complete this tutorial introducing the ggplot2 graphics package.


Week 3

Tuesday 2-6

Lecture Notes

Topics

  • Simple Linear Regression

  • Inference for Linear Models

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 3.1, along with Sub-sections 3.6.1, 3.6.2, and 3.6.7, in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Give one example of a real-world prediction question that could be answered using a simple linear model. Then give an example of a real-world inference question that could be answered using a simple linear model.

Due

  • HW 2 due on Tuesday 2/6 at 11:59pm

Thursday 2-8

Lecture Notes

Topics

  • Building Linear Models in R

  • Diagnosing Problems with Linear Models

  • Functions in R

  • Live code from class

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Sub-Section 3.3.3 in ISLR 2e

    • Review Sub-sections 3.6.1, 3.6.2, and 3.6.7 (from Tuesday’s reading)
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Choose one of the 6 potential problems that can occur when fitting a linear regression model in Section 3.3.3. Explain what this problem is, why it represents a cause for concern, and how it might be corrected, using language that would be understandable to a non-statistical audience.

Week 4

Tuesday 2-13

Lecture Notes

Topics

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 3.2, along with Sub-sections 3.6.3, in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. Suppose we want to predict the value of \(Y\) based on three variables \(X_1, X_2, X_3\). In what ways is a single multiple regression model for \(Y\) based on \(X_1, X_2, X_3\) different from creating 3 separate simple regression models for \(Y\) based on each of \(X_1\), \(X_2\) and \(X_3\) individually.

  2. Give an intuitive explanation for why training \(R^2\) will always increase as more predictors are added to a model, while the number of observations are held fixed. Then explain why adding more predictors will not necessarily lead to a more accurate model.

Due

  • HW 3 due on Tuesday 2/13 at 11:59pm

Thursday 2-15

Lecture Notes

Topics

  • Extending Multilinear Models

    • Interaction Terms

    • Transformations

    • Qualitative Predictors

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Sub-Sections 3.3.1, 3.3.2, 3.6.4, 3.6.5, 3.6.6 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Give a real-world example of a response and pair of predictors that may have a synergy or interaction effect, and explain why we might expect this effect based on context. (Recall: an interaction occurs when the effect of one variable on the response is amplified or diminished as the values of the other variable change)

Week 5

Tuesday 2-20

Lecture Notes

Topics

  • More on MLR extensions

  • Non-Parametric Models

  • K-Nearest Neighbors

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 3.5 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Discuss one benefit, and one drawback, that non-parametric methods (like KNN) have compared to parametric methods (like linear regression).

Due

  • HW 4 due on Tuesday 2/20 at 11:59pm

Thursday 2-22

Lecture Notes

Topics

  • More K-Nearest Neighbors

  • KNN in R

  • Data Wrangling with dplyr

Reading Assignment

The listed reading assignments should be completed prior to class

  • Complete this tutorial introducing the dplyr package for data wrangling.
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Choose 1 of the “On Your Own” tasks listed in the dplyr tutorial. Write a solution to the task using R code. Then, in 1 or 2 sentences, describe what your code does.

Week 6

Tuesday 2-27

Lecture Notes

Topics

  • Validation Sets

  • Cross-Validation

  • Bootstrapping

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 5.1 (skip 5.1.5) and 5.2 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following three questions:

  1. In your own words, describe one problem that cross-fold validation attempts to solve.

  2. In your own words, describe one problem that cross-fold validation attempts to solve.

  3. Compare and contrast bootstrapping and cross-validation. In what ways are they similar? In what ways are they different?

Due

  • HW 5 due on Tuesday 2/27 at 11:59pm

Thursday 2-29

Live Coding .Rmd

Topics

  • The rsample package for cross-validation and bootstrapping

Reading Assignment

The listed reading assignments should be completed prior to class

  • No new reading assignment for today.
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. No reading questions for today.

Week 7

Tuesday 3-5

Live Code:

Topics

  • for loops in R

  • The map functions from purrr

  • Tuning Parameters

Reading Assignment

The listed reading assignments should be completed prior to class

Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following three questions:

  1. What is a for loop and why is it useful?

  2. What does the map function do? Why is this useful?

  3. Compare and contrast iterating using for loops vs iteration using the map function.

Due

  • HW 6 due on Tuesday 3/5 at 11:59pm

Thursday 3-7

Lecture Notes

Topics

  • Feature Selection

  • Feature Engineering

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 6.1 and 6.5.1 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. What is one advantage and one disadvantage that the forward or backward selection algorithms have over the best subset algorithm?

  2. Why is it advantageous to use a selection criterion like \(C_p\), \(AIC\), or \(BIC\), over a criterion like \(RSS\), \(R^2\) or adjusted \(R^2\)?


Week 8

Tuesday 3-12

Lecture Notes

Topics

  • Penalized Regression

  • Ridge Regression

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 6.2.1, 6.2.3, 6.5.2 (just the subsection on Ridge Regression) in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer the following question

  1. Compare the method of ridge regression (introduced in Section 6.2) to the method of least squares (the classic method in section 3.1) for finding coefficients of a linear model: Which method is likely to have higher bias? Which method is likely to have higher variance? Why is Ridge Regression likely to find a model with lower test RMSE than least squares?

Due

  • None

Thursday 3-14

Lecture Notes

Topics

  • LASSO

  • LASSO and Ridge Regression Code Example (created by Class Mentor: Patrick Kim)

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 6.2.2 (skip the subsection “Bayesian Interpretation”) and 6.5.2 (the subsection on LASSO) in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. Briefly explain one similarly and one difference between Ridge Regression and LASSO.

  2. What is one benefit of using LASSO to perform feature selection compared to the best subset algorithm?


Week 9

Tuesday 4-2

Lecture Notes

Topics

  • Classification Problems

  • Assessing Classification Accuracy

  • The Bayes Classifier

  • KNN for Classification

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 2.2.3 in ISLR 2e

  • Read Section 11.2 and 11.3 of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)

Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer the following question:

  1. Choose three of the model evaluation measurements discussed in Sections 11.2 and 11.3 of Applied Predictive Modeling (Accuracy, Sensitivity, Specificity, Kappa, PPV, NPV, PCF, ROC Curve, Lift Charts); briefly describe how the measurement is calculated and then explain why we would be interested in this measurement.

Due

  • None

Thursday 4-4

Lecture Notes

Topics

  • Logistic Regression

    • Simple logistic regression

    • Predicting classes using logistic regression

    • Multiple logistic regression

    • Multinomial logistic regression

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 4.1, 4.2, 4.3, 4.7.1 and 4.7.2 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. What is one reason linear regression is not often used for classification problems?

  2. Describe a particular real-world binary classification problem in which it is more costly to make one type of misclassification mistake than the other (i.e. if the two levels of the response are coded as 0 and 1, it is more costly to classify a true 0 as a 1 than to classify a true 1 as a 0.)


Week 10

Tuesday 4-9

Lecture Notes

Topics

  • Extensions of Logistic Regression

    • Multinomial models

    • Data transformations

    • Penalized models

  • Practice with Classification Problems

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 12.5 of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  • In what ways is Penalized Logistic Regression similar to Penalized Linear Regression? In what ways is it different?

Due

  • None

Thursday 4-11

Lecture Notes

Topics

  • Generative Probability Models

  • Naive Bayes

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 13.6 of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)

  • Read Section 4.7.5 in ISLR 2e

Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. What is the “naive” assumption that is made in the Naive Bayes model? In what ways does this simplify the model?

  2. Explain what each of the following means: “posterior probability of the class”, “prior probability of the outcome”, “prior probability of the predictor” and “conditional probability”


Week 11

Tuesday 4-16

Lecture Notes

Topics

  • Decision Trees for Regression

  • Trees in R

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 8.1 and 8.8 (just the part about Single Trees) of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Suppose you want to predict the temperature in Grinnell at 2:30pm on Tuesday, APril 16th. Give an explicit example of a decision tree that uses two variables and at least 3 (but no more than 5) nodes that you could use to make this prediction. (Your tree doesn’t need to be the “best” tree, but should be plausible based on what you know about Grinnell weather.)

Due

  • HW 8

Thursday 4-18

Lecture Notes

Topics

  • Classification Trees

  • Practice with Trees

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 14.1 and 14.8 (just the part about Classification Trees) of Applied Predictive Modeling (a free .pdf copy of the text is available through SpringerLink using Grinnell College credentials)
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. Choose one of either Gini Index or the Information Statistic (or entropy) and briefly explain in 2 to 3 sentences why it is a measurement of node purity. Be sure to include what values correspond to high and to lower impurities.

  2. What parameters need to be tuned for classification or regression trees? What are the possible consequences of leaving these parameters at their default values?


Week 12

Tuesday 4-23

Lecture Notes

Topics

  • Ensemble Models

  • Bagging

  • Random Forests

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read Section 8.2.1, 8.2.2 and 8.3.3 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

Answer ONE of the following two questions:

  1. What is one advantage and one disadvantage of Bagged Trees, compared to simple decision trees?

  2. In what ways do Random Forests differ from Bagged Trees? Then explain why this difference tends to lead to increases in performance.

Due

  • None

Thursday 4-25

Lecture Notes

Topics

  • Random Forests

  • Boosted Trees

  • ~Bayesian Additive Trees~

Reading Assignment

The listed reading assignments should be completed prior to class

  • Read 8.2.3, 8.2.4, 8.2.5, 8.3.4 and 8.3.5 in ISLR 2e
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Compare and contrast two of the three tree-based methods: random forests, boosting, Bayesian additive trees

Week 13

Tuesday 4-30

Lecture Notes

Notes on Tuning Boosted Trees

Topics

  • Boosted Trees

  • Beyond the gbm package

Reading Assignment

The listed reading assignments should be completed prior to class

Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Consider an occasion when you encountered a new challenge, puzzle or problem which required multiple attempts to solve. Describe the challenge, puzzle or problem, and detail the strategy you used to overcome it.

Due

  • None

Thursday 5-2

Lecture Notes

Topics

  • Tidymodels

Reading Assignment

The listed reading assignments should be completed prior to class

Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Based on the reading for today, what is one advantage of the tidymodels framework, compared to the workflow we’ve used earlier in the term? What is one disadvantage?

Week 14

Tuesday 5-7

Boosted Trees using tidymodels

Topics

  • Finish discussing tidymodels

  • Work on final project

Reading Assignment

The listed reading assignments should be completed prior to class

  • None
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

None

Due

  • HW 10

Thursday 5-2

Topics

  • Discuss elements of effective presentations

  • Work on final project

Reading Assignment

The listed reading assignments should be completed prior to class

  • None
Discussion Question

Responses to questions should be added as an answer on the day’s post in the Discussions section on the class Github Organization Grinnell-STA-295-S24. These responses should be submitted before the start of class

  1. Reflect on academic presentations you’ve witnessed as well as participated in.

    • Describe two specific elements of an effective presentation, and briefly discuss why these elements are important.

    • Describe one thing that should be avoided in an effective presentation (alternatively, one element that is often included in an ineffective presentation). Explain why it should be avoided.

    • Describe one specific element of an effective presentation about data and statistical models (this element should be specific to presentations about data, and not presentations in general).