Instructor: Jonathan Wells
Email: wellsjon@grinnell.edu
Classroom: Noyce 2401
Class Time: T/Th 2:30 - 3:50pm
Office: Noyce 2249
Office Hours: Link to Office Hours Calendar (be sure to navigate on the calendar to the current week)
This course is an overview of modern approaches to analyzing and modeling large multivariate data sets across a variety of fields. Theory and implementation for common predictive techniques will be covered, including linear, penalized, and logistic regression, tree-based models, and ensemble models. Framework for model assessment, including the bias-variance trade-off, train-testing splits, and resampling methods, will be discussed. This course will make extensive use of the R programming language.
Prerequisites: STA 209, or Instructor Consent.
Primary Textbook:
Secondary Textbooks
Applied Predictive Modeling, 1st Edition by Kuhn and Johnson.
Tidy Modeling with R, 1st Edition by Kuhn and Silge
The following web-based resources will be used for communicating class information:
Course Website https://grinnell-sta-295-s24.github.io/ (announcements, documents, schedule, assignments)
GitHub Classroom https://classroom.github.com/classrooms/156477932-sta-295-s24 (homework submission).
You are encouraged to bring a personal computer to class each day for notetaking and live coding. Access to a computer with webbrowser will be required for homework completion and submission.
We will make frequent use of the R programming language to perform calculations, sample random variables, and create probability models. Both R (the programming language) and RStudio (an editor and UI) are free to use, and can be accessed three ways:
Through the cloud on the Grinnell RStudio Server: https://rstudio.grinnell.edu/
On a classroom or library computer
On your own computer by downloading R (http://www.r-project.org/) and RStudio (http://www.rstudio.com/)
Throughout the term, we will use GitHub to manage and submit assignments. GitHub is a hosting service to house Git-based projects online, and is designed to assist with version control and collaboration on big projects. https://github.com/
If you would like to contact me, I can most easily be reached via email weekdays between 8am and 6pm. While I try to answer emails as soon as possible, in some cases, I may not be able to respond until the following school day. If you’d prefer to talk live, send me an email and we can schedule a time to chat on WebEx or Teams.
You are free and encouraged to attend any scheduled office hours without prior appointment. These are times I have specifically set aside for answering questions, discussing class material, and helping with other college business. If you have a matter you’d prefer to discuss one-on-one, or if none of the scheduled times fit your schedule, please email me and we can arrangement another time to meet. On very rare occasions, I may need to reschedule office hours due to illness or other unavoidable conflict, and in these cases, I will notify the class via email.
By the end of the course, a student should be able to:
Articulate and compare the different philosophical approaches to prediction, statistical inference, classification, and clustering.
Create valid statistical models, perform data analysis using software, and communicate results in non-technical language using reproducible methods in order to answer a particular research question.
Implement simulation and randomization algorithms in order to demonstrate and assess properties of statistical models.
Assess and compare the performance of a variety of statistical models, and select appropriate models according to suitable criteria.
Apply statistical learning techniques to real-world data and problems.
Justify and describe properties of particular statistical learning methods by appealing to appropriate theory.
Use the R programming language to perform exploratory data analysis, create statistical models, and analyze model results.
Create and share reproducible reports using RMarkdown that include code, prose, and output
Implement version control and perform data analysis project management, using Git / Github
A typical class day will involve the following:
Reading Assignment. Every class will have an assigned reading which should be completed prior to the start of class.
Active Lectures. Our 80-minute class meetings will include an interactive lecture by the instructor, with some time devoted to discussion either class-wide or in small groups.
Group Work. Typically during each class period, a portion class time will be reserved for collaborative coding and group work with your peers.
A prepared student will attend class for 80 minutes per day, twice each week, and spend about three to four hours per day of class on work outside the classroom (reading, doing homework, working on projects, discussing, studying, etc.). Together, this represents a 9 - 11 hour per week commitment.
Your grade in the class will be determined by your proficiency in each of the Course Outcomes, as demonstrated in the following assessments:
Letter grades will be assigned based on the following course percentages (with upper and lower \(2\%\) of each division corresponding to \(+ / -\), respectively).
Daily reading assignments will be posted on our course webpage, and
will list the specific section(s) to read for each day, along with a
response question to be completed by 1pm each day of class (to give me
time to review them before class). Responses should be posted as an
Issue
on the daily-reading
repository of our
Github organization. The questions are not intended to be overly
difficult, but should help both you and I highlight topics that need
further review. Responses will be assessed primarily on the basis of
completion. No extensions on daily readings will be given, but up to
three assignments may be missed without penalty.
A weekly problem set will be made available after class on Tuesday, due by 11:59pm the following Tuesday. Problem sets must be completed as a .Rmd file in RStudio and submitted via GitHub. Detailed submission instructions can be found on the course webpage. Up to two times throughout the term, you may request up to a 5 day extension on your assignment. Except in extraordinary circumstances, requests must be made prior to an assignment’s due date.
The ability to immediately interrogate your beliefs and understanding through dialogue sets a live class apart from more passive means of education. For this reason, you are expected to attend class regularly and to actively participate by asking questions, responding to questions, and engaging in class discussion and collaborative coding.
Your participation score may be lowered for use of a phone, tablet or computer for non-class activities during class time, particularly if this behavior is disruptive to others in the classroom. It may also be lowered for failure to abide by the class Code of Conduct.
If you are unable to attend class, you should notify the instructor before class (or promptly after, if that’s not possible). You are responsible for independently catching up on the material missed, which you can do by:
Typically, you may miss up to two classes without penalty. However, prolonged or recurring illness, as well as other emergencies, may require individual adjustment, in which case you should contact the instructor as soon as possible to make appropriate arrangements.
Twice throughout the term, students will schedule an individual meeting with the instructor to review code the student submitted on a recent assignment, as well as discuss conceptual questions related to the topics covered in class.
Each meeting will last approximately 20 minutes.
The first review will take place during week 7 (3/4 - 3/8) or week 8 (3/11 - 3/15), and the second review will take place during week 11 (4/15 - 4/19) or week 12 (4/22 - 4/26).
Throughout the second half of the term, you will work in groups of 2 or 3 on a project that answers a significant research question using real-world data, by implementing the fundamental techniques developed in our class, as well as some more advanced methods from supplementary sources. The project will culminate in a 20 minute presentation during finals week and a 4-8 page reproducible technical report.
Grinnell College is committed to creating inclusive and accommodating learning environments. Please notify me as soon as possible if there are aspects of the instruction or design of this course that result in barriers to your participation. I also encourage you to have a conversation about and provide documentation of your disability to the Coordinator for Student Disability Resources, Jae Baldree baldreej@grinnell.edu. If you have already been approved for accommodations, please have Disability Resources provide a letter during the first week of classes, or as soon as possible after approval. I will then contact you to schedule a meeting during which we can discuss the particular implementation of your accommodations.
Grinnell College offers alternative options to complete academic work for studnets who observe religious holy days. Please contact me within the first three weeks of the semester if you would like to discuss how to meet the terms of your religious observance and also the requirements for this course.
Students are allowed and encouraged to collaborate on most in-class and homework assignments. However, any work that you turn in for grading must be your own. If you collaborate on homework, you should clearly indicate the names of your collaborators on the first page of your assignment.
You are welcome to use other paper or internet resources to supplement content we cover in this course; however, with the exception of existing solutions to homeworkproblems. Copying or paraphrasing solutions from the internet or other sources is an example of academic dishonesty.
All written work that references material outside of the textbook or lecture should be accompanied by an appropriate citation. Because ChatGPT and other generative text models do no currently provide direct citations for sources used to generate responses, they should not be used when composing homework, project, or other class assignments.
I expect all members of the class to make participation a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
I expect everyone to act and interact in ways that contribute to an open, welcoming diverse, inclusive, and healthy community of learners. Examples of unacceptable behavior include: using sexualized language or imagery, making insulting or derogatory comments, harassing someone publicly or privately, monopolizing discussion or otherwise preventing others from meaningfully participating. Instead you can contribute to a positive learning environment by demonstrating empathy and kindness, being respectful of differing viewpoints and experiences, giving and gracefully accepting constructive feedback, and making space for everyone to contribute.
You will receive timely feedback on your homework via GitHub, either from me or the course grader (another Grinnell undergraduate). You are strongly encouraged to review comments on your solutions and rework missed problems. Each homework problem will be graded out seven points, with five points for statistical content, and two points for the quality of writing and clarity of code. A general rubric for each problem can be found on the Homework page.
I strongly encourage you to attend my office hours each week. You are welcome to come either with specific questions, or just with general uncertainties about content we’ve discussed. If you are unable to attend scheduled office hours, please email me to schedule an alternative appointment (either in-person or virtual).
The Data Science and Social Inquiry Lab (DASIL) in HSSC S1310 is staffed by mentors who are experienced in R programming and may be able to troubleshoot coding problems you are having.
This is the schedule as of Day 1. A more up-to-date schedule can be found here.
Week | Sections Covered | Week | Sections Covered |
---|---|---|---|
1 | Intro to R | Spring Break | |
2 | Foundations | 9 | Classification |
3 | KNN, Linear Models | 10 | Tree-Based Models |
4 | Extending Linear Models | 11 | Tree-Based Models |
5 | Resampling Methods | 12 | SVM |
6 | Model Selection | 13 | Ensemble Models |
7 | Penalized Regression | 14 | Project Work |
8 | Beyond Linear Models | 15 | Finals Week; Presentations |