One of the most important functions of the working statistician is to investigate and answer significant research questions by analyzing real-world data, using a variety of elementary and advanced modeling techniques, and to distill the results into reports that are accessible to the non-statistician.
You will work in small groups to explore a topic of interest to you, building appropriate predictive models to answer a research question, and then summarize your results in a short presentation to the class and as well as in a technical report submitted to the instructor.
Group Membership: 5pm Friday, March 15th (Week 8) via Email
Project Proposal: 5pm Monday, April 8th (Week 10) on GitHub
Draft of Data and Exploratory Data Analysis sections: 5pm Monday, April 22nd (Week 12) on GitHub
Technical Report Rough Draft: 5pm Friday, May 10th (Week 14) on GitHub
Presentation: Thursday, May 16th 9am - noon
Final Technical Report: Friday, May 17th by 5pm on GitHub
During your project, you will …
Collect multivariate data appropriate for regression or classification tasks.
Articulate clear and compelling research questions that can be answered by building predictive models on an appropriate data set.
Implement data wrangling in order to pre-process data for analysis.
Perform exploratory data analysis using data visualizations and descriptive statistics to understand the structure of your data.
Build and assess predictive statistical models in order to answer the research question.
Craft a clear, engaging narrative answering the research question in a technical report.
Share the results of your investigation with your peers through a presentation.
You will arrange yourselves into groups of 2 - 3 students each. One person in your group should submit a list of group members to me via email by 5pm on Friday, March 15th. On Friday evening, I will sort any students who didn’t express group preferences into groups.
For this project, you will need to collect or obtain a rich multivariate data set with many observations that can be used to build predictive models to answer a research question. Generally, this data will likely need to contain a categorical or quantitative response variable, at least 4 other predictors, and at least 100 observations (although larger data sets are encouraged).
Some resources for finding appropriate data can be found below:
Grinnell College Libraries Data Best Bets, a list of large, general-purpose, user-friendly aggregations of data covering a variety of topics.
Stat2Labs, a site providing project-based mateirals that emphasize real-world applications and conceptual understanding.
Grinnell Data Analysis adn Social Inquiry Lab Downloadable Data, a page dedicated to several DASIL-affiliated data sets for download.
UC Irvine Machine Learning Repository, a collection of curated data sets widely used by the Machine / Statistical Learning community for model building.
Kaggle, an extremely large repository of data sets covering wide variety of topics. Warning The quality, usability and authenticity of these data sets are not as thoroughly assessed as those from other sources; use data from this site with caution.
By 5pm Monday, April 8th, your group will draft a well-written project proposal outlining your project and upload the .pdf document to your group’s GitHub repository. The proposal should include the following information:
At least 1 paragraph of background information on the topic you wish to study.
A precise statement of the research question you wish to answer.
A candidate for the data sets that you can analyze.
A description of the type of data you will use to answer your question, and a list of variables you might include in your analysis
At least 1 paragraph describing the utility of an answer to your research question, or a discussion of why an answer would be interesting or relevant to you.
A brief discussion of any obstacles you foresee either in data acquisition or analysis.
Before you build any models, you should perform appropriate exploratory (or descriptive) analysis. This might include:
Data wrangling, including joining two or more data sets into a single set, converting quantitative variables to categorical or collapsing categorical variables to ones with fewer levels, renaming variables and/or variable levels, creating new variables from existing ones
Descriptive statistics for all variables you intend to investigate. For quantitative variables, this includes: mean, standard deviation, 5-number summaries, and histograms and/or boxplots; and for categorical variables, this includes: a list of all factor levels, as well as counts and proportions within each level, and bar charts.
Exploration of the relationships between variables, both numerically and graphically. Consider not only the relationship between the response and explanatory variables, but also between two (or more) explanatory variables.
You should not focus on building statistical models at this stage.
You will summarize your exploratory data analysis in a 2 - 3 page exploratory analysis report, uploaded to your Github repo by 5pm Monday April 22nd. This report should include:
A short paragraph introducing your data and the primary research question
A description of the variables of interest to your investigation
An overview of of any data wrangling that you performed (you do not need to show the code or the code output, just describe what you did and why)
Graphs and summary tables from your exploratory analysis, along with discussion and interpretation of the results; you do not need to include every summary statistic you calculated or graph you made, but should focus on the most relevant or important ones.
Brief description of your plans for model building.
A final draft of your technical report should be between 3 and 5 pages in length, and is to be uploaded to GitHub by 5pm on Friday, May 17th. The technical report should be a .Rmd file with output either to .pdf or .html. Your report should contain the following sections:
An overview of the topic and relevant background information, a discussion of existing theories and models, a description of how your investigation differs from prior ones, and a precise statement of your research question.
A description of the data sets used, a discussion of where the data came from and how it was obtained, a summary of the data itself (including the number of observations and variables, and what each observational unit represents), an explanation of data processing implemented to prepare the data for analysis.
A presentation of graphical and numerical summaries of the data (along with a discussion of their relevance to modeling assumptions and further analysis), a description of the statistical methods used to analyze your data, and diagnostics of the appropriateness of any models or inference procedures you will apply in the Results section. You do not need to include every graph you created during your research, only those that are most relevant to your results.
A description of the tools and methods used to build your models, an overview of the models themselves and a summary of their attributes, a discussion of model comparisons and accuracy, a presentation of model predictions, classifications, and/or parameter inference.
A review of the results generated from the model and synthesis with the context from which the data was generated or observed, a restatement of research objective and an answer to the original research question, a discussion limitations of the study as well as areas for further research.
A collection of code used to process data, perform analysis, and
build models. To avoid excessive run-times when compiling the document,
consider adding eval = F
to the chunk options (which will
force the code not to be run when compiling the document into .pdf or
.html)
The citations for any data sets, literature or resources directly or indirectly referenced in your report, along with any sources you consulted during your investigation that had a significant impact on your analysis. Citations can be made either according to AP guidelines or [Chicago guidelines](https://owl.purdue.edu/owl/research_and_citation/chicago_manual_17th_edition/cmos_formatting_and_style_guide/chicago_manual_of_style_17th_edition.html.
During finals week, each group will give a 10 - 15 minute presentation to the class outlining their project and results. Fifteen minutes is not a lot of time, so groups should plan carefully what they will discuss. The structure of the talk should mirror the structure of the technical report (albeit greatly abbreviated). Groups should create slides or an .html page that can be projected in order to engage the audience.
The final technical report should…
Be composed as an .Rmd file, and then exported to .pdf.
Use the standard font and margin sizes.
Be between 3 and 5 pages in length, including graphics and tables.
Include a title page, with project title and author names. This page does not count towards the page limit.
Display the code used to perform your analysis ONLY if reading the code is necessary for understanding the output.
Include the output of the code (summary statistics, visualizations, and the results of any inference where appropriate).
Include graphics with appropriate axes labels and titles, and of reasonable size (i.e. that do not take up a half-page of the document, unless absolutely necessary)
Include tables that are neatly formatted and legible.
A page break can be created in R using the command, which should be on its own line after the metadata (stuff concluding with — at the top of the .Rmd) but before the start of your document prose.
To have code run when you knit, but not display, replace the
chunk header {r}
with {r echo = F}
.
If you also don’t want the output of the code to display (for
example, for data processing steps), use
{r echo = F, include = F}
You can control the size of included graphics by adding
{r fig.width =..., fig.height=...}
to your chunk options,
where … is replaced with the desired width/height of graphic in
inches.
You can create nice tables for .pdf output by piping
(%>%
) a data frame into the kable()
function from the knitr
package. Some table formatting
options can be controlled through inputs in the kable
function. For description, see run ?kable()
in the
console.
The final technical report will be assessed on the following:
Requirements: The degree to which the report adhered to the Document Specifications above.
Content: The depth and accuracy of analysis, as well as the appropriateness of the methods used.
Style: The degree to which the text is well-organized, well-written, coherent and understandable.
The presentation should…
Be accompanied by a slideshow with appropriate narrative, data summaries and graphics
Incorporate each group member in a speaking role in a significant way
Be between 10 and 15 minutes in length
Be well-rehearsed prior to delivery
Be delivered without reading verbatim from the slides or from notes (notes can be briefly referenced during the presentation, but should not be extensively consulted)
The presentation will be assessed on the following:
Requirements: The degree to which the presentation adhered to the Presentation Specifications above.
Content: The clarity, depth and accuracy of analysis, as well as the appropriateness of the methods used.
Style: The degree to which the written slides are well-organized, well-written, coherent and understandable.
Delivery: The degree to which the oral presentation is well-organized, well-rehearsed, coherent, understandable and engaging