Philosophy and Approach to 197R

Three Skills

In this six-week course, we will (ambitiously) cover three skills: elementary data analysis, elementary programming, and visualizing data. These are highly inter-related! Programming is much more than telling a machine what analyses we want: it's a different way of thinking about and approaching problems—one well suited to analysis. Visualization also goes hand-in-hand with exploring data, understanding and evaluating models, and presenting results to others. Thus, we will interweave all three activities through the 12 meetings.

R will be the particular language we learn and use (within the RStudio environment). It is open-source and popular in many academic domains, particularly the life sciences, as well as data science more generally. We will spend some times learning the particular syntax, libraries, and quirks of R. Students with programming experience may thus take the course primarily to learn R, while others may also be learning coding for the first time, or learning data analysis for the first time. Primarily though, this is a "skills course" focused on getting things done with data—and students with external data projects will readily be able to make progress on them.

But Not Statistics

One thing we will not attempt to cover is statistics. We will discuss how certain broad classes of models are implemented in R, how to run them, and how to retrieve and visualize results—including linear regressions, analysis of variance (ANOVA), and decision trees. But outside of our scope are the bigger and serious questions like which model to choose, what assumptions they have, their limitations, and their statistical interpretation.

Students with more statistical background should find the explanations provided in the course sufficient to jump into using more sophisticated models in R. Those with less background may find the course whets their appetite to learn more.

Minimal Libraries

R is highly extensible, with large repositories of user-created packages. Many of these introduce tools for particular domains, covering everything from survival analysis to genetic sequence data. Others extend the basic functionality of R, for instance by creating web applets with R code or doing animations. We therefore face a choice in an introductory course about which packages to teach from the get-go.

In general, I wish to keep this course focused on base R, so that the emphasis is on learning programming in general, and solving problems in general, rather than on using particular libraries. This also allows students to adopt whatever packages they need later. However, we will adopt two significant code packages.

ggplot2

Base R has decent enough plotting functions, but almost everyone now uses a package called ggplot2, where the "gg" stands for Grammar of Graphics. (Personally, I do not use this package, but I don't need to inflict my idiosyncrasies on my students.) We will look at base R graphics, and use ggplot2 primarily through another package called ggformula.

data.table

We will also use the powerful data.table package (not to be confused with DataTables) toward the end of the course. This extends R's main tabular data structure significantly, allowing fast data extraction, aggregation, and joining with concise syntax.

We will not generally adopt the "Tidyverse" of related packages or the "tidy approach" to R programming ala Hadley Wickham—though we may touch on some Tidyverse packages, such as stringr. (Why? See here.) You may well want to learn the tidyverse on your own, if it suits your style, after you learn the R basics.