Meghan Hall
NEAIR
July 12, 2022
1 2 3 4 5 6
1 2 3 4 5 6
R is an open-source (free!) scripting language for working with data
1 2 3 4 5 6
My personal Excel nightmare
The magic of R is that it’s reproducible (by someone else or by yourself in six months)
Keeps data separate from code (data preparation steps)
1 2 3 4 5 6
You need the R language
And also the software
1 2 3 4 5 6

project files are here
imported data shows up here
code can go here
1 2 3 4 5 6

project files are here
imported data shows up here
code can also
go here
1 2 3 4 5 6
You use R via packages
…which contain functions
…which are just verbs

1 2 3 4 5 6
faculty
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology |
1 2 3 4 5 6
courses
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212202 | 10605 | 1772 | Physics | 7 | UG |
| 20212202 | 10605 | 1772 | Physics | 32 | GR |
| 20212202 | 11426 | 1820 | Political Science | 8 | UG |
| 20212202 | 12048 | 1914 | English | 24 | UG |
| 20212202 | 13269 | 1095 | Sociology | 48 | UG |
| 20212202 | 13517 | 1086 | Music | 17 | UG |
1 2 3 4 5 6
1 2 3 4 5 6
<-
“save as”
opt + -
%>%
“and then”
Cmd + shift + m
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
select keeps or discards columns (aka variables)
arrange sorts data set by certain variable(s)
count tallies data set by certain variable(s)
mutate creates new variables
group_by/summarize aggregates data (pivot tables!)
str_* functions work easily with text
1 2 3 4 5 6
function(data, argument(s))
is the same as
data %>%
function(argument(s))
1 2 3 4 5 6
filter keeps or discards rows (aka observations)
the == operator tests for equality
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology | |
| 2021-22 | 1161 | Assistant Professor | Sociology | |
| 2021-22 | 1191 | Professor | Sociology | |
| 2021-22 | 1216 | Associate Professor | Sociology | American Studies |
| 2021-22 | 1273 | Assistant Professor | Sociology |
1 2 3 4 5 6
the | operator signifies “or”
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology | |
| 2021-22 | 1161 | Assistant Professor | Sociology | |
| 2021-22 | 1191 | Professor | Sociology |
1 2 3 4 5 6
the %in% operator allows for multiple options in a list
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | |
| 2021-22 | 1118 | Assistant Professor | Sociology |
1 2 3 4 5 6
the & operator combines conditions
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1191 | Professor | Sociology | |
| 2021-22 | 1201 | Professor | Physics | |
| 2021-22 | 1209 | Professor | Music | |
| 2021-22 | 1421 | Professor | Physics | Engineering |
1 2 3 4 5 6
select keeps or discards columns (aka variables)
1 2 3 4 5 6
can drop columns with -column
1 2 3 4 5 6
the pipe %>% chains multiple functions together
1 2 3 4 5 6
arrange sorts data set by certain variable(s)
use desc() to get descending order
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212201 | 10511 | 1005 | Chemistry | 50 | UG |
| 20212201 | 15934 | 1421 | Physics | 50 | UG |
| 20192002 | 13850 | 1105 | Chemistry | 50 | UG |
| 20181901 | 17773 | 1942 | Music | 50 | UG |
| 20212202 | 13269 | 1095 | Sociology | 48 | UG |
| 20202101 | 16202 | 1816 | Political Science | 48 | UG |
1 2 3 4 5 6
can sort by multiple variables
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212201 | 10511 | 1005 | Chemistry | 50 | UG |
| 20192002 | 13850 | 1105 | Chemistry | 50 | UG |
| 20202102 | 13850 | 1258 | Chemistry | 39 | UG |
| 20202102 | 16606 | 1393 | Chemistry | 38 | UG |
| 20202101 | 16540 | 1784 | Chemistry | 38 | UG |
| 20181901 | 10511 | 1829 | Chemistry | 36 | UG |
1 2 3 4 5 6
count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)
1 2 3 4 5 6
can use sort = TRUE to order results
1 2 3 4 5 6
mutate creates new variables (with a single =)
| year | id | rank | dept1 | dept2 | new |
|---|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | hello! | |
| 2021-22 | 1022 | Professor | Physics | Engineering | hello! |
| 2021-22 | 1059 | Professor | Physics | hello! | |
| 2021-22 | 1079 | Lecturer | Music | hello! | |
| 2021-22 | 1086 | Assistant Professor | Music | hello! | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | hello! |
1 2 3 4 5 6
much more useful with a conditional such as ifelse(), which has three arguments:
condition, value if true, value if false
1 2 3 4 5 6
the ! operator means not
is.na() identifies null values
1 2 3 4 5 6
with multiple conditions, case_when() is much easier!
| dept1 | division |
|---|---|
| Chemistry | Sciences |
| Physics | Sciences |
| Physics | Sciences |
| Music | Humanities |
| Music | Humanities |
| Sociology | Social Sciences |
1 2 3 4 5 6
group_by/summarize aggregates data (pivot tables!)
group_by() identifies the grouping variable(s) and summarize() specifies the aggregation
1 2 3 4 5 6
useful arguments within summarize:
mean, median, sd, min, max, n
1 2 3 4 5 6
1 2 3 4 5 6

project files are here
imported data shows up here
code can also
go here
1 2 3 4 5 6
Typing in the console
think of it like a post-it: useful for quick notes but disposable
actions are saved but code is not
one chunk of code is run at a time (Return)
Typing in a code file
script files have a .R extension
code is saved and sections of any size can be run (Cmd + Return)
do ~95% of your typing in a code file instead of the console!
1 2 3 4 5 6
packages need to be installed on each computer you use
packages need to be loaded/attached with library() at the beginning of every session
can access help files by typing ??tidyverse or ??mutate in the console
1 2 3 4 5 6
highly recommend using projects to stay organized
keeps code files and data files together, allowing for easier file path navigation and better reproducible work habits
1 2 3 4 5 6

project files are here
imported data shows up here
code can also
go here
1 2 3 4 5 6

click big green Code button and select “Download ZIP”, then open neair.Rproj
1 2 3 4 5 6
use read_csv() to import a csv file
the readxl package is helpful for Excel files
view the data with View(faculty) or by clicking on the data name in the Environment pane
1 2 3 4 5 6
1 2 3 4 5 6
functions from stringr (which all start with str_) are useful for working with text data
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1118 | Assistant Professor | Sociology | |
| 2021-22 | 1158 | Assistant Professor | Political Science | |
| 2021-22 | 1161 | Assistant Professor | Sociology |
1 2 3 4 5 6
cheat sheet of functions is here
1 2 3 4 5 6
existing faculty data has one row per faculty, some with multiple departments (sometimes known as wide data)
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology |
1 2 3 4 5 6
what if you instead want one row per faculty per department? (sometimes known as long data)
| year | id | rank | dept_no | dept |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | dept1 | Chemistry |
| 2021-22 | 1022 | Professor | dept1 | Physics |
| 2021-22 | 1022 | Professor | dept2 | Engineering |
| 2021-22 | 1059 | Professor | dept1 | Physics |
| 2021-22 | 1079 | Lecturer | dept1 | Music |
| 2021-22 | 1086 | Assistant Professor | dept1 | Music |
1 2 3 4 5 6
the pivot_longer function lengthens data
1 2 3 4 5 6
and pivot_wider does the opposite!
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212202 | 10605 | 1772 | Physics | 7 | UG |
| 20212202 | 10605 | 1772 | Physics | 32 | GR |
1 2 3 4 5 6
R has many useful functions for handling relational data
all you need is at least one key variable that connects data sets
left_join is most common, but there are more
1 2 3 4 5 6
what’s the average UG enrollment per year, per faculty rank?
faculty
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music |
courses
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212202 | 10605 | 1772 | Physics | 7 | UG |
| 20212202 | 10605 | 1772 | Physics | 32 | GR |
| 20212202 | 11426 | 1820 | Political Science | 8 | UG |
| 20212202 | 12048 | 1914 | English | 24 | UG |
faculty$id is the same as courses$faculty_id
1 2 3 4 5 6
what’s the average UG enrollment per year, per faculty rank?
| semester | course_id | faculty_id | dept | enrollment | level |
|---|---|---|---|---|---|
| 20212202 | 10605 | 1772 | Physics | 7 | UG |
| 20212202 | 10605 | 1772 | Physics | 32 | GR |
| 20212202 | 11426 | 1820 | Political Science | 8 | UG |
| 20212202 | 12048 | 1914 | English | 24 | UG |
| 20212202 | 13269 | 1095 | Sociology | 48 | UG |
UG courses onlyyear variable againenrollment by year and faculty_id1 2 3 4 5 6
use the <- operator to create a new data frame courses_UG
1 2 3 4 5 6
filter to undergraduate courses only and mutate a new academic year variable
1 2 3 4 5 6
group_by year and faculty member; summarize enrollment
| year | faculty_id | enr |
|---|---|---|
| 2018-19 | 1059 | 35 |
| 2018-19 | 1086 | 14 |
| 2018-19 | 1102 | 37 |
| 2018-19 | 1203 | 25 |
1 2 3 4 5 6
what’s the average UG enrollment per year, per faculty rank?
faculty
| year | id | rank | dept1 | dept2 |
|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | |
| 2021-22 | 1022 | Professor | Physics | Engineering |
| 2021-22 | 1059 | Professor | Physics | |
| 2021-22 | 1079 | Lecturer | Music | |
| 2021-22 | 1086 | Assistant Professor | Music | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology |
courses_UG
| year | faculty_id | enr |
|---|---|---|
| 2021-22 | 1005 | 50 |
| 2021-22 | 1086 | 17 |
| 2021-22 | 1095 | 48 |
| 2021-22 | 1128 | 32 |
| 2021-22 | 1147 | 32 |
| 2021-22 | 1191 | 7 |
1 2 3 4 5 6
1
2
3
| year | id | rank | dept1 | dept2 | enr |
|---|---|---|---|---|---|
| 2021-22 | 1005 | Lecturer | Chemistry | 50 | |
| 2021-22 | 1022 | Professor | Physics | Engineering | |
| 2021-22 | 1059 | Professor | Physics | ||
| 2021-22 | 1079 | Lecturer | Music | ||
| 2021-22 | 1086 | Assistant Professor | Music | 17 | |
| 2021-22 | 1095 | Adjunct Instructor | Sociology | 48 |
1 2 3 4 5 6
what’s the average UG enrollment per year, per faculty rank?
| year | rank | avg_enr |
|---|---|---|
| 2021-22 | Adjunct Instructor | 34.66667 |
| 2021-22 | Assistant Professor | 23.60000 |
| 2021-22 | Associate Professor | 17.25000 |
| 2021-22 | Lecturer | 31.83333 |
| 2021-22 | Professor | 32.16667 |
| 2021-22 | Visiting Researcher |
1 2 3 4 5 6
1 2 3 4 5 6
ggplot2 is the data visualization package that is loaded with the tidyverse
the grammar of graphics maps data to the aesthetic attributes of geometric points
encoding data into visual cues (e.g., length, color, position, size) is how we signify changes and comparisons
1 2 3 4 5 6
to combine lines into one code chunk, use + instead of %>%

1 2 3 4 5 6
can create a prettier plot pretty easily
faculty %>%
count(rank) %>%
ggplot(aes(x = reorder(rank, -n), y = n)) +
geom_bar(stat = "identity", fill = "#cc0000") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
geom_text(aes(label = n), vjust = -0.5) +
labs(x = NULL, y = NULL,
title = "Count of faculty by rank, 2018-2021") +
theme_linedraw() +
theme(panel.grid.major.x = element_blank(),
axis.ticks = element_blank())1 2 3 4 5 6

1 2 3 4 5 6
fac_enr %>%
filter(!is.na(avg_enr)) %>%
ggplot(aes(x = year, y = avg_enr, group = rank, color = rank)) +
geom_line() +
geom_point() +
scale_color_brewer(type = "qual", palette = "Dark2") +
labs(x = NULL, y = "Average enrollment",
title = "Average undergraduate enrollment per rank over time") +
theme_linedraw() +
theme(panel.grid.major.x = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank(),
legend.background = element_rect(fill = NA),
legend.key = element_rect(fill = NA),
legend.position = c(0.85, 0.82))1 2 3 4 5 6
from R for Data Science
Data Visualization: a practical introduction
creating custom themes
the ggplot2 book
the R graph gallery
1 2 3 4 5 6
with what we’ve done so far, your .R file could:
and that file would make it extremely easy for you or someone else to reproduce this analysis with new data in six months
1 2 3 4 5 6
1 2 3 4 5 6
using RStudio, create .Rmd documents that combine text, code, and graphics
many output formats: html, pdf, Word, slides
exceedingly useful for parameterized reporting: can create an R-based PDF report and generate it automatically for, say, each department
1 2 3 4 5 6
you can also create your own packages!
your package can hold:
ggplot2 themescan be stored on a shared drive to facilitate collaboration
1 2 3 4 5 6
R Markdown
the official R Markdown website
R Markdown: The Definitive Guide
internal packages
a comprehensive theoretical explainer
a talk I gave earlier this year on the topic
R for Data Science: the ultimate guide
R for Excel users: a very useful workshop
STAT 545: an online book on reproducible data analysis in R
the RStudio Education site
the Learn tidyverse site