Streamlining with R

Meghan Hall
NEAIR
July 12, 2022

Housekeeping

Intro 👋
Workshop materials ⬇️
Break 🕘
By the end of today ✔️
Today’s plan 📋

Today’s plan

What is R? How can it ease the burden of repeated reporting?
Basic functions for manipulating data
Using R effectively
More data manipulation
Visualizing data
A peek at advanced topics

What is R?

1 2 3 4 5 6

What is R?

1 2 3 4 5 6

R is an open-source (free!) scripting language for working with data

The benefits of R

1 2 3 4 5 6

My personal Excel nightmare

The magic of R is that it’s reproducible (by someone else or by yourself in six months)

Keeps data separate from code (data preparation steps)

Getting R

1 2 3 4 5 6

You need the R language

And also the software

Navigating RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can go here

Navigating RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Using R

1 2 3 4 5 6

You use R via packages

…which contain functions

…which are just verbs

Today’s data

1 2 3 4 5 6

faculty

year	id	rank	dept1	dept2
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

Today’s data

1 2 3 4 5 6

courses

semester	course_id	faculty_id	dept	enrollment	level
20212202	10605	1772	Physics	7	UG
20212202	10605	1772	Physics	32	GR
20212202	11426	1820	Political Science	8	UG
20212202	12048	1914	English	24	UG
20212202	13269	1095	Sociology	48	UG
20212202	13517	1086	Music	17	UG

Basic data manipulation

1 2 3 4 5 6

Useful operators

1 2 3 4 5 6

<-

“save as”

opt + -

%>%

“and then”

Cmd + shift + m

Common functions

1 2 3 4 5 6

filter keeps or discards rows (aka observations)

select keeps or discards columns (aka variables)

arrange sorts data set by certain variable(s)

count tallies data set by certain variable(s)

mutate creates new variables

group_by/summarize aggregates data (pivot tables!)

str_* functions work easily with text

Syntax of a function

1 2 3 4 5 6

function(data, argument(s))

is the same as

data %>%

function(argument(s))

Filter

1 2 3 4 5 6

filter keeps or discards rows (aka observations)

the == operator tests for equality

faculty %>% 
  filter(dept1 == "Sociology")

year	id	rank	dept1	dept2
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology
2021-22	1161	Assistant Professor	Sociology
2021-22	1191	Professor	Sociology
2021-22	1216	Associate Professor	Sociology	American Studies
2021-22	1273	Assistant Professor	Sociology

Filter

1 2 3 4 5 6

the | operator signifies “or”

faculty %>% 
  filter(dept1 == "Sociology" | 
           dept1 == "Physics")

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology
2021-22	1161	Assistant Professor	Sociology
2021-22	1191	Professor	Sociology

Filter

1 2 3 4 5 6

the %in% operator allows for multiple options in a list

faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music"))

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology
2021-22	1118	Assistant Professor	Sociology

Filter

1 2 3 4 5 6

the & operator combines conditions

faculty %>% 
  filter(dept1 %in% c("Sociology",
                      "Physics",
                      "Music") &
         rank == "Professor")

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1191	Professor	Sociology
2021-22	1201	Professor	Physics
2021-22	1209	Professor	Music
2021-22	1421	Professor	Physics	Engineering

Select

1 2 3 4 5 6

select keeps or discards columns (aka variables)

faculty %>% 
  select(id, dept1, rank)

id	dept1	rank
1005	Chemistry	Lecturer
1022	Physics	Professor
1059	Physics	Professor
1079	Music	Lecturer
1086	Music	Assistant Professor
1095	Sociology	Adjunct Instructor

Select

1 2 3 4 5 6

can drop columns with -column

faculty %>% 
  select(-dept2)

year	id	rank	dept1
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

Select

1 2 3 4 5 6

the pipe %>% chains multiple functions together

faculty %>% 
  select(id, dept1, rank) %>% 
  filter(rank == "Professor")

id	dept1	rank
1022	Physics	Professor
1059	Physics	Professor
1191	Sociology	Professor
1201	Physics	Professor
1209	Music	Professor
1407	English	Professor

Arrange

1 2 3 4 5 6

arrange sorts data set by certain variable(s)

use desc() to get descending order

courses %>% 
  arrange(desc(enrollment))

semester	course_id	faculty_id	dept	enrollment	level
20212201	10511	1005	Chemistry	50	UG
20212201	15934	1421	Physics	50	UG
20192002	13850	1105	Chemistry	50	UG
20181901	17773	1942	Music	50	UG
20212202	13269	1095	Sociology	48	UG
20202101	16202	1816	Political Science	48	UG

Arrange

1 2 3 4 5 6

can sort by multiple variables

courses %>% 
  arrange(dept, desc(enrollment))

semester	course_id	faculty_id	dept	enrollment	level
20212201	10511	1005	Chemistry	50	UG
20192002	13850	1105	Chemistry	50	UG
20202102	13850	1258	Chemistry	39	UG
20202102	16606	1393	Chemistry	38	UG
20202101	16540	1784	Chemistry	38	UG
20181901	10511	1829	Chemistry	36	UG

Count

1 2 3 4 5 6

count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)

courses %>% 
  count(dept)

dept	n
Chemistry	16
English	18
Music	17
Physics	19
Political Science	17
Sociology	17

Count

1 2 3 4 5 6

can use sort = TRUE to order results

courses %>% 
  count(dept, level, sort = TRUE)

dept	level	n
Chemistry	UG	16
English	UG	16
Music	UG	16
Physics	UG	16
Political Science	UG	16
Sociology	UG	16
Physics	GR	3
English	GR	2
Music	GR	1
Political Science	GR	1
Sociology	GR	1

Mutate

1 2 3 4 5 6

mutate creates new variables (with a single =)

faculty %>% 
  mutate(new = "hello!")

year	id	rank	dept1	dept2	new
2021-22	1005	Lecturer	Chemistry		hello!
2021-22	1022	Professor	Physics	Engineering	hello!
2021-22	1059	Professor	Physics		hello!
2021-22	1079	Lecturer	Music		hello!
2021-22	1086	Assistant Professor	Music		hello!
2021-22	1095	Adjunct Instructor	Sociology		hello!

Mutate

1 2 3 4 5 6

much more useful with a conditional such as ifelse(), which has three arguments:

condition, value if true, value if false

faculty %>% 
  mutate(prof = ifelse(rank == "Professor",
                       1, 0)) %>% 
  select(rank, prof)

rank	prof
Lecturer	0
Professor	1
Professor	1
Lecturer	0
Assistant Professor	0
Adjunct Instructor	0

Mutate

1 2 3 4 5 6

the ! operator means not

is.na() identifies null values

faculty %>% 
  mutate(joint = ifelse(!is.na(dept2),
                        "joint", NA)) %>% 
  select(dept1, dept2, joint)

dept1	dept2	joint
Chemistry
Physics	Engineering	joint
Physics
Music
Music
Sociology

Mutate

1 2 3 4 5 6

with multiple conditions, case_when() is much easier!

faculty %>% 
  mutate(division = case_when(dept1 %in% c("Sociology","Political Science") ~
                                "Social Sciences",
                              dept1 %in% c("Music","English") ~
                                "Humanities",
                              dept1 %in% c("Chemistry","Physics") ~
                                "Sciences")) %>% 
  select(dept1, division)

dept1	division
Chemistry	Sciences
Physics	Sciences
Physics	Sciences
Music	Humanities
Music	Humanities
Sociology	Social Sciences

Group by / summarize

1 2 3 4 5 6

group_by/summarize aggregates data (pivot tables!)

group_by() identifies the grouping variable(s) and summarize() specifies the aggregation

courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment))

dept	semester	enr
Chemistry	20181901	59
Chemistry	20181902	44
Chemistry	20192001	47
Chemistry	20192002	68
Chemistry	20202101	69
Chemistry	20202102	77

Group by / summarize

1 2 3 4 5 6

useful arguments within summarize:

mean, median, sd, min, max, n

courses %>% 
  group_by(dept, semester) %>% 
  summarize(enr = sum(enrollment),
            count = n_distinct(course_id))

dept	semester	enr	courses
Chemistry	20181901	59	2
Chemistry	20181902	44	2
Chemistry	20192001	47	2
Chemistry	20192002	68	2
Chemistry	20202101	69	2
Chemistry	20202102	77	2

Using R effectively

1 2 3 4 5 6

Working in RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Working in RStudio

1 2 3 4 5 6

Typing in the console

think of it like a post-it: useful for quick notes but disposable
actions are saved but code is not
one chunk of code is run at a time (Return)

Typing in a code file

script files have a .R extension
code is saved and sections of any size can be run (Cmd + Return)
do ~95% of your typing in a code file instead of the console!

Working with packages

1 2 3 4 5 6

packages need to be installed on each computer you use

# only need to do this once (per computer)
install.packages("tidyverse")

packages need to be loaded/attached with library() at the beginning of every session

# always put the necessary packages at the top of a code file
library(tidyverse)

can access help files by typing ??tidyverse or ??mutate in the console

Organizing with projects

1 2 3 4 5 6

highly recommend using projects to stay organized

keeps code files and data files together, allowing for easier file path navigation and better reproducible work habits

File -> New Project

more guidance: here and here

Organizing with projects

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Accessing workshop materials

1 2 3 4 5 6

click big green Code button and select “Download ZIP”, then open neair.Rproj

Accessing data

1 2 3 4 5 6

use read_csv() to import a csv file

# the file path is this simple if you use projects!
# ?read_csv() in the console will bring up the help file with more options
faculty <- read_csv("faculty.csv")

the readxl package is helpful for Excel files

# needs to be loaded but not installed as it's part of the tidyverse
library(readxl)
faculty <- read_excel("faculty.xlsx", sheet = 2)

view the data with View(faculty) or by clicking on the data name in the Environment pane

More data manipulation

1 2 3 4 5 6

Stringr functions

1 2 3 4 5 6

functions from stringr (which all start with str_) are useful for working with text data

faculty %>% 
  filter(str_detect(rank, "Professor"))

year	id	rank	dept1	dept2
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1086	Assistant Professor	Music
2021-22	1118	Assistant Professor	Sociology
2021-22	1158	Assistant Professor	Political Science
2021-22	1161	Assistant Professor	Sociology

Stringr functions

1 2 3 4 5 6

cheat sheet of functions is here

courses %>% 
  mutate(year = str_c(str_sub(semester, 1, 4), 
                      "-",
                      str_sub(semester, 5, 6))) %>% 
  select(semester, year) %>% 
  unique()

semester	year
20212202	2021-22
20212201	2021-22
20202102	2020-21
20202101	2020-21
20192002	2019-20
20192001	2019-20
20181902	2018-19
20181901	2018-19

Pivoting data

1 2 3 4 5 6

existing faculty data has one row per faculty, some with multiple departments (sometimes known as wide data)

year	id	rank	dept1	dept2
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

Pivoting data

1 2 3 4 5 6

what if you instead want one row per faculty per department? (sometimes known as long data)

year	id	rank	dept_no	dept
2021-22	1005	Lecturer	dept1	Chemistry
2021-22	1022	Professor	dept1	Physics
2021-22	1022	Professor	dept2	Engineering
2021-22	1059	Professor	dept1	Physics
2021-22	1079	Lecturer	dept1	Music
2021-22	1086	Assistant Professor	dept1	Music

Pivoting data

1 2 3 4 5 6

the pivot_longer function lengthens data

faculty %>% 
  pivot_longer(dept1:dept2,
               names_to = "dept_no",
               values_to = "dept",
               values_drop_na = TRUE) %>% 
  select(-year, -rank)

id	dept_no	dept
1005	dept1	Chemistry
1022	dept1	Physics
1022	dept2	Engineering
1059	dept1	Physics
1079	dept1	Music
1086	dept1	Music

Pivoting data

1 2 3 4 5 6

and pivot_wider does the opposite!

semester	course_id	faculty_id	dept	enrollment	level
20212202	10605	1772	Physics	7	UG
20212202	10605	1772	Physics	32	GR

courses %>% 
  pivot_wider(names_from = "level",
              values_from = "enrollment")

semester	course_id	faculty_id	dept	UG	GR
20212202	10605	1772	Physics	7	32
20212202	11426	1820	Political Science	8
20212202	12048	1914	English	24
20212202	13269	1095	Sociology	48

Joining data

1 2 3 4 5 6

R has many useful functions for handling relational data

all you need is at least one key variable that connects data sets

left_join is most common, but there are more

Joining data

1 2 3 4 5 6

what’s the average UG enrollment per year, per faculty rank?

faculty

year	id	rank	dept1	dept2
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music

courses

semester	course_id	faculty_id	dept	enrollment	level
20212202	10605	1772	Physics	7	UG
20212202	10605	1772	Physics	32	GR
20212202	11426	1820	Political Science	8	UG
20212202	12048	1914	English	24	UG

faculty$id is the same as courses$faculty_id

Joining data

1 2 3 4 5 6

what’s the average UG enrollment per year, per faculty rank?

semester	course_id	faculty_id	dept	enrollment	level
20212202	10605	1772	Physics	7	UG
20212202	10605	1772	Physics	32	GR
20212202	11426	1820	Political Science	8	UG
20212202	12048	1914	English	24	UG
20212202	13269	1095	Sociology	48	UG

filter to UG courses only
create our year variable again
summarize enrollment by year and faculty_id

Joining data

1 2 3 4 5 6

use the <- operator to create a new data frame courses_UG

courses_UG <- courses %>% 
  filter(level == "UG") %>% 
  mutate(year = str_c(str_sub(semester, 1, 4), 
                      "-",
                      str_sub(semester, 5, 6)))

Joining data

1 2 3 4 5 6

filter to undergraduate courses only and mutate a new academic year variable

courses_UG <- courses %>% 
  filter(level == "UG") %>% 
  mutate(year = str_c(str_sub(semester, 1, 4), 
                      "-",
                      str_sub(semester, 5, 6)))

Joining data

1 2 3 4 5 6

group_by year and faculty member; summarize enrollment

courses_UG <- courses %>% 
  filter(level == "UG") %>% 
  mutate(year = str_c(str_sub(semester, 1, 4), 
                      "-",
                      str_sub(semester, 5, 6))) %>% 
  group_by(year, faculty_id) %>% 
  summarize(enr = sum(enrollment))

year	faculty_id	enr
2018-19	1059	35
2018-19	1086	14
2018-19	1102	37
2018-19	1203	25

Joining data

1 2 3 4 5 6

what’s the average UG enrollment per year, per faculty rank?

faculty

year	id	rank	dept1	dept2
2021-22	1005	Lecturer	Chemistry
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music
2021-22	1095	Adjunct Instructor	Sociology

courses_UG

year	faculty_id	enr
2021-22	1005	50
2021-22	1086	17
2021-22	1095	48
2021-22	1128	32
2021-22	1147	32
2021-22	1191	7

Joining data

1 2 3 4 5 6

fac_enr <- faculty %>% 
  left_join(courses_UG, by = c("id" = "faculty_id",
                               "year" = "year"))

new data frame
data frame you’re adding data to
data frame where the new data is coming from

year	id	rank	dept1	dept2	enr
2021-22	1005	Lecturer	Chemistry		50
2021-22	1022	Professor	Physics	Engineering
2021-22	1059	Professor	Physics
2021-22	1079	Lecturer	Music
2021-22	1086	Assistant Professor	Music		17
2021-22	1095	Adjunct Instructor	Sociology		48

Joining data

1 2 3 4 5 6

what’s the average UG enrollment per year, per faculty rank?

fac_enr <- faculty %>% 
  left_join(courses_UG, by = c("id" = "faculty_id",
                               "year" = "year")) %>% 
  group_by(year, rank) %>% 
  summarize(avg_enr = mean(enr, na.rm = TRUE))

year	rank	avg_enr
2021-22	Adjunct Instructor	34.66667
2021-22	Assistant Professor	23.60000
2021-22	Associate Professor	17.25000
2021-22	Lecturer	31.83333
2021-22	Professor	32.16667
2021-22	Visiting Researcher

Data visualization

1 2 3 4 5 6

ggplot2

1 2 3 4 5 6

ggplot2 is the data visualization package that is loaded with the tidyverse

the grammar of graphics maps data to the aesthetic attributes of geometric points

encoding data into visual cues (e.g., length, color, position, size) is how we signify changes and comparisons

Bar chart

1 2 3 4 5 6

faculty %>%
  count(rank) %>%
  ggplot(aes(x = rank, y = n)) +
  geom_bar(stat = "identity")

to combine lines into one code chunk, use + instead of %>%

Bar chart

1 2 3 4 5 6

can create a prettier plot pretty easily

expand for full code

faculty %>%
  count(rank) %>%
  ggplot(aes(x = reorder(rank, -n), y = n)) +
  geom_bar(stat = "identity", fill = "#cc0000") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(x = NULL, y = NULL,
       title = "Count of faculty by rank, 2018-2021") +
  theme_linedraw() +
  theme(panel.grid.major.x = element_blank(),
        axis.ticks = element_blank())

Line graph

1 2 3 4 5 6

fac_enr %>% 
  filter(!is.na(avg_enr)) %>% 
  ggplot(aes(x = year, y = avg_enr, group = rank, color = rank)) +
  geom_line()

Line graph

1 2 3 4 5 6

expand for full code

fac_enr %>% 
  filter(!is.na(avg_enr)) %>% 
  ggplot(aes(x = year, y = avg_enr, group = rank, color = rank)) +
  geom_line() +
  geom_point() +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  labs(x = NULL, y = "Average enrollment",
       title = "Average undergraduate enrollment per rank over time") +
  theme_linedraw() +
  theme(panel.grid.major.x = element_blank(),
        axis.ticks = element_blank(),
        legend.title = element_blank(),
        legend.background = element_rect(fill = NA),
        legend.key = element_rect(fill = NA),
        legend.position = c(0.85, 0.82))