+ - 0:00:00
Notes for current slide
Notes for next slide

Research Methods: Open Science and Reproducible Research in Linguistics

Welcome to the tidyverse II: Tidying and descriptives

Joseph V. Casillas, PhD

Rutgers University
Spring 2019
Last update: 2025-02-16

1 / 66
2 / 66
3 / 66
4 / 66

select

columns of a dataframe

with

select()

5 / 66

select

columns of a dataframe

with

select()

6 / 66

select

columns of a dataframe

with

select()

7 / 66

select

columns of a dataframe

with

select()

8 / 66

select()

  • You can select consecutive columns using ":" (try)

  • You can rename columns directly (new_name = old_name)

  • Take a look at the mtcars dataset using glimpse()

  • Use the select() function to select any 3 variables

  • Use the select() function to select the last 3 variables

  • Use the select() function to rename mpg to hello_world

?select() if you need help

9 / 66

select(mtcars, mpg, disp, drat)

select(mtcars, am:carb)

select(mtcars, hello_world = mpg)

filter

rows of a dataframe

with

filter()

10 / 66

filter

rows of a dataframe

with

filter()

11 / 66

filter

rows of a dataframe

with

filter()

12 / 66

filter

rows of a dataframe

with

filter()

13 / 66

filter()

  • You can use logical operators in filter
Operator function
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== equal to
!= not equal to
| or
& and
%in% in
  • Filter rows in which mpg is less than 20 and greater than 14
  • Filter rows in which cyl is equal to 6
  • Filter rows in which mpg is greater than 20 or disp is less than 200
14 / 66

filter(mtcars, mpg < 20 & mpg > 14)

filter(mtcars, cyl == 6)

filter(mtcars, mpg > 20 | disp < 200)

arrange

rows of a dataframe

with

arrange()

15 / 66

arrange

rows of a dataframe

with

arrange()

16 / 66

arrange

rows of a dataframe

with

arrange()

17 / 66

arrange

rows of a dataframe

with

arrange()

18 / 66

arrange

rows of a dataframe

with

arrange()

19 / 66

arrange()

  • You probably won't use this very often

  • You can arrange using multiple variables

  • Arrange the mtcars dataset based on cyl and disp

  • Arrange the mtcars dataset based on mpg from highest to lowest

20 / 66

arrange(mtcars, cyl, disp)

arrange(mtcars, desc(mpg))

mutate

variables of a dataframe

with

mutate()

21 / 66

mutate

variables of a dataframe

with

mutate()

22 / 66

mutate

variables of a dataframe

with

mutate()

23 / 66

mutate

variables of a dataframe

with

mutate()

24 / 66
25 / 66
26 / 66

mutate()

  • Get comfortable using mutate()

  • In the mtcars dataset, select the mpg column and then...

    • create a new column called mpg_x2 that doubles every value in the dataframe
27 / 66

mutate()

  • Get comfortable using mutate()

  • In the mtcars dataset, select the mpg column and then...

    • create a new column called mpg_x2 that doubles every value in the dataframe
    • create a new column called mpg_c that centers the mpg data by subtracting the mean value of mpg from every value in the dataframe
27 / 66

mutate()

  • Get comfortable using mutate()

  • In the mtcars dataset, select the mpg column and then...

    • create a new column called mpg_x2 that doubles every value in the dataframe
    • create a new column called mpg_c that centers the mpg data by subtracting the mean value of mpg from every value in the dataframe
    • CHALLENGE: create a new column called value that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or less
27 / 66

mutate()

  • Get comfortable using mutate()

  • In the mtcars dataset, select the mpg column and then...

    • create a new column called mpg_x2 that doubles every value in the dataframe
    • create a new column called mpg_c that centers the mpg data by subtracting the mean value of mpg from every value in the dataframe
    • CHALLENGE: create a new column called value that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or less

HINT:
Start every attempt in the same way...

mtcars |>
select(mpg) |>
mutate(???)
27 / 66
mtcars |>
select(mpg) |>
mutate(mpg_x2 = mpg * 2)
mtcars |>
select(mpg) |>
mutate(mpg_c = mpg - mean(mpg))
mtcars |>
select(mpg) |>
mutate(value = if_else(mpg <= 18, 'bad', 'good'))

Advanced mutations

mutate() + case_when()

  • Extremely useful when you need to create a new column based on multiple conditions of another column

  • Use this if you find yourself using nested if_else()

  • Syntax uses logical operators:
    condition ~ desired result

28 / 66

Conditions

  • if age of learning is less than 12 and L1 is Spanish, then heritage speaker
  • if age of learning is less than 12 and L1 is English, then early learner
  • if age of learning is greater than 12, then late learner
  • if age of learning is NA, then monolingual
id age_learn_l2 l1
101 3 sp
102 2 sp
103 3 sp
104 NA sp
105 18 en
106 17 en
107 3 en
108 2 en
109 NA en
110 3 sp
29 / 66

Conditions

  • if age of learning is less than 12 and L1 is Spanish, then heritage speaker
  • if age of learning is less than 12 and L1 is English, then early learner
  • if age of learning is greater than 12, then late learner
  • if age of learning is NA, then monolingual
id age_learn_l2 l1
101 3 sp
102 2 sp
103 3 sp
104 NA sp
105 18 en
106 17 en
107 3 en
108 2 en
109 NA en
110 3 sp

Code

case_when_df |>
mutate(
group = case_when(
age_learn_l2 < 12 & l1 == 'sp' ~ 'heritage',
age_learn_l2 < 12 & l1 == 'en' ~ 'early_learner',
age_learn_l2 > 12 ~ 'late_learner',
is.na(age_learn_l2) ~ 'monolingual'
)
) |>
knitr::kable()
id age_learn_l2 l1 group
101 3 sp heritage
102 2 sp heritage
103 3 sp heritage
104 NA sp monolingual
105 18 en late_learner
106 17 en late_learner
107 3 en early_learner
108 2 en early_learner
109 NA en monolingual
110 3 sp heritage
29 / 66

summarize

variables of a dataframe

with

summarize()

30 / 66

summarize

variables of a dataframe

with

summarize()

31 / 66

summarize

variables of a dataframe

with

summarize()

32 / 66

summarize

variables of a dataframe

with

summarize()

33 / 66

summarize

variables of a dataframe

with

summarize()

  • summarize() will always reduce the number of rows in your dataframe
  • summarize() is often accompanied by the helper function group_by()
33 / 66

summarize

variables of a dataframe

with

summarize()

34 / 66

summarize

variables of a dataframe

with

summarize()

35 / 66

summarize()
group_by |> summarize()

Note

  • Get accustomed to using these two functions, they are extremely useful
  • Remember that summarize() reduces the number of rows in your dataframe
  • Remember that mutate() adds a column to your dataframe
  • You can include more than one summary statistic inside summarize()
36 / 66

summarize()
group_by |> summarize()

Note

  • Get accustomed to using these two functions, they are extremely useful
  • Remember that summarize() reduces the number of rows in your dataframe
  • Remember that mutate() adds a column to your dataframe
  • You can include more than one summary statistic inside summarize()

Practice

  • Calculate the mean value of mpg in the dataset mtcars
  • Calculate the mean value of mpg as a function of cyl
  • Calculate the mean, standard deviation, min, and max of mpg as a function of cyl
36 / 66
mtcars |>
group_by(cyl) |>
summarize(
mean_mpg = mean(mpg),
sd_mpg = sd(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg)
)

Summary

37 / 66
38 / 66

tidyr

What is tidyr?

  • A package that is part of the tidyverse

  • Contains functions (verbs) that are helpful for tidying (cleaning, munging) data

What is tidy data?

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell.

  • (most) functions in R are designed to work with tidy data

  • It is imperative that you learn how to tidy your data

39 / 66

What does untidy data look like?

40 / 66

What does untidy data look like?

id spec test1 test2
span01 g1_lo 64.31 69.2
span02 g1_lo 59.81 63.7
span03 g1_hi 66.08 70.9
span04 g1_hi 72.78 79.2
span05 g2_lo 68.29 75.4
span06 g2_lo 69.22 76.7
span07 g2_hi 71.36 77.2
span08 g2_hi 80.37 88.9
cata01 g1_lo 75.63 83.6
cata02 g1_lo 71.25 78.8
cata03 g1_hi 69.09 74.6
cata04 g1_hi 72.35 80.7
cata05 g2_lo 71.66 77.9
cata06 g2_lo 69.01 75.0
cata07 g2_hi 69.86 76.0

pre_post

  • How many columns are there?

  • How many variables are there? What are they?

  • How many observations are there per row?

41 / 66

separate

elements of a variable

with

separate()

42 / 66

separate()

  • This is untidy data

  • How many variables does the column id contain?

id pre_test
101_m_ctr 75
102_m_ctr 70
103_m_ctr 65
104_m_exp 66
105_m_exp 68
106_m_exp 58
107_f_ctr 60
108_f_ctr 66
109_f_ctr 69
110_f_exp 54
111_f_exp 88
112_f_exp 44
43 / 66

separate()

  • This is untidy data

  • How many variables does the column id contain?

id pre_test
101_m_ctr 75
102_m_ctr 70
103_m_ctr 65
104_m_exp 66
105_m_exp 68
106_m_exp 58
107_f_ctr 60
108_f_ctr 66
109_f_ctr 69
110_f_exp 54
111_f_exp 88
112_f_exp 44
my_data_wide |>
separate(
col = id,
into = c('id', 'group', 'condition'),
sep = "_"
)
id group condition pre_test
101 m ctr 75
102 m ctr 70
103 m ctr 65
104 m exp 66
105 m exp 68
106 m exp 58
107 f ctr 60
108 f ctr 66
109 f ctr 69
110 f exp 54
111 f exp 88
112 f exp 44
43 / 66

unite

columns into a variable

with

unite()

44 / 66

unite()

  • We will put id, group, and condition back into a single column

  • You probably won't use this often

id group condition pre_test
101 m ctr 75
102 m ctr 70
103 m ctr 65
104 m exp 66
105 m exp 68
106 m exp 58
107 f ctr 60
108 f ctr 66
109 f ctr 69
110 f exp 54
111 f exp 88
112 f exp 44
45 / 66

unite()

  • We will put id, group, and condition back into a single column

  • You probably won't use this often

id group condition pre_test
101 m ctr 75
102 m ctr 70
103 m ctr 65
104 m exp 66
105 m exp 68
106 m exp 58
107 f ctr 60
108 f ctr 66
109 f ctr 69
110 f exp 54
111 f exp 88
112 f exp 44
my_data_wide |>
unite(
col = id_group_condition,
c('id', 'group', 'condition'),
sep = "-"
)
id_group_condition pre_test
101-m-ctr 75
102-m-ctr 70
103-m-ctr 65
104-m-exp 66
105-m-exp 68
106-m-exp 58
107-f-ctr 60
108-f-ctr 66
109-f-ctr 69
110-f-exp 54
111-f-exp 88
112-f-exp 44
45 / 66

What if we have more than one observation per row?

46 / 66

pivot_longer()

dataframes from wide to long

with

pivot_longer()

47 / 66
id pre_test post_test
101_m_ctr 75 85
102_m_ctr 70 80
103_m_ctr 65 75
104_m_exp 66 76
105_m_exp 68 78
106_m_exp 58 68
107_f_ctr 60 70
108_f_ctr 66 76
109_f_ctr 69 79
110_f_exp 54 64
111_f_exp 88 98
112_f_exp 44 54
  • What do the columns pre_test and post_test represent?

  • What is each numeric value?

48 / 66
id pre_test post_test
101_m_ctr 75 85
102_m_ctr 70 80
103_m_ctr 65 75
104_m_exp 66 76
105_m_exp 68 78
106_m_exp 58 68
107_f_ctr 60 70
108_f_ctr 66 76
109_f_ctr 69 79
110_f_exp 54 64
111_f_exp 88 98
112_f_exp 44 54
  • What do the columns pre_test and post_test represent?

  • What is each numeric value?

id test score
101_m_ctr pre_test 75
101_m_ctr post_test 85
102_m_ctr pre_test 70
102_m_ctr post_test 80
103_m_ctr pre_test 65
103_m_ctr post_test 75
104_m_exp pre_test 66
104_m_exp post_test 76
105_m_exp pre_test 68
105_m_exp post_test 78
106_m_exp pre_test 58
106_m_exp post_test 68
107_f_ctr pre_test 60
107_f_ctr post_test 70
108_f_ctr pre_test 66
108_f_ctr post_test 76
109_f_ctr pre_test 69
109_f_ctr post_test 79
110_f_exp pre_test 54
110_f_exp post_test 64
111_f_exp pre_test 88
111_f_exp post_test 98
112_f_exp pre_test 44
112_f_exp post_test 54
48 / 66
my_data_wide |>
pivot_longer(
cols = c("pre_test", "post_test"),
names_to = "test",
values_to = "score"
)
id test score
101_m_ctr pre_test 75
101_m_ctr post_test 85
102_m_ctr pre_test 70
102_m_ctr post_test 80
103_m_ctr pre_test 65
103_m_ctr post_test 75
104_m_exp pre_test 66
104_m_exp post_test 76
105_m_exp pre_test 68
105_m_exp post_test 78
106_m_exp pre_test 58
106_m_exp post_test 68
107_f_ctr pre_test 60
107_f_ctr post_test 70
108_f_ctr pre_test 66
108_f_ctr post_test 76
109_f_ctr pre_test 69
109_f_ctr post_test 79
110_f_exp pre_test 54
110_f_exp post_test 64
111_f_exp pre_test 88
111_f_exp post_test 98
112_f_exp pre_test 44
112_f_exp post_test 54
49 / 66

pivot_longer()

Note

  • You will have to do this often
  • Remember...
    • cols is a vector of names of the columns you want to pivot
    • names_to is the name you will give the column of the factor
    • values_to is the name you will give the column of observations (numbers)

Practice

  • Download the untidydata package:
    remotes::install_github('jvcasillas/untidydata')
  • Load the package and convert the pre_post data set from wide to long
    • Include the relevant variables or
    • Exclude the irrelevant variable

1 This function used to be called gather()

51 / 66
pre_post |>
pivot_longer(cols = test1:test2, names_to = "test", values_to = "score")
pre_post |>
pivot_longer(cols = -c("id", "spec"), names_to = "test", values_to = "score")

What if we want a wide data set?

52 / 66

pivot_wider

dataframes from long to wide

with

pivot_wider()

53 / 66
id test score
101_m_ctr pre_test 75
101_m_ctr post_test 85
102_m_ctr pre_test 70
102_m_ctr post_test 80
103_m_ctr pre_test 65
103_m_ctr post_test 75
104_m_exp pre_test 66
104_m_exp post_test 76
105_m_exp pre_test 68
105_m_exp post_test 78
106_m_exp pre_test 58
106_m_exp post_test 68
107_f_ctr pre_test 60
107_f_ctr post_test 70
108_f_ctr pre_test 66
108_f_ctr post_test 76
109_f_ctr pre_test 69
109_f_ctr post_test 79
110_f_exp pre_test 54
110_f_exp post_test 64
111_f_exp pre_test 88
111_f_exp post_test 98
112_f_exp pre_test 44
112_f_exp post_test 54






  • We need to spread() this dataframe back to wide format (pivot_wider)

  • Why might we want to do this?

54 / 66
my_data_long |>
pivot_wider(names_from = "test", values_from = "score") |>
ggplot() +
aes(x = pre_test, post_test) +
geom_vline(xintercept = mean(my_data_wide$pre_test), lty = 3) +
geom_hline(yintercept = mean(my_data_wide$post_test), lty = 3) +
geom_point(size = 4) +
theme_test(base_size = 18, base_family = "Palatino")

55 / 66

pivot_wider()

Note

  • You will probably use pivot_wider() less

  • It can be useful for making scatterplots and doing data transformations using mutate()

Exercise

  • Take a look at the language_diversity data set in untidydata

  • Spread the data set from long to wide using pivot_wider and create a plot

1 This function used to be called spread()

56 / 66
language_diversity |>
pivot_wider(names_from = "Measurement", values_from = "Value") |>
ggplot() +
aes(x = log(Area), y = log(Langs), label = Country) +
geom_text() +
geom_smooth(method = "glm", method.args=list(family = "poisson"))
57 / 66

Loading and saving data

read_csv()

  • Read .csv files into R using the read_csv() function
  • Ideally your data is stored in a folder of your project called data
  • If it is raw data you can use sub-directories, i.e., data > raw
  • You can pipe directly into any other verbs to tidy your data
  • Ex.
my_df <- read_csv("./data/raw/raw_data.csv")

write_csv()

  • Save dataframes as .csv files using write_csv() function
  • After tidying your data you can (should) save it
  • Keep this data separate from your raw data, i.e., data > tidy
  • You can pipe into write_csv() right after tidying your data
  • Ex.
my_df <- read_csv("./data/raw/raw_data.csv") |>
mutate(
new_var = var1 - var2,
group_sum = if_else(group == "level", -1, 1)
) |>
write_csv(path = "./data/tidy/tidy_data.csv")
58 / 66

A note about paths

59 / 66

System paths

  • what are they?
  • how do they work?
  • relative paths
  • absolute paths
  • what problems do they create?
  • what are the solutions?
60 / 66

System paths

What are they?

  • Your computer is a hierarchical system of directories (folders) and files
  • You can think of it as a garden of forking paths
  • The top of this hierarchy is the root
  • The path from root to a given file is an absolute path

61 / 66

System paths

How do they work?

  • The user defines the system specific path
  • Every time one "enters" a directory the path is marked with "/"1
  • Ex.
    /Users/casillas/Desktop/new_proj
  • This absolute path goes from my system root to a directory on my desktop called new_proj

1This is specific to the operating system. For PCs you use "\".

62 / 66

System paths

How do they work?

  • The user defines the system specific path
  • Every time one "enters" a directory the path is marked with "/"1
  • Ex.
    /Users/casillas/Desktop/new_proj
  • This absolute path goes from my system root to a directory on my desktop called new_proj

1This is specific to the operating system. For PCs you use "\".







  • We can simplify the hierarchy by using relative paths
  • With a relative path the user specifies what root is and all paths are relative to that root
62 / 66

new_proj as root

  .
├── README.md
├── data
│       ├── raw
│       │       ├── EMD_2afc_template_2019-02-14_09h56.54.363.csv
│       │       └── ...
│       └── tidy
│                   └── tidy_data.csv
├── my_proj.Rproj
└── scripts
            └── my_script.R

63 / 66

System paths

What problems do they create?

  • An absolute path can get long (and annoying) fast
  • Your file system will have different paths than my file system
64 / 66

System paths

What problems do they create?

  • An absolute path can get long (and annoying) fast
  • Your file system will have different paths than my file system

What are the solutions?

  • Always use relative paths!
    • ./ = "here"
    • ../ = "Go up one directory"
    • ../../ = "Go up two directories"
  • Always use RStudio projects 👍
  • Use Rstudio projects + here() 😍
64 / 66

System paths

Exercise I

  • Download this repo: https://github.com/jvcasillas/new_proj
  • Load the data using an absolute path
  • Reload the data using a relative path
  • Calculate a summary on the data (group_by() + summarize()) and save the output as a csv to the data folder
  • Load the new .csv
  • Move the .csv files to root and reload them
65 / 66

System paths

Exercise I

  • Download this repo: https://github.com/jvcasillas/new_proj
  • Load the data using an absolute path
  • Reload the data using a relative path
  • Calculate a summary on the data (group_by() + summarize()) and save the output as a csv to the data folder
  • Load the new .csv
  • Move the .csv files to root and reload them

Exercise II

  • Install here
  • Load here
  • Run here(). What happens?
  • Load the data as you did before but use here() where previously you used a relative path
65 / 66

Getting help

If you have problems getting or tidying your data

ask for help in the slack channel

R, RStudio, RMarkdown, GitHub, and Slack here

66 / 66
2 / 66
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow