You can select consecutive columns using ":
" (try)
You can rename columns directly (new_name = old_name)
Take a look at the mtcars
dataset using glimpse()
Use the select()
function to select any 3 variables
Use the select()
function to select the last 3 variables
Use the select()
function to rename mpg
to hello_world
?select()
if you need help
select(mtcars, mpg, disp, drat)
select(mtcars, am:carb)
select(mtcars, hello_world = mpg)
Operator | function |
---|---|
< | less than |
> | greater than |
<= | less than or equal to |
>= | greater than or equal to |
== | equal to |
!= | not equal to |
| | or |
& | and |
%in% | in |
mpg
is less than 20 and greater than 14cyl
is equal to 6mpg
is greater than 20 or disp
is less than
200filter(mtcars, mpg < 20 & mpg > 14)
filter(mtcars, cyl == 6)
filter(mtcars, mpg > 20 | disp < 200)
You probably won't use this very often
You can arrange using multiple variables
Arrange the mtcars
dataset based on cyl
and disp
Arrange the mtcars
dataset based on mpg
from highest to lowest
arrange(mtcars, cyl, disp)
arrange(mtcars, desc(mpg))
Get comfortable using mutate()
In the mtcars
dataset, select
the mpg
column and then...
mpg_x2
that doubles every value in the dataframeGet comfortable using mutate()
In the mtcars
dataset, select
the mpg
column and then...
mpg_x2
that doubles every value in the dataframempg_c
that centers the mpg data by subtracting the mean value of mpg
from every value in the dataframeGet comfortable using mutate()
In the mtcars
dataset, select
the mpg
column and then...
mpg_x2
that doubles every value in the dataframempg_c
that centers the mpg data by subtracting the mean value of mpg
from every value in the dataframevalue
that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or lessGet comfortable using mutate()
In the mtcars
dataset, select
the mpg
column and then...
mpg_x2
that doubles every value in the dataframempg_c
that centers the mpg data by subtracting the mean value of mpg
from every value in the dataframevalue
that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or lessHINT:
Start every attempt in the same way...
mtcars |> select(mpg) |> mutate(???)
mtcars |> select(mpg) |> mutate(mpg_x2 = mpg * 2)
mtcars |> select(mpg) |> mutate(mpg_c = mpg - mean(mpg))
mtcars |> select(mpg) |> mutate(value = if_else(mpg <= 18, 'bad', 'good'))
Extremely useful when you need to create a new column based on multiple conditions of another column
Use this if you find yourself using nested if_else()
Syntax uses logical operators:condition ~ desired result
NA
, then monolingualid | age_learn_l2 | l1 |
---|---|---|
101 | 3 | sp |
102 | 2 | sp |
103 | 3 | sp |
104 | NA | sp |
105 | 18 | en |
106 | 17 | en |
107 | 3 | en |
108 | 2 | en |
109 | NA | en |
110 | 3 | sp |
NA
, then monolingualid | age_learn_l2 | l1 |
---|---|---|
101 | 3 | sp |
102 | 2 | sp |
103 | 3 | sp |
104 | NA | sp |
105 | 18 | en |
106 | 17 | en |
107 | 3 | en |
108 | 2 | en |
109 | NA | en |
110 | 3 | sp |
case_when_df |> mutate( group = case_when( age_learn_l2 < 12 & l1 == 'sp' ~ 'heritage', age_learn_l2 < 12 & l1 == 'en' ~ 'early_learner', age_learn_l2 > 12 ~ 'late_learner', is.na(age_learn_l2) ~ 'monolingual' )) |> knitr::kable()
id | age_learn_l2 | l1 | group |
---|---|---|---|
101 | 3 | sp | heritage |
102 | 2 | sp | heritage |
103 | 3 | sp | heritage |
104 | NA | sp | monolingual |
105 | 18 | en | late_learner |
106 | 17 | en | late_learner |
107 | 3 | en | early_learner |
108 | 2 | en | early_learner |
109 | NA | en | monolingual |
110 | 3 | sp | heritage |
summarize()
will always reduce the number of rows in your dataframesummarize()
is often accompanied by the helper function group_by()
summarize()
reduces the number of rows in your dataframemutate()
adds a column to your dataframesummarize()
summarize()
reduces the number of rows in your dataframemutate()
adds a column to your dataframesummarize()
mpg
in the dataset mtcars
mpg
as a function of cyl
mpg
as
a function of cyl
mtcars |> group_by(cyl) |> summarize( mean_mpg = mean(mpg), sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg) )
tidyr
?A package that is part of the tidyverse
Contains functions (verbs) that are helpful for tidying (cleaning, munging) data
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
(most) functions in R are designed to work with tidy data
It is imperative that you learn how to tidy your data
id | spec | test1 | test2 |
---|---|---|---|
span01 | g1_lo | 64.31 | 69.2 |
span02 | g1_lo | 59.81 | 63.7 |
span03 | g1_hi | 66.08 | 70.9 |
span04 | g1_hi | 72.78 | 79.2 |
span05 | g2_lo | 68.29 | 75.4 |
span06 | g2_lo | 69.22 | 76.7 |
span07 | g2_hi | 71.36 | 77.2 |
span08 | g2_hi | 80.37 | 88.9 |
cata01 | g1_lo | 75.63 | 83.6 |
cata02 | g1_lo | 71.25 | 78.8 |
cata03 | g1_hi | 69.09 | 74.6 |
cata04 | g1_hi | 72.35 | 80.7 |
cata05 | g2_lo | 71.66 | 77.9 |
cata06 | g2_lo | 69.01 | 75.0 |
cata07 | g2_hi | 69.86 | 76.0 |
pre_post
How many columns are there?
How many variables are there? What are they?
How many observations are there per row?
This is untidy data
How many variables does the column id
contain?
id | pre_test |
---|---|
101_m_ctr | 75 |
102_m_ctr | 70 |
103_m_ctr | 65 |
104_m_exp | 66 |
105_m_exp | 68 |
106_m_exp | 58 |
107_f_ctr | 60 |
108_f_ctr | 66 |
109_f_ctr | 69 |
110_f_exp | 54 |
111_f_exp | 88 |
112_f_exp | 44 |
This is untidy data
How many variables does the column id
contain?
id | pre_test |
---|---|
101_m_ctr | 75 |
102_m_ctr | 70 |
103_m_ctr | 65 |
104_m_exp | 66 |
105_m_exp | 68 |
106_m_exp | 58 |
107_f_ctr | 60 |
108_f_ctr | 66 |
109_f_ctr | 69 |
110_f_exp | 54 |
111_f_exp | 88 |
112_f_exp | 44 |
my_data_wide |> separate( col = id, into = c('id', 'group', 'condition'), sep = "_" )
id | group | condition | pre_test |
---|---|---|---|
101 | m | ctr | 75 |
102 | m | ctr | 70 |
103 | m | ctr | 65 |
104 | m | exp | 66 |
105 | m | exp | 68 |
106 | m | exp | 58 |
107 | f | ctr | 60 |
108 | f | ctr | 66 |
109 | f | ctr | 69 |
110 | f | exp | 54 |
111 | f | exp | 88 |
112 | f | exp | 44 |
We will put id
, group
, and condition
back into a single column
You probably won't use this often
id | group | condition | pre_test |
---|---|---|---|
101 | m | ctr | 75 |
102 | m | ctr | 70 |
103 | m | ctr | 65 |
104 | m | exp | 66 |
105 | m | exp | 68 |
106 | m | exp | 58 |
107 | f | ctr | 60 |
108 | f | ctr | 66 |
109 | f | ctr | 69 |
110 | f | exp | 54 |
111 | f | exp | 88 |
112 | f | exp | 44 |
We will put id
, group
, and condition
back into a single column
You probably won't use this often
id | group | condition | pre_test |
---|---|---|---|
101 | m | ctr | 75 |
102 | m | ctr | 70 |
103 | m | ctr | 65 |
104 | m | exp | 66 |
105 | m | exp | 68 |
106 | m | exp | 58 |
107 | f | ctr | 60 |
108 | f | ctr | 66 |
109 | f | ctr | 69 |
110 | f | exp | 54 |
111 | f | exp | 88 |
112 | f | exp | 44 |
my_data_wide |> unite( col = id_group_condition, c('id', 'group', 'condition'), sep = "-" )
id_group_condition | pre_test |
---|---|
101-m-ctr | 75 |
102-m-ctr | 70 |
103-m-ctr | 65 |
104-m-exp | 66 |
105-m-exp | 68 |
106-m-exp | 58 |
107-f-ctr | 60 |
108-f-ctr | 66 |
109-f-ctr | 69 |
110-f-exp | 54 |
111-f-exp | 88 |
112-f-exp | 44 |
id | pre_test | post_test |
---|---|---|
101_m_ctr | 75 | 85 |
102_m_ctr | 70 | 80 |
103_m_ctr | 65 | 75 |
104_m_exp | 66 | 76 |
105_m_exp | 68 | 78 |
106_m_exp | 58 | 68 |
107_f_ctr | 60 | 70 |
108_f_ctr | 66 | 76 |
109_f_ctr | 69 | 79 |
110_f_exp | 54 | 64 |
111_f_exp | 88 | 98 |
112_f_exp | 44 | 54 |
What do the columns pre_test
and post_test
represent?
What is each numeric value?
id | pre_test | post_test |
---|---|---|
101_m_ctr | 75 | 85 |
102_m_ctr | 70 | 80 |
103_m_ctr | 65 | 75 |
104_m_exp | 66 | 76 |
105_m_exp | 68 | 78 |
106_m_exp | 58 | 68 |
107_f_ctr | 60 | 70 |
108_f_ctr | 66 | 76 |
109_f_ctr | 69 | 79 |
110_f_exp | 54 | 64 |
111_f_exp | 88 | 98 |
112_f_exp | 44 | 54 |
What do the columns pre_test
and post_test
represent?
What is each numeric value?
id | test | score |
---|---|---|
101_m_ctr | pre_test | 75 |
101_m_ctr | post_test | 85 |
102_m_ctr | pre_test | 70 |
102_m_ctr | post_test | 80 |
103_m_ctr | pre_test | 65 |
103_m_ctr | post_test | 75 |
104_m_exp | pre_test | 66 |
104_m_exp | post_test | 76 |
105_m_exp | pre_test | 68 |
105_m_exp | post_test | 78 |
106_m_exp | pre_test | 58 |
106_m_exp | post_test | 68 |
107_f_ctr | pre_test | 60 |
107_f_ctr | post_test | 70 |
108_f_ctr | pre_test | 66 |
108_f_ctr | post_test | 76 |
109_f_ctr | pre_test | 69 |
109_f_ctr | post_test | 79 |
110_f_exp | pre_test | 54 |
110_f_exp | post_test | 64 |
111_f_exp | pre_test | 88 |
111_f_exp | post_test | 98 |
112_f_exp | pre_test | 44 |
112_f_exp | post_test | 54 |
my_data_wide |> pivot_longer( cols = c("pre_test", "post_test"), names_to = "test", values_to = "score" )
id | test | score |
---|---|---|
101_m_ctr | pre_test | 75 |
101_m_ctr | post_test | 85 |
102_m_ctr | pre_test | 70 |
102_m_ctr | post_test | 80 |
103_m_ctr | pre_test | 65 |
103_m_ctr | post_test | 75 |
104_m_exp | pre_test | 66 |
104_m_exp | post_test | 76 |
105_m_exp | pre_test | 68 |
105_m_exp | post_test | 78 |
106_m_exp | pre_test | 58 |
106_m_exp | post_test | 68 |
107_f_ctr | pre_test | 60 |
107_f_ctr | post_test | 70 |
108_f_ctr | pre_test | 66 |
108_f_ctr | post_test | 76 |
109_f_ctr | pre_test | 69 |
109_f_ctr | post_test | 79 |
110_f_exp | pre_test | 54 |
110_f_exp | post_test | 64 |
111_f_exp | pre_test | 88 |
111_f_exp | post_test | 98 |
112_f_exp | pre_test | 44 |
112_f_exp | post_test | 54 |
cols
is a vector of names of the columns you want to pivotnames_to
is the name you will give the column of the factorvalues_to
is the name you will give the column of observations (numbers)untidydata
package:remotes::install_github('jvcasillas/untidydata')
pre_post
data set from wide to long1 This function used to be called gather()
pre_post |> pivot_longer(cols = test1:test2, names_to = "test", values_to = "score")
pre_post |> pivot_longer(cols = -c("id", "spec"), names_to = "test", values_to = "score")
id | test | score |
---|---|---|
101_m_ctr | pre_test | 75 |
101_m_ctr | post_test | 85 |
102_m_ctr | pre_test | 70 |
102_m_ctr | post_test | 80 |
103_m_ctr | pre_test | 65 |
103_m_ctr | post_test | 75 |
104_m_exp | pre_test | 66 |
104_m_exp | post_test | 76 |
105_m_exp | pre_test | 68 |
105_m_exp | post_test | 78 |
106_m_exp | pre_test | 58 |
106_m_exp | post_test | 68 |
107_f_ctr | pre_test | 60 |
107_f_ctr | post_test | 70 |
108_f_ctr | pre_test | 66 |
108_f_ctr | post_test | 76 |
109_f_ctr | pre_test | 69 |
109_f_ctr | post_test | 79 |
110_f_exp | pre_test | 54 |
110_f_exp | post_test | 64 |
111_f_exp | pre_test | 88 |
111_f_exp | post_test | 98 |
112_f_exp | pre_test | 44 |
112_f_exp | post_test | 54 |
We need to spread()
this dataframe back to wide format (pivot_wider
)
Why might we want to do this?
my_data_long |> pivot_wider(names_from = "test", values_from = "score") |> ggplot() + aes(x = pre_test, post_test) + geom_vline(xintercept = mean(my_data_wide$pre_test), lty = 3) + geom_hline(yintercept = mean(my_data_wide$post_test), lty = 3) + geom_point(size = 4) + theme_test(base_size = 18, base_family = "Palatino")
You will probably use pivot_wider()
less
It can be useful for making scatterplots and doing data transformations using mutate()
Take a look at the language_diversity
data set in untidydata
Spread the data set from long to wide using pivot_wider
and create a plot
1 This function used to be called spread()
language_diversity |> pivot_wider(names_from = "Measurement", values_from = "Value") |> ggplot() + aes(x = log(Area), y = log(Langs), label = Country) + geom_text() + geom_smooth(method = "glm", method.args=list(family = "poisson"))
read_csv()
functiondata
data > raw
my_df <- read_csv("./data/raw/raw_data.csv")
write_csv()
functiondata > tidy
write_csv()
right after tidying your datamy_df <- read_csv("./data/raw/raw_data.csv") |> mutate( new_var = var1 - var2, group_sum = if_else(group == "level", -1, 1)) |> write_csv(path = "./data/tidy/tidy_data.csv")
/
"1/Users/casillas/Desktop/new_proj
new_proj
1This is specific to the operating system. For PCs you use "\
".
/
"1/Users/casillas/Desktop/new_proj
new_proj
1This is specific to the operating system. For PCs you use "\
".
new_proj
as root .
├── README.md
├── data
│ ├── raw
│ │ ├── EMD_2afc_template_2019-02-14_09h56.54.363.csv
│ │ └── ...
│ └── tidy
│ └── tidy_data.csv
├── my_proj.Rproj
└── scripts
└── my_script.R
./
= "here"../
= "Go up one directory"../../
= "Go up two directories"here()
😍group_by() + summarize()
)
and save the output as a csv to the data foldergroup_by() + summarize()
)
and save the output as a csv to the data folderhere
here
here()
. What happens?here()
where previously
you used a relative pathKeyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |