class: center, middle, inverse, title-slide .title[ # Research Methods: Open Science and Reproducible Research in Linguistics ] .subtitle[ ## Welcome to the tidyverse II: Tidying and descriptives ] .author[ ### Joseph V. Casillas, PhD ] .date[ ### Rutgers UniversitySpring 2019Last update: 2025-02-16 ] --- background-color: black background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/general/memes/sucking2.png) background-size: contain --- background-color: black background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyverse.png), url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png), url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png) background-size: 500px, 250px, 250px background-position: 50% 50%, 5% 50%, 95% 50% --- class: title-slide-section-grey, middle, center background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png) background-size: contain --- layout: true class: middle # **select** # .blue[<u>columns</u>] .grey[of a dataframe] # .grey[with] # .big[select()] --- --- background-image: url(./assets/img/select1.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/select2.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/select3.png) background-size: 350px background-position: 80% 50% --- layout: false # select() .big[ - You can select consecutive columns using "`:`" (try) - You can rename columns directly (new_name = old_name) - Take a look at the `mtcars` dataset using `glimpse()` - Use the `select()` function to select any 3 variables - Use the `select()` function to select the last 3 variables - Use the `select()` function to rename `mpg` to `hello_world` ] .footnote[`?select()` if you need help] ??? `select(mtcars, mpg, disp, drat)` `select(mtcars, am:carb)` `select(mtcars, hello_world = mpg)` --- layout: true class: middle # **filter** # .blue[<u>rows</u>] .grey[of a dataframe] # .grey[with] # .big[filter()] --- --- background-image: url(./assets/img/filter1.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/filter2.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/filter3.png) background-size: 600px background-position: 90% 40% --- layout: false # filter() - You can use logical operators in filter | Operator | function | | :--------: | :----------------------- | | **<** | less than | | **>** | greater than | | **<=** | less than or equal to | | **>=** | greater than or equal to | | **==** | equal to | | **!=** | not equal to | | **|** | or | | **&** | and | | **%in%** | in | - Filter rows in which `mpg` is less than 20 and greater than 14 - Filter rows in which `cyl` is equal to 6 - Filter rows in which `mpg` is greater than 20 or `disp` is less than 200 ??? `filter(mtcars, mpg < 20 & mpg > 14)` `filter(mtcars, cyl == 6)` `filter(mtcars, mpg > 20 | disp < 200)` --- layout: true class: middle # **arrange** # .blue[<u>rows</u>] .grey[of a dataframe] # .grey[with] # .big[arrange()] --- --- background-image: url(./assets/img/arrange1.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/arrange2.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/arrange3.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/arrange4.png) background-size: 600px background-position: 90% 50% --- layout: false # arrange() .big[ - You probably won't use this very often - You can arrange using multiple variables - Arrange the `mtcars` dataset based on `cyl` and `disp` - Arrange the `mtcars` dataset based on `mpg` from highest to lowest ] ??? `arrange(mtcars, cyl, disp)` `arrange(mtcars, desc(mpg))` --- layout: true class: middle # **mutate** # .blue[<u>variables</u>] .grey[of a dataframe] # .grey[with] # .big[mutate()] --- --- background-image: url(./assets/img/mutate1.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/mutate2.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/mutate3.png) background-size: 600px background-position: 90% 50% --- layout: false background-image: url(./assets/img/mutate4.png) background-size: contain --- background-image: url(./assets/img/mutate5.png) background-size: contain --- # mutate() - Get comfortable using `mutate()` - In the `mtcars` dataset, `select` the `mpg` column and then... - create a new column called `mpg_x2` that doubles every value in the dataframe -- - create a new column called `mpg_c` that centers the mpg data by subtracting the mean value of `mpg` from every value in the dataframe -- - **CHALLENGE**: create a new column called `value` that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or less -- **HINT**: Start every attempt in the same way... .pull-left[ ``` r mtcars |> select(mpg) |> * mutate(???) ``` ] ??? ``` mtcars |> select(mpg) |> mutate(mpg_x2 = mpg * 2) ``` ``` mtcars |> select(mpg) |> mutate(mpg_c = mpg - mean(mpg)) ``` ``` mtcars |> select(mpg) |> mutate(value = if_else(mpg <= 18, 'bad', 'good')) ``` --- background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_case_when.png) background-size: 600px background-position: 100% 50% # Advanced mutations ### mutate() + case_when() .pull-left[ - Extremely useful when you need to create a new column based on multiple conditions of another column - Use this if you find yourself using nested `if_else()` - Syntax uses logical operators: `condition ~ desired result` ] --- .pull-left[ ### Conditions - if **age of learning** is .blue[less than] 12 and **L1** is .blue[Spanish], then .green[heritage speaker] - if **age of learning** is .blue[less than] 12 and **L1** is .blue[English], then .green[early learner] - if **age of learning** is .blue[greater than] 12, then .green[late learner] - if **age of learning** is `NA`, then .green[monolingual] | id| age_learn_l2|l1 | |---:|------------:|:--| | 101| 3|sp | | 102| 2|sp | | 103| 3|sp | | 104| NA|sp | | 105| 18|en | | 106| 17|en | | 107| 3|en | | 108| 2|en | | 109| NA|en | | 110| 3|sp | ] -- .pull-right[ ### Code ``` r case_when_df |> mutate( * group = case_when( age_learn_l2 < 12 & l1 == 'sp' ~ 'heritage', age_learn_l2 < 12 & l1 == 'en' ~ 'early_learner', age_learn_l2 > 12 ~ 'late_learner', is.na(age_learn_l2) ~ 'monolingual' ) ) |> knitr::kable() ``` | id| age_learn_l2|l1 |group | |---:|------------:|:--|:-------------| | 101| 3|sp |heritage | | 102| 2|sp |heritage | | 103| 3|sp |heritage | | 104| NA|sp |monolingual | | 105| 18|en |late_learner | | 106| 17|en |late_learner | | 107| 3|en |early_learner | | 108| 2|en |early_learner | | 109| NA|en |monolingual | | 110| 3|sp |heritage | ] --- layout: true class: middle .pull-left[ # **summarize** # .blue[<u>variables</u>] .grey[of a dataframe] # .grey[with] # .big[summarize()] ] --- --- background-image: url(./assets/img/summarize1.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/summarize2.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/summarize3.png) background-size: 600px background-position: 90% 50% -- .pull-right[ .footnote[ - `summarize()` will always reduce the number of rows in your dataframe - `summarize()` is often accompanied by the helper function `group_by()` ] ] --- background-image: url(./assets/img/summarize4.png) background-size: 600px background-position: 90% 50% --- background-image: url(./assets/img/summarize5.png) background-size: 600px background-position: 90% 50% --- layout: false # summarize() <br> group_by |> summarize() ### Note - Get accustomed to using these two functions, they are extremely useful - Remember that `summarize()` reduces the number of rows in your dataframe - Remember that `mutate()` adds a column to your dataframe - You can include more than one summary statistic inside `summarize()` -- ### Practice - Calculate the mean value of `mpg` in the dataset `mtcars` - Calculate the mean value of `mpg` as a function of `cyl` - Calculate the mean, standard deviation, min, and max of `mpg` as a function of `cyl` ??? ``` mtcars |> group_by(cyl) |> summarize( mean_mpg = mean(mpg), sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg) ) ``` --- background-color: black background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/r_tidy_verbs.png) background-size: contain background-position: 100% 50% # Summary --- class: title-slide-section-grey, center, middle background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png) background-size: contain --- background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png) background-size: 200px background-position: 95% 5% # tidyr ### What is `tidyr`? - A package that is part of the `tidyverse` - Contains functions (verbs) that are helpful for tidying (cleaning, munging) data ### What is tidy data? - Each variable must have its own column. - Each observation must have its own row. - Each value must have its own cell. - (most) functions in R are designed to work with tidy data - It is imperative that you learn how to tidy your data --- layout: true # What does untidy data look like? - https://www.jvcasillas.com/untidydata/ --- <iframe src="https://www.jvcasillas.com/untidydata" style="border:none;" height="400" width="100%"></iframe> --- .pull-left[ |id |spec | test1| test2| |:------|:-----|-----:|-----:| |span01 |g1_lo | 64.31| 69.2| |span02 |g1_lo | 59.81| 63.7| |span03 |g1_hi | 66.08| 70.9| |span04 |g1_hi | 72.78| 79.2| |span05 |g2_lo | 68.29| 75.4| |span06 |g2_lo | 69.22| 76.7| |span07 |g2_hi | 71.36| 77.2| |span08 |g2_hi | 80.37| 88.9| |cata01 |g1_lo | 75.63| 83.6| |cata02 |g1_lo | 71.25| 78.8| |cata03 |g1_hi | 69.09| 74.6| |cata04 |g1_hi | 72.35| 80.7| |cata05 |g2_lo | 71.66| 77.9| |cata06 |g2_lo | 69.01| 75.0| |cata07 |g2_hi | 69.86| 76.0| ] .pull-right[ ### `pre_post` - How many columns are there? - How many variables are there? What are they? - How many observations are there per row? ] --- layout: false class: middle, center # **separate** # .blue[<u>elements</u>] .grey[of a variable] # .grey[with] # .big[separate()] --- # separate() .pull-left[ - This is untidy data - How many variables does the column `id` contain? |id | pre_test| |:---------|--------:| |101_m_ctr | 75| |102_m_ctr | 70| |103_m_ctr | 65| |104_m_exp | 66| |105_m_exp | 68| |106_m_exp | 58| |107_f_ctr | 60| |108_f_ctr | 66| |109_f_ctr | 69| |110_f_exp | 54| |111_f_exp | 88| |112_f_exp | 44| ] -- .pull-right[ ``` r my_data_wide |> separate( col = id, into = c('id', 'group', 'condition'), sep = "_" ) ``` |id |group |condition | pre_test| |:---|:-----|:---------|--------:| |101 |m |ctr | 75| |102 |m |ctr | 70| |103 |m |ctr | 65| |104 |m |exp | 66| |105 |m |exp | 68| |106 |m |exp | 58| |107 |f |ctr | 60| |108 |f |ctr | 66| |109 |f |ctr | 69| |110 |f |exp | 54| |111 |f |exp | 88| |112 |f |exp | 44| ] --- class: middle, center # **unite** # .blue[<u>columns</u>] .grey[into a variable] # .grey[with] # .big[unite()] --- # unite() .pull-left[ - We will put `id`, `group`, and `condition` back into a single column - You probably won't use this often |id |group |condition | pre_test| |:---|:-----|:---------|--------:| |101 |m |ctr | 75| |102 |m |ctr | 70| |103 |m |ctr | 65| |104 |m |exp | 66| |105 |m |exp | 68| |106 |m |exp | 58| |107 |f |ctr | 60| |108 |f |ctr | 66| |109 |f |ctr | 69| |110 |f |exp | 54| |111 |f |exp | 88| |112 |f |exp | 44| ] -- .pull-right[ ``` r my_data_wide |> unite( col = id_group_condition, c('id', 'group', 'condition'), sep = "-" ) ``` |id_group_condition | pre_test| |:------------------|--------:| |101-m-ctr | 75| |102-m-ctr | 70| |103-m-ctr | 65| |104-m-exp | 66| |105-m-exp | 68| |106-m-exp | 58| |107-f-ctr | 60| |108-f-ctr | 66| |109-f-ctr | 69| |110-f-exp | 54| |111-f-exp | 88| |112-f-exp | 44| ] --- class: title-slide-section-grey, center, middle # What if we have more than one observation per row? --- class: middle, center # **pivot_longer()** # .blue[<u>dataframes</u>] .grey[from] wide .grey[to] long # .grey[with] # .big[pivot_longer()] --- .pull-left[ <table class="table" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:left;"> pre_test </th> <th style="text-align:left;"> post_test </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">75</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">85</span> </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">70</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">80</span> </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">65</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">75</span> </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">66</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">76</span> </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">68</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">78</span> </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">58</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">68</span> </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">60</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">70</span> </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">66</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">76</span> </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">69</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">79</span> </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">54</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">64</span> </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">88</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">98</span> </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">44</span> </td> <td style="text-align:left;"> <span style=" color: blue !important;">54</span> </td> </tr> </tbody> </table> - What do the columns `pre_test` and `post_test` represent? - What is each numeric value? ] -- .pull-right[ <table class="table" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:left;"> test </th> <th style="text-align:right;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 85 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 78 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 58 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 69 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 79 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 64 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 88 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 98 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> </tbody> </table> ] --- class: middle ``` r my_data_wide |> pivot_longer( cols = c("pre_test", "post_test"), names_to = "test", values_to = "score" ) ``` <table class="table" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:left;"> test </th> <th style="text-align:right;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 85 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 78 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 58 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 69 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 79 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 64 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 88 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 98 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> </tbody> </table> --- background-image: url(./assets/img/gather.gif) background-size: contain .footnote[https://alison.rbind.io] --- # pivot_longer() ### Note - You will have to do this often - Remember... - `cols` is a vector of names of the columns you want to pivot - `names_to` is the name you will give the column of the factor - `values_to` is the name you will give the column of observations (numbers) ### Practice - Download the `untidydata` package: `remotes::install_github('jvcasillas/untidydata')` - Load the package and convert the `pre_post` data set from wide to long - Include the relevant variables or - Exclude the irrelevant variable .footnote[<sup>1</sup> This function used to be called `gather()`] ??? ``` pre_post |> pivot_longer(cols = test1:test2, names_to = "test", values_to = "score") ``` ``` pre_post |> pivot_longer(cols = -c("id", "spec"), names_to = "test", values_to = "score") ``` --- class: title-slide-section-grey, center, middle # What if we want a wide data set? --- class: middle, center # **pivot_wider** # .blue[<u>dataframes</u>] .grey[from] long .grey[to] wide # .grey[with] # .big[pivot_wider()] --- .pull-left[ <table class="table" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> id </th> <th style="text-align:left;"> test </th> <th style="text-align:right;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 101_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 85 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 102_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 65 </td> </tr> <tr> <td style="text-align:left;"> 103_m_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 75 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 104_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 105_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 78 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 58 </td> </tr> <tr> <td style="text-align:left;"> 106_m_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 68 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 60 </td> </tr> <tr> <td style="text-align:left;"> 107_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 70 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 66 </td> </tr> <tr> <td style="text-align:left;"> 108_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 69 </td> </tr> <tr> <td style="text-align:left;"> 109_f_ctr </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 79 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 110_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 64 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 88 </td> </tr> <tr> <td style="text-align:left;"> 111_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 98 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: red !important;">pre_test</span> </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:left;"> 112_f_exp </td> <td style="text-align:left;"> <span style=" color: blue !important;">post_test</span> </td> <td style="text-align:right;"> 54 </td> </tr> </tbody> </table> ] .pull-right[ <br><br><br><br><br> - We need to `spread()` this dataframe back to wide format (`pivot_wider`) - Why might we want to do this? ] --- ``` r my_data_long |> * pivot_wider(names_from = "test", values_from = "score") |> ggplot() + aes(x = pre_test, post_test) + geom_vline(xintercept = mean(my_data_wide$pre_test), lty = 3) + geom_hline(yintercept = mean(my_data_wide$post_test), lty = 3) + geom_point(size = 4) + theme_test(base_size = 18, base_family = "Palatino") ``` <img src="index_files/figure-html/spread-ex2-1.png" width="936" style="display: block; margin: auto;" /> --- # pivot_wider() ### Note - You will probably use `pivot_wider()` less - It can be useful for making scatterplots and doing data transformations using `mutate()` ### Exercise - Take a look at the `language_diversity` data set in `untidydata` - Spread the data set from long to wide using `pivot_wider` and create a plot .footnote[<sup>1</sup> This function used to be called `spread()`] ??? ``` language_diversity |> pivot_wider(names_from = "Measurement", values_from = "Value") |> ggplot() + aes(x = log(Area), y = log(Langs), label = Country) + geom_text() + geom_smooth(method = "glm", method.args=list(family = "poisson")) ``` --- background-image: url(./assets/img/spread_gather.gif) background-size: contain --- <!-- background-color: black background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_load_all1.png) background-size: contain background-color: black background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_load_all2.png) background-size: contain --> # Loading and saving data .pull-left[ ### read_csv() - Read .csv files into R using the `read_csv()` function - Ideally your data is stored in a folder of your project called `data` - If it is raw data you can use sub-directories, i.e., `data > raw` - You can pipe directly into any other verbs to tidy your data - Ex. ``` r my_df <- read_csv("./data/raw/raw_data.csv") ``` ] .pull-right[ ### write_csv() - Save dataframes as .csv files using `write_csv()` function - After tidying your data you can (should) save it - Keep this data separate from your raw data, i.e., `data > tidy` - You can pipe into `write_csv()` right after tidying your data - Ex. ``` r my_df <- read_csv("./data/raw/raw_data.csv") |> mutate( new_var = var1 - var2, group_sum = if_else(group == "level", -1, 1) ) |> write_csv(path = "./data/tidy/tidy_data.csv") ``` ] --- class: title-slide-section-grey, center, middle # A note about paths --- # System paths - what are they? - how do they work? - relative paths - absolute paths - what problems do they create? - what are the solutions? --- # System paths ### What are they? - Your computer is a hierarchical system of directories (folders) and files - You can think of it as a garden of forking paths - The top of this hierarchy is the **root** - The *path* from **root** to a given file is an .blue[absolute path]  --- background-image: url(https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSbaVWfe2qSnmcNes6NZADG-QI6o-3JaTjsc6-hC7Hv08turZmSgg) background-size: 400px background-position: 95% 10% # System paths ### How do they work? .pull-left[ - The user defines the system specific path - Every time one "enters" a directory the path is marked with "**`/`**"<sup>1</sup> - Ex. `/Users/casillas/Desktop/new_proj` - This **absolute path** goes from my system root to a directory on my desktop called `new_proj` .footnote[<sup>1</sup>This is specific to the operating system. For PCs you use "**`\`**".] ] -- </br></br></br></br></br></br> .pull-right[ - We can simplify the hierarchy by using .blue[relative paths] - With a .blue[relative path] the user specifies what root is and all paths are *relative* to that root ] --- ## `new_proj` as root .big[ **.** ├── README.md ├── data │ ├── raw │ │ ├── EMD_2afc_template_2019-02-14_09h56.54.363.csv │ │ └── ... │ └── tidy │ └── tidy_data.csv ├── my_proj.Rproj └── scripts └── my_script.R ] --- # System paths ### What problems do they create? - An **absolute path** can get long (and annoying) fast - Your file system will have different paths than my file system -- ### What are the solutions? - Always use relative paths! - **`./`** = "here" - **`../`** = "Go up one directory" - **`../../`** = "Go up two directories" - Always use RStudio projects 👍 - Use Rstudio projects + `here()` 😍 --- # System paths ### Exercise I - Download this repo: https://github.com/jvcasillas/new_proj - Load the data using an absolute path - Reload the data using a relative path - Calculate a summary on the data (`group_by() + summarize()`) and save the output as a csv to the data folder - Load the new .csv - Move the .csv files to root and reload them -- ### Exercise II - Install `here` - Load `here` - Run `here()`. What happens? - Load the data as you did before but use `here()` where previously you used a relative path --- class: title-slide-final, middle background-image: url(https://github.com/jvcasillas/ru_xaringan/raw/master/img/logo/ru_shield.png), url(https://www.r-project.org/Rlogo.png) background-size: 55px, 100px background-position: 9% 15%, 89% 15% # Getting help ## If you have problems getting or tidying your data ## ask for help in the slack channel ### You can find some very basic tutorials related to ### R, RStudio, RMarkdown, GitHub, and Slack [here][here] [here]: http://www.jvcasillas.com/ru_teaching/ru_spanish_589/589_01_s2018/sources/tuts/index.html