Research Methods: Open Science and Reproducible Research in Linguistics

class: center, middle, inverse, title-slide

.title[
# Research Methods: Open Science and Reproducible Research in Linguistics
]
.subtitle[
## Welcome to the tidyverse II: Tidying and descriptives
]
.author[
### Joseph V. Casillas, PhD
]
.date[
### Rutgers UniversitySpring 2019Last update: 2025-02-16
]

---

background-color: black
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/general/memes/sucking2.png)
background-size: contain

---
background-color: black
background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyverse.png), url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png), url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png)
background-size: 500px, 250px, 250px
background-position: 50% 50%, 5% 50%, 95% 50%

---
class: title-slide-section-grey, middle, center
background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/dplyr.png)
background-size: contain

---
layout: true
class: middle

# **select**

# .blue[columns] .grey[of a dataframe]

# .grey[with]

# .big[select()]

---

---
background-image: url(./assets/img/select1.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/select2.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/select3.png)
background-size: 350px
background-position: 80% 50%

---
layout: false

# select()

.big[

- You can select consecutive columns using "`:`" (try)

- You can rename columns directly (new_name = old_name)

- Take a look at the `mtcars` dataset using `glimpse()`

- Use the `select()` function to select any 3 variables

- Use the `select()` function to select the last 3 variables

- Use the `select()` function to rename `mpg` to `hello_world`

]

.footnote[`?select()` if you need help]

???

`select(mtcars, mpg, disp, drat)`

`select(mtcars, am:carb)`

`select(mtcars, hello_world = mpg)`

---
layout: true
class: middle

# **filter**

# .blue[rows] .grey[of a dataframe]

# .grey[with]

# .big[filter()]

---

---
background-image: url(./assets/img/filter1.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/filter2.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/filter3.png)
background-size: 600px
background-position: 90% 40%

---
layout: false

# filter()

- You can use logical operators in filter

| Operator | function |
| :--------: | :----------------------- |
| **<** | less than |
| **>** | greater than |
| **<=** | less than or equal to |
| **>=** | greater than or equal to |
| **==** | equal to |
| **!=** | not equal to |
| **&#124;** | or |
| **&** | and |
| **%in%** | in |

- Filter rows in which `mpg` is less than 20 and greater than 14
- Filter rows in which `cyl` is equal to 6
- Filter rows in which `mpg` is greater than 20 or `disp` is less than 
200

???

`filter(mtcars, mpg < 20 & mpg > 14)`

`filter(mtcars, cyl == 6)`

`filter(mtcars, mpg > 20 | disp < 200)`

---
layout: true
class: middle

# **arrange**

# .blue[rows] .grey[of a dataframe]

# .grey[with]

# .big[arrange()]

---

---
background-image: url(./assets/img/arrange1.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/arrange2.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/arrange3.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/arrange4.png)
background-size: 600px
background-position: 90% 50%

---
layout: false

# arrange()

.big[

- You probably won't use this very often

- You can arrange using multiple variables

- Arrange the `mtcars` dataset based on `cyl` and `disp`

- Arrange the `mtcars` dataset based on `mpg` from highest to lowest

]

???

`arrange(mtcars, cyl, disp)`

`arrange(mtcars, desc(mpg))`

---
layout: true
class: middle

# **mutate**

# .blue[variables] .grey[of a dataframe]

# .grey[with]

# .big[mutate()]

---

---
background-image: url(./assets/img/mutate1.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/mutate2.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/mutate3.png)
background-size: 600px
background-position: 90% 50%

---
layout: false
background-image: url(./assets/img/mutate4.png)
background-size: contain

---
background-image: url(./assets/img/mutate5.png)
background-size: contain

---

# mutate()

- Get comfortable using `mutate()`

- In the `mtcars` dataset, `select` the `mpg` column and then...
  - create a new column called `mpg_x2` that doubles every value in the dataframe
--

- create a new column called `mpg_c` that centers the mpg data by subtracting the mean value of `mpg` from every value in the dataframe
--

- **CHALLENGE**: create a new column called `value` that applies the label 'good' to cars that get over 18 mpg and the label 'bad' to cars that get 18 mpg or less

**HINT**:  
Start every attempt in the same way...

.pull-left[

``` r
mtcars |> 
  select(mpg) |> 
* mutate(???)
```

]

???

```
mtcars |> 
  select(mpg) |> 
  mutate(mpg_x2 = mpg * 2)
```

```
mtcars |> 
  select(mpg) |> 
  mutate(mpg_c = mpg - mean(mpg))
```

```
mtcars |> 
 select(mpg) |> 
 mutate(value = if_else(mpg <= 18, 'bad', 'good'))
```

---
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_case_when.png)
background-size: 600px
background-position: 100% 50%

# Advanced mutations

### mutate() + case_when()

.pull-left[

- Extremely useful when you need to create a new column based on multiple 
conditions of another column

- Use this if you find yourself using nested `if_else()`

- Syntax uses logical operators:  
`condition ~ desired result`

]

---

.pull-left[

### Conditions

- if **age of learning** is .blue[less than] 12 and **L1** is .blue[Spanish], then .green[heritage speaker]
- if **age of learning** is .blue[less than] 12 and **L1** is .blue[English], then .green[early learner]
- if **age of learning** is .blue[greater than] 12, then .green[late learner]
- if **age of learning** is `NA`, then .green[monolingual]

|  id| age_learn_l2|l1 |
|---:|------------:|:--|
| 101|            3|sp |
| 102|            2|sp |
| 103|            3|sp |
| 104|           NA|sp |
| 105|           18|en |
| 106|           17|en |
| 107|            3|en |
| 108|            2|en |
| 109|           NA|en |
| 110|            3|sp |

]

.pull-right[

### Code

``` r
case_when_df |> 
mutate(
* group = case_when(
 age_learn_l2 < 12 & l1 == 'sp' ~ 'heritage', 
 age_learn_l2 < 12 & l1 == 'en' ~ 'early_learner', 
 age_learn_l2 > 12 ~ 'late_learner', 
 is.na(age_learn_l2) ~ 'monolingual'
 )
) |> 
knitr::kable()
```

|  id| age_learn_l2|l1 |group         |
|---:|------------:|:--|:-------------|
| 101|            3|sp |heritage      |
| 102|            2|sp |heritage      |
| 103|            3|sp |heritage      |
| 104|           NA|sp |monolingual   |
| 105|           18|en |late_learner  |
| 106|           17|en |late_learner  |
| 107|            3|en |early_learner |
| 108|            2|en |early_learner |
| 109|           NA|en |monolingual   |
| 110|            3|sp |heritage      |

]

---
layout: true
class: middle

.pull-left[

# **summarize**

# .blue[variables] .grey[of a dataframe]

# .grey[with]

# .big[summarize()]

]

---

---
background-image: url(./assets/img/summarize1.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/summarize2.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/summarize3.png)
background-size: 600px
background-position: 90% 50%

.pull-right[
.footnote[
- `summarize()` will always reduce the number of rows in your dataframe
- `summarize()` is often accompanied by the helper function `group_by()`
]
]

---
background-image: url(./assets/img/summarize4.png)
background-size: 600px
background-position: 90% 50%

---
background-image: url(./assets/img/summarize5.png)
background-size: 600px
background-position: 90% 50%

---
layout: false

# summarize() group_by |> summarize()

### Note

- Get accustomed to using these two functions, they are extremely useful
- Remember that `summarize()` reduces the number of rows in your dataframe
- Remember that `mutate()` adds a column to your dataframe
- You can include more than one summary statistic inside `summarize()`

### Practice

- Calculate the mean value of `mpg` in the dataset `mtcars`
- Calculate the mean value of `mpg` as a function of `cyl`
- Calculate the mean, standard deviation, min, and max of `mpg` as 
a function of `cyl`

???

```
mtcars |> 
  group_by(cyl) |> 
  summarize(
    mean_mpg = mean(mpg), 
    sd_mpg = sd(mpg), 
    min_mpg = min(mpg), 
    max_mpg = max(mpg)
  )
```

---
background-color: black
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/r_tidy_verbs.png)
background-size: contain
background-position: 100% 50%

# Summary

---
class: title-slide-section-grey, center, middle
background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png)
background-size: contain

---
background-image: url(https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidyr.png)
background-size: 200px
background-position: 95% 5%

# tidyr

### What is `tidyr`?

- A package that is part of the `tidyverse`

- Contains functions (verbs) that are helpful for tidying (cleaning, 
munging) data

### What is tidy data?

- Each variable must have its own column.

- Each observation must have its own row.

- Each value must have its own cell.

- (most) functions in R are designed to work with tidy data

- It is imperative that you learn how to tidy your data

---
layout: true

# What does untidy data look like?

- https://www.jvcasillas.com/untidydata/

---

---

.pull-left[

|id     |spec  | test1| test2|
|:------|:-----|-----:|-----:|
|span01 |g1_lo | 64.31|  69.2|
|span02 |g1_lo | 59.81|  63.7|
|span03 |g1_hi | 66.08|  70.9|
|span04 |g1_hi | 72.78|  79.2|
|span05 |g2_lo | 68.29|  75.4|
|span06 |g2_lo | 69.22|  76.7|
|span07 |g2_hi | 71.36|  77.2|
|span08 |g2_hi | 80.37|  88.9|
|cata01 |g1_lo | 75.63|  83.6|
|cata02 |g1_lo | 71.25|  78.8|
|cata03 |g1_hi | 69.09|  74.6|
|cata04 |g1_hi | 72.35|  80.7|
|cata05 |g2_lo | 71.66|  77.9|
|cata06 |g2_lo | 69.01|  75.0|
|cata07 |g2_hi | 69.86|  76.0|
]

.pull-right[

### `pre_post`

- How many columns are there?

- How many variables are there? What are they?

- How many observations are there per row?

]

---
layout: false
class: middle, center

# **separate**

# .blue[elements] .grey[of a variable]

# .grey[with]

# .big[separate()]

---

# separate()

.pull-left[

- This is untidy data

- How many variables does the column `id` contain?

|id        | pre_test|
|:---------|--------:|
|101_m_ctr |       75|
|102_m_ctr |       70|
|103_m_ctr |       65|
|104_m_exp |       66|
|105_m_exp |       68|
|106_m_exp |       58|
|107_f_ctr |       60|
|108_f_ctr |       66|
|109_f_ctr |       69|
|110_f_exp |       54|
|111_f_exp |       88|
|112_f_exp |       44|

]

.pull-right[

``` r
my_data_wide |> 
  separate(
    col = id, 
    into = c('id', 'group', 'condition'), 
    sep = "_"
  ) 
```

|id  |group |condition | pre_test|
|:---|:-----|:---------|--------:|
|101 |m     |ctr       |       75|
|102 |m     |ctr       |       70|
|103 |m     |ctr       |       65|
|104 |m     |exp       |       66|
|105 |m     |exp       |       68|
|106 |m     |exp       |       58|
|107 |f     |ctr       |       60|
|108 |f     |ctr       |       66|
|109 |f     |ctr       |       69|
|110 |f     |exp       |       54|
|111 |f     |exp       |       88|
|112 |f     |exp       |       44|
]

---
class: middle, center

# **unite**

# .blue[columns] .grey[into a variable]

# .grey[with]

# .big[unite()]

---

# unite()

.pull-left[

- We will put `id`, `group`, and `condition` back into a single column

- You probably won't use this often

]

.pull-right[

``` r
my_data_wide |> 
  unite(
    col = id_group_condition, 
    c('id', 'group', 'condition'), 
    sep = "-"
  )
```

|id_group_condition | pre_test|
|:------------------|--------:|
|101-m-ctr          |       75|
|102-m-ctr          |       70|
|103-m-ctr          |       65|
|104-m-exp          |       66|
|105-m-exp          |       68|
|106-m-exp          |       58|
|107-f-ctr          |       60|
|108-f-ctr          |       66|
|109-f-ctr          |       69|
|110-f-exp          |       54|
|111-f-exp          |       88|
|112-f-exp          |       44|

]

---
class: title-slide-section-grey, center, middle

# What if we have more than one observation per row?

---
class: middle, center

# **pivot_longer()**

# .blue[dataframes] .grey[from] wide .grey[to] long

# .grey[with]

# .big[pivot_longer()]

---

- What do the columns `pre_test` and `post_test` represent?

- What is each numeric value?

]

.pull-right[
<table class="table" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> id </th>
 <th style="text-align:left;"> test </th>
 <th style="text-align:right;"> score </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 85 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 80 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 65 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 78 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 58 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 60 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 69 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 79 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 64 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 88 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 98 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 44 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
</tbody>
</table>
]

---
class: middle

``` r
my_data_wide |> 
  pivot_longer(
    cols = c("pre_test", "post_test"), 
    names_to = "test", 
    values_to = "score"
  )
```

<table class="table" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> id </th>
 <th style="text-align:left;"> test </th>
 <th style="text-align:right;"> score </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 85 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 80 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 65 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 78 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 58 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 60 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 69 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 79 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 64 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 88 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 98 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 44 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
</tbody>
</table>

---
background-image: url(./assets/img/gather.gif)
background-size: contain

.footnote[https://alison.rbind.io]

---

# pivot_longer()

### Note

- You will have to do this often
- Remember... 
    - `cols` is a vector of names of the columns you want to pivot
    - `names_to` is the name you will give the column of the factor
    - `values_to` is the name you will give the column of observations (numbers)

### Practice

- Download the `untidydata` package:  
`remotes::install_github('jvcasillas/untidydata')`
- Load the package and convert the `pre_post` data set from wide to long
  - Include the relevant variables or
  - Exclude the irrelevant variable

.footnote[1 This function used to be called `gather()`]

???

```
pre_post |> 
  pivot_longer(cols = test1:test2, names_to = "test", values_to = "score")
```

```
pre_post |> 
  pivot_longer(cols = -c("id", "spec"), names_to = "test", values_to = "score")
```

---
class: title-slide-section-grey, center, middle

# What if we want a wide data set?

---
class: middle, center

# **pivot_wider**

# .blue[dataframes] .grey[from] long .grey[to] wide

# .grey[with]

# .big[pivot_wider()]

---

.pull-left[
<table class="table" style="color: black; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> id </th>
 <th style="text-align:left;"> test </th>
 <th style="text-align:right;"> score </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 101_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 85 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 102_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 80 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 65 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 103_m_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 75 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 104_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 105_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 78 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 58 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 106_m_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 68 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 60 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 107_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 70 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 66 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 108_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 76 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 69 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 109_f_ctr </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 79 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 110_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 64 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 88 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 111_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 98 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> pre_test </td>
 <td style="text-align:right;"> 44 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112_f_exp </td>
 <td style="text-align:left;"> post_test </td>
 <td style="text-align:right;"> 54 </td>
 </tr>
</tbody>
</table>
]

.pull-right[
 
- We need to `spread()` this dataframe back to wide format (`pivot_wider`)

- Why might we want to do this?

]

---

``` r
my_data_long |> 
* pivot_wider(names_from = "test", values_from = "score") |>
  ggplot() + 
  aes(x = pre_test, post_test) + 
  geom_vline(xintercept = mean(my_data_wide$pre_test), lty = 3) + 
  geom_hline(yintercept = mean(my_data_wide$post_test), lty = 3) + 
  geom_point(size = 4) + 
  theme_test(base_size = 18, base_family = "Palatino")
```

---

# pivot_wider()

### Note

- You will probably use `pivot_wider()` less

- It can be useful for making scatterplots and doing data transformations using `mutate()`

### Exercise

- Take a look at the `language_diversity` data set in `untidydata`

- Spread the data set from long to wide using `pivot_wider` and create a plot

.footnote[1 This function used to be called `spread()`]

???

```
language_diversity |> 
  pivot_wider(names_from = "Measurement", values_from = "Value") |> 
  ggplot() + 
  aes(x = log(Area), y = log(Langs), label = Country) + 
  geom_text() + 
  geom_smooth(method = "glm", method.args=list(family = "poisson"))
```

---

background-image: url(./assets/img/spread_gather.gif)
background-size: contain

---

<!--
background-color: black
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_load_all1.png)
background-size: contain

background-color: black
background-image: url(https://raw.githubusercontent.com/jvcasillas/media/master/rstats/memes/rstats_load_all2.png)
background-size: contain
-->

# Loading and saving data

.pull-left[
### read_csv()

- Read .csv files into R using the `read_csv()` function
- Ideally your data is stored in a folder of your project called `data`
- If it is raw data you can use sub-directories, i.e., `data > raw`
- You can pipe directly into any other verbs to tidy your data
- Ex.

``` r
my_df <- read_csv("./data/raw/raw_data.csv")
```
]

.pull-right[
### write_csv()

- Save dataframes as .csv files using `write_csv()` function
- After tidying your data you can (should) save it 
- Keep this data separate from your raw data, i.e., `data > tidy`
- You can pipe into `write_csv()` right after tidying your data
- Ex.

``` r
my_df <- read_csv("./data/raw/raw_data.csv") |> 
mutate(
 new_var = var1 - var2, 
 group_sum = if_else(group == "level", -1, 1)
) |> 
write_csv(path = "./data/tidy/tidy_data.csv")
```
]

---
class: title-slide-section-grey, center, middle

# A note about paths

---

# System paths

- what are they?
- how do they work?
- relative paths
- absolute paths
- what problems do they create?
- what are the solutions?

---

# System paths

### What are they?

- Your computer is a hierarchical system of directories (folders) 
and files
- You can think of it as a garden of forking paths
- The top of this hierarchy is the **root**
- The *path* from **root** to a given file is an .blue[absolute path]

![](https://www.researchgate.net/profile/Peter_Makovini/publication/309411137/figure/fig4/AS:667818014044161@1536231630016/Kanes-illustration-of-a-garden-of-forking-paths-2005.jpg)

---
background-image: url(https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSbaVWfe2qSnmcNes6NZADG-QI6o-3JaTjsc6-hC7Hv08turZmSgg)
background-size: 400px
background-position: 95% 10%

# System paths

### How do they work?

.pull-left[
- The user defines the system specific path
- Every time one "enters" a directory the path is marked with "**`/`**"1
- Ex. 
`/Users/casillas/Desktop/new_proj`
- This **absolute path** goes from my system root to a directory on 
my desktop called `new_proj`

.footnote[1This is specific to the operating system. For PCs you use "**`\`**".]
]

.pull-right[
- We can simplify the hierarchy by using .blue[relative paths]
- With a .blue[relative path] the user specifies what root is and all 
paths are *relative* to that root
]

---

## `new_proj` as root

.big[
&nbsp; **.**  
├── README.md  
├── data  
│ &nbsp; &nbsp; &nbsp; ├── raw  
│ &nbsp; &nbsp; &nbsp; │ &nbsp; &nbsp; &nbsp; ├── EMD_2afc_template_2019-02-14_09h56.54.363.csv  
│ &nbsp; &nbsp; &nbsp; │ &nbsp; &nbsp; &nbsp; └── ...  
│ &nbsp; &nbsp; &nbsp; └── tidy  
│ &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; └── tidy_data.csv  
├── my_proj.Rproj  
└── scripts  
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; └── my_script.R

]

---

# System paths

### What problems do they create?

- An **absolute path** can get long (and annoying) fast
- Your file system will have different paths than my file system

### What are the solutions?

- Always use relative paths! 
  - **`./`** = "here"
  - **`../`** = "Go up one directory"
  - **`../../`** = "Go up two directories"
- Always use RStudio projects 👍
- Use Rstudio projects + `here()` 😍

---

# System paths

### Exercise I

- Download this repo: https://github.com/jvcasillas/new_proj
- Load the data using an absolute path
- Reload the data using a relative path
- Calculate a summary on the data (`group_by() + summarize()`) 
and save the output as a csv to the data folder
- Load the new .csv
- Move the .csv files to root and reload them

### Exercise II

- Install `here`
- Load `here`
- Run `here()`. What happens?
- Load the data as you did before but use `here()` where previously 
you used a relative path

---
class: title-slide-final, middle
background-image: url(https://github.com/jvcasillas/ru_xaringan/raw/master/img/logo/ru_shield.png), url(https://www.r-project.org/Rlogo.png)
background-size: 55px, 100px
background-position: 9% 15%, 89% 15%

# Getting help

## If you have problems getting or tidying your data
## ask for help in the slack channel

### You can find some very basic tutorials related to 
### R, RStudio, RMarkdown, GitHub, and Slack [here][here]

[here]: http://www.jvcasillas.com/ru_teaching/ru_spanish_589/589_01_s2018/sources/tuts/index.html