Review

Packages

How do you install a package from the R prompt, like readxl?

Answer

How do you load a package from the R prompt, like dplyr

Answer

How can you use RStudio to load a package?

Answer

RStudio Packages Pane

RStudio Packages Pane

Okay, but what is a package?

Answer

A package contains:

  1. Functions
  2. Documentation
  3. Vignettes
  4. Data

Data Types

What data type are each of the following?

Type Example
1L
3.14, 1.23e-4
"apple"
TRUE, FALSE
c(...)
list(...)
data.frame(...)
data_frame(...)
NA
NULL
factor(letters)
Answer integer, double, character, logical, vector, list, data.frame, tibble, N/A (missing), Null and factor

Run the following command. It will create 3 variables: x, y, and z. Without printing the variables, how can you tell what data type they are?

Answer

Here’s the code that was run:

  A   B   C   D   E   F   G   H   I   J 
"x" "z" "g" "t" "o" "k" "v" "c" "l" "r" 
 [1] 0.4577418        NA 0.9346722 0.2554288 0.4622928 0.9400145 0.9782264
 [8] 0.1174874 0.4749971        NA
[1] "integer"
[1] "character"
[1] "numeric"
[1] TRUE
[1] TRUE
 [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Workspaces & RStudio Projects

In the last session, we created an RStudio project and an example R script.

  1. Re-open the project we created (or create a new project).
  2. What is the current working directory?
  3. Use the File pane to navigate to your desktop (or another folder on your computer).
  4. How can you quickly navigate back to the working directory?

Open example_single_patient.R from our previous session and add the following lines.

Clear your workspace (quick refresher here) and then source the script.

View the tibble that is stored in example.

# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        55        9.5 C400     
3    5554321     54        56        9.7 C412     
4    5554321     54        57        9.9 C220     
5    5554321     54        58       10.1 C400     

Functions

Functions that work with vectors

So far we’ve primarily seen vectors that operate on single values or that take single-valued arguments.

But as we’ve seen, R is a vectorized language.

Try using the following functions on the variables we created in example_single_patient.R in Session 2.

[1] 9.5
[1] 10.1
[1] 9.74
[1] 9.7
[1] 0.068
[1] 0.2607681
[1] 0.4

Those functions all come from base R (standard R library).

The following functions are given to us from dplyr. We have dplyr loaded if we’ve run library(tidyverse), but I’ll include the dplyr:: first as a reminder that that’s where these functions come from.

[1] "C220"
[1] 58
[1] "C400"
[1] 3

All of these functions return a single value. Try the following. What happens and why?

[1]  95  95  97  99 101
[1] 0 1 2 3 4
[1] "Site Code: C220" "Site Code: C400" "Site Code: C412" "Site Code: C220"
[5] "Site Code: C400"

Because R is vectorized, operations are applied to the whole vector.

Dot, dot, dot

R has a somewhat unique addition for writing and using functions: the dot-dot-dot (...).

The ... is used in two ways:

  1. To allow you to include an unknown number of values.

    paste <- function (..., sep = " ", collapse = NULL) 
    [1] "a b c"
    [1] "a b c d"
  2. To allow you to pass arguments to an underlying function.

    rep <- function (x, ...)  .Primitive("rep")
    [1] 1 1 1 1
    [1] 1 1 1 1
    [1] 1 1 1 1

Before We Begin

Before We Begin

Whenever you’re learning a new tool, for a long time you’re going to suck. It’s going to be very frustrating. But the good news is that that is typical, it’s something that happens to everyone, and it’s only temporary.

Unfortunately, there is no way to go from knowing nothing about a subject to knowing something about the subject … without going through a period of great frustration and much suckiness.

But remember, when you’re getting frustrated, that’s a good thing, it’s typical, it’s temporary. Keep pushing through and in time it will become second nature.

Hadley Wickham, UseR!2014

dplyr Basics

dplyr provides a wide range of functions for data manipulation and transformation. In this session, we’re going to cover 5 key dplyr functions:

Function Action
filter() Pick out observations by their values
arrange() Reorder the rows
select() Pick out variables by their names
mutate() Create new variables using existing variables
summarize() Collapse many values into a single summary

All dplyr verbs work similarly:

  1. The first argument is a data frame.

  2. Subsequent arguments describe how the verb will transform the data frame, using column names without "column_name"

  3. The output is a new data frame.

filter()

# A tibble: 2 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        55        9.5 C400     
2    5554321     54        58       10.1 C400     
# A tibble: 3 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        56        9.7 C412     
2    5554321     54        57        9.9 C220     
3    5554321     54        58       10.1 C400     
# A tibble: 0 x 5
# ... with 5 variables: patient_id <dbl>, age_dx <dbl>, age_visit <int>,
#   tumor_size <dbl>, site_code <chr>

Multiple arguments to filter() are combined with &:

# A tibble: 1 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        58       10.1 C400     
# A tibble: 1 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        58       10.1 C400     

To build more complex filter combinations, use

Operation Symbol
and &
or |
not !

Filtering for an item in a group

Error in filter_impl(.data, quo): Evaluation error: operations are possible only for numeric, logical or complex types.
# A tibble: 3 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        56        9.7 C412     
3    5554321     54        57        9.9 C220     
# A tibble: 3 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        56        9.7 C412     
3    5554321     54        57        9.9 C220     

Missing values

Filtering can be tricky when there are missing values – NA. Or sometimes, you’re trying to find the missing values.

It’s important to keep in mind that NAs are “contagious” in R, meaning that the result of almost any operation involving an NA will be an NA.

[1] NA
[1] NA
[1] NA
[1] NA
[1] NA

Here’s an example that helps to illustrate why NA == NA isn’t TRUE.

[1] NA

filter() keeps only the rows where the condition is TRUE and drops the rows where it is FALSE or NA.

# A tibble: 1 x 1
      x
  <dbl>
1     3
# A tibble: 2 x 1
      x
  <dbl>
1    NA
2     3

arrange()

To arrange, or sort, the rows according to values in a given column, use arrange().

# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        57        9.9 C220     
3    5554321     54        55        9.5 C400     
4    5554321     54        58       10.1 C400     
5    5554321     54        56        9.7 C412     
# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        56        9.7 C412     
2    5554321     54        55        9.5 C400     
3    5554321     54        58       10.1 C400     
4    5554321     54        54        9.5 C220     
5    5554321     54        57        9.9 C220     
# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        57        9.9 C220     
3    5554321     54        55        9.5 C400     
4    5554321     54        58       10.1 C400     
5    5554321     54        56        9.7 C412     
# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        58       10.1 C400     
2    5554321     54        57        9.9 C220     
3    5554321     54        56        9.7 C412     
4    5554321     54        54        9.5 C220     
5    5554321     54        55        9.5 C400     

select()

# A tibble: 5 x 2
  age_dx tumor_size
   <dbl>      <dbl>
1     54        9.5
2     54        9.5
3     54        9.7
4     54        9.9
5     54       10.1
# A tibble: 5 x 3
  age_dx age_visit tumor_size
   <dbl>     <int>      <dbl>
1     54        54        9.5
2     54        55        9.5
3     54        56        9.7
4     54        57        9.9
5     54        58       10.1
# A tibble: 5 x 2
  tumor_size site_code
       <dbl> <chr>    
1        9.5 C220     
2        9.5 C400     
3        9.7 C412     
4        9.9 C220     
5       10.1 C400     

Helper functions for select()

# A tibble: 5 x 2
  age_dx age_visit
   <dbl>     <int>
1     54        54
2     54        55
3     54        56
4     54        57
5     54        58
# A tibble: 5 x 1
  patient_id
       <dbl>
1    5554321
2    5554321
3    5554321
4    5554321
5    5554321
# A tibble: 5 x 1
  site_code
  <chr>    
1 C220     
2 C400     
3 C412     
4 C220     
5 C400     

Renaming

# A tibble: 5 x 1
  code 
  <chr>
1 C220 
2 C400 
3 C412 
4 C220 
5 C400 
# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size code 
       <dbl>  <dbl>     <int>      <dbl> <chr>
1    5554321     54        54        9.5 C220 
2    5554321     54        55        9.5 C400 
3    5554321     54        56        9.7 C412 
4    5554321     54        57        9.9 C220 
5    5554321     54        58       10.1 C400 

mutate()

# A tibble: 5 x 6
  patient_id age_dx age_visit tumor_size site_code follow_up
       <dbl>  <dbl>     <int>      <dbl> <chr>         <dbl>
1    5554321     54        54        9.5 C220              0
2    5554321     54        55        9.5 C400              1
3    5554321     54        56        9.7 C412              2
4    5554321     54        57        9.9 C220              3
5    5554321     54        58       10.1 C400              4
# A tibble: 5 x 6
  patient_id age_dx age_visit tumor_size site_code elapsed
       <dbl>  <dbl>     <int>      <dbl> <chr>       <dbl>
1    5554321     54        54        9.5 C220            0
2    5554321     54        55        9.5 C400            1
3    5554321     54        56        9.7 C412            2
4    5554321     54        57        9.9 C220            3
5    5554321     54        58       10.1 C400            4
# A tibble: 5 x 6
  patient_id age_dx age_visit tumor_size site_code tumor_size_mm
       <dbl>  <dbl>     <int>      <dbl> <chr>             <dbl>
1    5554321     54        54         95 C220               9500
2    5554321     54        55         95 C400               9500
3    5554321     54        56         97 C412               9700
4    5554321     54        57         99 C220               9900
5    5554321     54        58        101 C400              10100

Recoding or conditionally changing values

Often you’ll want to replace certain values of a variable with another value. There are several helpful functions provided by dplyr that let you do this, including:

  1. recode(): Replace character values with "old" = "new".

  2. if_else(): Use logical statements to change the value.

# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        55        9.5 C400     
3    5554321     54        56        9.7 C999     
4    5554321     54        57        9.9 C220     
5    5554321     54        58       10.1 C400     
# A tibble: 5 x 5
  patient_id age_dx age_visit tumor_size site_code
       <dbl>  <dbl>     <int>      <dbl> <chr>    
1    5554321     54        54        9.5 C220     
2    5554321     54        55      950   C400     
3    5554321     54        56        9.7 C412     
4    5554321     54        57        9.9 C220     
5    5554321     54        58     1010   C400     

summarize()

# A tibble: 1 x 1
  `min(age_dx)`
          <dbl>
1            54
# A tibble: 1 x 1
  max_tumor_size
           <dbl>
1           10.1
# A tibble: 1 x 3
  tumor_size_median tumor_size_mean age_visit
              <dbl>           <dbl>     <dbl>
1               9.7            9.74        58

group_by() & summarize()

# A tibble: 3 x 3
  site_code tumor_size_mean age_mean
  <chr>               <dbl>    <dbl>
1 C220                  9.7       54
2 C400                  9.8       54
3 C412                  9.7       54
# A tibble: 3 x 4
# Groups:   site_code [?]
  site_code patient_id tumor_size_mean age_mean
  <chr>          <dbl>           <dbl>    <dbl>
1 C220         5554321             9.7       54
2 C400         5554321             9.8       54
3 C412         5554321             9.7       54

group_by() & count()

# A tibble: 3 x 2
  site_code     n
  <chr>     <int>
1 C220          2
2 C400          2
3 C412          1
# A tibble: 1 x 2
  patient_id     n
       <dbl> <int>
1    5554321     5
# A tibble: 3 x 3
# Groups:   site_code, patient_id [3]
  site_code patient_id     n
  <chr>          <dbl> <int>
1 C220         5554321     2
2 C400         5554321     2
3 C412         5554321     1

Combining dplyr Verbs

Let’s say we want to calculate the average age and tumor size (in cm) by site code for each patient.

To do this we’ll take our example data and

  1. Group by site_code and patient_id

  2. Convert tumor size from cm to mm

  3. Summarize tumor_size and age_dx by their average.

In dplyr speak:

# A tibble: 3 x 4
# Groups:   site_code [?]
  site_code patient_id tumor_size_mean age_mean
  <chr>          <dbl>           <dbl>    <dbl>
1 C220         5554321              97       54
2 C400         5554321              98       54
3 C412         5554321              97       54

Notice that the output from each step is the input to the next step. Also, we don’t really need ex1 or ex2, we just want the output which we’ve saved as ex3.

To make this much cleaner we can use the pipe operator.

# A tibble: 3 x 4
# Groups:   site_code [?]
  site_code patient_id tumor_size_mean age_mean
  <chr>          <dbl>           <dbl>    <dbl>
1 C220         5554321              97       54
2 C400         5554321              98       54
3 C412         5554321              97       54

The pipe operator looks like this

%>%

You can type it with

Ctrl + Shift + M (Windows)

Cmd + Shift + M (Mac)

You say it like

…and then…

Re-read the code above:

  1. Take example and then…

  2. Group by and then…

  3. Mutate and then…

  4. Summarize

The pipe operator is now ubiquitous in modern R code, but it’s not part of the R language.

Make sure that you load tidyverse or dplyr first!

Your turn!

Use the pipe operator and dplyr verbs to complete the following task:

  1. Use the example dataset

  2. Filter out tumors smaller than 9.7

  3. Calculate follow_up time as the number of years between diagnosis and the patient’s visit

  4. Rename the site_code column to code.