Session 6

Homework Review

Review the homework from Session 5.

You can check out the approach I took in this page. There is more than one way to accomplish each task, so it’s okay if your code looks different!

What parts of these tasks were difficult?
Which parts did you find easy?
What questions did you have as you worked?
Were you able to find the answers to any of these questions?

Overview

Tidying Data
- Group discussion: messy data sets
Joins
- Review Joins
- Group work joining data
Working with dates and times

Tidying Data

The three rules of tidy data

Columns represent separate variables
Rows represent individual observations
Observational units form tables

Why tidy data?

source("https://git.io/fATEK")

Each of the following tables contains the same five variables: mrn, sequence, age, tumor_size, tumor_unit.

How easy would it be to calculate average tumor size for each patient? What problems would you face if your data were formatted like the examples below?

Example 1

table_6b

# A tibble: 20 x 4
      mrn sequence type       value 
    <int>    <int> <chr>      <chr> 
 1 289034        1 age        47    
 2 289034        1 tumor_size 0.9 cm
 3 289034        2 age        47    
 4 289034        2 tumor_size 999 mm
 5 290660        1 age        49    
 6 290660        1 tumor_size 30 mm 
 7 290660        2 age        49    
 8 290660        2 tumor_size 999 mm
 9 341050        1 age        61    
10 341050        1 tumor_size 999 mm
11 341050        2 age        70    
12 341050        2 tumor_size 2.1 cm
13 385615        1 age        71    
14 385615        1 tumor_size 12 mm 
15 385615        2 age        71    
16 385615        2 tumor_size 2 cm  
17 550955        1 age        43    
18 550955        1 tumor_size 2.5 cm
19 550955        2 age        43    
20 550955        2 tumor_size 998 mm

Example 2

table_6c

# A tibble: 10 x 4
      mrn age    tumor_size tumor_unit
    <int> <chr>       <dbl> <chr>     
 1 289034 47 (1)        0.9 cm        
 2 289034 47 (2)      999   mm        
 3 290660 49 (1)       30   mm        
 4 290660 49 (2)      999   mm        
 5 341050 61 (1)      999   mm        
 6 341050 70 (2)        2.1 cm        
 7 385615 71 (1)       12   mm        
 8 385615 71 (2)        2   cm        
 9 550955 43 (1)        2.5 cm        
10 550955 43 (2)      998   mm

Example 3

table_6d1 # age

# A tibble: 5 x 3
     mrn   `1`   `2`
   <int> <int> <int>
1 289034    47    47
2 290660    49    49
3 341050    61    70
4 385615    71    71
5 550955    43    43

table_6d2 # tumor_size

# A tibble: 5 x 3
     mrn `1`    `2`   
   <int> <chr>  <chr> 
1 289034 0.9 cm 999 mm
2 290660 30 mm  999 mm
3 341050 999 mm 2.1 cm
4 385615 12 mm  2 cm  
5 550955 2.5 cm 998 mm

Example 4

table_6e

# A tibble: 4 x 11
  var   `289034` `289034` `290660` `290660` `341050` `341050` `385615`
  <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
1 sequ… 1        2        1        2        1        2        1       
2 age   47       47       49       49       61       70       71      
3 tumo… 0.9      999.0    30.0     999.0    999.0    2.1      12.0    
4 tumo… cm       mm       mm       mm       mm       cm       mm      
# ... with 3 more variables: `385615` <chr>, `550955` <chr>,
#   `550955` <chr>

Tidying Data

Most data isn’t tidy: often data is organized for something other than analysis. For example, the data may be structured in a way that’s easiest for data entry or so that all of the data fits on a single page.

First, ask yourself (or the person who created the data):

What are the variables?
What are the observations?

Sometimes, a variable is spread across multiple columns. Othertimes, an observation is scattered across multiple rows.

The tidyr packages is part of the tidyverse and loaded with library(tidyverse).

`gather()`

gather(data, key, value, ...)

Use gather() to gather the values of a variable that are spread across columns.

The key is the name of the variable that will hold the values that are currently used as column names. The value is the name of the variables that holds the values in those columns.

table_6d1

# A tibble: 5 x 3
     mrn   `1`   `2`
   <int> <int> <int>
1 289034    47    47
2 290660    49    49
3 341050    61    70
4 385615    71    71
5 550955    43    43

gather(table_6d1, sequence, age, `1`, `2`)

# A tibble: 10 x 3
      mrn sequence   age
    <int> <chr>    <int>
 1 289034 1           47
 2 290660 1           49
 3 341050 1           61
 4 385615 1           71
 5 550955 1           43
 6 289034 2           47
 7 290660 2           49
 8 341050 2           70
 9 385615 2           71
10 550955 2           43

`spread()`

spread(data, key, value)

spread() is the inverse of gather(). The values of the variable given as key will be the new column names that will contain the values in value.

table_6b

# A tibble: 20 x 4
      mrn sequence type       value 
    <int>    <int> <chr>      <chr> 
 1 289034        1 age        47    
 2 289034        1 tumor_size 0.9 cm
 3 289034        2 age        47    
 4 289034        2 tumor_size 999 mm
 5 290660        1 age        49    
 6 290660        1 tumor_size 30 mm 
 7 290660        2 age        49    
 8 290660        2 tumor_size 999 mm
 9 341050        1 age        61    
10 341050        1 tumor_size 999 mm
11 341050        2 age        70    
12 341050        2 tumor_size 2.1 cm
13 385615        1 age        71    
14 385615        1 tumor_size 12 mm 
15 385615        2 age        71    
16 385615        2 tumor_size 2 cm  
17 550955        1 age        43    
18 550955        1 tumor_size 2.5 cm
19 550955        2 age        43    
20 550955        2 tumor_size 998 mm

spread(table_6b, type, value)

# A tibble: 10 x 4
      mrn sequence age   tumor_size
    <int>    <int> <chr> <chr>     
 1 289034        1 47    0.9 cm    
 2 289034        2 47    999 mm    
 3 290660        1 49    30 mm     
 4 290660        2 49    999 mm    
 5 341050        1 61    999 mm    
 6 341050        2 70    2.1 cm    
 7 385615        1 71    12 mm     
 8 385615        2 71    2 cm      
 9 550955        1 43    2.5 cm    
10 550955        2 43    998 mm

Animated

Quick Review

Which function would you use to transform this data into tidy format?

Pew Forum on Religious and Public Life

Working with Dates

To work with dates, we need another packges from the tidyverse called lubridate. This package isn’t loaded by default, so you’ll need to run

library(lubridate)


Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date

About dates, times and date-times

2006-01-02 03:04:05

A date-time is a point in time, stored as the number of seconds since 1970-01-01 00:00:00 UTC. Note that a date-time also understands time zones.

as_datetime(1136171045)

[1] "2006-01-02 03:04:05 UTC"

A date is a day and is stored as the number of days since 1970-01-01.

as_date(13150)

[1] "2006-01-02"

An hms time is stored as the number of seconds since 00:00:00.

hms::as.hms(3 * 60^2 + 4 * 60 + 5)

03:04:05

Converting strings to date-times

lubridate has a number of functions that convert strings to date-times. The functions are named so that they match the order of the date components in the string. The the string doesn’t include time, a date is returned rather than a date-time.

element	letter
year	y
month	m
day	d
hour	h
minute	m
second	s

ymd_hms("2006-01-02 03:04:05")

[1] "2006-01-02 03:04:05 UTC"

ymd_hm("2006-01-02 03:04")

[1] "2006-01-02 03:04:00 UTC"

ymd("2006-01-02")

[1] "2006-01-02"

mdy_hm("1/2/06 3:04")

[1] "2006-01-02 03:04:00 UTC"

mdy_hm("January 2, 2006 3:04")

[1] "2006-01-02 03:04:00 UTC"

mdy("1/2/2006")

[1] "2006-01-02"

dmy_hm("2 Jan 2006 3:04")

[1] "2006-01-02 03:04:00 UTC"

dmy("2 Jan 2006")

[1] "2006-01-02"

Getting and Setting Date Components

You can get or set a date component with the singular noun describing the date element.

x <- ymd_hms("2006-01-02 03:04:05")

date(x)

[1] "2006-01-02"

year(x)

[1] 2006

week(x)

[1] 1

epiyear(x)

[1] 2006

epiweek(x)

[1] 1

month(x)

[1] 1

day(x)

[1] 2

wday(x)

[1] 2

wday(x, label = TRUE)

[1] Mon
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(x, label = TRUE, abbr = FALSE)

[1] Monday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

hour(x)

[1] 3

minute(x)

[1] 4

second(x)

[1] 5

For all of the above you can also use this syntax to change the value.

y <- ymd_hms("2006-01-02 03:04:05")
day(y) <- 7
y

[1] "2006-01-07 03:04:05 UTC"

Moving the date-time

If you want to adjust a date-time on the timeline, you can add or subtract time periods, like 2 years. These functions are all named after the plural nouns.

x <- ymd_hms("2006-01-02 03:04:05")

x + years(4)

[1] "2010-01-02 03:04:05 UTC"

x + months(4)

[1] "2006-05-02 03:04:05 UTC"

x - days(4)

[1] "2005-12-29 03:04:05 UTC"

x - hours(1)

[1] "2006-01-02 02:04:05 UTC"

x + minutes(20)

[1] "2006-01-02 03:24:05 UTC"

Durations and Comparing date-times

x <- ymd_hms("2006-01-02 03:04:05")
y <- ymd_hms("2007-02-03 04:05:06")

x > y

[1] FALSE

x < y

[1] TRUE

y - x

Time difference of 397.0424 days

x + y #<< Error in `+.POSIXt`(x, y) `+` not defined...

Error in `+.POSIXt`(x, y): binary '+' is not defined for "POSIXt" objects

# Comparing date-time
z <- y - x
z

Time difference of 397.0424 days

as.integer(z)

[1] 397

# Compareing dates
z <- as_date(y) - as_date(x)
z

Time difference of 397 days

as.integer(z)

[1] 397

Rounding Dates

x <- ymd_hms("2006-01-02 03:04:05")
y <- ymd_hms("2007-02-03 04:05:06")

# Round to nearest unit
round_date(x, unit = "minute")

[1] "2006-01-02 03:04:00 UTC"

round_date(y, unit = "month")

[1] "2007-02-01 UTC"

# Round down to nearest unit
floor_date(x, unit = "hour")

[1] "2006-01-02 03:00:00 UTC"

floor_date(x, unit = "day")

[1] "2006-01-02 UTC"

# Round up to nearest unit
ceiling_date(y, unit = "hour")

[1] "2007-02-03 05:00:00 UTC"

ceiling_date(y, unit = "day")

[1] "2007-02-04 UTC"

# Rollback to last day of previous month
rollback(x)

[1] "2005-12-31 03:04:05 UTC"

rollback(x, roll_to_first = TRUE)

[1] "2006-01-01 03:04:05 UTC"

rollback(x, roll_to_first = TRUE, preserve_hms = FALSE)

[1] "2006-01-01 UTC"

Time Zones

x <- ymd_hms("2006-01-02 03:04:05")

with_tz(x, tzone = "US/Eastern")

[1] "2006-01-01 22:04:05 EST"

force_tz(x, tzone = "US/Eastern")

[1] "2006-01-02 03:04:05 EST"

timezones <- OlsonNames()
timezones[grepl("US", timezones)]

 [1] "US/Alaska"         "US/Aleutian"       "US/Arizona"       
 [4] "US/Central"        "US/East-Indiana"   "US/Eastern"       
 [7] "US/Hawaii"         "US/Indiana-Starke" "US/Michigan"      
[10] "US/Mountain"       "US/Pacific"        "US/Pacific-New"   
[13] "US/Samoa"

Assignment

Review: Two-Table dplyr verbs

Keys

What are the two types of keys?

Answer

A primary key uniquely identifies an observation in its own table.
A foreign key uniquely identifies an observation in another table.

Joins

I highly encourage you to read the chapter on Relational Data in the R4DS book.

You can also review all of the join animations by visiting my page on tidy animated verbs.

Joins Review

Mutating Joins

Filtering Joins

Homework

We’ll use the Breast Cancer Registry Data from Session 4. I’ve started writing the script with some tips and hints, which you can download from the Session 6 materials page. Or, if you’d like to start blind, you can use the code in Import Data below to get started.

Import Data

library(tidyverse)
# If you don't have the data in your working directory, you can download it by
# uncommenting and running the two lines below.
# download.file("https://git.io/fATNV", "CancerRegistryData.csv")
# download.file("https://git.io/fATNo", "CancerRegistryDataPatients.csv")

cr_patients <- read_csv(
  "CancerRegistryDataPatients.csv",
  col_types = cols(`Date Last Patient Contact/Dead` = col_date(format = "%m/%d/%y"))
)

cr <- read_csv(
  "CancerRegistryData.csv",
  col_types = cols(
    `Date Last Patient Contact/Dead` = col_date(format = "%m/%d/%y"),
    `Date of Diagnosis` = col_date(format = "%m/%d/%y")
  )
)

Task 1

Merge the two data sets to add the patient information to their diagnosis records.

Task 2

Find deceased patients at least 75 years or older. Create a data frame containing their patient information and all of their diagnosis records.

Task 3

Find the diagnosis records of the African American patients.

Bonus: Find the diagnosis records of married African American patients using two joins.

Task 4

First, verify that all patients in the diagnosis records data set cr appearin the patient information data set using anti_join

Then, find all patients who have had more than one cancer diagnosis and create a data frame with only their patient information, using semi_join().

Task 5

Find patients who were last contacted in 2017 and merge their patient information with their diagnosis records. Transform all dates to be the number of days since their earliest diagnosis.

Session 6

August 22, 2018

Links

Homework Review

Overview

Tidying Data

The three rules of tidy data

Why tidy data?

Example 1

Example 2

Example 3

Example 4

Tidying Data

`gather()`

`spread()`

Animated

Quick Review

Working with Dates

About dates, times and date-times

Converting strings to date-times

Getting and Setting Date Components

Moving the date-time

Durations and Comparing date-times

Rounding Dates

Time Zones

Assignment

Review: Two-Table dplyr verbs

Keys

Joins

Homework

Task 1

Task 2

Task 3

Task 4

Task 5