Homework Review

Review the homework from Session 6.

You can check out the approach I took in this page. There is more than one way to accomplish each task, so it’s okay if your code looks different!

  • What parts of these tasks were difficult?
  • Which parts did you find easy?
  • What questions did you have as you worked?
  • Were you able to find the answers to any of these questions?

Tidy Data

Question: What are the three rules of tidy data?

Question: Which function from tidyr takes data from wide to long format?

  1. gather()
  2. spread()

Working with Dates

Question: Which of the following functions is used to:

  1. extract the month from a date;
  2. add a number of months to a date;
  3. find the first day of the month to which a date belongs?
  1. months()
  2. floor_date()
  3. month()

Answer

  • month(x) extracts the month of the date x
  • x + months(2) adds two months to x
  • floor_date(x, unit = "month") rounds down to the nearest month

Question: Given the following dates

  • "15 March 2018 02pm"
  • "04/29/18 02:49"
  • "Fri Feb 9 2018 14:49:29"
  • "2018-09-16"

which functions can be used to convert the dates from strings to dates or date-times?

  1. mdy_hm()
  2. dmy_h()
  3. ymd()
  4. mdy_hms()

Answer

  • dmy_h(" 4 January 2018 01pm"): 2018-01-04 13:00:00

  • mdy_hm("02/18/18 01:54"): 2018-02-18 01:54:00

  • mdy_hms("Fri Dec 1 2017 13:54:01"): 2017-12-01 13:54:01

  • ymd("2018-07-08"): 2018-07-08

Overview

  1. Review Homework

  2. Working with strings (regex-free)
    • paste()
    • str_trim()
    • str_detect()
    • str_count()
    • str_extact()
    • str_remove()
    • str_replace() and str_replace_all()
    • str_to_lower(), str_to_title(), str_to_upper()
  3. Regular Expressions
    • regexplain
    • or |
    • groups (apple) (banana)
    • anything .
    • modifiers +, *, ?
    • anchors ^, $
    • backslashes

Package Spotlight: janitor

janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness.

To install janitor:

Clean names, remove empties, find duplicates

Three functions are useful on a daily basis:

  1. clean_names(): Makes nice, easy to use names

  2. remove_empty(): Removes rows or columns where everything is missing

  3. get_dupes(df, col1, col2): Get rows in df that have duplicate values in col1 and col2.

Parsed with column specification:
cols(
  .default = col_character(),
  `Medical Record Number` = col_integer(),
  `Sequence Number` = col_integer(),
  `Age at Diagnosis` = col_integer(),
  `Age, Current` = col_integer(),
  `Survival Time (months)` = col_integer(),
  `Clinical Tumor Size` = col_double()
)
See spec(...) for full column specifications.
 [1] "medical_record_number"          "sequence_number"               
 [3] "site_primary_icd_o_3"           "histology_behavior_icd_o_3"    
 [5] "date_of_diagnosis"              "age_at_diagnosis"              
 [7] "age_current"                    "sex"                           
 [9] "spanish_hispanic_origin"        "race"                          
[11] "vital_status"                   "date_last_patient_contact_dead"
[13] "survival_time_months"           "dc_cause_of_death"             
[15] "clinical_tumor_size"            "clinical_size_unit"            
[17] "tobacco_use_fcds"               "summary_of_rx_1st_course"      
[19] "payer_source_1_at_dx"           "patient_history_alcohol"       
[21] "marital_status_at_dx"          

Learn more at the janitor package website.

Doing things with strings

Quick String Review

Which of the following are valid R strings:

  1. "I am OZ, the Great and Terrible"

  2. "You're a humbug!"

  3. "There's no place like 'home'"

  4. "\"Can't you give me brains?,\" asked Scarecrow"

Load Packages and Data

This script loads the tidyverse, lubridate, and janitor and then loads the example Cancer Registry Data.

Work Directly with Strings

All of the stringr functions start with str_, like str_c(). This function takes parts of a string and concatenates them, just like c() takes parts of a vector and puts them together.

[1] "Somethinginteresting"
[1] "Something interesting"
[1] "Something funny" "Something odd"  
[1] "Something funny or Something odd"

Detect if a string contains another string

[1] TRUE
[1]  TRUE  TRUE FALSE

Visualize

apple

grape

banana

# A tibble: 151 x 21
   medical_record_… sequence_number site_primary_ic… histology_behav…
              <int>           <int> <chr>            <chr>           
 1           687391               1 C509 (BREAST NO… 85202 (LOBULAR …
 2           906008               0 C509 (BREAST NO… 85003 (INFILTRA…
 3           162232               0 C503 (BREAST LI… 85003 (INFILTRA…
 4           710246               0 C503 (BREAST LI… 85003 (INFILTRA…
 5           317251               0 C509 (BREAST NO… 85003 (INFILTRA…
 6           175897               0 C504 (BREAST UO… 85003 (INFILTRA…
 7           197932               0 C504 (BREAST UO… 85003 (INFILTRA…
 8           446531               1 C504 (BREAST UO… 85223 (INFILTRA…
 9           148456               0 C509 (BREAST NO… 85003 (INFILTRA…
10           334557               1 C504 (BREAST UO… 85003 (INFILTRA…
# ... with 141 more rows, and 17 more variables: date_of_diagnosis <date>,
#   age_at_diagnosis <int>, age_current <int>, sex <chr>,
#   spanish_hispanic_origin <chr>, race <chr>, vital_status <chr>,
#   date_last_patient_contact_dead <date>, survival_time_months <int>,
#   dc_cause_of_death <chr>, clinical_tumor_size <dbl>,
#   clinical_size_unit <chr>, tobacco_use_fcds <chr>,
#   summary_of_rx_1st_course <chr>, payer_source_1_at_dx <chr>,
#   patient_history_alcohol <chr>, marital_status_at_dx <chr>
# A tibble: 37 x 21
   medical_record_… sequence_number site_primary_ic… histology_behav…
              <int>           <int> <chr>            <chr>           
 1           687391               1 C509 (BREAST NO… 85202 (LOBULAR …
 2           341050               1 C569 (OVARY)     84603 (PAPILLAR…
 3           896356               2 C182 (COLON ASC… 81403 (ADENOCAR…
 4           896356               1 C509 (BREAST NO… 85203 (LOBULAR …
 5           644232               0 C508 (BREAST OV… 82113 (TUBULAR …
 6           218883               0 C502 (BREAST UI… 85202 (LOBULAR …
 7           558013               1 C649 (KIDNEY NO… 83123 (RENAL CE…
 8           563418               0 C509 (BREAST NO… 85203 (LOBULAR …
 9           883521               0 C509 (BREAST NO… 81403 (ADENOCAR…
10           289034               1 C504 (BREAST UO… 82113 (TUBULAR …
# ... with 27 more rows, and 17 more variables: date_of_diagnosis <date>,
#   age_at_diagnosis <int>, age_current <int>, sex <chr>,
#   spanish_hispanic_origin <chr>, race <chr>, vital_status <chr>,
#   date_last_patient_contact_dead <date>, survival_time_months <int>,
#   dc_cause_of_death <chr>, clinical_tumor_size <dbl>,
#   clinical_size_unit <chr>, tobacco_use_fcds <chr>,
#   summary_of_rx_1st_course <chr>, payer_source_1_at_dx <chr>,
#   patient_history_alcohol <chr>, marital_status_at_dx <chr>

Count the number of times a string is found in other string(s)

[1] 2
[1] 2 1 0

Extract part of a string

[1] "p" "p" NA 

Convert string to lower, upper or title case

[1] "stop shouting"
[1] "TALK LOUDLY"
[1] "Be A Little More Professional"

Exercises 7.1

Task 1

Find all BREAST cancer diagnosis records with a payer source that includes HMOs.

Answer

# A tibble: 69 x 3
   medical_record_number site_primary_icd_o_3 payer_source_1_at_dx   
                   <int> <chr>                <chr>                  
 1                197932 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 2                446531 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 3                766840 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 4                591913 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 5                182841 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 6                896356 C509 (BREAST NOS)    MANAGED CARE (HMO, PPO)
 7                218883 C502 (BREAST UIQ)    MANAGED CARE (HMO, PPO)
 8                348966 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
 9                718924 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
10                458035 C504 (BREAST UOQ)    MANAGED CARE (HMO, PPO)
# ... with 59 more rows

Task 2

Count the total number of times surgery appears in the prescribed first course for each diagnosis record.

Answer

# A tibble: 150 x 2
   medical_record_number n_surgery
                   <int>     <int>
 1                101145         0
 2                101380         0
 3                112743         0
 4                124227         0
 5                140848         0
 6                145185         0
 7                148456         0
 8                149065         0
 9                151416         0
10                158774         1
# ... with 140 more rows

Task 3

I took another look at the data and improved the previous code. What improvement did I make to get this result?

# A tibble: 150 x 2
   medical_record_number n_surgery
                   <int>     <int>
 1                101145         1
 2                101380         1
 3                112743         1
 4                124227         1
 5                140848         1
 6                145185         1
 7                148456         1
 8                149065         1
 9                151416         1
10                158774         1
# ... with 140 more rows

Answer

Task 4

The cr_patients dataset also includes an email field, but some of the records weren’t filled out correctly. An email definitely has to have an @.

How many patients are missing a valid email address with this criteria?

Answer

email
550.245.1108x906
Okuneva, Okuneva and Okuneva
+92(5)4722895694
814.891.0875x7342
(610)647-6685x068
013-030-7545
Batz, Batz and Batz
Hammes, Hammes and Hammes
Mayer Ltd
268.986.9090x238
(947)787-6881x8309
Ratke-Ratke
O’Reilly, O’Reilly and O’Reilly
Brakus-Brakus
Spencer, Spencer and Spencer
397.981.2233x56350
790.197.5667x87625
Jerde, Jerde and Jerde
(847)437-7174
Durgan, Durgan and Durgan

Regular Expressions

What is a regular expression?

Our diagnosis records include an ICD code for the location of the cancer, but these values include additional text:

# A tibble: 167 x 1
   site_primary_icd_o_3
   <chr>               
 1 C509 (BREAST NOS)   
 2 C509 (BREAST NOS)   
 3 C503 (BREAST LIQ)   
 4 C503 (BREAST LIQ)   
 5 C509 (BREAST NOS)   
 6 C504 (BREAST UOQ)   
 7 C504 (BREAST UOQ)   
 8 C504 (BREAST UOQ)   
 9 C509 (BREAST NOS)   
10 C509 (BREAST NOS)   
# ... with 157 more rows

How would you describe in words the structure of a breast cancer related IDC code?

Pretend you’re writing instructions for a robot that can only look at one letter of input at a time. How would you instruct the robot to extract just the IDC code "C504" from the full string "C504 (BREAST UOQ)"?

Here are the ICD codes found in this example.

 [1] "C182" "C383" "C421" "C449" "C501" "C502" "C503" "C504" "C505" "C506"
[11] "C508" "C509" "C529" "C541" "C569" "C649" "C700" "C739" "C778" "C809"

Regular expressions are a language for describing patterns in strings.

An RStudio Addin to help you

RStudio Addin - RegExplain.

Basic Regular Expressions

"abc123"
RegExp Meaning Example Result
\w Word Character (a letter) \w abc123
\d Digit (a number) \d\d abc123
. Anything c. abc1123
+ One or more c.+ abc123
* Zero or more abc\d* abc123
? Maybe? abcd?1 abc123
| Or abcd|1 abc123
^ Start ^a abc123
$ End 3$ abc123
() Group abc(d|1) abc123

Backslashes!

As we saw with the examples above, there are characters that have special meaning in strings in R. For example, a string starts and ends with " or '.

There is no place like home

If we used " to open or close the string but we want to have another " inside the string, we need to escape the quote.

There is no place like "home"

This is called escaping because adding the backslash lets you escape the special semantics of the character. It indicates that you don’t want a special ", you want a regular literal ".

What if you want to add a backslash? The backslash is also special (it does the escaping), so to write a literal backslash you have to escape the backslash, too.

\

Some of the regular expressions above include backslashes. To write this in a string in R, you need to escape the backslash.

Error: '\w' is an unrecognized escape in character string starting ""\w"
[1] "a"
[1] "bc1"
\w\w\d

Escaping for R or for the regular expression When writing regular expressions, you’re operating at two levels. First, you are writing a code into R to store the strings. Second, when R is done processing that code we want the output to be a regular expression.

So above, we write "\\w" into R so that we get a regular expression like "\w".

But regular expressions also have special characters, like . and +. If we want to write literal versions of these in our regular expressions – for example to match the dot in 10.123 – we need to escape the dot at the regular expression level.

Adding one backslash escapes at the R level:

Error: '\.' is an unrecognized escape in character string starting ""0\."

But this doesn’t pass down to the regular expression level.

Error: '\.' is an unrecognized escape in character string starting ""0\."

To get there we need yet another backslash.

[1] "0.1"
0\.1

How can you write an actual backslash at the regular expression level?

Answer

\\

Regular expressions in data

Back to our example from before. How can we extract the ICD code from the diagnosis records?

# A tibble: 167 x 1
   site_primary_icd_o_3
   <chr>               
 1 C509 (BREAST NOS)   
 2 C509 (BREAST NOS)   
 3 C503 (BREAST LIQ)   
 4 C503 (BREAST LIQ)   
 5 C509 (BREAST NOS)   
 6 C504 (BREAST UOQ)   
 7 C504 (BREAST UOQ)   
 8 C504 (BREAST UOQ)   
 9 C509 (BREAST NOS)   
10 C509 (BREAST NOS)   
# ... with 157 more rows

Answer

# A tibble: 167 x 2
   site_primary_icd_o_3 icd  
   <chr>                <chr>
 1 C509 (BREAST NOS)    C509 
 2 C509 (BREAST NOS)    C509 
 3 C503 (BREAST LIQ)    C503 
 4 C503 (BREAST LIQ)    C503 
 5 C509 (BREAST NOS)    C509 
 6 C504 (BREAST UOQ)    C504 
 7 C504 (BREAST UOQ)    C504 
 8 C504 (BREAST UOQ)    C504 
 9 C509 (BREAST NOS)    C509 
10 C509 (BREAST NOS)    C509 
# ... with 157 more rows

How can we extract the ICD meaning? Note: it’s inside ( and ), but these are special characters in regular expressions.

Answer

# A tibble: 167 x 3
   site_primary_icd_o_3 icd   icd_info  
   <chr>                <chr> <chr>     
 1 C509 (BREAST NOS)    C509  BREAST NOS
 2 C509 (BREAST NOS)    C509  BREAST NOS
 3 C503 (BREAST LIQ)    C503  BREAST LIQ
 4 C503 (BREAST LIQ)    C503  BREAST LIQ
 5 C509 (BREAST NOS)    C509  BREAST NOS
 6 C504 (BREAST UOQ)    C504  BREAST UOQ
 7 C504 (BREAST UOQ)    C504  BREAST UOQ
 8 C504 (BREAST UOQ)    C504  BREAST UOQ
 9 C509 (BREAST NOS)    C509  BREAST NOS
10 C509 (BREAST NOS)    C509  BREAST NOS
# ... with 157 more rows