Review the homework from Session 6.
You can check out the approach I took in this page. There is more than one way to accomplish each task, so it’s okay if your code looks different!
Question: What are the three rules of tidy data?
Question: Which function from tidyr takes data from wide to long format?
gather()
spread()
Question: Which of the following functions is used to:
months()
floor_date()
month()
Answer
month(x)
extracts the month of the date x
x + months(2)
adds two months to x
floor_date(x, unit = "month")
rounds down to the nearest monthQuestion: Given the following dates
"15 March 2018 02pm"
"04/29/18 02:49"
"Fri Feb 9 2018 14:49:29"
"2018-09-16"
which functions can be used to convert the dates from strings to dates or date-times?
mdy_hm()
dmy_h()
ymd()
mdy_hms()
Answer
dmy_h(" 4 January 2018 01pm")
: 2018-01-04 13:00:00
mdy_hm("02/18/18 01:54")
: 2018-02-18 01:54:00
mdy_hms("Fri Dec 1 2017 13:54:01")
: 2017-12-01 13:54:01
ymd("2018-07-08")
: 2018-07-08Review Homework
paste()
str_trim()
str_detect()
str_count()
str_extact()
str_remove()
str_replace()
and str_replace_all()
str_to_lower()
, str_to_title()
, str_to_upper()
|
(apple) (banana)
.
+
, *
, ?
^
, $
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness.
To install janitor:
Three functions are useful on a daily basis:
clean_names()
: Makes nice, easy to use names
remove_empty()
: Removes rows or columns where everything is missing
get_dupes(df, col1, col2)
: Get rows in df
that have duplicate values in col1
and col2
.
library(janitor)
library(readr)
library(lubridate)
cr <- read_csv("CancerRegistryData.csv") %>%
clean_names() %>%
remove_empty() %>%
mutate(
date_of_diagnosis = mdy(date_of_diagnosis),
date_last_patient_contact_dead = mdy(date_last_patient_contact_dead)
)
Parsed with column specification:
cols(
.default = col_character(),
`Medical Record Number` = col_integer(),
`Sequence Number` = col_integer(),
`Age at Diagnosis` = col_integer(),
`Age, Current` = col_integer(),
`Survival Time (months)` = col_integer(),
`Clinical Tumor Size` = col_double()
)
See spec(...) for full column specifications.
[1] "medical_record_number" "sequence_number"
[3] "site_primary_icd_o_3" "histology_behavior_icd_o_3"
[5] "date_of_diagnosis" "age_at_diagnosis"
[7] "age_current" "sex"
[9] "spanish_hispanic_origin" "race"
[11] "vital_status" "date_last_patient_contact_dead"
[13] "survival_time_months" "dc_cause_of_death"
[15] "clinical_tumor_size" "clinical_size_unit"
[17] "tobacco_use_fcds" "summary_of_rx_1st_course"
[19] "payer_source_1_at_dx" "patient_history_alcohol"
[21] "marital_status_at_dx"
Learn more at the janitor package website.
Which of the following are valid R strings:
"I am OZ, the Great and Terrible"
"You're a humbug!"
"There's no place like 'home'"
"\"Can't you give me brains?,\" asked Scarecrow"
This script loads the tidyverse, lubridate, and janitor and then loads the example Cancer Registry Data.
All of the stringr functions start with str_
, like str_c()
. This function takes parts of a string and concatenates them, just like c()
takes parts of a vector and puts them together.
[1] "Somethinginteresting"
[1] "Something interesting"
[1] "Something funny" "Something odd"
[1] "Something funny or Something odd"
[1] TRUE
[1] TRUE TRUE FALSE
Visualize
apple
grape
banana
# A tibble: 151 x 21
medical_record_… sequence_number site_primary_ic… histology_behav…
<int> <int> <chr> <chr>
1 687391 1 C509 (BREAST NO… 85202 (LOBULAR …
2 906008 0 C509 (BREAST NO… 85003 (INFILTRA…
3 162232 0 C503 (BREAST LI… 85003 (INFILTRA…
4 710246 0 C503 (BREAST LI… 85003 (INFILTRA…
5 317251 0 C509 (BREAST NO… 85003 (INFILTRA…
6 175897 0 C504 (BREAST UO… 85003 (INFILTRA…
7 197932 0 C504 (BREAST UO… 85003 (INFILTRA…
8 446531 1 C504 (BREAST UO… 85223 (INFILTRA…
9 148456 0 C509 (BREAST NO… 85003 (INFILTRA…
10 334557 1 C504 (BREAST UO… 85003 (INFILTRA…
# ... with 141 more rows, and 17 more variables: date_of_diagnosis <date>,
# age_at_diagnosis <int>, age_current <int>, sex <chr>,
# spanish_hispanic_origin <chr>, race <chr>, vital_status <chr>,
# date_last_patient_contact_dead <date>, survival_time_months <int>,
# dc_cause_of_death <chr>, clinical_tumor_size <dbl>,
# clinical_size_unit <chr>, tobacco_use_fcds <chr>,
# summary_of_rx_1st_course <chr>, payer_source_1_at_dx <chr>,
# patient_history_alcohol <chr>, marital_status_at_dx <chr>
cr %>% filter(
str_detect(histology_behavior_icd_o_3, "CARCINOMA"),
!str_detect(histology_behavior_icd_o_3, "INFILTRATING")
)
# A tibble: 37 x 21
medical_record_… sequence_number site_primary_ic… histology_behav…
<int> <int> <chr> <chr>
1 687391 1 C509 (BREAST NO… 85202 (LOBULAR …
2 341050 1 C569 (OVARY) 84603 (PAPILLAR…
3 896356 2 C182 (COLON ASC… 81403 (ADENOCAR…
4 896356 1 C509 (BREAST NO… 85203 (LOBULAR …
5 644232 0 C508 (BREAST OV… 82113 (TUBULAR …
6 218883 0 C502 (BREAST UI… 85202 (LOBULAR …
7 558013 1 C649 (KIDNEY NO… 83123 (RENAL CE…
8 563418 0 C509 (BREAST NO… 85203 (LOBULAR …
9 883521 0 C509 (BREAST NO… 81403 (ADENOCAR…
10 289034 1 C504 (BREAST UO… 82113 (TUBULAR …
# ... with 27 more rows, and 17 more variables: date_of_diagnosis <date>,
# age_at_diagnosis <int>, age_current <int>, sex <chr>,
# spanish_hispanic_origin <chr>, race <chr>, vital_status <chr>,
# date_last_patient_contact_dead <date>, survival_time_months <int>,
# dc_cause_of_death <chr>, clinical_tumor_size <dbl>,
# clinical_size_unit <chr>, tobacco_use_fcds <chr>,
# summary_of_rx_1st_course <chr>, payer_source_1_at_dx <chr>,
# patient_history_alcohol <chr>, marital_status_at_dx <chr>
[1] 2
[1] 2 1 0
[1] "apple " "grape" " banana"
[1] "apple FRUIT" "grapeFRUIT" "FRUIT banana"
[1] "apple and banana fruit"
[1] "apple and banana "
[1] "apple FRUIT and banana FRUIT"
[1] "stop shouting"
[1] "TALK LOUDLY"
[1] "Be A Little More Professional"
Find all BREAST
cancer diagnosis records with a payer source that includes HMO
s.
Answer
cr %>%
filter(
str_detect(site_primary_icd_o_3, "BREAST"),
str_detect(payer_source_1_at_dx, "HMO")
) %>%
select(medical_record_number,
site_primary_icd_o_3,
payer_source_1_at_dx)
# A tibble: 69 x 3
medical_record_number site_primary_icd_o_3 payer_source_1_at_dx
<int> <chr> <chr>
1 197932 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
2 446531 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
3 766840 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
4 591913 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
5 182841 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
6 896356 C509 (BREAST NOS) MANAGED CARE (HMO, PPO)
7 218883 C502 (BREAST UIQ) MANAGED CARE (HMO, PPO)
8 348966 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
9 718924 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
10 458035 C504 (BREAST UOQ) MANAGED CARE (HMO, PPO)
# ... with 59 more rows
Count the total number of times surgery appears in the prescribed first course for each diagnosis record.
cr %>%
mutate(
________ = str_=====(summary_of_rx_1st_course, "SURGERY")
) %>%
group_by(medical_record_number) %>%
summarize(n_surgery = sum(________))
Answer
cr %>%
mutate(
surgery_count = str_count(summary_of_rx_1st_course, "SURGERY")
) %>%
group_by(medical_record_number) %>%
summarize(n_surgery = sum(surgery_count))
# A tibble: 150 x 2
medical_record_number n_surgery
<int> <int>
1 101145 0
2 101380 0
3 112743 0
4 124227 0
5 140848 0
6 145185 0
7 148456 0
8 149065 0
9 151416 0
10 158774 1
# ... with 140 more rows
I took another look at the data and improved the previous code. What improvement did I make to get this result?
# A tibble: 150 x 2
medical_record_number n_surgery
<int> <int>
1 101145 1
2 101380 1
3 112743 1
4 124227 1
5 140848 1
6 145185 1
7 148456 1
8 149065 1
9 151416 1
10 158774 1
# ... with 140 more rows
The cr_patients
dataset also includes an email
field, but some of the records weren’t filled out correctly. An email definitely has to have an @
.
How many patients are missing a valid email address with this criteria?
Answer
550.245.1108x906 |
Okuneva, Okuneva and Okuneva |
+92(5)4722895694 |
814.891.0875x7342 |
(610)647-6685x068 |
013-030-7545 |
Batz, Batz and Batz |
Hammes, Hammes and Hammes |
Mayer Ltd |
268.986.9090x238 |
(947)787-6881x8309 |
Ratke-Ratke |
O’Reilly, O’Reilly and O’Reilly |
Brakus-Brakus |
Spencer, Spencer and Spencer |
397.981.2233x56350 |
790.197.5667x87625 |
Jerde, Jerde and Jerde |
(847)437-7174 |
Durgan, Durgan and Durgan |
Our diagnosis records include an ICD code for the location of the cancer, but these values include additional text:
# A tibble: 167 x 1
site_primary_icd_o_3
<chr>
1 C509 (BREAST NOS)
2 C509 (BREAST NOS)
3 C503 (BREAST LIQ)
4 C503 (BREAST LIQ)
5 C509 (BREAST NOS)
6 C504 (BREAST UOQ)
7 C504 (BREAST UOQ)
8 C504 (BREAST UOQ)
9 C509 (BREAST NOS)
10 C509 (BREAST NOS)
# ... with 157 more rows
How would you describe in words the structure of a breast cancer related IDC code?
Pretend you’re writing instructions for a robot that can only look at one letter of input at a time. How would you instruct the robot to extract just the IDC code "C504"
from the full string "C504 (BREAST UOQ)"
?
Here are the ICD codes found in this example.
[1] "C182" "C383" "C421" "C449" "C501" "C502" "C503" "C504" "C505" "C506"
[11] "C508" "C509" "C529" "C541" "C569" "C649" "C700" "C739" "C778" "C809"
Regular expressions are a language for describing patterns in strings.
"abc123"
RegExp | Meaning | Example | Result |
---|---|---|---|
\w |
Word Character (a letter) | \w |
abc123 |
\d |
Digit (a number) | \d\d |
abc123 |
. |
Anything | c. |
abc1123 |
+ |
One or more | c.+ |
abc123 |
* |
Zero or more | abc\d* |
abc123 |
? |
Maybe? | abcd?1 |
abc123 |
| |
Or | abcd|1 |
abc123 |
^ |
Start | ^a |
abc123 |
$ |
End | 3$ |
abc123 |
() |
Group | abc(d|1) |
abc123 |
As we saw with the examples above, there are characters that have special meaning in strings in R. For example, a string starts and ends with "
or '
.
There is no place like home
If we used "
to open or close the string but we want to have another "
inside the string, we need to escape the quote.
There is no place like "home"
This is called escaping because adding the backslash lets you escape the special semantics of the character. It indicates that you don’t want a special "
, you want a regular literal "
.
What if you want to add a backslash? The backslash is also special (it does the escaping), so to write a literal backslash you have to escape the backslash, too.
\
Some of the regular expressions above include backslashes. To write this in a string in R, you need to escape the backslash.
Error: '\w' is an unrecognized escape in character string starting ""\w"
[1] "a"
[1] "bc1"
\w\w\d
Escaping for R or for the regular expression When writing regular expressions, you’re operating at two levels. First, you are writing a code into R to store the strings. Second, when R is done processing that code we want the output to be a regular expression.
So above, we write "\\w"
into R so that we get a regular expression like "\w"
.
But regular expressions also have special characters, like .
and +
. If we want to write literal versions of these in our regular expressions – for example to match the dot in 10.123
– we need to escape the dot at the regular expression level.
Adding one backslash escapes at the R level:
Error: '\.' is an unrecognized escape in character string starting ""0\."
But this doesn’t pass down to the regular expression level.
Error: '\.' is an unrecognized escape in character string starting ""0\."
To get there we need yet another backslash.
[1] "0.1"
0\.1
How can you write an actual backslash at the regular expression level?
Back to our example from before. How can we extract the ICD code from the diagnosis records?
# A tibble: 167 x 1
site_primary_icd_o_3
<chr>
1 C509 (BREAST NOS)
2 C509 (BREAST NOS)
3 C503 (BREAST LIQ)
4 C503 (BREAST LIQ)
5 C509 (BREAST NOS)
6 C504 (BREAST UOQ)
7 C504 (BREAST UOQ)
8 C504 (BREAST UOQ)
9 C509 (BREAST NOS)
10 C509 (BREAST NOS)
# ... with 157 more rows
Answer
# A tibble: 167 x 2
site_primary_icd_o_3 icd
<chr> <chr>
1 C509 (BREAST NOS) C509
2 C509 (BREAST NOS) C509
3 C503 (BREAST LIQ) C503
4 C503 (BREAST LIQ) C503
5 C509 (BREAST NOS) C509
6 C504 (BREAST UOQ) C504
7 C504 (BREAST UOQ) C504
8 C504 (BREAST UOQ) C504
9 C509 (BREAST NOS) C509
10 C509 (BREAST NOS) C509
# ... with 157 more rows
How can we extract the ICD meaning? Note: it’s inside (
and )
, but these are special characters in regular expressions.
Answer
cr %>%
select(site_primary_icd_o_3) %>%
mutate(
icd = str_extract(site_primary_icd_o_3, "C5\\d\\d"),
icd_info = str_extract(site_primary_icd_o_3, "\\(.+\\)"),
icd_info = str_remove_all(icd_info, "\\(|\\)")
)
# A tibble: 167 x 3
site_primary_icd_o_3 icd icd_info
<chr> <chr> <chr>
1 C509 (BREAST NOS) C509 BREAST NOS
2 C509 (BREAST NOS) C509 BREAST NOS
3 C503 (BREAST LIQ) C503 BREAST LIQ
4 C503 (BREAST LIQ) C503 BREAST LIQ
5 C509 (BREAST NOS) C509 BREAST NOS
6 C504 (BREAST UOQ) C504 BREAST UOQ
7 C504 (BREAST UOQ) C504 BREAST UOQ
8 C504 (BREAST UOQ) C504 BREAST UOQ
9 C509 (BREAST NOS) C509 BREAST NOS
10 C509 (BREAST NOS) C509 BREAST NOS
# ... with 157 more rows