This course will develop core proficiencies in data processing in the R programming language. Through a series of 8 interactive weekly sessions and 4 bi-weekly follow-up sessions, the Collaborative Data Services Core team will learn to:
Import data from varying formats (CSV, Excel, SAS, txt) into RStudio.
Perform common data transformation steps in R, including manipulating, processing, cleaning, de-identifying, and merging data sources.
Output data in flexible formats (CSV, Excel, SAS, txt, etc.), as required by CDS clients from R/RStudio.
Save and document data processing steps as an R script that can be repeated or modified later.
These objectives will be tailored and refined prior to the course offering through conversations with the CDS Core team about their requirements and current workflows.
Additional goals dependent on the needs of CDS Core and the scope of previous projects chosen by the team to serve as exemplary challenges.
Visualize data using ggplot2
String manipulation via stringr with regular expressions
Interacting with databases using dbplyr
Documenting data analyses and process with R Markdown
Version control with git
Looking forward (e.g. once core R proficiencies are incorporated into routine practice), a needs assessment could identify whether the following further goals would be useful in a follow-up series:
Organizing and collecting functions for common CDS data actions, processes and interface interactions into an R Package
The overall structure of the course will take the form of 8 weekly ~2 hour sessions. Each session will be structured to support a core CDS learning goal drawn from previously completed projects with the intention of having participants leave the session having learned a new core skill that can be applied to CDS workflows. At each additional session we will build on the skills learned in the previous session, increasing at each session the breadth of tasks that can be accomplished.
Overall, the first 2-4 sessions will be designed to be focused on getting to “productive” as quickly as possible, relative to the goals and requirements of CDS workflows. These sessions will about 70% presentation and 30% “try-it-yourself” interactive examples. Rather than focus on teaching discrete units – e.g. data types; subsetting; control structures – concepts will be introduced “just-in-time” for the session goals. As the complexity of the targeted outcome skill grows, the level of the concepts explored can increase to include more difficult or comprehensive tasks.
After the initial learning period (8 sessions), the course will shift to more informal “workgroup” sessions centered around recent applications of R programming to actual CDS queries. These 4 bi-weekly follow-up sessions will be used to help troubleshoot problems or issues that arise while integrating R into CDS tasks. A goal in the design of this course is that, after each initial session, the participants will be able to use RStudio and R tools to fully replace at least one critical task. Along the way, participants will be able to recognize areas where R can generally be applied to streamline their daily tasks and these observations can be used to guide later content in conjunction with the list of topics in the additional goals section above.
From a technical perspective, the course will feature modern R packages and tools, namely RStudio and the tidyverse packages and programming styles, in favor of base R methods. Fundamental concepts from base R will be included as needed, but the course will favor productive data engineering over technical completeness.
library(``)
and pkg``::function()
<-
NA
and NULL
readr
and readxl
filter(``)
, arrange()
, select()
, rename()
, mutate()
library()
and pkg::function()
<-
NA
and NULL
mpg
and flights
readr
and readxl
ggplot() + geom_point()
aes()
+ geom_<GEOM>()
geom_point()
geom_line()
geom_smooth()
+ facet_wrap()
filter()
, arrange()
, select()
, rename()
, mutate()
select()
helpersmutate()
summarize()
, count()
, sample_n()
, sample_frac()
, top_n()
group_by()
+ summarize()
, count()
, mutate()
select(flights, year:dep_time)
vs mutate(df, "year", 2)
spread()
and gather()