Improve your data analysis workflow with the drake R package

data science R

A quick guide.

Miha Gazvoda https://mihagazvoda.com
10-30-2020

drake is an R package by Will Landau that analyzes your workflow. It

Perks of using drake.

Setup

Install drake. You can also load an example written by Kirill Müller. It will appear in a new main folder. I will use it as a showcase for some file examples.

# Install and load drake
install.packages("drake")

# Get an example in a new `main` folder
drake::drake_example("main")

# You can use drake::examples() to see all examples

Project structure

It’s suggested that you start your project using this structure2:

make.R
R/
├── packages.R
├── functions.R
└── plan.R
data/

You can also use dflow::use_dflow() to create almost similar structure.

Make

make.R is a master script that

# make.R
source("R/packages.R")  # loads packages
source("R/functions.R") # loads user-defined functions
source("R/plan.R")      # creates drake plan

make(plan)              # defined in R/plan.R

Plan

drake plan is the high-level catalog of data analysis steps (such as data cleaning, model fitting, visualization, and reporting) in a workflow.
Plan is presented as a data frame with columns named target and command.

# plan.R
# The workflow `plan` data frame outlines what you are going to do.
plan <- drake::drake_plan(
  # target, command
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

Drake plan is presented as a data frame with columns named target and command. Each row represents a step in the workflow. Each command is a concise expression that makes use of our functions, and each target is the return value of the command.

See plan object.
target command
raw_data readxl::read_excel(file_in(“raw_data.xlsx”))
data raw_data %>% mutate(Species = forcats::fct_inorder(Species))
hist create_plot(data)
fit lm(Sepal.Width ~ Petal.Width + Species, data)
report rmarkdown::render(knitr_in(“report.Rmd”), output_file = file_out(“report.html”), , quiet = TRUE)
See dependency graph.

Choose good targets

As Will Landau proposed, a good target is

Workflow

Even if you use drake, it makes sense to develop interactively. With r_make("make.R")4 you build your project. With loadd and readd you return targets to your session and interactively use them to develop things further.

Basic commands

Here are the most useful commands.

function description
r_make() Build your project.
clean() Force targets to be out of date and remove target names from the data in the cache.
vis_drake_graph() Show an interactive visual network representation of your workflow.
code_to_function() Create functions from scripts so you can pass them as commands in drake plan.
loadd() Loads built target(s) into your R session.
readd() Read and return a built target.

You can find more functions in drake README.

\(\bullet\bullet\bullet\)

For further reading I suggest you The drake R package User Manual. The book also served as a resource for this post.


  1. More about functions in R.↩︎

  2. It’s almost the same as in the example.↩︎

  3. If there are many functions, split them up into multiple files.↩︎

  4. If you name make file _drake.R instead, you are able to call r_make() without an argument.↩︎