Targets Pipeline Guide

Reproducible workflows with {targets}

TipPrerequisites

This page assumes familiarity with: - Basic R programming (R Style Guide) - Command line basics (running commands in terminal) - Git fundamentals (Git Practices)

A comprehensive guide to using the {targets} package for reproducible research workflows in R.

Why Targets?

The {targets} package provides:

  1. Automatic dependency tracking - Only re-runs what changed
  2. Parallel execution - Distributes work across cores/nodes
  3. Caching - Stores intermediate results for fast iteration
  4. Reproducibility - Documents the entire computational workflow

Quick Start

Installation

install.packages("targets")
install.packages("tarchetypes")  # Useful extensions

# For cluster computing (Longleaf)
install.packages("crew")
install.packages("crew.cluster")

Basic Setup

Create _targets.R in your project root:

# _targets.R
library(targets)
library(tarchetypes)

# Source your functions
tar_source("R/")

# Define pipeline
list(
  # Data loading
  tar_target(raw_data, read_csv("data/raw/dataset.csv")),

  # Processing
  tar_target(clean_data, clean_dataset(raw_data)),

  # Analysis
  tar_target(model_fit, fit_model(clean_data)),

  # Results
  tar_target(results_table, summarize_results(model_fit)),

  # Figures
  tar_target(fig_main, create_main_figure(model_fit))
)

Running the Pipeline

# Run full pipeline
Rscript -e "targets::tar_make()"

# Or interactively in R
targets::tar_make()

# Visualize dependencies
targets::tar_visnetwork()

# Check status
targets::tar_progress()

Core Concepts

Targets

A target is a single unit of work with: - Name: Unique identifier - Command: R expression to execute - Dependencies: Automatically detected from the command

tar_target(
  name = model_results,
  command = fit_model(training_data, params)
  # Dependencies: training_data, params, fit_model()
)

Dependency Detection

Targets automatically tracks: - Other targets referenced in commands - Functions called (and their source code) - Files read via tar_read() or tracked with tar_file()

When any dependency changes, the target is invalidated and re-runs.

Invalidation

A target re-runs when: 1. Its command code changes 2. Any upstream target changes 3. Any function it uses changes 4. Tracked files change

Common Patterns

Dynamic Branching (Simulation Studies)

Run the same analysis across multiple scenarios:

# Define scenarios
tar_target(
  scenarios,
  data.frame(
    scenario_id = c("null", "alt_small", "alt_large"),
    effect_size = c(0, 0.2, 0.5),
    n_reps = 1000
  )
),

# Branch over scenarios
tar_target(
  sim_results,
  run_simulation(
    effect_size = scenarios$effect_size,
    n_reps = scenarios$n_reps,
    scenario_id = scenarios$scenario_id
  ),
  pattern = map(scenarios)  # Creates one branch per row
),

# Combine results
tar_target(
  combined_results,
  bind_rows(sim_results)
)

File Tracking

Track external files so pipeline re-runs when they change:

# Track input file
tar_target(data_file, "data/raw/dataset.csv", format = "file"),
tar_target(data, read_csv(data_file)),

# Track output file
tar_target(
  report_file,
  {
    render("reports/analysis.Rmd")
    "reports/analysis.html"  # Return the path
  },
  format = "file"
)

Quarto/RMarkdown Integration

library(tarchetypes)

# Render Quarto document
tar_quarto(
  paper,
  path = "paper/main.qmd"
)

# Render with dependencies
tar_quarto(
  paper,
  path = "paper/main.qmd",
  extra_files = c("paper/references.bib", "paper/template.tex")
)

Configuration Management

Load configuration from YAML:

tar_target(config, yaml::read_yaml("config/settings.yml")),

tar_target(
  analysis_results,
  run_analysis(
    data = clean_data,
    alpha = config$analysis$alpha,
    method = config$analysis$method
  )
)

Slurm Integration (Longleaf)

Basic Setup

# _targets.R
library(targets)
library(crew)
library(crew.cluster)

# Configure Slurm controller
tar_option_set(
  controller = crew_controller_slurm(
    name = "slurm_workers",
    workers = 10,                    # Max concurrent jobs
    slurm_partition = "general",
    slurm_time_minutes = 60,
    slurm_cpus_per_task = 4,
    slurm_memory_gigabytes_per_cpu = 4,
    slurm_log_output = "logs/slurm_%j.out",
    slurm_log_error = "logs/slurm_%j.err"
  ),
  # Continue on individual failures
  error = "continue",
  # Store large objects on workers
  storage = "worker"
)

Longleaf-Specific Configuration

# Longleaf Slurm controller with UNC-specific settings
longleaf_controller <- crew_controller_slurm(
  name = "longleaf",
  workers = 50,                          # Adjust based on fairshare
  slurm_partition = "general",           # or "gpu", "bigmem"
  slurm_time_minutes = 1440,             # 24 hours max for general
  slurm_cpus_per_task = 4,
  slurm_memory_gigabytes_per_cpu = 4,    # 16GB total per job
  slurm_log_output = "logs/slurm_%A_%a.out",
  slurm_log_error = "logs/slurm_%A_%a.err",
  # Longleaf-specific options
slurm_partition = "general"
)

# For GPU jobs
gpu_controller <- crew_controller_slurm(
  name = "gpu",
  workers = 4,
  slurm_partition = "gpu",
  slurm_time_minutes = 1440,
  slurm_cpus_per_task = 8,
  slurm_memory_gigabytes_per_cpu = 8,
  slurm_gres = "gpu:1"                   # Request 1 GPU
)

### Worker Deployment

Mark compute-heavy targets for worker execution:

```r
tar_target(
  sim_results,
  run_simulation(scenario),
  pattern = map(scenarios),
  deployment = "worker"  # Run on Slurm
),

tar_target(
  summary_table,
  summarize(sim_results),
  deployment = "main"  # Run locally (fast)
)

Running on Cluster

# Submit controller job
sbatch run_pipeline.sh

run_pipeline.sh:

#!/bin/bash
#SBATCH --job-name=targets_controller
#SBATCH --time=24:00:00
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --output=logs/controller_%j.out

module load r/4.3.0
Rscript -e "targets::tar_make()"

Monitoring

# Check Slurm queue
squeue -u $USER

# Watch pipeline progress
watch -n 10 'Rscript -e "targets::tar_progress()"'

# View worker logs
tail -f logs/slurm_*.out

Project Structure

Recommended layout for targets projects:

project/
├── _targets.R           # Pipeline definition
├── _targets/            # Cache (gitignored)
├── R/                   # Functions (tar_source loads these)
│   ├── data_cleaning.R
│   ├── modeling.R
│   └── visualization.R
├── config/
│   └── settings.yml     # Configuration
├── data/
│   ├── raw/             # Input data
│   └── processed/       # Intermediate (or use targets cache)
├── results/             # Final outputs
├── paper/               # Manuscript
├── logs/                # Slurm logs
└── Makefile             # Convenience commands

Makefile Integration

# Run pipeline
run:
    Rscript -e "targets::tar_make()"

# Parallel execution (local)
run-parallel:
    Rscript -e "targets::tar_make(workers = 4)"

# Visualize
visualize:
    Rscript -e "targets::tar_visnetwork()"

# Check outdated
status:
    Rscript -e "targets::tar_outdated()"

# Clean cache
clean:
    Rscript -e "targets::tar_destroy()"

# Validate pipeline syntax
validate:
    Rscript -e "targets::tar_validate()"

.PHONY: run run-parallel visualize status clean validate

Best Practices

1. Keep Targets Small

Each target should do one thing:

# Good: Separated concerns
tar_target(clean_data, clean_dataset(raw_data)),
tar_target(model, fit_model(clean_data)),
tar_target(predictions, predict(model, test_data)),

# Bad: Monolithic target
tar_target(everything, {
  clean <- clean_dataset(raw_data)
  model <- fit_model(clean)
  predict(model, test_data)
})

2. Use Functions, Not Scripts

Put logic in functions, not inline:

# Good: Function in R/modeling.R
tar_target(model, fit_bayesian_model(data, priors, config)),

# Bad: Inline code
tar_target(model, {
  library(brms)
  formula <- y ~ x1 + x2
  priors <- c(prior(normal(0, 1), class = "b"))
  brm(formula, data = data, prior = priors, ...)
})

3. Version Control

Add to .gitignore:

_targets/

Commit _targets.R and all R/*.R function files.

4. Seed Management

Set seeds for reproducibility:

tar_option_set(seed = 2024)

# Or per-target
tar_target(sim, run_sim(data), seed = 42)

5. Error Handling

Use error = "continue" for long-running pipelines:

tar_option_set(error = "continue")

Check for failures:

targets::tar_meta() %>% filter(!is.na(error))

Debugging

Inspect Target

# Load a target's value
tar_read(model_results)

# Load into environment
tar_load(model_results)

Debug Failed Target

# Get the error
tar_meta(names = "failed_target", fields = "error")

# Run interactively
tar_make(names = "failed_target", callr_function = NULL)

Workspace Recovery

# Save workspace on error
tar_option_set(workspace_on_error = TRUE)

# Load failed workspace
tar_workspace(failed_target)
# Now debug with all objects available

Common Issues

Memory Problems

# Store large objects as files
tar_target(
  big_result,
  compute_big_thing(data),
  format = "qs"  # Faster serialization
)

# Or use external storage
tar_option_set(storage = "worker", retrieval = "worker")

Slow Dependency Detection

# Exclude files from tracking
tar_option_set(
  imports = c("R/functions.R"),  # Only track these
  # OR
  garbage_collection = TRUE  # Help with memory
)

Cluster Job Failures

Check logs and adjust resources:

crew_controller_slurm(
  slurm_time_minutes = 120,  # Increase time
  slurm_memory_gigabytes_per_cpu = 8  # Increase memory
)

Resources

Lab-Specific Notes

Longleaf Partition Selection

Partition Use Case Time Limit
debug Testing 4 hours
general Standard jobs 7 days
gpu GPU computing 7 days

Shared Package Library

Consider using a shared package library for targets projects:

# In .Rprofile
.libPaths(c("/proj/rashidlab/R-packages", .libPaths()))

Project Templates

See lab templates for pre-configured targets setups: - template-methods-paper/ - Methodology papers - template-research-project/ - General research

Debugging with Claude Code

Claude Code understands targets pipelines and can help debug issues.

Common Tasks

> Explain the targets pipeline in _targets.R
> Run tar_outdated() to see what needs to run
> The sim_results target failed. Help me debug it.
> Configure this pipeline for Slurm with 20 workers

When Targets Fail

> Check tar_meta() for the error
> Load the workspace for the failed target
> What caused the simulation to fail?

Claude can examine error logs, check resource usage, and suggest fixes like increasing memory or time limits.

See Claude Code Lab Integration for more on working with targets.