Targets Pipeline Guide
Reproducible workflows with {targets}
This page assumes familiarity with: - Basic R programming (R Style Guide) - Command line basics (running commands in terminal) - Git fundamentals (Git Practices)
A comprehensive guide to using the {targets} package for reproducible research workflows in R.
Why Targets?
The {targets} package provides:
- Automatic dependency tracking - Only re-runs what changed
- Parallel execution - Distributes work across cores/nodes
- Caching - Stores intermediate results for fast iteration
- Reproducibility - Documents the entire computational workflow
Quick Start
Installation
install.packages("targets")
install.packages("tarchetypes") # Useful extensions
# For cluster computing (Longleaf)
install.packages("crew")
install.packages("crew.cluster")Basic Setup
Create _targets.R in your project root:
# _targets.R
library(targets)
library(tarchetypes)
# Source your functions
tar_source("R/")
# Define pipeline
list(
# Data loading
tar_target(raw_data, read_csv("data/raw/dataset.csv")),
# Processing
tar_target(clean_data, clean_dataset(raw_data)),
# Analysis
tar_target(model_fit, fit_model(clean_data)),
# Results
tar_target(results_table, summarize_results(model_fit)),
# Figures
tar_target(fig_main, create_main_figure(model_fit))
)Running the Pipeline
# Run full pipeline
Rscript -e "targets::tar_make()"
# Or interactively in R
targets::tar_make()
# Visualize dependencies
targets::tar_visnetwork()
# Check status
targets::tar_progress()Core Concepts
Targets
A target is a single unit of work with: - Name: Unique identifier - Command: R expression to execute - Dependencies: Automatically detected from the command
tar_target(
name = model_results,
command = fit_model(training_data, params)
# Dependencies: training_data, params, fit_model()
)Dependency Detection
Targets automatically tracks: - Other targets referenced in commands - Functions called (and their source code) - Files read via tar_read() or tracked with tar_file()
When any dependency changes, the target is invalidated and re-runs.
Invalidation
A target re-runs when: 1. Its command code changes 2. Any upstream target changes 3. Any function it uses changes 4. Tracked files change
Common Patterns
Dynamic Branching (Simulation Studies)
Run the same analysis across multiple scenarios:
# Define scenarios
tar_target(
scenarios,
data.frame(
scenario_id = c("null", "alt_small", "alt_large"),
effect_size = c(0, 0.2, 0.5),
n_reps = 1000
)
),
# Branch over scenarios
tar_target(
sim_results,
run_simulation(
effect_size = scenarios$effect_size,
n_reps = scenarios$n_reps,
scenario_id = scenarios$scenario_id
),
pattern = map(scenarios) # Creates one branch per row
),
# Combine results
tar_target(
combined_results,
bind_rows(sim_results)
)File Tracking
Track external files so pipeline re-runs when they change:
# Track input file
tar_target(data_file, "data/raw/dataset.csv", format = "file"),
tar_target(data, read_csv(data_file)),
# Track output file
tar_target(
report_file,
{
render("reports/analysis.Rmd")
"reports/analysis.html" # Return the path
},
format = "file"
)Quarto/RMarkdown Integration
library(tarchetypes)
# Render Quarto document
tar_quarto(
paper,
path = "paper/main.qmd"
)
# Render with dependencies
tar_quarto(
paper,
path = "paper/main.qmd",
extra_files = c("paper/references.bib", "paper/template.tex")
)Configuration Management
Load configuration from YAML:
tar_target(config, yaml::read_yaml("config/settings.yml")),
tar_target(
analysis_results,
run_analysis(
data = clean_data,
alpha = config$analysis$alpha,
method = config$analysis$method
)
)Slurm Integration (Longleaf)
Basic Setup
# _targets.R
library(targets)
library(crew)
library(crew.cluster)
# Configure Slurm controller
tar_option_set(
controller = crew_controller_slurm(
name = "slurm_workers",
workers = 10, # Max concurrent jobs
slurm_partition = "general",
slurm_time_minutes = 60,
slurm_cpus_per_task = 4,
slurm_memory_gigabytes_per_cpu = 4,
slurm_log_output = "logs/slurm_%j.out",
slurm_log_error = "logs/slurm_%j.err"
),
# Continue on individual failures
error = "continue",
# Store large objects on workers
storage = "worker"
)Multi-Controller Architecture (Recommended)
For pipelines with heterogeneous tasks (some memory-intensive, some CPU-intensive), use multiple controllers:
library(crew)
# Controller for general tasks (more workers, less memory)
default_controller <- crew_controller_local(
name = "default",
workers = 8,
seconds_idle = 120
)
# Controller for memory-intensive tasks (fewer workers, more memory)
heavy_controller <- crew_controller_local(
name = "heavy",
workers = 2,
seconds_idle = 300
)
# Controller for I/O-bound tasks
io_controller <- crew_controller_local(
name = "io",
workers = 4,
seconds_idle = 120
)
# Combine into controller group
tar_option_set(
controller = crew_controller_group(
default_controller,
heavy_controller,
io_controller
),
error = "continue"
)
# Then assign tasks to appropriate controllers:
tar_target(
heavy_computation,
run_memory_intensive_task(data),
resources = tar_resources(crew = tar_resources_crew(controller = "heavy"))
)Longleaf-Specific Configuration
# Longleaf Slurm controller with UNC-specific settings
longleaf_controller <- crew_controller_slurm(
name = "longleaf",
workers = 50, # Adjust based on fairshare
slurm_partition = "general", # or "gpu", "bigmem"
slurm_time_minutes = 1440, # 24 hours max for general
slurm_cpus_per_task = 4,
slurm_memory_gigabytes_per_cpu = 4, # 16GB total per job
slurm_log_output = "logs/slurm_%A_%a.out",
slurm_log_error = "logs/slurm_%A_%a.err",
# Longleaf-specific options
slurm_partition = "general"
)
# For GPU jobs
gpu_controller <- crew_controller_slurm(
name = "gpu",
workers = 4,
slurm_partition = "gpu",
slurm_time_minutes = 1440,
slurm_cpus_per_task = 8,
slurm_memory_gigabytes_per_cpu = 8,
slurm_gres = "gpu:1" # Request 1 GPU
)
### Worker Deployment
Mark compute-heavy targets for worker execution:
```r
tar_target(
sim_results,
run_simulation(scenario),
pattern = map(scenarios),
deployment = "worker" # Run on Slurm
),
tar_target(
summary_table,
summarize(sim_results),
deployment = "main" # Run locally (fast)
)
Running on Cluster
# Submit controller job
sbatch run_pipeline.shrun_pipeline.sh:
#!/bin/bash
#SBATCH --job-name=targets_controller
#SBATCH --time=24:00:00
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --output=logs/controller_%j.out
module load r/4.3.0
Rscript -e "targets::tar_make()"Monitoring
# Check Slurm queue
squeue -u $USER
# Watch pipeline progress
watch -n 10 'Rscript -e "targets::tar_progress()"'
# View worker logs
tail -f logs/slurm_*.outProject Structure
Recommended layout for targets projects:
project/
├── _targets.R # Pipeline definition
├── _targets/ # Cache (gitignored)
├── R/ # Functions (tar_source loads these)
│ ├── data_cleaning.R
│ ├── modeling.R
│ └── visualization.R
├── config/
│ └── settings.yml # Configuration
├── data/
│ ├── raw/ # Input data
│ └── processed/ # Intermediate (or use targets cache)
├── results/ # Final outputs
├── paper/ # Manuscript
├── logs/ # Slurm logs
└── Makefile # Convenience commands
Makefile Integration
# Run pipeline
run:
Rscript -e "targets::tar_make()"
# Parallel execution (local)
run-parallel:
Rscript -e "targets::tar_make(workers = 4)"
# Visualize
visualize:
Rscript -e "targets::tar_visnetwork()"
# Check outdated
status:
Rscript -e "targets::tar_outdated()"
# Clean cache
clean:
Rscript -e "targets::tar_destroy()"
# Validate pipeline syntax
validate:
Rscript -e "targets::tar_validate()"
.PHONY: run run-parallel visualize status clean validateBest Practices
1. Keep Targets Small
Each target should do one thing:
# Good: Separated concerns
tar_target(clean_data, clean_dataset(raw_data)),
tar_target(model, fit_model(clean_data)),
tar_target(predictions, predict(model, test_data)),
# Bad: Monolithic target
tar_target(everything, {
clean <- clean_dataset(raw_data)
model <- fit_model(clean)
predict(model, test_data)
})2. Use Functions, Not Scripts
Put logic in functions, not inline:
# Good: Function in R/modeling.R
tar_target(model, fit_bayesian_model(data, priors, config)),
# Bad: Inline code
tar_target(model, {
library(brms)
formula <- y ~ x1 + x2
priors <- c(prior(normal(0, 1), class = "b"))
brm(formula, data = data, prior = priors, ...)
})3. Version Control
Add to .gitignore:
_targets/
Commit _targets.R and all R/*.R function files.
4. Seed Management
Set seeds for reproducibility:
tar_option_set(seed = 2024)
# Or per-target
tar_target(sim, run_sim(data), seed = 42)5. Error Handling
Use error = "continue" for long-running pipelines:
tar_option_set(error = "continue")Check for failures:
targets::tar_meta() %>% filter(!is.na(error))Debugging
Inspect Target
# Load a target's value
tar_read(model_results)
# Load into environment
tar_load(model_results)Debug Failed Target
# Get the error
tar_meta(names = "failed_target", fields = "error")
# Run interactively
tar_make(names = "failed_target", callr_function = NULL)Workspace Recovery
# Save workspace on error
tar_option_set(workspace_on_error = TRUE)
# Load failed workspace
tar_workspace(failed_target)
# Now debug with all objects availableCommon Issues
Memory Problems
# Store large objects as files
tar_target(
big_result,
compute_big_thing(data),
format = "qs" # Faster serialization
)
# Or use external storage
tar_option_set(storage = "worker", retrieval = "worker")Slow Dependency Detection
# Exclude files from tracking
tar_option_set(
imports = c("R/functions.R"), # Only track these
# OR
garbage_collection = TRUE # Help with memory
)Cluster Job Failures
Check logs and adjust resources:
crew_controller_slurm(
slurm_time_minutes = 120, # Increase time
slurm_memory_gigabytes_per_cpu = 8 # Increase memory
)Resources
Lab-Specific Notes
Longleaf Partition Selection
| Partition | Use Case | Time Limit |
|---|---|---|
debug |
Testing | 4 hours |
general |
Standard jobs | 7 days |
gpu |
GPU computing | 7 days |
Project Templates
See lab templates for pre-configured targets setups: - template-methods-paper/ - Methodology papers - template-research-project/ - General research
Debugging with Claude Code
Claude Code understands targets pipelines and can help debug issues.
Common Tasks
> Explain the targets pipeline in _targets.R
> Run tar_outdated() to see what needs to run
> The sim_results target failed. Help me debug it.
> Configure this pipeline for Slurm with 20 workers
When Targets Fail
> Check tar_meta() for the error
> Load the workspace for the failed target
> What caused the simulation to fail?
Claude can examine error logs, check resource usage, and suggest fixes like increasing memory or time limits.
See Claude Code Lab Integration for more on working with targets.