R Package Development

Standards for reusable R packages

Paper Repos vs Package Repos

In our lab, we maintain a clear separation between:

  1. Paper repositories - Analysis code for specific manuscripts
  2. Package repositories - Reusable R packages with exported functions

When to Create a Package

Create a separate R package when:

  • Functions will be reused across multiple projects
  • Code implements novel methodology
  • Other researchers may want to use your methods
  • Functions need formal documentation and testing

When to Keep Code in Paper Repo

Keep code in the paper repository when:

  • Analysis is project-specific
  • Code won’t be reused elsewhere
  • You’re doing exploratory analysis

Package Structure

Following the BATON package as a reference:

MyPackage/
├── DESCRIPTION          # Package metadata and dependencies
├── NAMESPACE            # Exports (auto-generated by roxygen2)
├── LICENSE              # License file
├── R/                   # R source files
│   ├── core_functions.R
│   ├── helpers.R
│   └── zzz.R            # Package hooks
├── man/                 # Documentation (auto-generated)
├── tests/
│   └── testthat/
│       ├── testthat.R
│       └── test-functions.R
├── vignettes/           # Long-form documentation
├── inst/                # Additional files
├── README.md
├── CLAUDE.md            # AI assistant guidance
├── .Rbuildignore
└── .gitignore

DESCRIPTION File

Package: MyPackage
Type: Package
Title: Short Title for the Package
Version: 0.1.0
Authors@R:
    person(given = "First", family = "Last",
           role = c("aut", "cre"),
           email = "email@unc.edu",
           comment = c(affiliation = "UNC Chapel Hill"))
Description: One paragraph describing what the package does.
License: MIT + file LICENSE
Depends: R (>= 4.2)
Imports:
    data.table,
    stats,
    utils
Suggests:
    testthat,
    knitr,
    rmarkdown
VignetteBuilder: knitr
URL: https://github.com/rashidlab/MyPackage
BugReports: https://github.com/rashidlab/MyPackage/issues
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.0

Function Documentation

Use roxygen2 comments above every exported function:

#' Calculate constrained optimization result
#'
#' @description
#' Performs constrained Bayesian optimization over the specified
#' parameter bounds.
#'
#' @param sim_fun Simulation function that takes parameter vector
#' @param bounds Named list of parameter bounds (min, max)
#' @param constraints List of constraint specifications
#' @param n_iter Number of optimization iterations
#' @param seed Random seed for reproducibility
#'
#' @return A list with components:
#' \describe{
#'   \item{best_params}{Optimal parameter values}
#'   \item{best_value}{Objective value at optimum}
#'   \item{history}{Data frame of all evaluations}
#' }
#'
#' @examples
#' \dontrun{
#' result <- bo_calibrate(
#'   sim_fun = my_simulator,
#'   bounds = list(theta = c(0, 1)),
#'   constraints = list(power = 0.80)
#' )
#' }
#'
#' @export
bo_calibrate <- function(sim_fun, bounds, constraints,
                         n_iter = 100, seed = NULL) {
  # Implementation
}

Testing

Create tests in tests/testthat/:

# tests/testthat/test-core.R

test_that("bo_calibrate returns expected structure", {
  result <- bo_calibrate(
    sim_fun = simple_sim,
    bounds = list(x = c(0, 1)),
    n_iter = 5
  )

  expect_type(result, "list")
  expect_named(result, c("best_params", "best_value", "history"))
  expect_s3_class(result$history, "data.frame")
})

test_that("bo_calibrate respects constraints", {
  result <- bo_calibrate(
    sim_fun = constrained_sim,
    bounds = list(x = c(0, 1)),
    constraints = list(g1 = 0),
    n_iter = 10
  )

  expect_true(result$best_value >= 0)
})

Run tests:

devtools::test()

# Or specific test file
testthat::test_file("tests/testthat/test-core.R")

Development Workflow

# Initial setup
usethis::create_package("MyPackage")
usethis::use_testthat()
usethis::use_mit_license()

# During development
devtools::load_all()      # Load package for testing
devtools::document()      # Generate documentation
devtools::test()          # Run tests
devtools::check()         # Full R CMD check

# Building
devtools::build()         # Build tarball
devtools::install()       # Install locally

Linking Paper to Package

In your paper repository, install the package from GitHub:

# In paper repo's setup
devtools::install_github("naimurashid/BATON")

# Or from local source during development
devtools::install("~/Downloads/BATON")

Document the package version in your paper repo’s README:

## Dependencies

This analysis requires:
- BATON >= 0.3.0 (`devtools::install_github("naimurashid/BATON@v0.3.0")`)

README Structure for Packages

A good package README should help users install and use your package quickly:

# PackageName

<!-- badges: start -->
[![R-CMD-check](https://github.com/rashidlab/PackageName/workflows/R-CMD-check/badge.svg)](https://github.com/rashidlab/PackageName/actions)
<!-- badges: end -->

Brief one-line description of what the package does.

## Installation

```r
# From GitHub
devtools::install_github("rashidlab/PackageName")

# From local source (during development)
devtools::install("path/to/PackageName")
```

## Quick Start

```r
library(PackageName)

# Basic usage example
result <- main_function(data, param1 = value)
```

## Main Functions

| Function | Purpose |
|----------|---------|
| `main_function()` | Primary analysis function |
| `helper_function()` | Supporting utility |
| `plot_results()` | Visualization |

## Example

```r
# Load example data
data <- example_dataset

# Run analysis
result <- main_function(data)

# Visualize
plot_results(result)
```

## Documentation

See the package vignettes for detailed tutorials:

```r
vignette("getting-started", package = "PackageName")
```

## Citation

If you use this package, please cite:

```
Author et al. (2024). "Paper Title." Journal Name.
```

## License

MIT

Version Control

Git Tags for Releases

# Tag a release
git tag -a v0.3.0 -m "Release version 0.3.0"
git push origin v0.3.0

CRAN Submission Checklist

Before CRAN submission:

  1. devtools::check() passes with 0 errors, 0 warnings, 0 notes
  2. All examples run without error
  3. Vignettes build successfully
  4. R CMD check --as-cran passes
  5. Test coverage is adequate

Best Practices for R Package Development

Code Organization

R/
├── core.R            # Main exported functions
├── internal.R        # Internal helper functions (not exported)
├── utils.R           # Small utility functions
├── validators.R      # Input validation functions
└── zzz.R             # .onLoad, .onAttach hooks

Guidelines:

  • One main function per file for large functions
  • Group related small functions together
  • Prefix internal functions with . or don’t export them
  • Keep files under ~500 lines

Naming Conventions

# Exported functions: verb_noun
calculate_power <- function(...) { }
fit_model <- function(...) { }
plot_results <- function(...) { }

# Internal functions: .prefix or no export
.validate_inputs <- function(...) { }
check_bounds <- function(...) { }  # Not in NAMESPACE

# S3 methods follow conventions
print.my_class <- function(x, ...) { }
summary.my_class <- function(object, ...) { }

Error Handling

#' Fit model with proper error handling
fit_model <- function(data, formula, ...) {
  # Validate inputs early
 if (!is.data.frame(data)) {
    stop("'data' must be a data frame", call. = FALSE)
  }

  if (nrow(data) == 0) {
    stop("'data' cannot be empty", call. = FALSE)
  }

  # Warnings for non-fatal issues
  if (any(is.na(data$outcome))) {
    warning("NA values in outcome will be excluded", call. = FALSE)
  }

  # Use tryCatch for operations that might fail
  result <- tryCatch(
    {
      # Model fitting code
    },
    error = function(e) {
      stop(sprintf("Model fitting failed: %s", e$message), call. = FALSE)
    }
  )

  result
}

Dependencies

DESCRIPTION best practices:

Imports:
    data.table,    # Essential functionality
    stats,         # Base R packages are fine
    utils
Suggests:
    testthat,      # Testing only
    knitr,         # Vignettes only
    ggplot2        # Optional visualization
  • Minimize dependencies
  • Use Imports: for required packages
  • Use Suggests: for optional/testing packages
  • Avoid depending on packages that depend on many others
  • Consider using base R where possible

Documentation Standards

Every exported function needs:

  1. Title - One line description
  2. Description - Detailed explanation
  3. Parameters - All arguments documented
  4. Return value - What the function returns
  5. Examples - Runnable code (use \dontrun{} for slow examples)
#' Calculate constrained optimization result
#'
#' Performs constrained Bayesian optimization over the specified
#' parameter bounds to find designs satisfying power and type I
#' error constraints.
#'
#' @param sim_fun Function that simulates trial outcomes. Must accept
#'   a named numeric vector of parameters and return a list with
#'   components `power` and `type1`.
#' @param bounds Named list where each element is a length-2 numeric
#'   vector giving (min, max) bounds for that parameter.
#' @param constraints Named list of constraint values. Currently
#'   supports `power` (minimum) and `type1` (maximum).
#' @param n_iter Integer. Number of optimization iterations.
#'   Default is 100.
#' @param seed Integer or NULL. Random seed for reproducibility.
#'
#' @return A list with class "bo_result" containing:
#' \describe{
#'   \item{best_params}{Named numeric vector of optimal parameters}
#'   \item{best_value}{Numeric. Objective value at optimum}
#'   \item{constraint_values}{Named numeric. Constraint values at optimum}
#'   \item{history}{Data frame of all evaluations}
#' }
#'
#' @examples
#' # Define a simple simulator
#' my_sim <- function(params) {
#'   list(power = 0.8 + params["x"] * 0.1,
#'        type1 = 0.05 - params["x"] * 0.02)
#' }
#'
#' # Run optimization
#' result <- bo_calibrate(
#'   sim_fun = my_sim,
#'   bounds = list(x = c(0, 1)),
#'   constraints = list(power = 0.80, type1 = 0.05),
#'   n_iter = 10
#' )
#'
#' @seealso [extract_best()] for extracting optimal parameters
#' @export

Testing Best Practices

# tests/testthat/test-core.R

# Group related tests
test_that("bo_calibrate validates inputs", {
  expect_error(bo_calibrate(NULL), "'sim_fun' must be a function")
  expect_error(bo_calibrate(identity, bounds = "x"),
               "'bounds' must be a list")
})

test_that("bo_calibrate returns expected structure", {
  result <- bo_calibrate(simple_sim, bounds = list(x = c(0, 1)))

  expect_s3_class(result, "bo_result")
  expect_named(result, c("best_params", "best_value", "history"))
})

test_that("bo_calibrate respects constraints", {
  result <- bo_calibrate(
    sim_fun = constrained_sim,
    bounds = list(x = c(0, 1)),
    constraints = list(power = 0.80),
    n_iter = 20
  )

  expect_gte(result$constraint_values["power"], 0.80)
})

# Test edge cases
test_that("bo_calibrate handles edge cases", {
  # Empty result
  expect_error(bo_calibrate(failing_sim, bounds = list(x = c(0, 1))))

  # Single parameter
  result <- bo_calibrate(simple_sim, bounds = list(x = c(0.5, 0.5)))
  expect_equal(result$best_params["x"], 0.5)
})

Performance Considerations

# Vectorize when possible
# BAD:
compute_values <- function(x) {
  result <- numeric(length(x))
  for (i in seq_along(x)) {
    result[i] <- expensive_operation(x[i])
  }
  result
}

# GOOD:
compute_values <- function(x) {
  expensive_operation_vectorized(x)
}

# Use data.table for large data operations
# GOOD:
process_large_data <- function(dt) {
  dt[, .(mean_val = mean(value)), by = group]
}

# Pre-allocate in loops
# GOOD:
results <- vector("list", n_iterations)
for (i in seq_len(n_iterations)) {
  results[[i]] <- compute(i)
}

For computationally intensive code, consider Rcpp integration below.

Rcpp Integration

When R code is too slow even after vectorization, Rcpp lets you write performance-critical sections in C++.

When to Use Rcpp

Scenario Use Rcpp?
Loops that can’t be vectorized Yes
Recursive algorithms Yes
Operations on millions of elements Consider
Code already fast enough No
One-time analysis script Probably no

Rule of thumb: Profile first. Only optimize bottlenecks that matter.

Setup

# Install Rcpp
install.packages("Rcpp")

# Add to your package
usethis::use_rcpp()

This creates: - src/ directory for C++ files - Updates DESCRIPTION with LinkingTo: Rcpp - Adds @useDynLib and @importFrom Rcpp sourceCpp to package documentation

Package Structure with Rcpp

MyPackage/
├── DESCRIPTION
├── NAMESPACE
├── R/
│   ├── RcppExports.R    # Auto-generated
│   └── wrapper.R        # R wrappers for C++ functions
├── src/
│   ├── RcppExports.cpp  # Auto-generated
│   ├── fast_functions.cpp
│   └── Makevars         # Compiler flags (optional)
└── ...

Basic Rcpp Example

src/fast_functions.cpp:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector cumsum_cpp(NumericVector x) {


  int n = x.size();
  NumericVector result(n);

  result[0] = x[0];
  for (int i = 1; i < n; i++) {
    result[i] = result[i-1] + x[i];
  }

  return result;
}

// [[Rcpp::export]]
double weighted_mean_cpp(NumericVector x, NumericVector w) {

  int n = x.size();
  double sum_xw = 0.0;
  double sum_w = 0.0;

  for (int i = 0; i < n; i++) {
    sum_xw += x[i] * w[i];
    sum_w += w[i];
  }

  return sum_xw / sum_w;
}

After running devtools::document(), these functions become available in R:

library(MyPackage)
cumsum_cpp(1:10)
weighted_mean_cpp(c(1, 2, 3), c(0.5, 0.3, 0.2))

Simulation-Specific Example

For adaptive trial simulations with many iterations:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List simulate_trial_cpp(int n_patients,
                        double true_rate,
                        double threshold,
                        int n_sims) {

  NumericVector successes(n_sims);
  NumericVector stopped_early(n_sims);

  for (int sim = 0; sim < n_sims; sim++) {
    int n_success = 0;
    bool stopped = false;

    for (int i = 0; i < n_patients; i++) {
      // Simulate patient outcome
      if (R::runif(0, 1) < true_rate) {
        n_success++;
      }

      // Interim analysis (example: futility check)
      if (i == n_patients / 2) {
        double current_rate = (double)n_success / (i + 1);
        if (current_rate < threshold * 0.5) {
          stopped = true;
          break;
        }
      }
    }

    successes[sim] = n_success;
    stopped_early[sim] = stopped ? 1.0 : 0.0;
  }

  return List::create(
    Named("successes") = successes,
    Named("stopped_early") = stopped_early
  );
}

R Wrapper Functions

Create user-friendly R wrappers with input validation:

R/simulate.R:

#' Simulate adaptive trial
#'
#' @param n_patients Total patients to enroll
#' @param true_rate True response rate
#' @param threshold Efficacy threshold
#' @param n_sims Number of simulations
#' @return List with simulation results
#' @export
simulate_trial <- function(n_patients, true_rate, threshold, n_sims = 10000) {
  # Validate inputs in R (easier error messages)
  stopifnot(
    is.numeric(n_patients), n_patients > 0,
    is.numeric(true_rate), true_rate >= 0, true_rate <= 1,
    is.numeric(threshold), threshold >= 0, threshold <= 1,
    is.numeric(n_sims), n_sims > 0
  )


  # Call C++ implementation
  result <- simulate_trial_cpp(
    as.integer(n_patients),
    as.double(true_rate),
    as.double(threshold),
    as.integer(n_sims)
  )

  # Post-process in R
  result$power <- mean(result$successes >= threshold * n_patients)
  result$early_stop_rate <- mean(result$stopped_early)

  class(result) <- "trial_sim"
  result
}

RcppArmadillo for Linear Algebra

For matrix operations, use RcppArmadillo:

usethis::use_rcpp_armadillo()
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arma::mat fast_crossprod(arma::mat X) {
  return X.t() * X;
}

// [[Rcpp::export]]
arma::vec solve_system(arma::mat A, arma::vec b) {
  return arma::solve(A, b);
}

DESCRIPTION for Rcpp Packages

Package: MyPackage
Type: Package
Title: Fast Simulation Tools
Version: 0.1.0
Authors@R: person("First", "Last", email = "email@unc.edu", role = c("aut", "cre"))
Description: Efficient simulation tools using Rcpp.
License: MIT + file LICENSE
Imports:
    Rcpp (>= 1.0.0)
LinkingTo:
    Rcpp,
    RcppArmadillo
SystemRequirements: C++11
Encoding: UTF-8
RoxygenNote: 7.3.0

Compilation on Longleaf

On Longleaf, load the appropriate compiler:

module load gcc/11.2.0
module load r/4.3.0

# Then in R
devtools::install()

Debugging Rcpp Code

# Compile with debug symbols
Sys.setenv("PKG_CXXFLAGS" = "-g -O0")
devtools::clean_dll()
devtools::load_all()

# Print debugging (appears in R console)
# In C++: Rcpp::Rcout << "value: " << x << std::endl;

Common Pitfalls

Issue Solution
Segfault Check array bounds; R uses 1-indexing, C++ uses 0
Wrong results Ensure integer vs double types match
Slow compilation Split large files; use // [[Rcpp::export]] sparingly
Memory leaks Rcpp handles most memory; avoid raw new/delete
RNG issues Use R::runif(), R::rnorm() for reproducibility with set.seed()

Learning Resources

Reference Package

See BATON for a complete example of:

  • Package structure and DESCRIPTION
  • Roxygen2 documentation style
  • Test organization
  • Vignette setup
  • GitHub Actions for CI (optional)