Staying in the Driver’s Seat

Calibrating AI Use in Statistical Research, Training, and Practice

Naim Rashid, PhD

Department of Biostatistics · UNC Chapel Hill

April 2026
QR code for naimurashid.github.io
naimurashid.github.io

What’s at Stake

What’s the worry, and is it new?

A scenario

Imagine a sophisticated user. They read every line of AI output, run the diagnostics, and can defend the analysis that was produced.

Their dissertation defense goes well — until a committee member asks why they didn’t consider a different approach.

They have no answer. Not because they missed it, but because the AI never proposed it, and they had stopped generating alternatives of their own.

Neither side notices what’s actually missing. The defense goes well.

Composite scenario, not a specific case.

The core tension

AI tools are most dangerous not when they give wrong answers, but when they give right answers you don’t understand.

The goal is not to use AI less — it’s to use it in ways that keep you learning.

Source: Rashid Lab Code Ownership Guide (March 2026)

This is not an AI-specific worry

Tools have been changing what we encode and what skills we maintain for decades.

  • Calculators changed mental arithmetic
  • GPS changed spatial memory
  • Search engines changed factual recall

Cognitive offloading is a well-established phenomenon: external tools improve performance while changing what we come to retain and practice internally. (Risko & Gilbert, Trends in Cognitive Sciences, 2016; Sparrow et al., Science, 2011.)

Bainbridge’s Ironies of Automation (1983) anticipated the central argument 40 years before AI: as automation grows, the human’s job becomes monitoring and judgment — and those skills atrophy without practice.

And recently, in professionals: Lee et al. (CHI 2025, n=319 knowledge workers) — higher GenAI confidence predicts less critical thinking; outputs grow more homogeneous (mechanized convergence).

What is new is the scope, not the mechanism.

Why this matters

I use these tools extensively. This isn’t a talk about abstinence — it’s a talk about calibration.

By now, the surface concerns are familiar to most of us.

  • LLMs hallucinate.
  • LLM benchmarks rarely report uncertainty, dependence, or deployment shift.
  • Recent reports of fabricated citations surviving expert review.

I’m not here to repeat what you already discuss in your circles.

The quieter cost is what I want to focus on:

  • For our research — the cost is rigor.
  • For our trainees — the cost is the formation of judgment.
  • For our students — the cost is the literacy we owe them.

What Stays Yours

What must each of us actually hold onto, and why?

A part you don’t outsource

There’s a part of every project that, if you outsource it, the work goes slack.

The same sentence applies across:

  • our own research
  • the doctoral students we train
  • the students we teach

The next three slides show what that part is in each.

In your own research

The part you don’t outsource is the chain of judgment: framing the question; choosing methods (over alternatives the AI didn’t suggest); verifying that what was produced is what you actually intended; interpreting what the result means and what it does not; defending the work to a hostile reader.

A frequent failure mode is plausibly wrong output that survives casual inspection.

Across 576,000 code samples from 16 LLMs (mostly Python/JavaScript): commercial GPT-class models referenced nonexistent packages 5.2% of the time; open-source models, 21.7%. 200,000+ unique fabricated names.

Spracklen et al., USENIX Security 2025. Generalizes in kind to R/SAS; magnitude less established.

For us, the dangerous output isn’t a missing package. It’s a model that runs, a p-value that prints, a sensitivity analysis answering the wrong estimand.

Without your judgment, plausibly wrong passes.

In doctoral training

The part you don’t outsource is the productive struggle.

“Conditions of instruction that make performance improve rapidly often fail to support long-term retention and transfer.”

Bjork & Bjork, 2011. From research on conventional instruction; the extension to AI-assisted learning is an extrapolation, not a finding.

Concretely: working through a derivation by hand the first time. Manually debugging by staring at residuals before reaching for help. Building a small simulation from scratch before reaching for a library.

The struggle is where calibrated intuition is laid down — the kind that lets each of us, today, catch a wrong answer in a paper, a colleague’s analysis, or an AI’s output.

We should be deliberate about whether AI lets the next generation skip the private wrong turns that build judgment.

In classroom teaching

The part you don’t outsource is the first encounter with a concept.

The schema gets built then, or it does not.

Concept formation is:

  • the most contextual phase of learning — it depends on what just happened in this room, with this cohort, this semester
  • the hardest to automate well phase, for that reason
  • where the broader statistical literacy of our students is determined

A student who outsources the first encounter with confounding, likelihood, or simulation often retains the answer without the schema.

The pattern across all three

What AI is excellent at
Executing under decisions already made.
What stays ours
Deciding — what’s worth doing, what counts as done, what the result means.

AI increasingly helps with deciding too — surfacing options, finding gaps in reasoning, stress-testing your plan. The responsibility for the decision stays with you.

The line moves over time. The distinction does not.

How We Keep It

What does keeping understanding alive look like in practice?

A few practices that travel

The source guide is built in layers: principles → daily habits → project-level practices. We won’t walk all of them.

The ones that travel best across coding, research, and teaching share a common theme.

deliberate friction

Listen for it as we walk through the practices.

Read everything that ships

Treat AI output like a pull request from a junior developer.

Review every line. Question every design choice. Switch into reviewer mode:

  • Could I have written this myself? If not, what concept am I missing?
  • Is this the approach I would have chosen? Why or why not?
  • What does this assume about inputs, environment, or state?

Same posture for an AI-drafted methods sentence, simulation parameter, lecture example, or rubric.

If something doesn’t make sense — stop and learn the concept before accepting it.

The pre-prompt pause

Before prompting: 2–5 minutes writing your own approach.

Answer four questions in a scratch file:

  1. What am I trying to accomplish? (the goal, not the implementation)
  2. What’s my best guess for how to do it? (rough or wrong is fine)
  3. What are the tricky parts? (edge cases, assumptions, unknowns)
  4. How will I know it worked? (what output am I expecting; how will I verify)

Then prompt with your plan as context. Compare the AI’s approach to yours.

The differences are where learning happens.

The decision log

A lightweight DECISIONS.md in every project.

## 2026-03-15: Switched simulator optimizer from BFGS to L-BFGS-B

Reason: BFGS hit memory limits for n_params > 5000.
Trade-off: ~10× memory reduction; slower per-iteration; needs warm start.
Affected: src/optim.R, sim/runner.R, methods section §3.2.

The point isn’t the artifact — it’s the act of writing.

Articulating the decision forces you to understand it. The log helps teammates, reviewers, and future-you reconstruct the reasoning.

Generalizes beyond code: research projects, course curricula, manuscripts.

The explain-it-back test

After AI helps you build something, close the tool.

Empirically: the self-explanation effect (Chi et al., 1989, 1994) and retrieval practice (Roediger & Karpicke, 2006) — well-evidenced mechanisms by which articulating cements understanding.

Try to explain the implementation to:

  • a rubber duck
  • a colleague
  • a voice memo on your phone

Where you get stuck is where your understanding has gaps.

The thinking required to explain is the point.

Architecture documentation

A living document describing your project at a high level.

  • major components and how they interact
  • the data model and how state flows
  • key design decisions and their rationale
  • external dependencies and why each was chosen

“Writing it yourself — not asking AI to generate it — is the point. The act of writing is the act of understanding.”

Rashid Lab Code Ownership Guide

For research, an analysis plan. For a course, a syllabus rationale. For a manuscript, a one-page summary.

What ties these practices together

deliberate friction

The kind of struggle a learner can resolve — not arbitrary difficulty.

When learning feels effortful and resolvable, it tends to stick.
When it feels smooth, the feeling can mislead us.

Bjork & Bjork, 2011 (desirable difficulties — resolvable, not arbitrary).
Kapur, 2008 / 2016 (productive failure, distinct from unproductive failure).
Deslauriers et al., PNAS 2019 (the feeling of learning ≠ learning).

The friction isn’t the problem AI solves.
It’s the asset AI is most likely to remove.

Across Our Roles

How does this differ for our research, our trainees, our students?

For educators and team leads

Four practices from the source guide that travel well:

  • The Two-Pass Assignment — attempt without AI; iterate with AI; submit both with reflection.
  • Explain-Your-Code Assessments — supplement code with an oral or written walkthrough.
  • Debugging Exercises — intentionally broken code; no AI tools allowed.
  • Code Review Culture — AI-generated code receives the same (or more) scrutiny as human-written code. “Can you walk me through this?” If the author can’t walk through it, the work isn’t ready.

The work isn’t to remove AI. It’s to design assignments and reviews where understanding still has to show.

Evidence that design matters

Bastani et al., PNAS 2025. Field experiment, ~1,000 high-school math students.

Vanilla GPT during practice
Better in practice. 17% worse when access removed.
Tutor-prompted GPT (with guardrails)
Better in practice. No detectable harm when access removed.

One study, one population, one subject. Suggestive of a broader pattern; not yet established for graduate technical training.

Design matters more than the tool.

For doctoral training

The advisor’s standing question:

“Which struggles do I let happen, and which do I remove?”

The job isn’t to remove AI from your trainees’ work. It’s to design the struggles AI cannot remove — the parts where understanding has to be built, not produced.

Concrete moves:

  • Require the unplugged version first on key derivations and debugging.
  • Audit what the AI did on a project, not just whether it worked.
  • Make defending the decision — not producing the output — the bar for sign-off.

For your own research

You are also someone whose intuition and judgment have to be maintained.

The same practices apply to your own work:

  • A DECISIONS.md for your active research projects, not just your code.
  • Unplugged hours: 90 minutes a week without AI assistance, on real problems.
  • Periodic explain-it-back on your own analyses, before you teach them or submit them.

The trainees you mentor will calibrate to the practices you actually keep.

For regulated environments

For many in clinical trials, biopharma, and industry, the line isn’t personal — it’s drawn by SOPs, sponsor protocols, FDA, and QA.

The same practices apply, with an audit-trail layer:

  • Decision log → prompt logs. Log prompt, model version, and accepted diff.
  • Pre-prompt pause → SAP pre-specification. Define permitted AI use before analysis begins.
  • Read everything → double programming. AI-assisted code gets independent re-derivation.
  • Architecture docs → reproducibility under model deprecation. Pin model versions.

The line is drawn by your sponsor, the FDA, and your QA group — not by you alone.

The spectrum of delegation

Where do you draw the line — and how does it shift across activities?

Usually safe to delegate
  • Boilerplate & scaffolding
  • Syntax & idioms
  • Format conversions
  • Search-query brainstorming
  • Prose smoothing
  • Slide formatting
Delegate with review
  • Analysis code
  • Refactoring
  • Database queries
  • Methods-section drafts
  • Literature summaries / source triage
  • Worked examples & rubric drafts
Stay hands-on
  • Architecture & design
  • Core algorithms
  • Methodological choices
  • Interpretation & defense
  • Curriculum judgment
  • Diagnosing the room

The columns are stable. The items shift by activity, by stakes, and by where you are in your career.

The line is personal

Where the columns sit on this grid is yours, not mine.

It will be different for you than for me.
Different for our trainees than for either of us.
Different for code than for analysis than for teaching.

The discipline isn’t where you draw the line.
It’s that you are deliberate about drawing one at all.

Calibration questions

An honest self-check — and one for your group.

For yourself:

For your trainees:

For yourself, as advisor:

Diagnostic, not judgmental. If anything gets a “no” — the next slide is where to start.

A place to start

If something on the previous slide got a “no” — start here, in order.

  1. Open a DECISIONS.md in your active project this week.
  2. Before your next prompt, write your own approach first.
  3. Schedule one 90-minute unplugged session.
  4. Line-by-line review on the next AI-generated chunk.
  5. Explain-it-back on the last thing you built with AI help.

Adopt one. Add the next when the first feels routine.

A year in: you can defend your work without your notes; your trainees can defend theirs.

From the source guide’s quick-start checklist, lightly reordered.

The GPS analogy

AI didn’t make people worse drivers,
but it did make many people worse at knowing where they are.

The people who still have a sense of direction are the ones who glance at the map occasionally, notice landmarks, and pay attention to the route — even while GPS is guiding them.

From the source guide.

Stay in the Driver’s Seat

Calibrating AI Use in Statistical Research, Training, and Practice

not because we don’t trust the GPS,
but because we still want to know where we are when we arrive.

Thank you · Questions welcome

naim@unc.edu  ·  naimurashid.github.io  ·  LinkedIn: Naim Rashid  ·  GitHub: @naimurashid