Staying in the Driver’s Seat

Calibrating AI Use in Statistical Research, Training, and Practice

Naim Rashid, PhD

Department of Biostatistics · UNC Chapel Hill

April 2026

naimurashid.github.io

What’s at Stake

What’s the worry, and is it new?

A scenario

Imagine a sophisticated user. They read every line of AI output, run the diagnostics, and can defend the analysis that was produced.

Their dissertation defense goes well — until a committee member asks why they didn’t consider a different approach.

They have no answer. Not because they missed it, but because the AI never proposed it, and they had stopped generating alternatives of their own.

Neither side notices what’s actually missing. The defense goes well.

Composite scenario, not a specific case.

I want to start with a scenario. Not a specific case — a composite — but a failure mode I think every one of us in this room should take seriously.

Imagine a sophisticated AI user. They read every line of AI output. They run the diagnostics. They can defend the analysis that was produced.

Their dissertation defense goes well. Until a committee member asks why they didn’t consider a different approach.

They have no answer. Not because they missed it — but because the AI never proposed it, and they had stopped generating alternatives of their own.

Neither side notices what’s actually missing. The defense goes well.

This is a composite scenario. Not a specific case. But it’s the failure mode I want us to take seriously today — because it’s not about careless AI use. It’s about sophisticated AI use that quietly narrows the option space.

Let me tell you why I think that’s the central worry.

The core tension

AI tools are most dangerous not when they give wrong answers, but when they give right answers you don’t understand.

The goal is not to use AI less — it’s to use it in ways that keep you learning.

Source: Rashid Lab Code Ownership Guide (March 2026)

This is not an AI-specific worry

Tools have been changing what we encode and what skills we maintain for decades.

Calculators changed mental arithmetic
GPS changed spatial memory
Search engines changed factual recall

Cognitive offloading is a well-established phenomenon: external tools improve performance while changing what we come to retain and practice internally. (Risko & Gilbert, Trends in Cognitive Sciences, 2016; Sparrow et al., Science, 2011.)

Bainbridge’s Ironies of Automation (1983) anticipated the central argument 40 years before AI: as automation grows, the human’s job becomes monitoring and judgment — and those skills atrophy without practice.

And recently, in professionals: Lee et al. (CHI 2025, n=319 knowledge workers) — higher GenAI confidence predicts less critical thinking; outputs grow more homogeneous (mechanized convergence).

What is new is the scope, not the mechanism.

Before we go further, I want to ground this in something. The worry I just named isn’t new with AI. Tools have been changing what we encode and what skills we maintain for decades.

Calculators changed mental arithmetic. GPS changed spatial memory. Search engines changed factual recall.

Cognitive offloading is settled science — not 2025 speculation. Erik Risko and Sam Gilbert reviewed the literature in 2016 in Trends in Cognitive Sciences; Betsy Sparrow’s “Google effects on memory” paper came out in Science in 2011.

In fact, the central argument of this talk was anticipated in 1983 by Lisanne Bainbridge, in a paper called “Ironies of Automation.” She observed that as automation grows, the human’s job becomes monitoring and judgment — and those skills atrophy without practice.

So what’s new with AI? The scope. Not the mechanism.

Why this matters

I use these tools extensively. This isn’t a talk about abstinence — it’s a talk about calibration.

By now, the surface concerns are familiar to most of us.

LLMs hallucinate.
LLM benchmarks rarely report uncertainty, dependence, or deployment shift.
Recent reports of fabricated citations surviving expert review.

I’m not here to repeat what you already discuss in your circles.

The quieter cost is what I want to focus on:

For our research — the cost is rigor.
For our trainees — the cost is the formation of judgment.
For our students — the cost is the literacy we owe them.

Why does any of this matter for our community specifically?

Before I go on — I use these tools extensively. This isn’t a talk about abstinence; it’s a talk about calibration. The productivity gains are real. AI reduces blank-page friction, helps me enter unfamiliar codebases, translates between languages, drafts scaffolding, and accelerates iteration. The question is what must remain practiced while those gains are captured.

By now, the surface concerns are familiar to most of us. LLMs hallucinate. LLM benchmarks rarely report uncertainty, dependence, or deployment shift. We’ve seen recent reports of fabricated citations surviving expert review.

I’m not here to repeat what you already discuss in your circles.

Some of you will reasonably think: trainees who never learned the unplugged version still end up better statisticians, because they iterate faster and reach harder problems sooner. I take that seriously, and I don’t think it’s wrong for every learner. The argument I’m making is narrower — that the formation of judgment depends on private wrong turns that AI can short-circuit, and the bar for whether that’s happening should be deliberate, not default.

What I want to focus on is a quieter cost. For our research — the cost is rigor. For our trainees — the cost is the formation of judgment. For our students — the cost is the literacy we owe them.

What Stays Yours

What must each of us actually hold onto, and why?

A part you don’t outsource

There’s a part of every project that, if you outsource it, the work goes slack.

The same sentence applies across:

our own research
the doctoral students we train
the students we teach

The next three slides show what that part is in each.

In your own research

The part you don’t outsource is the chain of judgment: framing the question; choosing methods (over alternatives the AI didn’t suggest); verifying that what was produced is what you actually intended; interpreting what the result means and what it does not; defending the work to a hostile reader.

A frequent failure mode is plausibly wrong output that survives casual inspection.

Across 576,000 code samples from 16 LLMs (mostly Python/JavaScript): commercial GPT-class models referenced nonexistent packages 5.2% of the time; open-source models, 21.7%. 200,000+ unique fabricated names.

Spracklen et al., USENIX Security 2025. Generalizes in kind to R/SAS; magnitude less established.

For us, the dangerous output isn’t a missing package. It’s a model that runs, a p-value that prints, a sensitivity analysis answering the wrong estimand.

Without your judgment, plausibly wrong passes.

Start with our own research. The part you don’t outsource is what I’d call the chain of judgment.

Framing the question — what’s worth asking. Choosing methods, including over alternatives the AI didn’t suggest. Verifying that what was produced is what you actually intended — that the code does what you described, that the output answers the question you asked. Interpreting what the result means and, just as importantly, what it does not. Defending the work to a hostile reader.

What’s the actual failure mode AI introduces here? It’s not gross incompetence. It’s plausibly wrong output that survives casual inspection.

Spracklen and colleagues at the USENIX Security symposium last year looked at 576,000 code samples generated by 16 LLMs — predominantly Python and JavaScript. Even commercial GPT-class models referenced nonexistent software packages 5.2% of the time. Open-source models, 21.7%. Over 200,000 unique fabricated package names.

That study’s about Python and JavaScript, not R or SAS. So the failure mode generalizes in kind to our work; the magnitude is less established. But the pattern matters — and in statistics, the dangerous output is often not a nonexistent package. It’s a model that runs, a p-value that prints, a prior that sounds reasonable, a contrast that looks plausible, or a sensitivity analysis that answers the wrong estimand.

One honest qualifier: some failures are recoverable through the chain of judgment, and some — like the narrowed alternatives in the opening scenario — require humility or a colleague. Vigilance helps, but it’s not sufficient on its own.

Without your judgment, plausibly wrong passes.

In doctoral training

The part you don’t outsource is the productive struggle.

“Conditions of instruction that make performance improve rapidly often fail to support long-term retention and transfer.”

Bjork & Bjork, 2011. From research on conventional instruction; the extension to AI-assisted learning is an extrapolation, not a finding.

Concretely: working through a derivation by hand the first time. Manually debugging by staring at residuals before reaching for help. Building a small simulation from scratch before reaching for a library.

The struggle is where calibrated intuition is laid down — the kind that lets each of us, today, catch a wrong answer in a paper, a colleague’s analysis, or an AI’s output.

We should be deliberate about whether AI lets the next generation skip the private wrong turns that build judgment.

In doctoral training, the part you don’t outsource is the productive struggle.

In 2011, Robert and Elizabeth Bjork wrote: “Conditions of instruction that make performance improve rapidly often fail to support long-term retention and transfer.”

This was research on conventional instruction. The extension to AI-assisted learning is an extrapolation, not a finding — and I want to be honest about that.

But the principle generalizes well. The struggle in graduate training is where calibrated intuition is laid down — the kind that lets each of us, today, catch a wrong answer in a paper, in a colleague’s analysis, or in an AI’s output.

I want to plant a question, not a claim. We should be deliberate about whether AI lets the next generation skip the private wrong turns that build judgment.

In classroom teaching

The part you don’t outsource is the first encounter with a concept.

The schema gets built then, or it does not.

Concept formation is:

the most contextual phase of learning — it depends on what just happened in this room, with this cohort, this semester
the hardest to automate well phase, for that reason
where the broader statistical literacy of our students is determined

A student who outsources the first encounter with confounding, likelihood, or simulation often retains the answer without the schema.

The pattern across all three

What AI is excellent at

Executing under decisions already made.

What stays ours

Deciding — what’s worth doing, what counts as done, what the result means.

AI increasingly helps with deciding too — surfacing options, finding gaps in reasoning, stress-testing your plan. The responsibility for the decision stays with you.

The line moves over time. The distinction does not.

How We Keep It

What does keeping understanding alive look like in practice?

A few practices that travel

The source guide is built in layers: principles → daily habits → project-level practices. We won’t walk all of them.

The ones that travel best across coding, research, and teaching share a common theme.

deliberate friction

Listen for it as we walk through the practices.

Read everything that ships

Treat AI output like a pull request from a junior developer.

Review every line. Question every design choice. Switch into reviewer mode:

Could I have written this myself? If not, what concept am I missing?
Is this the approach I would have chosen? Why or why not?
What does this assume about inputs, environment, or state?

Same posture for an AI-drafted methods sentence, simulation parameter, lecture example, or rubric.

If something doesn’t make sense — stop and learn the concept before accepting it.

The pre-prompt pause

Before prompting: 2–5 minutes writing your own approach.

Answer four questions in a scratch file:

What am I trying to accomplish? (the goal, not the implementation)
What’s my best guess for how to do it? (rough or wrong is fine)
What are the tricky parts? (edge cases, assumptions, unknowns)
How will I know it worked? (what output am I expecting; how will I verify)

Then prompt with your plan as context. Compare the AI’s approach to yours.

The differences are where learning happens.

The decision log

A lightweight DECISIONS.md in every project.

## 2026-03-15: Switched simulator optimizer from BFGS to L-BFGS-B

Reason: BFGS hit memory limits for n_params > 5000.
Trade-off: ~10× memory reduction; slower per-iteration; needs warm start.
Affected: src/optim.R, sim/runner.R, methods section §3.2.

The point isn’t the artifact — it’s the act of writing.

Articulating the decision forces you to understand it. The log helps teammates, reviewers, and future-you reconstruct the reasoning.

Generalizes beyond code: research projects, course curricula, manuscripts.

Third practice. A decision log.

Keep a lightweight markdown file called DECISIONS.md in every project.

Each entry captures the date, what decision was made, why this approach over alternatives, what trade-offs were accepted, and which components were affected.

Here’s an example from a real project of mine: switching the simulator’s optimizer from BFGS to L-BFGS-B. The reason — BFGS hit memory limits when n_params went over 5000. The trade-off — about 10x memory reduction, but slower per-iteration and needs a warm start. Affected — src/optim.R, sim/runner.R, methods section §3.2.

The point isn’t the artifact — it’s the act of writing. Articulating the decision forces you to understand it. The log helps teammates, helps reviewers, and helps future-you reconstruct the reasoning when you’ve forgotten why you did what you did.

This generalizes beyond code: research projects, course curricula, manuscripts. Anywhere decisions get made and forgotten.

The explain-it-back test

After AI helps you build something, close the tool.

Empirically: the self-explanation effect (Chi et al., 1989, 1994) and retrieval practice (Roediger & Karpicke, 2006) — well-evidenced mechanisms by which articulating cements understanding.

Try to explain the implementation to:

a rubber duck
a colleague
a voice memo on your phone

Where you get stuck is where your understanding has gaps.

The thinking required to explain is the point.

Architecture documentation

A living document describing your project at a high level.

major components and how they interact
the data model and how state flows
key design decisions and their rationale
external dependencies and why each was chosen

“Writing it yourself — not asking AI to generate it — is the point. The act of writing is the act of understanding.”

Rashid Lab Code Ownership Guide

For research, an analysis plan. For a course, a syllabus rationale. For a manuscript, a one-page summary.

Fifth practice. Architecture documentation.

A living document describing your project at a high level.

Major components and how they interact. The data model and how state flows. Key design decisions and their rationale. External dependencies and why each was chosen.

The source guide puts it this way: “Writing it yourself — not asking AI to generate it — is the point. The act of writing is the act of understanding.”

For a research project, that’s an analysis plan. For a course, a syllabus rationale. For a manuscript, a one-page summary.

And the question that travels everywhere: if this component fails, what’s the blast radius? In research, that becomes: if this assumption fails, what conclusions break? In teaching: if students misunderstand this, what later material collapses? In clinical work: if this analysis is wrong, what does the trial conclude?

What ties these practices together

deliberate friction

The kind of struggle a learner can resolve — not arbitrary difficulty.

When learning feels effortful and resolvable, it tends to stick.
When it feels smooth, the feeling can mislead us.

Bjork & Bjork, 2011 (desirable difficulties — resolvable, not arbitrary).
Kapur, 2008 / 2016 (productive failure, distinct from unproductive failure).
Deslauriers et al., PNAS 2019 (the feeling of learning ≠ learning).

The friction isn’t the problem AI solves.
It’s the asset AI is most likely to remove.

Across Our Roles

How does this differ for our research, our trainees, our students?

For educators and team leads

Four practices from the source guide that travel well:

The Two-Pass Assignment — attempt without AI; iterate with AI; submit both with reflection.
Explain-Your-Code Assessments — supplement code with an oral or written walkthrough.
Debugging Exercises — intentionally broken code; no AI tools allowed.
Code Review Culture — AI-generated code receives the same (or more) scrutiny as human-written code. “Can you walk me through this?” If the author can’t walk through it, the work isn’t ready.

The work isn’t to remove AI. It’s to design assignments and reviews where understanding still has to show.

Evidence that design matters

Bastani et al., PNAS 2025. Field experiment, ~1,000 high-school math students.

Vanilla GPT during practice

Better in practice. 17% worse when access removed.

Tutor-prompted GPT (with guardrails)

Better in practice. No detectable harm when access removed.

One study, one population, one subject. Suggestive of a broader pattern; not yet established for graduate technical training.

Design matters more than the tool.

I want to anchor what I just said in the strongest empirical evidence we have on this question.

Bastani and colleagues. PNAS, 2025. Field experiment. Roughly 1,000 high-school math students.

Vanilla GPT during practice — students performed better in practice. But 17% worse than no-AI controls when access was removed.

Tutor-prompted GPT — same access, same students, same content. The only thing that differed was the prompt scaffolding. No detectable harm when access was removed.

The point is not that this study settles graduate statistics education. It obviously does not. One study, one population, one subject — high-school math, not biostatistics. The point is that the same tool can produce very different learning consequences depending on the surrounding design.

Design matters more than the tool.

For doctoral training

The advisor’s standing question:

“Which struggles do I let happen, and which do I remove?”

The job isn’t to remove AI from your trainees’ work. It’s to design the struggles AI cannot remove — the parts where understanding has to be built, not produced.

Concrete moves:

Require the unplugged version first on key derivations and debugging.
Audit what the AI did on a project, not just whether it worked.
Make defending the decision — not producing the output — the bar for sign-off.

For your own research

You are also someone whose intuition and judgment have to be maintained.

The same practices apply to your own work:

A DECISIONS.md for your active research projects, not just your code.
Unplugged hours: 90 minutes a week without AI assistance, on real problems.
Periodic explain-it-back on your own analyses, before you teach them or submit them.

The trainees you mentor will calibrate to the practices you actually keep.

For regulated environments

For many in clinical trials, biopharma, and industry, the line isn’t personal — it’s drawn by SOPs, sponsor protocols, FDA, and QA.

The same practices apply, with an audit-trail layer:

Decision log → prompt logs. Log prompt, model version, and accepted diff.
Pre-prompt pause → SAP pre-specification. Define permitted AI use before analysis begins.
Read everything → double programming. AI-assisted code gets independent re-derivation.
Architecture docs → reproducibility under model deprecation. Pin model versions.

The line is drawn by your sponsor, the FDA, and your QA group — not by you alone.

I want to acknowledge directly that for many of you in clinical trials, biopharma, or industry — the line isn’t personal. It’s drawn by SOPs, sponsor protocols, FDA, and QA.

The same practices apply, but with an audit-trail layer.

The decision log becomes prompt logs — log the prompt, the model version, and the diff you accepted.

The pre-prompt pause becomes analysis-plan pre-specification — define permitted AI use in the SAP, programming plan, validation plan, or SOP before analysis begins.

Read everything that ships becomes double programming — AI-assisted code receives the same independent re-derivation as human-written code.

Architecture documentation becomes reproducibility under model deprecation — pin model versions; expect them to disappear.

One operational reality: hosted-LLM use with patient or confidential sponsor data may violate institutional policy or trigger privacy/security review. Local and on-prem deployments — and the productivity gap they create — are part of the calibration.

The line is drawn by your sponsor, the FDA, and your QA group — not by you alone.

The spectrum of delegation

Where do you draw the line — and how does it shift across activities?

Usually safe to delegate

Boilerplate & scaffolding
Syntax & idioms
Format conversions
Search-query brainstorming
Prose smoothing
Slide formatting

Delegate with review

Analysis code
Refactoring
Database queries
Methods-section drafts
Literature summaries / source triage
Worked examples & rubric drafts

Stay hands-on

Architecture & design
Core algorithms
Methodological choices
Interpretation & defense
Curriculum judgment
Diagnosing the room

The columns are stable. The items shift by activity, by stakes, and by where you are in your career.

Now for the take-home.

Where do you draw the line — and how does it shift across activities?

Left column — usually safe to delegate. Boilerplate and scaffolding. Syntax and idioms. Format conversions. Search-query brainstorming. Prose smoothing. Slide formatting.

Middle column — delegate with review. Analysis code. Refactoring. Database queries. Methods-section drafts. Literature summaries and source triage. Worked examples and rubric drafts.

Note that I moved literature work into the middle column. Search-query brainstorming is safe; AI-summarized papers and AI-suggested citations need verification — that’s where the fake-citation problem shows up.

Right column — stay hands-on. Architecture and design. Core algorithms. Methodological choices. Interpretation and defense. Curriculum judgment. Diagnosing the room.

Notice the right column has both code-side items — architecture, algorithms — and non-code items — curriculum judgment, diagnosing the room. The same column structure applies across all our roles. The items shift. The columns don’t.

The line also shifts by learner stage. For beginners, AI works best when it explains, quizzes, and gives hints. For intermediate learners, it can critique plans and help debug. For advanced users, it accelerates execution but the user still owns method choice and interpretation. For regulated work, the question becomes not just “do I understand it?” but “can I document and validate it?”

The columns are stable. The items shift by activity, by stakes, and by where you are in your career.

The line is personal

Where the columns sit on this grid is yours, not mine.

It will be different for you than for me.
Different for our trainees than for either of us.
Different for code than for analysis than for teaching.

The discipline isn’t where you draw the line.
It’s that you are deliberate about drawing one at all.

Calibration questions

An honest self-check — and one for your group.

For yourself:

Could you explain how your last analysis works without AI assistance?
Are you still learning new methods, or has AI handled what you don’t fully understand?

For your trainees:

Have you seen them work without AI on a substantive task this term?
Can they walk you through why their AI-assisted choices were the right ones?

For yourself, as advisor:

Have you asked your trainee how they actually use AI before judging the workflow?

Diagnostic, not judgmental. If anything gets a “no” — the next slide is where to start.

Before I close, I want to give you a self-check. To stay deliberate — an honest one. And one for your group.

For yourself.

Could you explain how your last analysis or paper actually works without AI assistance?

Could you reconstruct your last project’s architecture from memory?

Are you still learning new methods — or has AI handled what you don’t fully understand?

For your trainees.

Have you seen them work without AI on a substantive task this term?

Can they walk you through why their AI-assisted choices were the right ones?

And for yourself, as advisor.

Have you asked your trainee how they actually use AI before judging the workflow?

Are you keeping pace with their fluency — or judging from a 2023 mental model?

These are diagnostic, not judgmental. If something gets a “no” — the next slide is where to start.

A place to start

If something on the previous slide got a “no” — start here, in order.

Open a DECISIONS.md in your active project this week.
Before your next prompt, write your own approach first.
Schedule one 90-minute unplugged session.
Line-by-line review on the next AI-generated chunk.
Explain-it-back on the last thing you built with AI help.

Adopt one. Add the next when the first feels routine.

A year in: you can defend your work without your notes; your trainees can defend theirs.

From the source guide’s quick-start checklist, lightly reordered.

If something on the previous slide got a “no” — start here, in order.

First. Open a DECISIONS.md in your active project this week.

Second. Before your next prompt, write your own approach first.

Third. Schedule one 90-minute unplugged session.

Fourth. Line-by-line review on the next AI-generated chunk.

Fifth. Explain-it-back on the last thing you built with AI help.

Adopt one. Add the next when the first feels routine.

A year in: you can defend your work without your notes; your trainees can defend theirs.

These are from the source guide’s quick-start checklist, lightly reordered.

And one practical default for disclosure: disclose AI assistance when it materially shaped the analysis, writing, code, or interpretation — not when it merely helped with formatting or syntax.

As you sit with these — and pick where to start — here’s the image I keep coming back to.

The GPS analogy

AI didn’t make people worse drivers,
but it did make many people worse at knowing where they are.

The people who still have a sense of direction are the ones who glance at the map occasionally, notice landmarks, and pay attention to the route — even while GPS is guiding them.

From the source guide.

Stay in the Driver’s Seat

Calibrating AI Use in Statistical Research, Training, and Practice

not because we don’t trust the GPS,
but because we still want to know where we are when we arrive.

Thank you · Questions welcome

naim@unc.edu · naimurashid.github.io · LinkedIn: Naim Rashid · GitHub: @naimurashid