Staying in the Driver’s Seat

Calibrating AI Use in Statistical Research, Training, and Practice

Naim Rashid, PhD

Department of Biostatistics · UNC Chapel Hill

May 26, 2026

naimurashid.github.io

What’s at Stake

What’s the worry, and is it new?

A failure mode

Imagine a new hire - three months in - presenting their first analysis to senior leadership. They’ve used AI to help build it, reviewed every line, and can defend the strategy fully.

A senior colleague asks: “Why didn’t you consider [a different approach]?”

They were caught off-guard. They had reviewed every line of the AI’s suggestion. They had not thought to question the line they started from.

The senior colleague notes the omission. The work moves forward anyway.

The deeper failure: scrutinized every detail, never the framing.

The core tension

The risk isn’t just AI being wrong. It’s AI being good enough that it quietly substitutes for the judgment we’d otherwise build.

The goal isn’t to use AI less - it’s to use it in ways that keep our judgment in the loop.

Without that judgment, the work might still ship - but the work stops being ours.

What I am not arguing

Before going further - the position I am taking, and three I am not.

Not that the productivity gains from AI are illusory.

Not that trainees have to learn the way we did.

Not that speed and learning are opposites - some AI use is faster and deeper.

The narrower claim: some forms of speed remove the struggle that builds judgment - and the question is which ones.

This has happened before, and it will happen again

Clues from prior tool shifts - to navigate this one with eyes open.

Calculator
mental arithmetic

→

GPS
spatial memory

→

Search
factual recall

→

GenAI
judgment?

Lee et al. (CHI 2025, survey of 319 knowledge workers): higher GenAI confidence was associated with less self-reported critical thinking; higher task self-confidence with more.

The pattern is old. What’s new is the surface area - and that the work being offloaded is judgment, not arithmetic.

Sound familiar?

Bainbridge’s Ironies of Automation (1983) anticipated several automation failure modes that offer a useful but imperfect map for AI use today. The four below are a synthesis.

Skill degradation in the takeover

The better the automation, the rarer the cases that need a human - and the less practiced the human is when those cases arrive.

The training paradox

We train people to follow instructions, then ask them to provide judgment. The two skills aren’t the same.

The monitoring difficulty

Sustained attention is what humans are worst at - and it becomes their primary job once automation handles the routine work.

The transparency cost

For humans to oversee, the system must operate in ways humans can follow - even when that’s technically suboptimal.

The ironies don’t disappear on their own. They’re things you design around.

Designing around them is drawing the line.

The quieter cost

I use these tools extensively. This isn’t a talk about abstinence - it’s a talk about calibration.

By now, the surface concerns are familiar to most of us.

LLMs hallucinate.
LLM benchmarks rarely report uncertainty, dependence, or deployment shift.
Fabricated citations can survive expert review.

I’m not here to repeat what you already discuss in your circles.

The quieter risks are different across our roles:

For research - rigor can become harder to verify.
For trainees - judgment may be practiced less often.
For students - statistical literacy can become more fragile.

Where the line sits - what has to stay yours - is what comes next.

What Stays Yours

What must each of us actually hold onto, and why?

A part you don’t outsource

Every project has a part you can’t outsource. Skip it, and the work might still ship - but the work stops being yours.

The same sentence applies across:

our own research
the doctoral students we train
the students we teach

The next three slides show - for each - where the line sits.

In your own research

The part you don’t outsource is the chain of judgment.

Frame the question

→

Choose methods

→

Verify output

→

Interpret result

→

Defend the work

AI accelerates execution at every stage. The judgment at every stage stays yours.

The failure mode that matters: plausibly wrong output that passes casual inspection.

576,000 code samples, 16 LLMs: commercial models reference nonexistent packages 5.2% of the time; open-source models, 21.7%.

Package hallucination is the visible version of a broader problem: output that looks executable but rests on a false premise. In statistics, the analogous failures are often conceptual rather than syntactic.

Spracklen et al., USENIX Security 2025.

Without your judgment, plausibly wrong passes.

Correct code, wrong question

Here’s what “plausibly wrong output” often looks like in statistics: a model that does what AI suggested but answers a different question.

AI can produce	Human must decide	Failure if outsourced
Regression code	What estimand the question requires	Correct model, wrong question
Covariate-adjusted regression	Which covariates are pre- vs. post-treatment	Adjusted away the treatment effect by conditioning on a mediator
Imputation code	Which missingness mechanism is plausible (MCAR / MAR / MNAR)	Imputed under MCAR when MAR/MNAR is more credible; inference understates uncertainty
Simulation	What data-generating process matters	Simulation validates the wrong world

The middle column is where the chain of judgment lives.

In doctoral training

The part you don’t outsource is the productive struggle.

“Conditions of learning that make performance improve rapidly often fail to support long-term retention and transfer.”

Bjork & Bjork, 2011. Applied to AI-assisted learning: extrapolation, not finding.

The struggle preserves four kinds of effort that distinguish learning that lasts:

Varying conditions of practice

Practice across contexts, not one fixed setup. Default AI use can narrow the practice context.

Spacing

Practice over time, not cramming. Instant answer-seeking can collapse spacing.

Interleaving

Mixing topics, not blocked practice. Task-specific prompting can reinforce blocking.

Using tests as study events

Retrieving and generating, not reading. When AI supplies the answer before retrieval, neither happens.

Sources: Shea & Morgan 1979 (varied practice); Cepeda et al. 2006 (spacing); Rohrer & Taylor 2007 (interleaving); Roediger & Karpicke 2006 (testing). Extension to AI-assisted training in advanced statistics is plausible but not yet established empirically - treat as motivating, not proven.

Default AI use can bypass all four. Each one needs deliberate preservation.

In classroom teaching

The part you don’t outsource - as faculty or as a student - is the first encounter with a concept.

The first encounter strongly shapes the schema students build.

Concept formation is:

the most contextual phase of learning - it depends on what just happened in this room, with this cohort, this semester
the hardest to automate well, for that reason
where the broader statistical literacy of our students is determined

The first encounter should be designed, not defaulted. A student who outsources it may retain the answer without the schema. A course that lets generic AI output introduce the concept risks building the wrong schema before the instructor ever sees it.

Execution moves. Responsibility doesn’t.

What AI can assist with

Execution, options, critique, drafting, implementation.

What stays ours

Accountability - for what was chosen, why it was chosen, and what the result means.

AI increasingly helps with deciding too - surfacing options, finding gaps in reasoning, stress-testing your plan. The responsibility for the decision stays with you.

The line between them moves over time. The distinction does not.

A quick calibration check

A pause to take stock - after what we just covered.

Where do you feel most exposed in your AI-assisted work today?

A. Research rigor
B. The trainees I mentor
C. The students I teach
D. Still figuring out where the line should sit

      slido.com/1454077
    

Scan or visit the URL - your pick.

How We Keep It

What does keeping understanding alive look like in practice?

Five practices ahead

Read everything that ships
The pre-prompt pause (with a worked example)
The decision log
The explain-it-back test
Red-team your analysis

Then: what ties them together.

1. Read everything that ships

Treat AI output like a pull request from a capable collaborator who is not accountable for the final product.

Review every line. Question every design choice. Switch into reviewer mode:

Could I have written this myself? If not, what concept am I missing?
Is this the approach I would have chosen? Why or why not?
What does this assume about inputs, environment, or state?

Same posture for an AI-drafted methods sentence, simulation parameter, lecture example, or rubric.

If something doesn’t make sense - stop and learn the concept before accepting it.

2. The pre-prompt pause

Before prompting: 2–5 minutes writing your own approach.

Answer four questions in a scratch file:

What am I trying to accomplish? (the goal, not the implementation)
What’s my best guess for how to do it? (rough or wrong is fine)
What are the tricky parts? (edge cases, assumptions, what the AI might not suggest)
How will I know it worked? (what output am I expecting; how will I verify)

Then prompt with your plan as context. Compare the AI’s approach to yours.

The differences are where learning happens.

The pause in action

An illustrative example - planning a power analysis for a follow-up study.

What I wrote (1 min)

Goal: sample size for a confirmatory follow-up.

My approach: use the pilot’s effect size in pwr.t.test().

Tricky: the pilot is small and may overestimate the effect. I want to plan for a deflated value too.

What AI suggested

pwr.t.test() with the pilot’s reported effect size, 80% power, two-sided α = 0.05.

A clean, correct answer to the literal question.

What I learned

AI answered the literal prompt - not the concern the third line raised.

The diff revealed that the real task wasn’t computing power. It was choosing the planning effect size: pilot estimate, clinically meaningful minimum, and a deflated or shrunk effect.

I left with a sensitivity analysis, not a single sample-size answer.

The diff is the lesson.

3. The decision log

A lightweight DECISIONS.md in every project.

## 2026-03-15: Switched simulator optimizer from BFGS to L-BFGS-B

Reason: BFGS hit memory limits for n_params > 5000.
Trade-off: ~10× memory reduction; slower per-iteration; needs warm start.
Affected: src/optim.R, sim/runner.R, methods section §3.2.

The point isn’t the artifact - it’s the act of writing.

Articulating the decision forces you to understand it. The log helps teammates, reviewers, and future-you reconstruct the reasoning.

Generalizes beyond code: research projects, analysis plans, course curricula, manuscripts. Write your own project map - not AI’s.

4. The explain-it-back test

After AI helps you build something, close the tool.

Mechanism: the self-explanation effect (Chi 1989, 1994) and retrieval practice (Roediger & Karpicke 2006).

Try to explain the implementation to:

a rubber duck
a colleague
a voice memo on your phone

Where you get stuck is where your understanding has gaps.

Good explain-back, in practice - you can: (a) explain the goal in statistical terms, (b) name at least one reasonable alternative, (c) predict how the result could fail.

The thinking required to explain is the point.

5. Red-team your analysis

Before accepting an AI-assisted result, argue against it.

What estimand did this actually answer?
What diagnostic would make me distrust this?
What did the AI not suggest?
What would I do if the result were reversed?

The questions you ask before submitting are the questions a hostile reviewer asks first.

For regulated work - these become a sponsor-reviewable checklist.

What ties these practices together

Read everything that ships · 2. Pre-prompt pause · 3. Decision log · 4. Explain-it-back · 5. Red-team

deliberate friction

Some friction is where understanding gets built.
AI removes friction indiscriminately - the discipline is preserving the kind that matters.

Across Our Roles

How does the line differ between our research, our trainees, and our students?

For educators and team leads

Four practices from the source guide that travel well:

The Two-Pass Assignment - attempt without AI; iterate with AI; submit both with reflection.
Explain-Your-Work Assessments - supplement the submission with an oral or written walkthrough.
Debugging Exercises - intentionally broken analyses or code; no AI tools allowed.
Review Culture - AI-generated work receives the same (or more) scrutiny as work the trainee produced themselves. “Can you walk me through this?” If the author can’t walk through it, the work isn’t ready.

The work isn’t to remove AI. It’s to design assignments and reviews where understanding still has to show.

Evidence that design matters

Bastani et al., PNAS 2025. Field experiment, ~1,000 high-school math students.

Three randomized groups: vanilla GPT, tutor-prompted GPT, no-AI control. Same underlying model, different instructional design.

Vanilla GPT

Gave direct answers on request.

During practice: better than no-AI controls.
Post-test (no AI): 17% worse than no-AI controls.

Tutor-prompted GPT

Asked guiding questions; gave hints instead of answers.

During practice: better than no-AI controls.
Post-test (no AI): no detectable underperformance.

One study, one population, one subject. Suggestive of a broader pattern; not yet established for graduate technical training.

The same tool can have very different learning consequences depending on design.

For doctoral training

The advisor’s standing question:

“Which struggles do I let happen, and which do I remove?”

The job isn’t to remove AI from your trainees’ work. It’s to design the struggles AI should not remove - the parts where understanding has to be built, not produced.

Concrete moves:

Require the unplugged version first on key derivations and debugging.
Routine ask for any AI-assisted work: “show me one place AI helped, one suggestion you rejected, one thing you verified independently.” (Same conversation works senior-to-junior in industry teams.)
Make defending the decision - not producing the output - the bar for sign-off.

For your own research

You are also someone whose intuition and judgment have to be maintained.

The same practices apply to your own work:

A DECISIONS.md for your active research projects, not just your code.
Unplugged hours: 90 minutes a week without AI assistance, on real problems.
Periodic explain-it-back on your own analyses, before you teach them or submit them.

The trainees you mentor will calibrate to the practices you actually keep.

The spectrum of delegation

Where does the line sit - and which way is it shifting for you?

Usually safe to delegate

Boilerplate & scaffolding
Syntax & idioms
Format conversions
Search-query brainstorming

Delegate with review

Analysis code
Database queries
Methods-section drafts
Prose smoothing
Literature summaries
Worked examples & rubrics

Stay hands-on

Architecture & design
Methodological choices
Core algorithms
Interpretation & defense
Diagnosing the room

The columns are stable. The items shift - by activity, by stakes, by career stage.

Three modes of AI use

The spectrum tells you what to delegate. Modes tell you how to talk to AI about whatever you delegate.

Executor

Writes the work. Code, drafts, refactoring, format conversions.

"Write a function that…"

Tutor

Teaches through the work. Explains, asks questions, gives hints, quizzes.

"Quiz me on the method before helping me implement it."

Critic

Challenges the work. Argues against, names alternatives, stress-tests assumptions.

"Argue against this model choice. List two alternatives."

Default AI is executor. Tutor and critic require deliberate prompting.

Same question - handling heteroskedasticity in a regression:

Executor: "Fit OLS with robust standard errors and show GLS as a sensitivity analysis."

Tutor: "Quiz me on the assumptions behind OLS with robust SEs vs GLS before writing code."

Critic: "Argue against GLS here. What residual pattern would make OLS with robust SEs preferable?"

Critic-mode output is itself delegated judgment - it needs the same review as code.

Three dials, one line

The line between “hands-off” and “hands-on” shifts along three dimensions.

Shifts with	Tightens (more hands-on)	Loosens (more hands-off)
Context	First encounter in teaching · method choice in research · productive struggle in training	Refinement, formatting, scaffolding
Career stage	New researcher building judgment from scratch	Senior researcher whose intuition is already load-bearing
Stakes	Regulated, defended, published	Exploratory, scratch work, drafts

Three dials. Adjust them for your actual situation.

The fundamental questions

When in doubt about where the line should sit, ask:

Can you still answer the
fundamental questions about your own work?

What is the estimand or target of inference?
What assumptions make this method answer that question?
What diagnostic or sensitivity analysis would make me distrust the result?
If this piece fails, what breaks downstream?

Where you can still answer - without the model open - is where you’re still in the driver’s seat.

What this looks like in regulated work

For sponsor-reviewable analyses, the practices become artifacts.

AI-use disclosure in methods sections or SAPs:

“AI tools assisted with [code, drafts, literature triage]; methodological choices, estimands, assumptions, and result interpretation remain the analyst’s responsibility.”

Decision log alongside the SAP - choices tried, accepted, rejected, with AI-assisted steps flagged. Audit-trail evidence that judgment was applied, not delegated.

Sponsor-reviewable red-team: estimand specified before AI was consulted; pre- vs. post-treatment covariates verified; missingness mechanism named independently; simulation DGP validated.

In regulated work the line isn’t just where you draw it - it’s what you can show you drew.

Calibration questions

Now your turn. The check looks different depending on your role.

Diagnostic, not judgmental. If anything gets a “no” - the next slide is where to start.

A place to start

If something on the previous slide got a “no” - start here, in order.

Open a DECISIONS.md in your active project this week.
Before your next prompt, write your own approach first.
Schedule one 90-minute unplugged session.
Line-by-line review on the next AI-generated chunk.
Explain-it-back on the last thing you built with AI help.

Adopt one. Add the next when the first feels routine.

A year in: you can defend your work without your notes; your trainees can defend theirs.

From the source guide’s quick-start checklist, lightly reordered.

Still knowing where you are

GPS didn’t make people worse drivers,
but it did make many people worse at knowing where they are.

The people who still have a sense of direction are the ones who glance at the map occasionally, notice landmarks, and pay attention to the route - even while GPS is guiding them.

From the source guide.

Stay in the Driver’s Seat

This week - pick one place where AI has made your work smoother.
Add back one piece of deliberate friction.

not because we don’t trust the GPS,
but because we still want to know where we are when we arrive.

Thank you · Questions welcome

Continue the discussion: linkedin.com/in/naimurashid · naim@unc.edu

Resources & references

Companion repo - standalone templates, checklists, prompts

    github.com/naimurashid/
staying-in-drivers-seat-resources
  

The repo includes the handout PDF (deep-dive guide), plus standalone templates and checklists:

DECISIONS.md template · pre-prompt pause + red-team checklists · green/yellow/red AI-use policy · tutor/critic prompts (incl. OLS/GLS example) · Red Flags self-assessment

Cited work

Shea & Morgan (1979). Variable practice. J Exp Psych: Hum L M 5(2).
Bainbridge (1983). Ironies of Automation. Automatica 19(6).
Cepeda et al. (2006). Spacing meta-analysis. Psychological Bulletin 132(3).
Roediger & Karpicke (2006). Testing effect. Psychological Science 17(3).
Rohrer & Taylor (2007). Interleaving in math. Instructional Science 35.
Sparrow, Liu & Wegner (2011). Google effects on memory. Science 333.
Bjork & Bjork (2011). Desirable difficulties. In Psychology and the Real World.
Risko & Gilbert (2016). Cognitive offloading. Trends in Cognitive Sciences 20.
Deslauriers et al. (2019). Feeling of learning. PNAS 116(39).
Bastani et al. (2025). Generative AI guardrails. PNAS 122(26).
Spracklen et al. (2025). Package hallucinations. USENIX Security.
Lee et al. (2025). GenAI & critical thinking. CHI.