Calibrating AI Use in Statistical Research, Training, and Practice
Department of Biostatistics · UNC Chapel Hill
What’s the worry, and is it new?
Imagine a new hire - three months in - presenting their first analysis to senior leadership. They’ve used AI to help build it, reviewed every line, and can defend the strategy fully.
A senior colleague asks: “Why didn’t you consider [a different approach]?”
They were caught off-guard. They had reviewed every line of the AI’s suggestion. They had not thought to question the line they started from.
The senior colleague notes the omission. The work moves forward anyway.
The deeper failure: scrutinized every detail, never the framing.
The risk isn’t just AI being wrong. It’s AI being good enough that it quietly substitutes for the judgment we’d otherwise build.
The goal isn’t to use AI less - it’s to use it in ways that keep our judgment in the loop.
Without that judgment, the work might still ship - but the work stops being ours.
Before going further - the position I am taking, and three I am not.
Not that the productivity gains from AI are illusory.
Not that trainees have to learn the way we did.
Not that speed and learning are opposites - some AI use is faster and deeper.
The narrower claim: some forms of speed remove the struggle that builds judgment - and the question is which ones.
Clues from prior tool shifts - to navigate this one with eyes open.
Lee et al. (CHI 2025, survey of 319 knowledge workers): higher GenAI confidence was associated with less self-reported critical thinking; higher task self-confidence with more.
The pattern is old. What’s new is the surface area - and that the work being offloaded is judgment, not arithmetic.
Bainbridge’s Ironies of Automation (1983) anticipated several automation failure modes that offer a useful but imperfect map for AI use today. The four below are a synthesis.
Skill degradation in the takeover
The better the automation, the rarer the cases that need a human - and the less practiced the human is when those cases arrive.
The training paradox
We train people to follow instructions, then ask them to provide judgment. The two skills aren’t the same.
The monitoring difficulty
Sustained attention is what humans are worst at - and it becomes their primary job once automation handles the routine work.
The transparency cost
For humans to oversee, the system must operate in ways humans can follow - even when that’s technically suboptimal.
The ironies don’t disappear on their own. They’re things you design around.
Designing around them is drawing the line.
I use these tools extensively. This isn’t a talk about abstinence - it’s a talk about calibration.
By now, the surface concerns are familiar to most of us.
I’m not here to repeat what you already discuss in your circles.
The quieter risks are different across our roles:
Where the line sits - what has to stay yours - is what comes next.
What must each of us actually hold onto, and why?
Every project has a part you can’t outsource. Skip it, and the work might still ship - but the work stops being yours.
The same sentence applies across:
The next three slides show - for each - where the line sits.
The part you don’t outsource is the chain of judgment.
AI accelerates execution at every stage. The judgment at every stage stays yours.
The failure mode that matters: plausibly wrong output that passes casual inspection.
576,000 code samples, 16 LLMs: commercial models reference nonexistent packages 5.2% of the time; open-source models, 21.7%.
Package hallucination is the visible version of a broader problem: output that looks executable but rests on a false premise. In statistics, the analogous failures are often conceptual rather than syntactic.
Spracklen et al., USENIX Security 2025.
Without your judgment, plausibly wrong passes.
Here’s what “plausibly wrong output” often looks like in statistics: a model that does what AI suggested but answers a different question.
| AI can produce | Human must decide | Failure if outsourced |
|---|---|---|
| Regression code | What estimand the question requires | Correct model, wrong question |
| Covariate-adjusted regression | Which covariates are pre- vs. post-treatment | Adjusted away the treatment effect by conditioning on a mediator |
| Imputation code | Which missingness mechanism is plausible (MCAR / MAR / MNAR) | Imputed under MCAR when MAR/MNAR is more credible; inference understates uncertainty |
| Simulation | What data-generating process matters | Simulation validates the wrong world |
The middle column is where the chain of judgment lives.
The part you don’t outsource is the productive struggle.
“Conditions of learning that make performance improve rapidly often fail to support long-term retention and transfer.”
Bjork & Bjork, 2011. Applied to AI-assisted learning: extrapolation, not finding.
The struggle preserves four kinds of effort that distinguish learning that lasts:
Varying conditions of practice
Practice across contexts, not one fixed setup. Default AI use can narrow the practice context.
Spacing
Practice over time, not cramming. Instant answer-seeking can collapse spacing.
Interleaving
Mixing topics, not blocked practice. Task-specific prompting can reinforce blocking.
Using tests as study events
Retrieving and generating, not reading. When AI supplies the answer before retrieval, neither happens.
Sources: Shea & Morgan 1979 (varied practice); Cepeda et al. 2006 (spacing); Rohrer & Taylor 2007 (interleaving); Roediger & Karpicke 2006 (testing). Extension to AI-assisted training in advanced statistics is plausible but not yet established empirically - treat as motivating, not proven.
Default AI use can bypass all four. Each one needs deliberate preservation.
The part you don’t outsource - as faculty or as a student - is the first encounter with a concept.
The first encounter strongly shapes the schema students build.
Concept formation is:
The first encounter should be designed, not defaulted. A student who outsources it may retain the answer without the schema. A course that lets generic AI output introduce the concept risks building the wrong schema before the instructor ever sees it.
AI increasingly helps with deciding too - surfacing options, finding gaps in reasoning, stress-testing your plan. The responsibility for the decision stays with you.
The line between them moves over time. The distinction does not.
A pause to take stock - after what we just covered.
Where do you feel most exposed in your AI-assisted work today?
Scan or visit the URL - your pick.
What does keeping understanding alive look like in practice?
Then: what ties them together.
Treat AI output like a pull request from a capable collaborator who is not accountable for the final product.
Review every line. Question every design choice. Switch into reviewer mode:
Same posture for an AI-drafted methods sentence, simulation parameter, lecture example, or rubric.
If something doesn’t make sense - stop and learn the concept before accepting it.
Before prompting: 2–5 minutes writing your own approach.
Answer four questions in a scratch file:
Then prompt with your plan as context. Compare the AI’s approach to yours.
The differences are where learning happens.
An illustrative example - planning a power analysis for a follow-up study.
What I wrote (1 min)
Goal: sample size for a confirmatory follow-up.
My approach: use the pilot’s effect size in pwr.t.test().
Tricky: the pilot is small and may overestimate the effect. I want to plan for a deflated value too.
What AI suggested
pwr.t.test() with the pilot’s reported effect size, 80% power, two-sided α = 0.05.
A clean, correct answer to the literal question.
What I learned
AI answered the literal prompt - not the concern the third line raised.
The diff revealed that the real task wasn’t computing power. It was choosing the planning effect size: pilot estimate, clinically meaningful minimum, and a deflated or shrunk effect.
I left with a sensitivity analysis, not a single sample-size answer.
The diff is the lesson.
A lightweight
DECISIONS.mdin every project.
## 2026-03-15: Switched simulator optimizer from BFGS to L-BFGS-B
Reason: BFGS hit memory limits for n_params > 5000.
Trade-off: ~10× memory reduction; slower per-iteration; needs warm start.
Affected: src/optim.R, sim/runner.R, methods section §3.2.
The point isn’t the artifact - it’s the act of writing.
Articulating the decision forces you to understand it. The log helps teammates, reviewers, and future-you reconstruct the reasoning.
Generalizes beyond code: research projects, analysis plans, course curricula, manuscripts. Write your own project map - not AI’s.
After AI helps you build something, close the tool.
Mechanism: the self-explanation effect (Chi 1989, 1994) and retrieval practice (Roediger & Karpicke 2006).
Try to explain the implementation to:
Where you get stuck is where your understanding has gaps.
Good explain-back, in practice - you can: (a) explain the goal in statistical terms, (b) name at least one reasonable alternative, (c) predict how the result could fail.
The thinking required to explain is the point.
Before accepting an AI-assisted result, argue against it.
The questions you ask before submitting are the questions a hostile reviewer asks first.
For regulated work - these become a sponsor-reviewable checklist.
deliberate friction
Some friction is where understanding gets built.
AI removes friction indiscriminately - the discipline is preserving the kind that matters.
How does the line differ between our research, our trainees, and our students?
Four practices from the source guide that travel well:
The work isn’t to remove AI. It’s to design assignments and reviews where understanding still has to show.
Bastani et al., PNAS 2025. Field experiment, ~1,000 high-school math students.
Three randomized groups: vanilla GPT, tutor-prompted GPT, no-AI control. Same underlying model, different instructional design.
One study, one population, one subject. Suggestive of a broader pattern; not yet established for graduate technical training.
The same tool can have very different learning consequences depending on design.
The advisor’s standing question:
“Which struggles do I let happen, and which do I remove?”
The job isn’t to remove AI from your trainees’ work. It’s to design the struggles AI should not remove - the parts where understanding has to be built, not produced.
Concrete moves:
You are also someone whose intuition and judgment have to be maintained.
The same practices apply to your own work:
DECISIONS.md for your active research projects, not just your code.The trainees you mentor will calibrate to the practices you actually keep.
Where does the line sit - and which way is it shifting for you?
The columns are stable. The items shift - by activity, by stakes, by career stage.
The spectrum tells you what to delegate. Modes tell you how to talk to AI about whatever you delegate.
Default AI is executor. Tutor and critic require deliberate prompting.
Critic-mode output is itself delegated judgment - it needs the same review as code.
The line between “hands-off” and “hands-on” shifts along three dimensions.
| Shifts with | Tightens (more hands-on) | Loosens (more hands-off) |
|---|---|---|
| Context | First encounter in teaching · method choice in research · productive struggle in training | Refinement, formatting, scaffolding |
| Career stage | New researcher building judgment from scratch | Senior researcher whose intuition is already load-bearing |
| Stakes | Regulated, defended, published | Exploratory, scratch work, drafts |
Three dials. Adjust them for your actual situation.
When in doubt about where the line should sit, ask:
Can you still answer the
fundamental questions about your own work?
Where you can still answer - without the model open - is where you’re still in the driver’s seat.
For sponsor-reviewable analyses, the practices become artifacts.
AI-use disclosure in methods sections or SAPs:
“AI tools assisted with [code, drafts, literature triage]; methodological choices, estimands, assumptions, and result interpretation remain the analyst’s responsibility.”
Decision log alongside the SAP - choices tried, accepted, rejected, with AI-assisted steps flagged. Audit-trail evidence that judgment was applied, not delegated.
Sponsor-reviewable red-team: estimand specified before AI was consulted; pre- vs. post-treatment covariates verified; missingness mechanism named independently; simulation DGP validated.
In regulated work the line isn’t just where you draw it - it’s what you can show you drew.
Now your turn. The check looks different depending on your role.
As advisor / senior researcher:
As trainee / early-career:
Diagnostic, not judgmental. If anything gets a “no” - the next slide is where to start.
If something on the previous slide got a “no” - start here, in order.
DECISIONS.md in your active project this week.Adopt one. Add the next when the first feels routine.
A year in: you can defend your work without your notes; your trainees can defend theirs.
From the source guide’s quick-start checklist, lightly reordered.
GPS didn’t make people worse drivers,
but it did make many people worse at knowing where they are.
The people who still have a sense of direction are the ones who glance at the map occasionally, notice landmarks, and pay attention to the route - even while GPS is guiding them.
From the source guide.
This week - pick one place where AI has made your work smoother.
Add back one piece of deliberate friction.
not because we don’t trust the GPS,
but because we still want to know where we are when we arrive.
Thank you · Questions welcome
Continue the discussion: linkedin.com/in/naimurashid · naim@unc.edu
Companion repo - standalone templates, checklists, prompts
The repo includes the handout PDF (deep-dive guide), plus standalone templates and checklists:
DECISIONS.md template · pre-prompt pause + red-team checklists · green/yellow/red AI-use policy · tutor/critic prompts (incl. OLS/GLS example) · Red Flags self-assessment
Cited work
StatsUp.AI Webinar · American Statistical Association