Tony Myers · Birmingham Newman University
A practical, secure, and methodologically sceptical guide for researchers who want AI to accelerate work without outsourcing judgement. This guide is designed to be useful whether or not you attended the accompanying workshop — every claim stands on its own sources and reasoning.
Use models for generation, critique, transformation, and organisation. Scholarly databases, primary sources, code, and expert judgement still carry the evidential weight.
A secure or local model can protect unpublished work, but it cannot make claims accountable. The researcher remains responsible for every interpretation, citation, and disclosure.
Hallucination, positional bias, and correlated model errors are not rare accidents. Build checks into prompts, workflows, and publication decisions from the beginning.
Research lifecycle
The question is not whether AI can be inserted into research. It is whether the contribution is defensible, auditable, and proportionate to the evidence. The stages below show where AI can add genuine value — and where the risks concentrate.
Each pipeline stage has a productive AI role and a clear boundary. The principle throughout: deploy AI for generation, critique, transformation, and organisation. Never use it as a final authority.
| Pipeline stage | Best AI role | Do not use for |
|---|---|---|
| Idea conception | Critical sparring partner | Sole creator of hypotheses |
| Literature search | Search-term planner | Citation source or factual database |
| Data collection | Wording and bias checker | Generating synthetic data |
| Data analysis | Code drafter and assumption checker | Analytical authority |
| Writing | Clarity and structure editor | Primary author |
| Peer review | Reviewer-simulation and response organiser | Final reviewer or evaluator |
A voice-enabled qualitative interview bot can support data collection by conducting semi-structured interviews using natural speech. The example below uses DeepSeek for completions, ElevenLabs for text-to-speech, and Groq Whisper for speech-to-text. This approach has been used in practice — for example, in a pilot study capturing participant experiences of rugby taster sessions, with full ethical approval and a participant information sheet (OPIS) in place. As with any data collection method, standard ethical requirements apply: institutional ethical approval, informed consent, and appropriate data handling must all be secured before deployment.
Demo: Voice-enabled interview bot using DeepSeek, ElevenLabs TTS, and Groq Whisper STT.
View the GitHub exampleUse AI to draft deterministic R or Python code, not to calculate complex statistics inside a free-text chat window. A sandboxed environment runs the generated code in isolation, which protects the host system and makes the analysis reproducible. Julius AI provides one such sandbox, with model selection, a connected code environment, and R/Python support.
Demo: Using Julius AI's sandboxed R environment with model selection and connected data.
Julius AI for data workflowsInterpretative Phenomenological Analysis (IPA) and other qualitative frameworks can benefit from AI as an analytical companion — not a replacement for the researcher's interpretive labour. The tool below accepts a research question, researcher reflexive statement, and interview transcripts, then generates exploratory themes using DeepSeek. The reflexive statement is threaded through the analysis to surface the researcher's own biases and assumptions, maintaining the double hermeneutic that IPA demands.
Demo: IPA Analysis Tool with reflexive statement support and DeepSeek completions.
View the GitHub repositoryNotebookLM-style systems are strongest when the source set is curated and narrow — typically 5 to 15 closely related sources addressing a defined question. Avoid dumping an entire field into one mega-notebook and treating the summary as synthesis. The model cannot perform systematic review methodology; it can help you navigate and interrogate a pre-selected corpus.
NotebookLM helpA locally-hosted review pipeline can provide confidential feedback on draft manuscripts without exposing unpublished work to cloud APIs. The example below runs on Apple Silicon using MLX with quantised open-weight models (Qwen 3.6, Gemma 4), providing structured methodological critique via a FastAPI web interface with separate Chat and Review modes. This solves the confidentiality problem — the manuscript never leaves the machine — but it does not solve the accountability problem. The researcher must still evaluate every critique the model produces.
Running multiple models against the same manuscript is instructive: each model tends to identify different genuine weaknesses while also introducing its own distinct factual errors. One model may correctly flag an inferential gap between large F-values and trivial adjusted R² differences; another may erroneously describe a flexible distribution method as "distribution-free." This pattern — broader critical coverage at the cost of model-specific errors — reinforces why multi-model comparison and researcher verification are both necessary.
Demo: Local LLM manuscript reviewer with model switching (Qwen 3.6 35B, Gemma 4 26B).
Architecture-driven prompting
Transformer-based systems generate likely continuations from the context window. That makes prompting a methodological act: you are constraining the probability distribution, not asking an oracle.
Lower values (0.0–0.3) for factual consistency and reproducible outputs. Higher values (0.7–1.0) for divergent idea generation.
Limits sampling to the most likely cumulative token set. A value of 0.9 means the model considers only the top 90% probability mass.
Hard-cuts the candidate token pool to a fixed size. A value of 40 means only the 40 most probable next tokens are considered.
These parameters interact. Setting temperature to 0 makes top-p and top-k irrelevant (greedy decoding). In practice, adjust temperature first and leave the others at defaults unless you have a specific reason to constrain further. Most chat platforms do not expose these controls.
This framework structures prompts so that the model receives enough context to generate useful output while the constraints prevent the most common failure modes: hallucinated citations, overclaiming, and unsupported causal language.
Suppose you paste a draft results paragraph that reads: "Sprint training significantly improved VO2max (p = 0.03), demonstrating that high-intensity intervals are superior to steady-state training for aerobic adaptation."
A well-configured model should return something like this:
| Claim | Sprint training is "superior" for aerobic adaptation |
| Support status | Partially supported |
| Evidence | p = 0.03 for within-group change in VO2max |
| Concern | "Superior" implies a between-group comparison, but only a within-group p-value is supplied. No effect size, no confidence interval, no comparison condition reported. "Significantly" conflates statistical and practical significance. |
| Required check | Report the between-group comparison statistic and effect size. Check whether the study design supports a causal claim (randomised? controlled?). |
| Conservative revision | "Sprint training was associated with a pre-to-post increase in VO2max (mean difference = X, 95% CI [Y, Z], p = 0.03). Comparison with steady-state training requires the between-group analysis reported in Table N." |
The model catches the overclaiming, flags the missing effect size, and distinguishes within-group from between-group evidence. This is the kind of output the 7-part prompt is designed to elicit. If the model instead validates the original claim, the prompt constraints need tightening or the model is not suitable for this task.
Verification
Agreement between polished outputs is not the same thing as independent corroboration. Two models trained on overlapping data can produce the same confident, wrong answer. Verification needs its own workflow, separate from generation.
If a generative large language model cannot perfectly classify a fact, it is mathematically prone to hallucinate it. Hallucinations are not broken code — they are a natural consequence of the model doing exactly what it was trained to do: make the best statistical guess possible based on its training distribution (Kalai et al., 2025). This means hallucination cannot be fully eliminated through prompt engineering alone; it must be managed through verification workflows, source checking, and multi-model comparison.
Never ask a generative model for "the correct interpretation." Instead, mandate that it provides a defined range of interpretations and evaluates the evidentiary weight for each.
Grounded firmly in the supplied evidence. Lowest risk of hallucination. Claims only what the data directly supports.
Ask: "What evidence supports this?"
Synthesises the supplied text with standard domain knowledge. Reasonable inferences, but introduce additional assumptions.
Ask: "What additional evidence is needed?"
Extrapolates broader implications beyond the evidence. High uncertainty. Useful for hypothesis generation, not for claims.
Ask: "What facts would count against this?"
This checklist was developed for this guide as a mnemonic for the minimum verification steps a researcher should perform on any AI-generated content before it enters a manuscript or analysis pipeline.
Confirm that every citation exists and says what the model claims it says. Check DOIs, page numbers, and author lists against the original database entry.
Prioritise peer-reviewed and primary material over plausible grey literature. Models can generate convincing-sounding references to reports and working papers that do not exist.
Inspect what is omitted: methods, geographies, populations, theoretical positions, and languages not represented in the output. Models reflect training data distributions, not the full evidence base.
Separate supported findings from interpretation and speculation. If the model does not distinguish these itself, the output cannot be trusted without manual classification.
Record the model name and version, the full prompt, the source set provided, the date of generation, and the verification checks performed. This documentation enables reproducibility and audit.
Retrieval-augmented generation (RAG) moves from closed-book pattern completion to open-book, source-grounded generation. It can reduce hallucination by anchoring answers in a trusted corpus, but it introduces its own failure modes: retrieval misses, context-window truncation, and false confidence from partial matches.
The video below demonstrates a case where a model produces a confident but incorrect interpretation of statistical output. This is not a rare edge case — it is the default risk when statistical reasoning is delegated to a language model without independent verification. The model may identify the correct test, report plausible numbers, and still misinterpret what they mean.
Demo: An LLM producing confident but incorrect interpretation of statistical output.
Exploratory RAG — useful for orientation, question generation, and finding candidate passages in a curated source set. Acceptable for early-stage literature scanning.
Rigorous synthesis — requires defined inclusion criteria, paper-level extraction, traceable notes, and independent checking. RAG alone cannot perform this; it can assist with navigation within a pre-screened corpus.
StatsRAG pattern — a direct response to the kind of misinterpretation shown above. Build an auditable statistical specification, verify it against a trusted local reference library, then produce a verdict card covering compliance, metric integrity, direction, and source support. This is the approach used in tools like the StatsRAG project for Bayesian analysis specification.
Demo: StatsRAG — verifying LLM-generated statistical output against a trusted reference library.
IBM explainer on RAG| Avoid | Relying on an LLM for direct calculation of complex statistics or numerical datasets. |
| Instead | Generate deterministic R or Python code and run it in a controlled environment. Inspect the code before execution. |
| Avoid | Accepting test selection or model specification without manual verification of the design and assumptions. |
| Instead | Check normality, variance structure, outliers, dependence, units, sample size, and model assumptions against the study design. |
| Avoid | Pasting AI-generated interpretation into a manuscript without independent checking. |
| Instead | Constrain the prompt, re-run with variations, compare outputs across models, and verify claims against the original data and published sources. |
Secure AI choices
The newest secure options are considerably better than public consumer chat, but "secure" still depends on your institution, licence terms, data classification policy, region, retention settings, and whether features like web grounding or third-party connectors are enabled. No single answer works for every institution.
Best for low-risk brainstorming, exploring public information, and learning how models behave. Do not upload unpublished manuscripts, sensitive participant data, or confidential grant material. Consumer tier data handling varies by provider and changes frequently — check the current terms.
Stronger contractual controls, typically with no-training clauses and regional data residency. However, local policy decides what data classifications are permitted. Check your institution's AI acceptable use policy and the specific enterprise agreement before uploading anything beyond public data.
Tenant-level governance, audit logging, region selection, and model-provider separation. Suitable for serious deployments with institutional data. Requires technical setup and ongoing administration — not a plug-and-play option for individual researchers.
Run models on your own hardware. Nothing leaves the machine. This maximises confidentiality but does not maximise accuracy — local models are typically smaller and less capable than frontier cloud models. Best for peer-review assistance, manuscript critique, and code generation where the researcher can verify every output.
Current platform guide
These are not endorsements. They are a researcher's map: what each platform is good for, what to check, and when to consider alternatives. All links and descriptions were checked in May 2026, but product details, model access, pricing, data retention and privacy settings change frequently. Check current provider documentation before using any tool with non-public research data.
Strong general-purpose model with web browsing, code execution, and image generation. Enterprise and Edu tiers offer no-training guarantees. Free/Plus tiers may use conversations for model improvement unless opted out.
Emphasis on careful reasoning, document work and long-context processing. Strong for manuscript critique, coding and structured analysis. Claude for Work provides enterprise data controls; check the current context limits and plan features.
Deep integration with Google Workspace. Gemini in Docs, Sheets, and Slides is useful for faculty already in the Google ecosystem. Workspace data protection policies apply to enterprise customers.
Grounded in your Microsoft 365 data (SharePoint, Teams, email). Useful for institutional knowledge retrieval. Commercial data protection means prompts and responses are not used for training.
AI-assisted literature review and data extraction. Searches Semantic Scholar, extracts structured data from papers, and supports screening workflows. Useful for scoping reviews and evidence mapping.
Shows how a paper has been cited — supporting, contrasting, or mentioning — across the literature. Useful for assessing the reception of a specific finding and identifying disputes.
Systematic review management with AI-assisted screening. Supports blind review, conflict resolution, and PRISMA-compatible export. Free for individual researchers.
Source-grounded chat over uploaded documents. Best with 5–15 curated sources on a focused topic. Generates audio overviews and summaries. Does not replace systematic search or formal synthesis.
Data analysis platform with sandboxed code execution in R and Python. Connects to data sources, generates code, and produces visualisations. Useful for exploratory analysis and teaching statistical workflows.
AI search with visible source links. Useful for quick orientation and finding recent publications. Paid tiers may offer more capable models and longer outputs. Not a substitute for systematic database searching.
Desktop application for running open models locally. Easy model discovery, download, and chat. Good entry point for researchers new to local AI. Supports GGUF-format models on CPU and GPU.
Open-source desktop AI with a clean interface. Supports local models, API connections, and extensions. Good for researchers who want offline chat without terminal commands.
Privacy-focused desktop client from Nomic. Runs quantised models locally with a simple GUI. Includes a local document Q&A feature for small corpora.
Command-line model server (Ollama) paired with a browser-based interface (Open WebUI). Supports model switching, RAG, tool calling and multi-user access. A flexible local setup for research teams and demonstrations.
Desktop and server application for local RAG. Upload documents, build a vector store, and chat with local or cloud models grounded in your own data. Good for building a private knowledge base.
Anonymous access to a rotating set of third-party models with no account required. DuckDuckGo describes requests as proxied to reduce identifying metadata and says it has contractual no-training arrangements with model providers. Model availability and limits change, so check the current Duck.ai documentation before relying on a specific model.
Privacy-focused AI assistant from the makers of Proton Mail. Proton describes Lumo as running on Proton-controlled infrastructure with zero-access encryption for saved conversations and a temporary mode for ephemeral chats. This may suit sensitive drafting or file review where the capability is sufficient, but check current terms, model options and institutional requirements.
AI assistant built into the Brave browser with a choice of hosted models. Brave describes requests as proxied and chat history as stored locally unless users choose otherwise. Useful for page-level tasks such as summarisation, translation and Q&A grounded in the current tab; verify current privacy claims and model availability before using it for research material.
Private multi-model AI workspace with encryption and collaboration features. It may be useful for structured research workflows where the provider's current terms match the data classification, but it is newer and less institutionally tested than the large enterprise platforms. Check the current documentation before uploading non-public data.
The primary hub for open-weight models. Inspect model cards, licences, benchmark results, and community Spaces before downloading. Essential for evaluating which model is appropriate for a given task.
Authorship and peer review
Some uses of AI in the peer-review and writing process are highly appropriate; some are risky; and unverified generation is academic misconduct regardless of where the model runs. The spectrum below applies to any model, cloud or local.
| Use case | Rating | Researcher responsibility |
|---|---|---|
| Gap analysis and red teaming | Highly appropriate | The researcher must independently evaluate which critiques are valid and decide what to act on. |
| Grammar, wording, and structure | Highly appropriate | The meaning and argument must remain human-generated. All changes must be reviewed and approved. |
| Substantial drafting of text | Problematic | Risks false synthesis, fabricated citations, and authorship blur. If used at all, every claim must be independently verified and the contribution disclosed. |
| Unverified paste-in | Inappropriate | The author cannot vouch for accuracy, originality, or source integrity. This constitutes academic misconduct under most institutional and journal policies. |
Adapt these to the target journal. Always include the tool name, version where available, specific task, date range, and the human verification performed.
Paste your abstract and methods section into any capable model using the prompt below. Then evaluate the output: did the model identify a real weakness? Did it fabricate a concern? How does it compare to actual reviewer feedback you have received?
Debrief questions: Did the model identify a real weakness you had not noticed? Did it fabricate a concern that does not withstand scrutiny? Did different models produce different critiques? This exercise demonstrates both the power and the limits of AI-assisted review — and why the researcher must evaluate every critique independently.
See also: Nature AI policy · Elsevier AI policy
Future direction
The trajectory points towards a shift from back-and-forth prompting to agentic workflows, where the researcher becomes less a prompt typist and more a manager of objectives, constraints, tools, audit logs, and checkpoints. This raises the verification burden rather than removing it.
Coding and research agents (such as those in Claude Code, Cursor, and Windsurf) can execute multi-step tasks: searching literature, writing and running code, iterating on errors. The appeal is real, but so are the risks. Agents require explicit permission boundaries, sandboxed execution environments, source-boundary constraints (preventing the agent from citing material outside a defined corpus), and human review gates at each decision point. Unsupervised agent runs that modify data or submit outputs are not currently defensible in an academic context.
Demo: Claude Code building SecurXamine — an agentic coding workflow with human review gates.
Demo: OpenAI Codex generating a Bayesian 3D visualisation as an agentic task.
LoRA allows specialisation of a foundation model for a narrow task — such as evaluating statistical claims in academic prose or classifying methodological frameworks — without the cost or data requirements of full fine-tuning. A LoRA adapter trained on 100 annotated examples can meaningfully shift model behaviour on a focused task. However, dataset quality determines everything: garbage in, confidently wrong garbage out. Licensing of the base model, evaluation against held-out test sets, and monitoring for distributional drift over time all matter. Fine-tuning is powerful and accessible, but it is not a shortcut to a reliable domain expert.
Demo: Fine-tuning a small thinking model (Ouro) with LoRA adapters for statistical claim evaluation.
Current transformer limitations — finite context windows, no persistent memory, limited planning — are active research frontiers. Future systems may combine symbolic reasoning, retrieval, planning modules, and model-based generation. Mixture-of-experts architectures (already deployed in models like Qwen and Gemini) improve efficiency by activating only relevant subnetworks for a given input. State-space models and recurrent alternatives may reduce the quadratic cost of attention on long sequences. None of these architectural advances will eliminate the need for researcher verification; they will change the shape of the errors rather than removing them.
Sources
All links were checked for this build in May 2026. Where a source is behind a paywall, the DOI or arXiv preprint is provided.