Generative AI for Research

A practical, secure, and methodologically sceptical guide for researchers who want AI to accelerate work without outsourcing judgement. This guide is designed to be useful whether or not you attended the accompanying workshop — every claim stands on its own sources and reasoning.

Updated 16 May 2026 · Companion to The AI-Powered Academic

AI is a collaborator, not evidence

Use models for generation, critique, transformation, and organisation. Scholarly databases, primary sources, code, and expert judgement still carry the evidential weight.

Security and accountability are separate

A secure or local model can protect unpublished work, but it cannot make claims accountable. The researcher remains responsible for every interpretation, citation, and disclosure.

Verification is structural

Hallucination, positional bias, and correlated model errors are not rare accidents. Build checks into prompts, workflows, and publication decisions from the beginning.

Match the model to the task

The question is not whether AI can be inserted into research. It is whether the contribution is defensible, auditable, and proportionate to the evidence. The stages below show where AI can add genuine value — and where the risks concentrate.

Conception
Data collection
Analysis
Interpretation
Writing
Peer review

Capability alignment: match the prompt to the research task

Each pipeline stage has a productive AI role and a clear boundary. The principle throughout: deploy AI for generation, critique, transformation, and organisation. Never use it as a final authority.

Pipeline stageBest AI roleDo not use for
Idea conception Critical sparring partner Sole creator of hypotheses
Literature search Search-term planner Citation source or factual database
Data collection Wording and bias checker Generating synthetic data
Data analysis Code drafter and assumption checker Analytical authority
Writing Clarity and structure editor Primary author
Peer review Reviewer-simulation and response organiser Final reviewer or evaluator
Use AI to structure inquiry; use scholarly databases and expert judgement to establish evidence.

Voice interview bot

A voice-enabled qualitative interview bot can support data collection by conducting semi-structured interviews using natural speech. The example below uses DeepSeek for completions, ElevenLabs for text-to-speech, and Groq Whisper for speech-to-text. This approach has been used in practice — for example, in a pilot study capturing participant experiences of rugby taster sessions, with full ethical approval and a participant information sheet (OPIS) in place. As with any data collection method, standard ethical requirements apply: institutional ethical approval, informed consent, and appropriate data handling must all be secured before deployment.

Demo: Voice-enabled interview bot using DeepSeek, ElevenLabs TTS, and Groq Whisper STT.

View the GitHub example

Sandboxed analysis

Use AI to draft deterministic R or Python code, not to calculate complex statistics inside a free-text chat window. A sandboxed environment runs the generated code in isolation, which protects the host system and makes the analysis reproducible. Julius AI provides one such sandbox, with model selection, a connected code environment, and R/Python support.

Demo: Using Julius AI's sandboxed R environment with model selection and connected data.

Julius AI for data workflows

AI-assisted qualitative analysis

Interpretative Phenomenological Analysis (IPA) and other qualitative frameworks can benefit from AI as an analytical companion — not a replacement for the researcher's interpretive labour. The tool below accepts a research question, researcher reflexive statement, and interview transcripts, then generates exploratory themes using DeepSeek. The reflexive statement is threaded through the analysis to surface the researcher's own biases and assumptions, maintaining the double hermeneutic that IPA demands.

Demo: IPA Analysis Tool with reflexive statement support and DeepSeek completions.

View the GitHub repository

Focused notebooks

NotebookLM-style systems are strongest when the source set is curated and narrow — typically 5 to 15 closely related sources addressing a defined question. Avoid dumping an entire field into one mega-notebook and treating the summary as synthesis. The model cannot perform systematic review methodology; it can help you navigate and interrogate a pre-selected corpus.

NotebookLM help

Local manuscript review

A locally-hosted review pipeline can provide confidential feedback on draft manuscripts without exposing unpublished work to cloud APIs. The example below runs on Apple Silicon using MLX with quantised open-weight models (Qwen 3.6, Gemma 4), providing structured methodological critique via a FastAPI web interface with separate Chat and Review modes. This solves the confidentiality problem — the manuscript never leaves the machine — but it does not solve the accountability problem. The researcher must still evaluate every critique the model produces.

Running multiple models against the same manuscript is instructive: each model tends to identify different genuine weaknesses while also introducing its own distinct factual errors. One model may correctly flag an inferential gap between large F-values and trivial adjusted R² differences; another may erroneously describe a flexible distribution method as "distribution-free." This pattern — broader critical coverage at the cost of model-specific errors — reinforces why multi-model comparison and researcher verification are both necessary.

Demo: Local LLM manuscript reviewer with model switching (Qwen 3.6 35B, Gemma 4 26B).

Prompting is context design

Transformer-based systems generate likely continuations from the context window. That makes prompting a methodological act: you are constraining the probability distribution, not asking an oracle.

The sampling parameters below are relevant to researchers using API access or local models. If you are using a standard chat interface (ChatGPT, Claude, Gemini), the platform manages these settings and you can skip to the prompt framework.

temperature

Lower values (0.0–0.3) for factual consistency and reproducible outputs. Higher values (0.7–1.0) for divergent idea generation.

top_p

Limits sampling to the most likely cumulative token set. A value of 0.9 means the model considers only the top 90% probability mass.

top_k

Hard-cuts the candidate token pool to a fixed size. A value of 40 means only the 40 most probable next tokens are considered.

These parameters interact. Setting temperature to 0 makes top-p and top-k irrelevant (greedy decoding). In practice, adjust temperature first and leave the others at defaults unless you have a specific reason to constrain further. Most chat platforms do not expose these controls.

The 7-part research prompt

This framework structures prompts so that the model receives enough context to generate useful output while the constraints prevent the most common failure modes: hallucinated citations, overclaiming, and unsupported causal language.

1
Roledefine the expert lens the model should adopt.
2
Taskspecify the action: evaluate, draft, compare, extract.
3
Contextstate the study design, purpose, and stage of work.
4
Evidencesupply the data, text, or output to work with.
5
Constraintsforbid unsupported claims, invented citations, speculation.
6
Formatrequire tables, code blocks, headings, or structured output.
7
Verificationask for uncertainty flags, missing evidence, and checks needed.
# Research critique prompt — paste into any capable model Role: Act as a critical methodological adviser. Task: Evaluate the supplied material for factual, statistical, and methodological reliability. Context: I am preparing an academic research output and need conservative feedback before submission. Evidence: [Paste the source text, table, output, transcript, or model results here.] Constraints: Use only the supplied material unless I explicitly ask for external sources. Do not invent citations, missing facts, or statistical values. Distinguish clearly between fact, interpretation, and speculation. Output format: Use a table with these columns: Claim | Support status | Evidence | Concern | Required check | Conservative revision. Verification: Flag overclaiming, causal language without experimental design, missing uncertainty, mismatches between evidence and conclusion, and any claim that needs independent database or code verification.

Worked example — what good output looks like

Suppose you paste a draft results paragraph that reads: "Sprint training significantly improved VO2max (p = 0.03), demonstrating that high-intensity intervals are superior to steady-state training for aerobic adaptation."

A well-configured model should return something like this:

Claim Sprint training is "superior" for aerobic adaptation
Support status Partially supported
Evidence p = 0.03 for within-group change in VO2max
Concern "Superior" implies a between-group comparison, but only a within-group p-value is supplied. No effect size, no confidence interval, no comparison condition reported. "Significantly" conflates statistical and practical significance.
Required check Report the between-group comparison statistic and effect size. Check whether the study design supports a causal claim (randomised? controlled?).
Conservative revision "Sprint training was associated with a pre-to-post increase in VO2max (mean difference = X, 95% CI [Y, Z], p = 0.03). Comparison with steady-state training requires the between-group analysis reported in Table N."

The model catches the overclaiming, flags the missing effect size, and distinguishes within-group from between-group evidence. This is the kind of output the 7-part prompt is designed to elicit. If the model instead validates the original claim, the prompt constraints need tightening or the model is not suitable for this task.

Fluency is not evidence

Agreement between polished outputs is not the same thing as independent corroboration. Two models trained on overlapping data can produce the same confident, wrong answer. Verification needs its own workflow, separate from generation.

Why hallucination is structural, not accidental

If a generative large language model cannot perfectly classify a fact, it is mathematically prone to hallucinate it. Hallucinations are not broken code — they are a natural consequence of the model doing exactly what it was trained to do: make the best statistical guess possible based on its training distribution (Kalai et al., 2025). This means hallucination cannot be fully eliminated through prompt engineering alone; it must be managed through verification workflows, source checking, and multi-model comparison.

From single answers to plausible answer sets

Never ask a generative model for "the correct interpretation." Instead, mandate that it provides a defined range of interpretations and evaluates the evidentiary weight for each.

Conservative

Grounded firmly in the supplied evidence. Lowest risk of hallucination. Claims only what the data directly supports.

Ask: "What evidence supports this?"

Moderate

Synthesises the supplied text with standard domain knowledge. Reasonable inferences, but introduce additional assumptions.

Ask: "What additional evidence is needed?"

Speculative

Extrapolates broader implications beyond the evidence. High uncertainty. Useful for hypothesis generation, not for claims.

Ask: "What facts would count against this?"

The VALID-AI checklist

This checklist was developed for this guide as a mnemonic for the minimum verification steps a researcher should perform on any AI-generated content before it enters a manuscript or analysis pipeline.

V
Verify sources

Confirm that every citation exists and says what the model claims it says. Check DOIs, page numbers, and author lists against the original database entry.

A
Assess authority

Prioritise peer-reviewed and primary material over plausible grey literature. Models can generate convincing-sounding references to reports and working papers that do not exist.

L
Look for bias

Inspect what is omitted: methods, geographies, populations, theoretical positions, and languages not represented in the output. Models reflect training data distributions, not the full evidence base.

I
Identify limits

Separate supported findings from interpretation and speculation. If the model does not distinguish these itself, the output cannot be trusted without manual classification.

D
Document provenance

Record the model name and version, the full prompt, the source set provided, the date of generation, and the verification checks performed. This documentation enables reproducibility and audit.

RAG helps, but it is not magic

Retrieval-augmented generation (RAG) moves from closed-book pattern completion to open-book, source-grounded generation. It can reduce hallucination by anchoring answers in a trusted corpus, but it introduces its own failure modes: retrieval misses, context-window truncation, and false confidence from partial matches.

When AI gets statistics wrong

The video below demonstrates a case where a model produces a confident but incorrect interpretation of statistical output. This is not a rare edge case — it is the default risk when statistical reasoning is delegated to a language model without independent verification. The model may identify the correct test, report plausible numbers, and still misinterpret what they mean.

Demo: An LLM producing confident but incorrect interpretation of statistical output.

Three levels of RAG use

Exploratory RAG — useful for orientation, question generation, and finding candidate passages in a curated source set. Acceptable for early-stage literature scanning.

Rigorous synthesis — requires defined inclusion criteria, paper-level extraction, traceable notes, and independent checking. RAG alone cannot perform this; it can assist with navigation within a pre-screened corpus.

StatsRAG pattern — a direct response to the kind of misinterpretation shown above. Build an auditable statistical specification, verify it against a trusted local reference library, then produce a verdict card covering compliance, metric integrity, direction, and source support. This is the approach used in tools like the StatsRAG project for Bayesian analysis specification.

Demo: StatsRAG — verifying LLM-generated statistical output against a trusted reference library.

IBM explainer on RAG

Safety protocol for AI-assisted analysis

Avoid Relying on an LLM for direct calculation of complex statistics or numerical datasets.
Instead Generate deterministic R or Python code and run it in a controlled environment. Inspect the code before execution.
Avoid Accepting test selection or model specification without manual verification of the design and assumptions.
Instead Check normality, variance structure, outliers, dependence, units, sample size, and model assumptions against the study design.
Avoid Pasting AI-generated interpretation into a manuscript without independent checking.
Instead Constrain the prompt, re-run with variations, compare outputs across models, and verify claims against the original data and published sources.

Choose the smallest exposure that fits the job

The newest secure options are considerably better than public consumer chat, but "secure" still depends on your institution, licence terms, data classification policy, region, retention settings, and whether features like web grounding or third-party connectors are enabled. No single answer works for every institution.

Public web AI

ChatGPT, Gemini, Claude (free/consumer tiers), Perplexity

Best for low-risk brainstorming, exploring public information, and learning how models behave. Do not upload unpublished manuscripts, sensitive participant data, or confidential grant material. Consumer tier data handling varies by provider and changes frequently — check the current terms.

Campus enterprise AI

ChatGPT Edu, Microsoft 365 Copilot Chat, Gemini for Workspace, Claude for Work

Stronger contractual controls, typically with no-training clauses and regional data residency. However, local policy decides what data classifications are permitted. Check your institution's AI acceptable use policy and the specific enterprise agreement before uploading anything beyond public data.

Managed secure cloud

Azure AI Foundry, AWS Bedrock, Google Vertex AI

Tenant-level governance, audit logging, region selection, and model-provider separation. Suitable for serious deployments with institutional data. Requires technical setup and ongoing administration — not a plug-and-play option for individual researchers.

Local or self-hosted AI

Ollama, LM Studio, Open WebUI, AnythingLLM, Jan, GPT4All

Run models on your own hardware. Nothing leaves the machine. This maximises confidentiality but does not maximise accuracy — local models are typically smaller and less capable than frontier cloud models. Best for peer-review assistance, manuscript critique, and code generation where the researcher can verify every output.

A simple classification rule

Public data Use any appropriate tool, but verify and cite independently. The convenience of AI does not reduce the citation standard.
Internal or unpublished work Use approved enterprise/campus AI or a managed secure cloud service. Check retention and training-exclusion clauses.
Sensitive, identifiable, or embargoed data Use approved local, self-hosted, or institutionally governed platforms only. This includes participant data, clinical records, and pre-publication findings under embargo.

Tools worth knowing in 2026

These are not endorsements. They are a researcher's map: what each platform is good for, what to check, and when to consider alternatives. All links and descriptions were checked in May 2026, but product details, model access, pricing, data retention and privacy settings change frequently. Check current provider documentation before using any tool with non-public research data.

Cloud · Enterprise

ChatGPT (OpenAI)

Strong general-purpose model with web browsing, code execution, and image generation. Enterprise and Edu tiers offer no-training guarantees. Free/Plus tiers may use conversations for model improvement unless opted out.

Secure tiers available
Cloud · Enterprise

Claude (Anthropic)

Emphasis on careful reasoning, document work and long-context processing. Strong for manuscript critique, coding and structured analysis. Claude for Work provides enterprise data controls; check the current context limits and plan features.

Secure tiers available
Cloud · Enterprise

Gemini (Google)

Deep integration with Google Workspace. Gemini in Docs, Sheets, and Slides is useful for faculty already in the Google ecosystem. Workspace data protection policies apply to enterprise customers.

Secure tiers available
Cloud · Enterprise

Microsoft 365 Copilot Chat

Grounded in your Microsoft 365 data (SharePoint, Teams, email). Useful for institutional knowledge retrieval. Commercial data protection means prompts and responses are not used for training.

Secure tiers available
Literature · Search

Elicit

AI-assisted literature review and data extraction. Searches Semantic Scholar, extracts structured data from papers, and supports screening workflows. Useful for scoping reviews and evidence mapping.

Literature
Literature · Citation

Scite

Shows how a paper has been cited — supporting, contrasting, or mentioning — across the literature. Useful for assessing the reception of a specific finding and identifying disputes.

Literature
Literature · Screening

Rayyan

Systematic review management with AI-assisted screening. Supports blind review, conflict resolution, and PRISMA-compatible export. Free for individual researchers.

Literature
Literature · Notebooks

NotebookLM (Google)

Source-grounded chat over uploaded documents. Best with 5–15 curated sources on a focused topic. Generates audio overviews and summaries. Does not replace systematic search or formal synthesis.

LiteratureSecure (Workspace)
Analysis · Sandbox

Julius AI

Data analysis platform with sandboxed code execution in R and Python. Connects to data sources, generates code, and produces visualisations. Useful for exploratory analysis and teaching statistical workflows.

Analysis
Analysis · Search

Perplexity

AI search with visible source links. Useful for quick orientation and finding recent publications. Paid tiers may offer more capable models and longer outputs. Not a substitute for systematic database searching.

Analysis
Local · Desktop

LM Studio

Desktop application for running open models locally. Easy model discovery, download, and chat. Good entry point for researchers new to local AI. Supports GGUF-format models on CPU and GPU.

Local
Local · Desktop

Jan

Open-source desktop AI with a clean interface. Supports local models, API connections, and extensions. Good for researchers who want offline chat without terminal commands.

Local
Local · Desktop

GPT4All

Privacy-focused desktop client from Nomic. Runs quantised models locally with a simple GUI. Includes a local document Q&A feature for small corpora.

Local
Local · Research lab

Ollama + Open WebUI

Command-line model server (Ollama) paired with a browser-based interface (Open WebUI). Supports model switching, RAG, tool calling and multi-user access. A flexible local setup for research teams and demonstrations.

Local
Local · RAG

AnythingLLM

Desktop and server application for local RAG. Upload documents, build a vector store, and chat with local or cloud models grounded in your own data. Good for building a private knowledge base.

Local
Cloud · Privacy

Duck.ai (DuckDuckGo)

Anonymous access to a rotating set of third-party models with no account required. DuckDuckGo describes requests as proxied to reduce identifying metadata and says it has contractual no-training arrangements with model providers. Model availability and limits change, so check the current Duck.ai documentation before relying on a specific model.

Secure
Cloud · Privacy

Lumo (Proton)

Privacy-focused AI assistant from the makers of Proton Mail. Proton describes Lumo as running on Proton-controlled infrastructure with zero-access encryption for saved conversations and a temporary mode for ephemeral chats. This may suit sensitive drafting or file review where the capability is sufficient, but check current terms, model options and institutional requirements.

Secure
Browser · Privacy

Brave Leo

AI assistant built into the Brave browser with a choice of hosted models. Brave describes requests as proxied and chat history as stored locally unless users choose otherwise. Useful for page-level tasks such as summarisation, translation and Q&A grounded in the current tab; verify current privacy claims and model availability before using it for research material.

Secure
Cloud · Privacy

Okara

Private multi-model AI workspace with encryption and collaboration features. It may be useful for structured research workflows where the provider's current terms match the data classification, but it is newer and less institutionally tested than the large enterprise platforms. Check the current documentation before uploading non-public data.

Secure
Community · Models

Hugging Face

The primary hub for open-weight models. Inspect model cards, licences, benchmark results, and community Spaces before downloading. Essential for evaluating which model is appropriate for a given task.

LocalLiterature

Local AI solves confidentiality. It does not solve accountability.

Some uses of AI in the peer-review and writing process are highly appropriate; some are risky; and unverified generation is academic misconduct regardless of where the model runs. The spectrum below applies to any model, cloud or local.

Use caseRatingResearcher responsibility
Gap analysis and red teaming Highly appropriate The researcher must independently evaluate which critiques are valid and decide what to act on.
Grammar, wording, and structure Highly appropriate The meaning and argument must remain human-generated. All changes must be reviewed and approved.
Substantial drafting of text Problematic Risks false synthesis, fabricated citations, and authorship blur. If used at all, every claim must be independently verified and the contribution disclosed.
Unverified paste-in Inappropriate The author cannot vouch for accuracy, originality, or source integrity. This constitutes academic misconduct under most institutional and journal policies.

Disclosure statement templates

Adapt these to the target journal. Always include the tool name, version where available, specific task, date range, and the human verification performed.

Copy editing I used [tool and version] to suggest grammar, wording, and structure improvements to human-authored text. I reviewed all changes and take full responsibility for the final manuscript.
Analysis support I used [tool and version] to draft R/Python code and identify possible model assumptions. All analyses were executed in [software and version], checked against the dataset and relevant statistical references, and revised by the authors.
Red teaming I used [local/approved tool and version] to identify possible limitations, missing literature, and overclaims. Suggestions were independently evaluated and only incorporated after author review and verification against primary sources.

Non-negotiable research rules

Try it yourself — AI peer review of your own paper

Paste your abstract and methods section into any capable model using the prompt below. Then evaluate the output: did the model identify a real weakness? Did it fabricate a concern? How does it compare to actual reviewer feedback you have received?

You are a sceptical methodological reviewer for an academic journal. Your job is to find weaknesses, not to validate. Read the material below and evaluate every substantive claim for evidential support. Context: I am preparing a manuscript for peer-reviewed publication. I want conservative, evidence-grounded feedback before submission. I do not want encouragement — I want problems. === MATERIAL TO REVIEW === [PASTE YOUR ABSTRACT AND METHODS SECTION HERE] === END OF MATERIAL === Constraints: - Use ONLY the supplied text. - Do not invent missing information or fabricate citations. - If a claim cannot be assessed from the material provided, say so explicitly and state what additional information would be needed. - Do not soften your language. Output format: A numbered table with these columns: Claim | Support status | Evidence from text | Concern | What needs checking | Suggested revision Support status must be one of: directly supported, partially supported, unsupported, not assessable. After the table, write one paragraph summarising the three most serious methodological issues. Finally: list any claims where you were uncertain about your own assessment and explain why. If you have no concerns about a claim, do not fabricate one.

Debrief questions: Did the model identify a real weakness you had not noticed? Did it fabricate a concern that does not withstand scrutiny? Did different models produce different critiques? This exercise demonstrates both the power and the limits of AI-assisted review — and why the researcher must evaluate every critique independently.

See also: Nature AI policy · Elsevier AI policy

From co-intelligence to managed agents

The trajectory points towards a shift from back-and-forth prompting to agentic workflows, where the researcher becomes less a prompt typist and more a manager of objectives, constraints, tools, audit logs, and checkpoints. This raises the verification burden rather than removing it.

Agentic research workflows

Coding and research agents (such as those in Claude Code, Cursor, and Windsurf) can execute multi-step tasks: searching literature, writing and running code, iterating on errors. The appeal is real, but so are the risks. Agents require explicit permission boundaries, sandboxed execution environments, source-boundary constraints (preventing the agent from citing material outside a defined corpus), and human review gates at each decision point. Unsupervised agent runs that modify data or submit outputs are not currently defensible in an academic context.

Practical example A coding agent could be tasked with writing a Bayesian power analysis in R, running it, checking convergence diagnostics, and producing a summary table — but the researcher must review the model specification, prior choices, and interpretation before the output enters a protocol or manuscript.

Demo: Claude Code building SecurXamine — an agentic coding workflow with human review gates.

Demo: OpenAI Codex generating a Bayesian 3D visualisation as an agentic task.

Fine-tuning and low-rank adaptation (LoRA)

LoRA allows specialisation of a foundation model for a narrow task — such as evaluating statistical claims in academic prose or classifying methodological frameworks — without the cost or data requirements of full fine-tuning. A LoRA adapter trained on 100 annotated examples can meaningfully shift model behaviour on a focused task. However, dataset quality determines everything: garbage in, confidently wrong garbage out. Licensing of the base model, evaluation against held-out test sets, and monitoring for distributional drift over time all matter. Fine-tuning is powerful and accessible, but it is not a shortcut to a reliable domain expert.

Practical example A LoRA adapter trained on annotated statistical claims (correct vs. overclaimed vs. under-reported) can be applied to a small open-weight model to produce a manuscript screening tool. The adapter adds domain specificity; the base model provides language capability. The researcher must still validate the tool against known-good and known-bad examples before trusting its output.

Demo: Fine-tuning a small thinking model (Ouro) with LoRA adapters for statistical claim evaluation.

Beyond standard transformer architectures

Current transformer limitations — finite context windows, no persistent memory, limited planning — are active research frontiers. Future systems may combine symbolic reasoning, retrieval, planning modules, and model-based generation. Mixture-of-experts architectures (already deployed in models like Qwen and Gemini) improve efficiency by activating only relevant subnetworks for a given input. State-space models and recurrent alternatives may reduce the quadratic cost of attention on long sequences. None of these architectural advances will eliminate the need for researcher verification; they will change the shape of the errors rather than removing them.

Primary and official links

All links were checked for this build in May 2026. Where a source is behind a paywall, the DOI or arXiv preprint is provided.

Research method and risk

Tools and local AI

  • Elicit AI-assisted literature review and evidence extraction.
  • Scite Citation context analysis: supporting, contrasting, mentioning.
  • Rayyan Systematic review screening and management.
  • Julius AI Sandboxed data analysis with R and Python.
  • Hugging Face Model hub, model cards, community Spaces, and datasets.
  • Duck.ai DuckDuckGo's anonymous, proxied AI chat. No account required.
  • Lumo (Proton) Zero-access encrypted AI on Proton-controlled European servers.
  • Brave Leo Browser-integrated AI with privacy-focused request handling.
  • Okara Encrypted multi-model AI workspace with client-side key generation.
  • Ollama Local model server for macOS, Linux, and Windows.
  • LM Studio Desktop application for running quantised open models.
  • Open WebUI Browser-based interface for Ollama and other backends.
  • AnythingLLM Local RAG and document Q&A platform.