Methods and provenance

How the AI-style comparison profiles were constructed

The Writing Register Diagnostic Tool does not attempt to determine whether a document was written by AI. Instead, it compares selected measurable features of a submitted document with task-specific reference profiles derived from a documented corpus of AI-generated writing.

Open the Writing Register Diagnostic Tool Back to education site home

Purpose of the comparison

The purpose is descriptive rather than forensic: to help users reflect on how a text compares with recurrent patterns observed in a curated AI-output corpus.

Corpus construction

The corpus was constructed by prompting several AI systems to complete the same writing tasks. These tasks were grouped into four broad categories: academic explanatory or argumentative writing, empirical summary, feedback or evaluative writing, and professional email writing.

Outputs were collected from multiple systems, including frontier cloud models and a wider set of hosted and open models. The corpus was then quality checked prior to analysis. Obvious wrong-task responses, truncations and encoding artefacts were either excluded or documented within a cleaned derivative corpus, while the raw files were retained separately.

Metrics

The tool uses a small set of interpretable style metrics rather than opaque classification procedures. Current public metrics include article-opening sentence patterns, hedges per 1,000 words and sentence-length variability.

Additional contextual metrics, such as type-token ratio and paragraph length, are retained cautiously because they are sensitive to document length and genre variation. These measures should therefore be interpreted as style indicators rather than evidence of authorship or writing quality.

Bayesian modelling

The AI-style reference profiles were estimated using Bayesian regression models fitted with brms, an R package that provides an interface to Stan for Bayesian generalised, non-linear and multilevel modelling. The brms framework supports multilevel modelling through formula syntax similar to lme4, which made it possible to account for variation across both writing tasks and AI systems where appropriate.

Two reference scopes were retained. The default public comparison uses a frontier-model reference set because this produced cleaner and more differentiated task profiles. A broader all-models reference set is retained as an exploratory option, although it is not treated as a stronger benchmark merely because it includes a larger number of systems.

For each metric, several candidate model structures were compared, including fixed-additive models, partial-pooling models across AI systems and, where appropriate, length-adjusted specifications. The final public profiles favour simpler and more stable model structures. For example, sentence-length variability is modelled using a partial-pooling intercept specification rather than a more complex random-slope structure because the latter introduced avoidable diagnostic concerns within the broader model set.

The Bayesian summaries distinguish between the estimated profile centre and the plausible document-level band. The profile centre is based on posterior expected values, whereas document-level bands use posterior predictive summaries. This distinction matters because posterior_epred() summarises expected values, while posterior_predict() incorporates residual or document-level variation and is therefore more appropriate when assessing whether a submitted document appears typical or unusual relative to the reference corpus.

Models were evaluated using standard Bayesian workflow diagnostics, including convergence, effective sample size, divergent transitions, treedepth warnings and posterior predictive checks. Where leave-one-out cross-validation was used, Pareto-k diagnostics were interpreted cautiously. High Pareto-k values indicate that the PSIS-LOO approximation may be unreliable and therefore should not be treated mechanically as a basis for model selection.

The final public tool therefore reports whether a submitted document falls within, somewhat outside, or substantially outside the document-level band for selected metrics. These labels describe similarity to a corpus-derived AI-style profile on specific measured features. They do not indicate the probability that a text was written by AI.

What the tool does not do

The tool does not detect AI writing, classify authorship or detect plagiarism. It does not judge whether writing is good or bad. Nor does it provide universal averages of AI writing.

Model outputs vary across systems, prompts, tasks, genres, document lengths and subsequent human editing. The profiles should therefore be treated as provisional, task-specific comparison benchmarks rather than universal standards.

Privacy

The analysis is client-side. Text entered into the public tool is processed locally within the browser and is not sent to an external API by the application.

Suggested citation/provenance line

Profile version: curated public AI-style profiles, May 2026. Corpus-derived and Bayesian modelled; public app values are provisional.

← Back to Home