The Calculator Discipline: catching AI-assisted disclosure hallucinations before they reach a maintainer’s inbox
A four-class taxonomy of failure modes, a pre-send filter that catches the mechanical two, and two real withdrawals from the practice’s own work — including the one Theo de Raadt asked the right question about.
This piece is the practice’s summary of The Calculator Discipline: A Taxonomy and Pre-Send Filter for AI-Assisted Vulnerability Disclosure Hallucinations, published 26 May 2026. The full paper, including the per-claim verification table for the OpenSMTPD case study and the source-level discussion of the four verifiers, sits on the lead researcher’s personal site. The summary below is intended for triage teams, bug-bounty programme managers, and other independent researchers who want the headline argument without the full apparatus.
The field-level problem
AI assistance has made source-code review cheap, and like every productivity multiplier in the history of engineering it has therefore made being wrong cheap. The most visible symptom is the open-source community’s response: Daniel Stenberg of the curl project coined death by a thousand slops in July 2025, and by January 2026 had ended curl’s HackerOne bug-bounty programme on the basis that AI-generated noise had made the triage workload unsustainable. Press coverage from BleepingComputer, Help Net Security, The New Stack and The Register has followed the same arc. Stenberg’s more recent posts note that the slop rate has fallen and the quality of AI-assisted reports has risen; the problem is not unsolvable, but the discipline-shaped hole in the methodology is real and worth naming.
The public conversation has been almost entirely from the receiving end. Maintainers complain about the volume; the researchers who produced it stay silent. That asymmetry is unhelpful. A failure mode that nobody owns publicly cannot be improved at the source. The practice’s methodology paper addresses that gap directly: the lead researcher is one of the people who shipped the slop, and the discipline described in the paper exists because the failure happened to him.
Two real withdrawals and one near-miss
The paper presents three cases from the practice’s 2026 OpenBSD work:
- bgpd
community_ext_add(withdrawn). A report sent tobugs@openbsd.orgon 2026-05-24 claiming a fixed-buffer overflow in a function that does not have a fixed buffer; it grows its array viareallocarray(). The report also cited “22 unique SIGSEGV crashes” from an AFL run that did not exist (the AFL output on the researcher’s system was zero crashes, against an unrelated target), and referenced abgp_poc.pyfile that had never been written. Withdrawn the following day. - OpenSMTPD six-claim chain (corrected). A disclosure sent to
security@openbsd.orgon 2026-05-23 listing six findings and framing them as a chain that bypassed ASLR and RETGUARD to achieve remote code execution. Theo de Raadt replied with a single pointed question: whether the researcher was actually claiming to have exploited the chain. The honest answer was no. Per-claim verification against current OpenBSD 7.8 source revealed two findings entirely fabricated, four real but with severity inflated by one to three steps, and zero RCE. The corrected reply went out 2026-05-26. - rpki-client
queue_add_from_cert(caught pre-send). An AI-assisted triage candidate flagged as a one-byte heap out-of-bounds read. A third read traced the populator ofcert->mftback throughcert.cand showed the length-equal collision the candidate required was structurally impossible — the parse-time validator rejects any cert that would have produced it. Also: the prior-art agent had hallucinated two commit hashes attributed to a real OpenBSD developer who had no involvement. Both failures caught before the report went out.
The taxonomy
Across the three cases, four distinct failure modes appear. Each has a corresponding catch mechanism, and the mechanisms divide cleanly into mechanical (toolable, deterministic) and judgement-shaped (requires human eye on the threat model). The full table from the paper:
| Class | Description | Catch mechanism |
|---|---|---|
| C1 Bug-shape fabrication |
The bug pattern claimed does not exist in the code as it actually stands. Fixed buffer that is actually dynamically grown; OOB on a path the code guards; UAF on a pointer the code clears. | Mechanical. Grep for realloc/reallocarray/recallocarray in the cited function before drafting any “fixed-buffer overflow” claim. |
| C2 Evidence fabrication |
The supporting evidence (AFL run, fuzz corpus, ASAN trace, PoC script, commit hash, prior-art reference) does not exist or does not match what is cited. | Mechanical. Resolve every cited artefact before citing it. Require afl_banner + exec count + timestamp for any “N crashes” claim. Require PoC paths to exist on disk. |
| C3 Severity inflation |
The bug-shape and the evidence are real, but the chain-to-RCE, pre-auth-network, or otherwise headline-grade framing is fabricated. | Judgement-shaped. Per-claim CVSS with explicit threat-model fields. The phrase “chained to RCE” is reserved for chains demonstrated end-to-end, not merely posited. |
| C4 Trivial-as-critical |
A real defect of negligible operational impact framed as critical, usually by appealing to a security boundary the trust model does not contain. | Judgement-shaped. Audience-aware severity calibration. Same-uid IPC is not a privsep boundary; trivial info-leaks within a single trust domain are hardening notes, not vulnerabilities. |
The classes are not exhaustive. Others (composability errors, misread atomicity guarantees, fabricated CVSS metric vectors) certainly exist. The four are offered as a starting vocabulary, not a closed system. Triage teams already track signal versus noise implicitly; making the noise classes explicit lets researchers self-check before sending.
The tool
The practice maintains hallucination_check.py (≈35 KB, BSD-2-Clause) as a pre-send filter against C1 and C2. Four verifiers were added on 2026-05-25 in direct response to the failures described above:
bug_shape— flags “fixed-buffer” claims against functions that containrealloc/reallocarray/recallocarray.caller_bounds_gate— requires drafts citingsize_tarithmetic andmemcpy/memmoveto contain a labelled caller-bounds analysis section. Structural, not semantic: it does not check whether the analysis is right, only that the author wrote one.afl_evidence— requires “N crashes” phrasing to be accompanied byafl_banner, exec count, and timestamp in the same paragraph; flags banner/target mismatches.poc_existence— resolves any cited PoC script path against the filesystem before allowing the draft to pass.
The four were validated against three synthesised drafts and one known-good template. The withdrawn bgpd report, when reconstructed and run through the tool, produces WRONG verdicts on three of the four verifiers — which is the test that mattered.
The paper also reports a self-check: the tool was run against the paper itself and returned a global DIRTY verdict with four WRONG findings. On inspection, every finding is a true positive by the tool’s rules but a false positive in context, because the paper is describing failed claims rather than making them. That distinction is the tool’s honest limitation, named explicitly: future work will add a quoted-context flag that suppresses the verdict while preserving the audit trail.
What the tool does not catch
The rpki-client case in §4 of the paper is the explicit counter-example: a draft would have contained a fully-resolved file:line reference, no fixed-buffer claim, no AFL output, no PoC path, and a passing caller-bounds analysis that was wrong. The verifiers check structure, not correctness. The cross-function invariant trace that caught it was the work of a third human read with the explicit prompt “trace cert->mft back to its populator before claiming the OOB is reachable.”
The wider gate in the practice’s workflow has five steps:
- AI-assisted source review produces a draft candidate.
- The candidate is independently checked against the actual current upstream source by a human read — not the original AI’s summary.
hallucination_check.pyruns against the draft. WRONG verdicts block send.- A separate Council-of-LLMs review (multiple LLMs, fresh context, brief to disagree) reviews tone and per-claim severity.
- Only after all four pass does the disclosure leave the outbox.
Step 3 is the only step the tool supplies. The other four are discipline. The slop rate will fall when the pre-send discipline becomes routine, not before.
Why the practice is publishing this
Three reasons.
First, the conversation needs senders, not just receivers. Maintainers complaining publicly is part of the picture; researchers admitting publicly is the part that has been missing. A taxonomy from someone who has shipped each failure class personally is, the practice believes, more useful than a taxonomy assembled from the receiving end.
Second, the tool is useful and cheap to adopt. The four verifiers are BSD-2-Clause and ship with the wider penfold/ tooling. Triage teams, bug-bounty programme managers, and independent researchers are welcome to adopt them as-is, extend them, or take the four-class taxonomy and build their own.
Third, the calculator analogy is the right one. AI is not going away. The corrective is not to ban it but to apply the discipline that every other productivity multiplier in engineering has eventually demanded. Check the units. Check the order of magnitude. Check the answer matches the real world. That is the case the paper is making, and the case TriageForge is making by publishing it.