CAPAS compiles supplied evidence into a claim admissibility decision. It does not determine truth — it decides, deterministically, whether the evidence licenses the claim: ACCEPT / REWRITE / REJECT / HOLD, with a re-derivable audit trail and no language model in the verdict. Your model proposes; CAPAS disposes. Built for regulatory reviewers, data & training-data engineers, and journal editors.
CAPAS checks whether your supplied evidence is internally consistent and re-derivable. It does not certify scientific truth, compliance, authorship, or the authenticity of the data — a consistent fabrication still passes (the GIGO ceiling).
python3 benchmarks/home_gate_decisions.pyCAPAS was designed from the ground up to make scientific claim evaluation auditable, repeatable, and integration-ready.
The same claim + evidence bundle always produces the same verdict. No probabilistic outputs, no model drift, no hallucinations. Schema v3 rules are the single source of truth.
No randomnessEvery payload is validated against CAPAS Schema v3 before gating. Field types, required keys, and claim-type-specific constraints are enforced at the gate boundary.
CAPAS Schema v3Every REJECT and HOLD verdict includes machine-readable provenance blockers — licensing flags, reproducibility gaps, attestation failures — exportable as structured JSON.
Audit-readyCAPAS does not use language models for gate decisions. Verdicts are computed by deterministic rule functions — 26 cross-domain invariant gates plus per-claim-type evidence contracts — against a structured contract, fully offline-capable. Every decision carries a re-derivable audit_hash.
From claim draft to admissibility decision in one deterministic pass.
Every gate run returns exactly one of these. Each carries a machine-readable reason trail.
All schema constraints pass. Evidence fully supports the claim as stated. No licensing or reproducibility blockers found.
Evidence is present but the claim overreaches. Direction, scope, or causal language must be corrected before resubmission.
Critical evidence is missing, contradicted, or irreproducible. The claim cannot be licensed in its current form.
Evidence is incomplete or no external oracle is available; the gate fail-closes to HOLD instead of guessing.
CAPAS is structured, deterministic admissibility gating for scientific claims — a verdict you re-derive to the same hash, not a score from a model.
Not a fact-checker — CAPAS doesn't verify truth, it validates evidence licensing and schema compliance
Not a plagiarism detector — citation matching is not the concern; admissibility is
Not an LLM wrapper — zero language model calls in the gate decision path
Not a peer review replacement — CAPAS is a pre-gate boundary tool, not editorial judgment
A deterministic claim admissibility gate — same input always produces the same output
A schema enforcement layer — CAPAS Schema v3 defines exactly what a valid evidence package looks like
A provenance audit tool — every blocker is machine-readable and exportable
An integration-ready API — designed to sit inside research pipelines, publishing workflows, and audit systems
Frontier models and agents are built to generate claims. The systems next to them each gate only a slice — LLM-judges assess truth (stochastically, not replayably), fact-checkers chase truth not boundary, domain validators like Pinnacle 21 cover one field. Across the systems we surveyed we found none that gate, deterministically and replayably, whether an arbitrary claim is licensed by its evidence before reuse. CAPAS starts there.
| System | Generates claims | Gates claims |
|---|---|---|
| GPT · Claude · Gemini | ✓ | ✗ |
| Deep-research agents | ✓ | ✗ |
| LLM-as-judge | ✗ | stochastic · no guarantee, not replayable |
| Fact-checkers | ✗ | truth, not boundary |
| CAPAS | ✗ | ✓ deterministic · replayable · fail-closed (test-proven) |
Without a claim-level gate, drift survives source review, metadata review, and provenance review — because none of them evaluate the claim boundary itself.
A fabricated claim still has to satisfy the conservation laws of its field — and that is re-derivable with no oracle. The same engine catches a balance sheet that doesn’t close, a survey mean that’s arithmetically impossible (GRIM), a 99%-sensitive test claimed to imply a 99% PPV for a rare disease (the base-rate fallacy), an unbalanced reaction, a non-dimensional equation, a qubit with T2 > 2·T1. Any declared number that breaks a domain law forces REJECT — downgrade-only, so it can only make a verdict stricter. Fail-closed is a proven invariant (18/18 structurally-deficient claims rejected, locked by a test), not a number.
Above the schema gate, CAPAS re-computes the claimed result from its raw inputs and only GATEs — marks re-derived — what it can reproduce. What it cannot re-derive it ATTESTs: signed and bound, never marketed as verified. The boundary is explicit on every receipt. No language model is ever in the decision path.
GATE = re-derived and reproducible. ATTEST = signed, not verified. CAPAS certifies computational consistency of the re-derivable slice — not scientific truth, and not the authenticity of the raw data (the irreducible GIGO residual). It re-derives more than it trusts, and says exactly which is which.
Frontier models produce text; CAPAS produces an auditable evidence trail — and that is what a regulated buyer actually purchases. Every verdict is operational, not just a label: ACCEPT licenses reuse · REWRITE returns the corrected claim with an Original→Licensed diff · REJECT names the missing evidence fields · HOLD lists the obligations to resolve before reuse.
CAPAS is a fail-closed gate: it disposes a verdict on the structure of the evidence you submit, with no language model in the decision path. Every claim below carries the command that regenerates it — clone the repo and run them. A claim you can’t re-derive isn’t a claim, it’s a slogan.
python3 benchmarks/verify_fail_closed.pypython3 benchmarks/test_dynamic_fuzz.pypython3 benchmarks/verify_robustness.pypython3 benchmarks/verify_hold_has_resolution.pypython3 benchmarks/conformance.pypython3 benchmarks/demo_sequential_typeI.pypython3 benchmarks/verify_audit_hash_reproduces.pypython3 benchmarks/attest_conformance.pypython3 benchmarks/generate_capability_matrix.pycapas.py. python3 benchmarks/verify_gate_contracts_match.pypython3 benchmarks/family_decision_mix.pypython3 benchmarks/pilot_real.pypython3 benchmarks/generate_pharma_corpus.pyThe GIGO ceiling is real and we do not hide it. CAPAS gates the structure of evidence, not ground truth — a self-consistent, well-formed, fabricated payload can pass the gate. Our own fuzz and pedagogy-governance tests measure and report this residual (a disclosed false-admit on the GIGO-ceiling class). The gate raises the cost of lying and makes every verdict re-derivable; it does not detect a careful liar.
python3 benchmarks/study_assembly.py), but the n≥500 human-adjudicated comparison is pre-registered, not yet executed.If any command above does not reproduce its stated result on your machine, that is a bug — tell us.
28 famous claims — every one of them passed peer review and was published in Nature, Science, or The Lancet. 14 were later retracted (Wakefield, Surgisphere, Schön, STAP…); 14 were independently replicated (LIGO, Higgs, RECOVERY dexamethasone…). Plausibility could not tell them apart — all 28 looked publishable. The gate separated them by structure.
Each fraud was gated for its actual structural deficiency — no controls, no independent reproduction, unauditable data — the same gaps it was retracted for. Honest scope: an illustrative retrospective whose corpus is coded from public retraction records (Retraction Watch, journal notices); it validates the gate’s structural logic, not fraud-detection from raw paper text. Partner pilots pending.
From academic publishing to enterprise AI governance — any workflow that touches scientific claims benefits from a deterministic admissibility gate.
Gate submitted manuscripts at the desk-review stage. Catch inadmissible claims before peer review consumes reviewer time.
Validate claim-evidence pairs before they enter training datasets. Prevent inadmissible or unlicensed scientific claims from corrupting model training.
Generate structured audit trails for claims in regulated industries. Every verdict is machine-readable, timestamped, and exportable.
Run batch evaluations across a corpus of claims. Identify systemic evidence gaps before a full production deployment.
Triage incoming claims automatically. Surface only those with sufficient evidence structure for human fact-checker review.
Embed CAPAS gate calls directly into your existing research pipeline, CMS, or editorial system. JSON in, structured verdict out.
An installable verification layer: your model proposes, CAPAS disposes. It never lets an unsupported claim through as ACCEPT — it re-derives what is re-derivable, grades the rest, and emits a verifiable reward your model can’t game by sounding right. No language model in the verdict.
python3 benchmarks/head_to_head_sota.py
The boost is reliability, not capability — the model doesn’t get smarter, its output becomes admissible-or-deferred. Honest scope: it grounds record↔text, not text↔reality — a source that lies about its methods and withholds its data passes (the GIGO ceiling), so CAPAS says exactly which slice it re-derived and which it could only attest.
The same deterministic engine, exposed three ways. CAPAS is never the language model — it is the fail-closed layer the model proposes into.
Wrap your LLM in code. gate · reward · certificate · invariants · gate_quantum.
A tool any agent calls — Claude Code, Desktop, Cursor. Zero dependencies. The agent proposes, CAPAS disposes.
Hosted, auth-gated issuance of a signed, persisted, tamper-evident admissibility certificate — the audit artifact a regulated buyer purchases.
The same pattern — invariant checks + threshold gates + a fail-closed verdict + a disclosed boundary — is what IBM’s production calibration system is. The architecture isn’t speculative: a hardware vendor runs it at scale.
IBM’s headline gate-error figure is an optimistic lower bound — for real circuits it under-states by 3–10× (Proctor, Nat. Phys. 2022). From the same published calibration fields, CAPAS re-derives the complete error budget — fully auditable, no hardware required.
Honest scope: the 1.9×10⁻² worst case shown is ibm_fez q9–q10, re-derived term-by-term from the vendor’s published fields (the 3–10× structured-circuit band is a cited Proctor range, not our finding). The live leg is a separate device: on ibm_kingston, CAPAS re-found the chip’s one anomalous qubit (Q121) and its bad-coupler cluster from calibration alone, and admitted a real Bell measurement — 0.045 vs a re-derived floor of 0.020, ≈2.2×, under two independent oracles. The arithmetic is textbook and copyable; the fail-closed discipline that refuses the optimistic headline as admissible is the part a third party can re-run. Full method →
IBM’s quantum stack will not run your circuit until it clears a calibration gate — every job checked against frozen, re-derived device invariants, fail-closed, with no model-of-the-day in the decision. That is exactly the CAPAS architecture: re-derive from declared evidence, refuse on violation, keep the verdict deterministic. It already runs in production, at the frontier of physics. CAPAS generalizes the same mechanism across ten domains — finance, statistics, epidemiology, chemistry, physics, quantum. Two independent systems converging on the same admissibility mechanism is consilience: evidence the design is structural, not a pitch.
Honest scope: the identities are textbook and the convergence is architectural, not a partnership claim. And the mechanism itself is the open Apache-2.0 engine — re-derivable, therefore copyable — so the consilience validates the design, it is not the moat. What is defensible is the cross-domain composition plus the self-run conformance mark and signed certificate; on an unchanging-then-reported-operational calibration value, CAPAS applies the more conservative, fail-closed disposition. (Independent of IBM — not affiliated or endorsed; observations from public open-plan metadata.)
Not a slide — re-derive it yourself: examples/kingston_live_audit.py audits the live device; benchmarks/kingston_real_bell_verdict.json gates a real Bell measurement against IBM’s own calibrated noise model (two independent oracles agree). Open engine, Apache-2.0: the method is fully inspectable and the verdict re-derivable by anyone. The defensible asset isn’t the copyable engine — it’s the self-run conformance mark and the signed certificate, under a governance charter that pre-commits the mark to neutral governance — binding in direction, drafted for irrevocability, not yet legally executed (the trustee is an open item).
Any one gate is copyable. What is hard to copy is trust you can audit — and CAPAS is built so trust is earnable, not asserted. Every verdict re-derives to a hash an independent party can reproduce, and tampering diverges; every headline claim is CLOSED / BACKED / SCOPED in a public ledger that discloses which numbers are synthetic; the mark attests only that an artifact passed a suite you run yourself for the same verdict and the same hash, issued as a signed, content-addressed certificate. A competitor can copy a gate in a week; it cannot copy a self-run conformance mark, a signed-certificate audit trail, and a relicensing posture renounced in writing. That is the moat — the auditable standard around the tool. Whether it becomes the reference standard depends on third parties actually challenging it: that adjudication is open, not yet run.
CAPAS ships open-core (Apache-2.0): the schema, calculus, reference gate, CLI, tests, and benchmark corpus are yours to run and fork. The defensible asset is not the code — it’s the certification mark, and a mark is only worth trusting if it can’t be pulled. So the mark is reserved and pre-committed to neutral governance before adoption, not after — the one move that let Open Policy Agent survive its sponsor’s acquisition while MongoDB, Elastic, HashiCorp, and Redis each triggered a fork by relicensing their core to capture value. We renounce that move in writing. Governance charter →
Conformance is self-runnable and deterministic — python3 benchmarks/conformance.py runs the exact suite the certifier runs and returns the same verdict and the same hash. No private process to trust. The mark attests an artifact passed that. Certification & how to certify →
Each badge links to a verifiable source — a passing workflow, a published score, a signed release, or a re-derivable hash. We don’t display a certification we haven’t earned.
Quantum-calibration credential: CAPAS gates reported quantum-device claims against textbook invariants — run live over IBM’s 156-qubit ibm_kingston calibration it re-found only the genuine anomalies (Q121, the bad-coupler cluster) with 0 false flags. Architectural consilience with IBM’s own admissibility engine — not a partnership or endorsement. The IBM consilience →
OpenSSF Best Practices Badge: passing (100%) — self-cert + static/dynamic-analysis criteria met. The badge hotlinks the live bestpractices.dev project, so it verifies independently and updates itself.
Pinnacle 21 already checks whether a trial dataset is structurally well-formed (CDISC conformance). It does not check whether the reported statistic is licensed by its evidence. CAPAS does: significance versus alpha, multiplicity, confidence-interval-excludes-null, effect direction, endpoint pre-specification — re-derivably, beside the submission, not as a replacement. Validated on a 3,024-case synthetic admissibility corpus, 0 deficient claims accepted (fail-closed) — contract coverage of the space P21 skips, not a production false-accept rate on real submissions. Market validation →
Load a sample payload or build your own evidence contract in under two minutes. No account required for the pilot.