Governed-memory evidence spine.
These four core studies serve as the evidence spine for the FieldHash governed-memory architecture. They measure whether the pre-answer control plane governs what reaches the model before generation: promoted authoritative memory, state that survives lifecycle changes, seeded reviewed state under conflict, and audit controls that catch broken governance.
Automatic memory promotion
90/90
The authoritative memory is selected, promoted, and governed before generation.
Internal synthetic adversarial benchmark, not external validation. The strongest independence check is the Claude-authored disjoint n=30 corpus; the n=100 provider replications use a same-family Gemini-authored corpus and frozen semantic-label artifact. A later same-budget Gemini two-pass smart diagnostic on the n=30 corpus selected the current record 30/30 and answered 28/30 with zero stale substitutions, so the public claim should not be framed as beating every same-budget selector. The benchmark measures governed answer-path control under singleton-current memory conflict, not broad reasoning superiority, universal memory safety, model-weight learning, provider-invariant fact extraction, or perfect source-span extraction. Claude Opus 4.7 n=100 misses were empty provider responses rather than stale substitutions.
Read case studyGoverned learning lifecycle
2400/2400
Governed state survives update, rollback, repair, compaction, and repeated reads.
Internal diagnostic on a FieldHash-authored semantic-authority lifecycle corpus, not external validation, not base-model weight learning, and not a claim that FieldHash out-reasons frontier LLMs. Authority metadata is written by governed operations, so this does not prove authority inference. The claim is bounded by two controls: visible rollback is recoverable when present as retrievable text, and a full-operation-log baseline can replay the lifecycle. The supported claim is durable materialized governed state versus bounded read-time reconstruction.
Read case studyGoverned memory pressure
600/600
Seeded reviewed/current state is enforced under plausible stale context.
Internal seeded-authority memory-pressure benchmark; the scenario was generated by an external base model, then executed through the FieldHash harness with deterministic exact-substring scoring (binary match against the approved codeword, same matching code across all complete comparison runs). The current record carries pre-written FieldHash governance metadata such as reviewed/current authority state; retrieval-only and prompt-only baselines receive retrieved records as ordinary context without that structured metadata, approved-current precedence, supersession handling, Bayesian arbitration, or hub compression. This is a supported governed-stack enforcement claim, not a pure single-variable ablation and not proof that FieldHash infers authority from raw text better than a same-model selector. It measures approved-context preservation and what memory was allowed to shape the answer under seeded conflict, not broad reasoning superiority, universal memory safety, independent external validation, billing-token reduction, or database storage compression. The original n=200 run predates strict provider metadata; the refreshed Gemini, Claude Opus, and GPT provider comparisons include row-level provider/model guards and direct answer-path exposure telemetry.
Read case studyAuditability diagnostic
437/437
Deliberate broken-governance controls are detected.
Internal deterministic code-path diagnostic with zero LLM calls. It verifies governed memory state, falsifiability checks, and audit telemetry behavior; it is not external validation, not a named memory-system comparison, and not evidence of base-model weight learning or general reasoning improvement.
Read case studyHow these fit together
Automatic promotion is upstream of memory pressure: first identify the authoritative memory, then govern what survives stale-context pressure. The memory-pressure benchmark supports the second half after reviewed/current state exists; it should not be read as an authority-inference superiority claim. The lifecycle pillar tests whether governed state remains durable through rollback, repair, compaction, and repeated reads. The auditability diagnostic checks whether broken governance is caught instead of silently steering answers. The same-budget two-pass diagnostic narrows the promotion claim to governed answer-path control rather than basic semantic selection. Read the loop case study.
Evidence package
The original public-safe methods reports, the governed-learning loop synthesis report, figures, source-artifact hashes, and the newer lifecycle diagnostics are retained as the FieldHash governed-context evidence set. Row-level prompts, answers, and implementation-sensitive traces are retained for qualified private review. View DOI 10.5281/zenodo.20401670.
How to read the rest of this registry
Core governed-memory studies
Public-safe methods reports, source-hash manifests, aggregate artifacts, and case-study audits for the four governed-memory proof points.
Workflow diagnostics
Same-provider reasoning and selection checks that measure scaffolding quality, not provider substitution.
Mechanism checks
Deterministic controls for memory gates, audience scope, compression, telemetry, and routing behavior.
Research evidence
Deep Synthesis and Research Lab outputs that show hypothesis generation, validation plans, symbolic regression, and pipeline constraints.
The category claim must demonstrate both adaptation and restraint.
FieldHash uses “governed continual learning” narrowly: validated outcomes can change future retrieval, routing, and memory influence, while scope, confidence, compression, and telemetry gates decide what is allowed to carry forward.
It is not base-model fine-tuning, preference-data training, plain retrieval, or context compaction. The benchmarks below separate the two halves of the claim: what gets reused, and what is blocked from becoming a future prior.
Reviewed context persists
98.96%In a 32-case live governed-memory benchmark, FieldHash recovered current approved project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.
Scoped memory retrieval plus correction precedence.
Bad priors are blocked
5/5A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.
Write gates, audience scope, hub compression, and telemetry checks.
Audit trail is falsifiable
437/437The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.
State invariants plus deliberate broken-governance controls.
Workflow quality improves
+48.6%Against the same-provider direct baseline, FieldHash improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.
The same-provider baseline isolates the FieldHash layer.
Rejected-learning control example
The governed-memory suite includes later turns that introduce discarded brainstorming, superseding corrections, and unrelated project context. Passing behavior means the system can preserve reviewed work while preventing stale, noisy, or out-of-scope material from becoming the answer’s hidden prior. The newer auditability diagnostic adds deliberate breakages so the suite must also catch disabled governance, missing supersession links, rejected-context promotion, scope leakage, and stale re-promotion.
Workflow-quality suite.
These measurements test open-form reasoning quality, exactness, task selection, and semantic grounding against a same-provider direct baseline. They support the product story, but the flagship governed-context studies above carry the strongest public evidence.
Reasoning quality
+48.6%
0.3740 to 0.5556
Grounding fit
100%
0.9464 to 1.0000
Task selection
+0.40
0.00 to 0.40
Exact correctness
100%
100% to 100%
The broader diagnostic suite establishes reasoning safety at scale.
In the broader diagnostic suite (v4), FieldHash measured a 52.2% relative reasoning lift with 155 wins, 5 losses, and 43 ties; exact correctness held at 100% on 24 deterministic tasks, and semantic grounding held at 100% across 32 ambiguity-control cases.
Broader diagnostic suite, not yet the promoted public headline; it preserves the reasoning lift at larger sample size while showing task-selection remains an active reliability target.
Reasoning lift
+52.2%
0.3509 to 0.5340
Paired wins
155/5/43
wins / losses / ties
Mean-delta CI95
[0.1668, 0.1976]
Exact correctness
100%
24 deterministic tasks
Semantic grounding
100%
32 ambiguity-control cases
Task selection
38.98%
0.0000 to 0.3898
What this tells us
The diagnostic suite is useful precisely because it is broader: the paired reasoning lift persisted at larger sample size, exactness held, and semantic grounding held after repair. The remaining softness is narrow and visible: task-selection accuracy remains an active optimization target.
Workload lift concentrates in the quick-lite reasoning path.
A router-conditioned ablation of the broader diagnostic suite (v4) shows the lift is concentrated in quick-lite-routed rows: 164 quick-lite prompts improved from 0.3507 to 0.5777 with 155 wins, 2 losses, and 7 ties, while 39 direct-high-retained rows were essentially flat.
Router-conditioned ablation, not a randomized forced-routing experiment; prompt difficulty may differ between quick-lite-eligible and direct-high-retained rows.
Quick-lite rows
164
0.3507 to 0.5777
Quick-lite wins
155/2/7
wins / losses / ties
Quick-lite mean delta
+0.2270
Direct-high rows
39
0.3516 to 0.3500
Memory governance, not just recall.
In a 32-case live governed-memory benchmark, FieldHash recovered current approved project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.
The diagnostic scenario is strictly bounded: it validates whether the governed context maintains high-fidelity alignment with verified project facts when subsequent interactions introduce discarded concepts, explicit overrides, or cross-project context. This design isolates contextual stability far more rigorously than standard static recall evaluations.
Focused internal live benchmark of governed project-memory behavior; it measures update precedence, noise suppression, topic isolation, and user isolation, not general model quality or universal memory performance.
Mean governed recall
98.96%
32 live seeded cases
Memory retrieval
100%
seeded recall turns
Control leakage
0%
32 unseeded controls
Task families
4
continuity, noise suppression, updates, isolation
Procedure
Each diagnostic iteration utilizes an isolated state session. The evaluation sequence seeds a verified baseline context, then introduces systematic interventions during subsequent turns: baseline retrieval, rejected/noisy exploratory content, explicit superseding corrections, or cross-project data. State recall is quantified directly against the generated response, with unseeded control runs executed in parallel to establish the baseline guessing boundary and rule out leakage.
The gates are measured separately from the model.
A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.
This benchmark is fully deterministic, verifying that logic-layer constraints and control-plane gates operate exactly as specified prior to and independent of model generation.
Deterministic mechanism benchmark; it verifies code-path controls and telemetry, not open-ended model answer quality.
Control checks
5/5
deterministic code-path benchmark
Pass rate
100%
all mechanism checks passed
Mechanisms
5
arbitration, gates, scope, compression, telemetry
Procedure
The suite runs direct code-path checks for personal-memory supersession, collective organizational insight and heuristic write gates, audience-scoped retrieval within an organization, hub-compression representative selection, and organization-scoped collective-ingestion telemetry.
The audit state has a falsifiability check.
The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.
Internal deterministic code-path diagnostic with zero LLM calls. It verifies governed memory state, falsifiability checks, and audit telemetry behavior; it is not external validation, not a named memory-system comparison, and not evidence of base-model weight learning or general reasoning improvement.
Total checks
437/437
deterministic memory-state and audit controls
Negative controls
180/180
broken governance states detected
Lifecycle scenarios
36
partial supersession and replacement updates
LLM calls
0
code-path diagnostic, not answer-quality benchmark
How we measured it
Live reasoning benchmark
Against the same-provider direct baseline, FieldHash improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.
Reasoning quality
+48.6%
0.3740 to 0.5556
Steering usefulness
0.3857
0.0125 to 0.3857
Grounding fit
100%
0.9464 to 1.0000
Sample: 56 live prompts
Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.
Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.
Evidence package
Website Benchmark Suite v2
Paired same-provider direct-baseline benchmark covering 56 live reasoning prompts, with methodology and component reports retained for technical review.
Exact correctness floor
Exact correctness held at 100% on 24 callable-backed deterministic tasks, used as a deterministic safety-floor alongside the broader reasoning benchmark.
Exact correctness
100%
100% to 100%
Sample: 24 deterministic tasks
Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.
This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.
Evidence package
Website Benchmark Suite v2 exactness slice
Callable-backed deterministic task slice used as a correctness safety-floor check alongside the broader reasoning suite.
Task selection
On the approved gold lens-family set, task-selection accuracy improved from 0.00 to 0.40 across 30 prompts.
Task selection
+0.40
0.00 to 0.40
Sample: 30 approved-gold prompts
Measures whether the system chooses a more useful reasoning family before answering.
Curated approved-gold lens-family benchmark; not an open-world classifier claim.
Evidence package
Website Benchmark Suite v2 task-selection slice
Approved-gold lens-family selection benchmark measuring whether the system chooses a useful reasoning family before answering.
Semantic grounding proxy
On a 16-case ambiguity-control proxy benchmark, semantic grounding landed at 1.00 artifact-class accuracy and 1.00 prompt-family accuracy.
Artifact class accuracy
1.00
Prompt-family accuracy
1.00
Sample: 16 ambiguity-control cases
Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.
Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.
Evidence package
Website Benchmark Suite v2 semantic-grounding slice
Focused ambiguity-control proxy benchmark measuring artifact-class and prompt-family grounding under ambiguous prompts.
Where it wins
Ambiguous Named Concept
0.3411 → 0.5872
When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.
Mathematical Strategy
0.3793 → 0.5536
On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.
Operational Tradeoff
0.4115 → 0.5754
For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.
UI System Design
0.3689 → 0.5782
On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.
High-Stakes Advisory
0.3370 → 0.4843
Under high-stakes or ambiguous pressure, the system provides grounded, practical strategies instead of vague reassurance.
Scientific Mechanism
0.3799 → 0.5809
On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.
Ambiguous Abstract
0.4003 → 0.5293
For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.
Example judgments
Clear win
Ambiguous concept framing
Under highly figurative prompts (such as "Glass Field"), unguided baselines frequently literalized metaphorical inputs. FieldHash successfully isolated the latent functional constraints, prioritizing structural implementation variables in its response.
Typical win
Operational tradeoff
Rather than presenting symmetry-bound lists of generic advantages and disadvantages, the FieldHash control layer successfully identified the governing trade-off variable—the critical factor that dictates real-world viability.
Clear miss
Scientific mechanism prompt
A diagnostic stress-suite miss revealed instances where the response drifted into narrative-conversational language rather than maintaining strict, quantitative mechanism analysis. This boundary illustrates why task-routing precision and semantic grounding are tracked as explicit architectural optimization targets.
Infrastructure & Governance
Tools manifold routing
Tools manifold routing improved top-1 selection by 3.77 percentage points on real paired events and by 5.34 percentage points on the broader combined benchmark.
Real paired events
+3.77 pp
50.94% to 54.72%
Combined benchmark
+5.34 pp
Broad benchmark-scale evaluation
Sample: 53 real paired events; combined benchmark-scale evaluation
Measures whether learned routing improves tool choice compared with a fixed baseline policy.
Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.
Evidence package
Tools manifold routing significance package
Paired real-event and broader combined routing benchmark measuring learned tool selection against a fixed baseline policy.
Manifold stability
Production manifolds validated above 91% accuracy while monitored training drift stayed within a bounded L2 range of 0.014 to 0.121.
Validation accuracy
>91%
L2 drift band
0.014–0.121
Convergence
1–13 epochs
Sample: Nine trained manifolds
Supports the claim that learning components remain stable enough to deploy under governance.
Training and validation stability evidence for manifolds, not a live companion benchmark.
Evidence package
Manifold validation and drift report
Training and validation stability evidence for production manifolds, including validation accuracy and bounded L2 drift ranges.
Mesh sharding speed
On a sharded synthesis workload, mesh-parallel execution achieved a 2.74x mean speedup over local execution across 10 queries.
Mean speedup
2.74x
CI95
2.66x–2.83x
Queries
10
Sample: 10 benchmark queries
Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.
Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.
Evidence package
Mesh synthesis sharding benchmark
Sharded synthesis workload comparison measuring wall-clock speedup for mesh-parallel execution versus local execution.
Deep Synthesis & Research Lab
PIMA Diabetes
On the PIMA Diabetes benchmark, the research pipeline reached 85.3% AUC on 768 rows while retaining the safe local-data path when a borrowed configuration would hurt performance.
AUC
85.3%
Rows
768
Sample: 768 rows
Shows parity-level performance on a clean medical classification benchmark with governance preventing a harmful borrowed configuration.
Dataset-task benchmark for the research platform, not a live companion benchmark.
Evidence package
QARIN Research Lab tabular benchmark report
Dataset-task benchmark evidence for the research pipeline on PIMA Diabetes, with governance retaining the safe local-data path when a borrowed configuration would hurt performance.
Non-linear stress test
On the non-linear stress benchmark, the research pipeline reached 90.8% AUC, outperformed the linear baseline by 10.5%, and filtered 87% of noise columns.
AUC
90.8%
Lift vs linear baseline
+10.5%
Noise filtered
87%
Sample: 1,000 rows, 23 features
Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.
Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.
Evidence package
QARIN Research Lab non-linear stress benchmark
Synthetic signal-versus-noise benchmark measuring autonomous signal detection and noise filtering under controlled conditions.
Adult Census
On Adult Census, the research pipeline reached 91.1% AUC on 30,162 rows and 96 features while degrading gracefully when dynamic grouping timed out.
AUC
91.1%
Rows
30,162
Features
96
Sample: 30,162 rows, 96 features
Shows robustness on high-dimensional, messy, real-world tabular data.
Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.
Evidence package
QARIN Research Lab Adult Census benchmark
High-dimensional tabular benchmark measuring robustness and graceful degradation under dynamic grouping timeouts.
Symbolic regression
The symbolic-regression stack recovered Kepler’s Third Law and the Rydberg Formula with perfect fit on standard benchmark tasks.
Kepler fit
R² = 1.0
Kepler complexity
4 nodes
Rydberg fit
R² = 1.0
Sample: Standard physics benchmark tasks
Shows interpretable equation discovery rather than black-box prediction alone.
Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.
Evidence package
QARIN symbolic-regression benchmark report
Physics-law recovery benchmark demonstrating interpretable equation discovery on standard symbolic-regression tasks.
Alzheimer’s biomarker discovery
On the GSE84422 Alzheimer’s candidate-marker task, the research pipeline produced AUC 0.855 on an internally processed evaluation matrix across 19 brain regions.
Validation AUC
0.855
Data matrix
processed
Brain regions
19
Sample: Internally processed GSE84422 evaluation matrix, 19 regions
Shows structured hypothesis generation on a real biological dataset with literature-grounded marker interpretation.
Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.
Evidence package
QARIN biomedical discovery benchmark report
Curated transcriptomics benchmark and literature-grounded marker interpretation for the GSE84422 Alzheimer’s task.
FieldHash & Provenance
These benchmarks validate the integrity and auditability of the provenance layer governing verified artifacts and policy actions. The accompanying FieldHash documentation outlines the underlying cryptographic registry and state-verification sequence.
FieldHash hardening closure
On the measured adversarial synthesis benchmark, a standard-profile uniform-blend attack passed in 15 of 800 trials while the hardened profile closed that gap to 0 of 800.
Standard profile
15/800
1.875%
Hardened profile
0/800
Sample: 800 trials per profile
Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.
Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.
Evidence package
FieldHash adversarial hardening package
Measured adversarial synthesis benchmark comparing standard and hardened profiles against a uniform-blend attack family.
FieldHash production-gated adaptive campaign
In the calibration-conditioned adaptive ML campaign, production-gated verification measured 0 of 5,000 successful forgeries per tested model, with a Wilson 95% upper bound of 0.0768%.
Production-gated acceptance
0/5000
Wilson 95% upper bound
0.0768%
Sample: 5,000 trials per tested model
Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.
Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.
Evidence package
FieldHash adaptive spoofing campaign
Calibration-conditioned adaptive ML spoofing campaign under the documented production-gated verifier and no-signing-key assumption.
Scientific Boundaries & Caveats
Defined reasoning boundaries: These metrics isolate cognitive alignment, instruction adherence, and state-governed retrieval. They do not simulate general consciousness or unconstrained artificial general intelligence.
Evaluation footprint: While the test sets are empirically significant and robust, they reflect distinct governed-context scenarios rather than exhaustive multi-modal enterprise operations.
Workflow diagnostics: The measured +48.6% reasoning lift represents a statistical mean across the 56-prompt diagnostic set. This diagnostic metric serves as secondary validation rather than the primary governed-context proof.
Broader diagnostic suite (v4): The May 2026 diagnostic run demonstrated a +52.2% reasoning lift across 203 live test prompts. Exactness and semantic grounding held at a 100% correctness floor, though dynamic task selection remains an active focus for optimization.
Scope of metrics: The exact correctness metrics establish a deterministic safety-floor, while semantic grounding functions as a high-fidelity control proxy rather than the primary architectural benchmark.
Review the evidence.
This registry documents verified benchmark milestones and active verification runs. Institutional partners may request secure access to row-level traces, the core architectural whitepaper, and advanced evaluation datasets.
Request accessReady to build?
These quantitative results establish the empirical foundation. The architectural whitepaper details the underlying mechanics, and our case studies demonstrate these governed behaviors in active environments.