Governed Continual Learning Loop

How FieldHash turns memory promotion, rollback, governed enforcement, and influence control into system-level learning.

The base LLM stays fixed. The system's behavior changes through governed memory: authoritative state, supersession, rollback, repair, influence control, and an audit layer that catches its own sabotage. The latest lifecycle diagnostic scales stateful compounding to 200 internal projects and compares it with stateless Gemini and GPT selectors.

Behavior change

2400/2400

Paired with 1198/1200 recency-baseline failures.

Promotion

90/90

Authoritative memory found before the model answers.

Enforcement

600/600

Reviewed/current state is enforced under stale context.

Rollback restored

200/200

Prior approved state governs again after rollback.

Audit controls

437/437

437 checks passed. 180 sabotage controls caught.

A project decision changes. The old notes still exist — deleting history would break auditability. FieldHash identifies which version is current and stops the earlier ones from steering the next answer. No retraining. A governed answer path.

The causal order

Memory pressure depends on promotion.

Memory pressure asks the downstream question: does the authoritative fact survive when stale context competes for the next answer?

Promotion is the upstream question: which record is authoritative in the first place? The loop fails if promotion misfires. The flagship studies test both.

Loop statement

1

Observe a new decision, correction, conflict, or tool requirement.

2

Infer which record is authoritative, superseded, rejected, or ordinary.

3

Promote only the state that passes scope, confidence, and authority checks.

4

Suppress stale context before answer construction while retaining audit history.

5

Condition the next model call on the governed context rather than raw memory clutter.

New lifecycle pillar

The continual part is stateful, not just semantic.

The strongest new evidence is the lifecycle diagnostic. Across 200 projects, FieldHash carried governed state through update, stale re-entry, rejection, scope collision, drift repair, rollback, re-apply, compaction, and repeated reads — and the audit layer caught and repaired its own injected sabotage. Stateless read-time selectors shown only the record text missed rollback, but that is an information asymmetry, not a capability gap: a model given the full operation log replays it and ties.

Internal n=200 lifecycle diagnostic · Gemini Flash Lite and GPT-5.5 stateless selector controls

FieldHash normal lifecycle reads

2400/2400

Rollback restored governed state

200/200

Drift detected and repaired

200/200

Compaction retained governed state

200/200

Why it matters.

A read-time selector shown only record text can pick the record that looks current, but it has no way to know a rollback happened — that is information withheld, not a capability gap. Given the full operation log, a frontier model replays it and ties.

So the claim is not superior selection intelligence. It is persistent, auditable, reversible state: the control works as specified — rollback reverts, rejected and superseded state is suppressed, and the audit layer catches its own sabotage. That governance correctness, not beating a model, is the wedge.

Read the lifecycle pillar

Internal diagnostic, not external validation. Authority metadata is written by governed operations, so this does not prove authority inference. Two controls bound the claim: a visible-rollback control (GPT-5.5 recovers rollback when it is ordinary retrievable text), and a structural analysis showing that a model given the full operation log replays it and ties. The supported claim is therefore architectural — durable, materialized governed state versus reconstruction under a bounded retrieval budget — not superiority over frontier LLM selectors.

Evidence chain

Four benchmarks. One control plane.

Each study tests one part of the loop. Read together, they support a precise claim: future model behavior changes through governed memory and answer-path control, not through hidden weight updates.

1. Promotion

Find the record that should win.

In internal automatic-promotion diagnostics, FieldHash identified the authoritative memory and recovered 90/90 exact current tokens across Gemini 3.5 Flash, Claude Opus 4.7, and GPT-5.5 on a Claude-authored disjoint n=30 corpus, while retrieval-only and prompt-only smart-memory controls recovered 36/90 and 40/90. A same-budget Gemini two-pass smart diagnostic on the same corpus selected the current record 30/30 and answered 28/30 with zero stale substitutions, narrowing the claim to governed, auditable answer-path control rather than basic semantic selection. On same-family n=100 provider replications, Gemini 3.5 Flash and GPT-5.5 each reached 100/100 with zero stale-token mentions; Claude Opus 4.7 reached 95/100, with the remaining misses caused by empty provider responses rather than stale substitutions. In a provider-sensitivity fact-extraction audit on the same n=100 corpus, Gemini reached 99/100 role-equivalent current facts and 68/100 exact spans, while GPT-5.5 reached 95/100 and 76/100; strict source-span fidelity and provider-invariant extraction are not claimed as solved.

Read evidence

2. Enforcement

Enforce governed state under stale pressure.

In the refreshed May 23 internal seeded-authority memory-pressure benchmark, the same frontier LLMs with FieldHash governed memory enforced reviewed/current state and recovered the approved-current fact in 600/600 cases across Gemini, Claude Opus, and GPT provider paths. Retrieval-only memory without FieldHash governance metadata recovered 415/600. The governed path reduced mean memory context exposed to the LLM to 2.00 of 10 retrieved candidates before answer construction, versus all 10 candidates in the retrieval-only baseline. Across the same three provider paths, adding a prompt-only instruction to prefer current/reviewed records improved the baseline to 464/600, but still left 136 stale-context failures and exposed all 10 retrieved memories. This supports governed-state enforcement under stale-context pressure, not a claim of superior authority inference from raw text.

Read evidence

3. Stateful lifecycle

Keep state durable as it changes.

The lifecycle diagnostic tests whether governed state survives update, rollback, repair, compaction, and repeated reads.

Read evidence

4. Auditability

Catch broken governance.

The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.

Internal proof map

What's proven. What's partial. What's next.

Promotion and supersession

Automatic promotion tests unlabeled conflict records and routes inferred authoritative memory into governed state before answer construction. A same-budget smart-prompt two-pass baseline narrowed the claim (see the falsification panel below); the surviving claim is governed answer-path control, not raw selection.

Internal support

Stale-context suppression

Memory pressure tests whether authoritative memory survives plausible stale context once authority metadata exists.

Internal support

Context hygiene

Tool-context compression remains a supporting diagnostic for smaller answer surfaces, but it is not a core wedge until matched top-k retrieval baselines are included.

Partial support

Auditability

The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.

Internal support

State-conditioned behavior change

The lifecycle diagnostic tests whether persisted governed state, not model weights, changes future behavior while controls keep stale, rejected, out-of-scope, and rolled-back records from steering the answer.

Internal support

Stateful compounding across rollback

In the n=200 compounding diagnostic, FieldHash held 2400/2400 normal lifecycle reads, reverted rollback 200/200, suppressed rejected and superseded state, and repaired injected drift 200/200. Stateless selectors shown only record text missed rollback, but a model given the full operation log replays it and ties — the advantage is materialized governed state, not selector superiority.

Internal support

Strict fact-span fidelity

Automatic promotion reached 99/100 role-equivalent current facts with Gemini and 95/100 with GPT-5.5 under field-role normalization. Strict exact source-span extraction remained provider-dependent at 68/100 and 76/100 respectively, and is not claimed as solved.

Partial support

Cross-session synthesis continuity

In a follow-up Deep Synthesis diagnostic, FieldHash reused scoped hypotheses from the prior synthesis, filtered unrelated priors, and narrowed the next investigation into new testable diagnostics. This is a runtime observation, not yet a repeatable benchmark.

Partial support

Public knowledge-update subset

On LongMemEval-S knowledge-update using Gemini 3.1 Flash Lite throughout, FieldHash recovered 69/78 answers versus 64/78 for retrieval-only and 60/78 for a compressed same-model smart two-pass selector. The retrieval-only margin was directional but not significant (McNemar p=0.26685); the smart-selector comparison was significant (57 both correct, 12 FieldHash-only, 3 smart-only, 6 neither; p=0.03515625) but used a much smaller answer context. This is external-dataset diagnostic evidence, not third-party validation or a broad continual-learning proof.

Partial support

Public conversational-memory boundary

LoCoMo-MC10 is currently a boundary result, not a win claim. With Gemini 3.1 Flash Lite, FieldHash scored 41/50 on choice-record memory versus 40/50 retrieval-only and 39/50 recency-aware; on turn-record memory it scored 37/50 versus 36/50 retrieval-only and 35/50 recency-aware. Paired comparisons against retrieval-only were non-significant (McNemar p=1.0 in both artifacts), and no same-model smart two-pass selector was included.

Partial support

Longitudinal external compounding

Broader validation still needs a public, third-party, multi-session knowledge-update benchmark with native scoring, competent memory baselines, and row-level traces before claiming broad external continual-learning validation.

Next proof bar
Runtime observation

Follow-up synthesis can inherit the previous investigation.

Deep Synthesis is not just a one-shot report generator. In a follow-up diagnostic, the next run began from the prior thread's strongest hypotheses, ignored unrelated priors, and refined the work into a narrower test plan.

Prior reuse

It carried forward the useful state.

The follow-up run reused the earlier question-shape, prompt-position, provider-parity, hub-compression, authority-density, and option-conditioning hypotheses instead of rediscovering the same ground.

Scoped filtering

It did not accept every memory.

The learning layer applied project-scope filtering: relevant priors were promoted into the new synthesis path, while unrelated experiment shapes were left out.

Narrower output

It produced sharper next tests.

The broad failure hypothesis narrowed into concrete diagnostics: operator-erasure, gold-rank elasticity, and turn-polarity mismatch. That is the practical learning behavior: keep the useful prior, refine the next experiment.

This is a qualitative runtime trace, not a public benchmark headline. It supports the governed-learning story by showing scoped prior reuse across synthesis sessions, but it does not prove broad external continual learning, weight-level adaptation, or statistically repeatable performance until run as a formal multi-session benchmark.

Supporting diagnostic

Retrieval recall is not authority resolution.

A later named-framework diagnostic tested the same failure mode against hosted memory SDK paths. The chart below keeps the framework names off the synthesis page, but preserves the point: the current record can appear in candidates and still lose to stale context before the answer.

Anonymized aggregate · two hosted memory-framework SDK paths · 60 SDK rows plus 60 paired governed rows

Current record surfaced in SDK candidates60/60
SDK direct paths selected current record1/60
SDK direct paths selected stale context59/60
FieldHash paired governed path selected current record60/60

Why it supports the loop.

The flagship promotion benchmark shows FieldHash can infer which memory should become authoritative. The framework diagnostic asks a narrower follow-up: after retrieval, does the authoritative memory actually control the answer path?

This remains an internal diagnostic on our corpus, not external validation or a homepage claim. It is linked here because it clarifies the loop's central distinction: retrieving the right memory is necessary, but not sufficient. One SDK row is a hosted graph-search/top-result configuration, not a claim about that framework's best possible temporal setup.

Read the diagnostic methods

Internal diagnostic, not external validation or a homepage superiority claim. It uses our semantic-authority corpus and deterministic authoritative-value scoring; the linked methods page discloses the named SDK configurations, extraction-mode boundary control, corrected ingestion gate, and full caveats. The SDK counts above aggregate two hosted memory-framework SDK paths from the latest Flash Lite diagnostic; the paired FieldHash governed path held 30/30 in each run.

What this supports

Governed, non-parametric learning can be decomposed into auditable steps.

FieldHash identifies the authoritative memory, preserves it under conflict, keeps governed state durable through rollback and compaction, and verifies that stale, rejected, or corrupted state cannot silently steer answers. Future answers change through governed memory and answer-path conditioning.

The base model is not retrained. The visible parts of the loop — promotion, supersession, rollback, stale suppression, and audit — are where the learning lives.

Caveats

What this does not prove.

This is a synthesis of internal benchmarks plus public-dataset diagnostics, not a third-party validation package. It does not prove weight-level continual learning, universal memory safety, customer deployment outcomes, or superiority over named third-party memory frameworks.

The remaining work is broader and external: reproduce the loop on a public multi-session knowledge-update benchmark with native scoring, competent recency-aware and memory-system baselines, and row-level traces.

Falsification tests

Two experiments we ran to try to break our own claims.

A claim only survives if a fair test could have killed it. The first test asked whether governance was actually doing the work, or whether a strong same-budget prompt could do the same job. The second sabotaged the governance machinery on purpose and asked whether the audit layer would notice.

Test 1 — Same-budget baseline

We ran the prompt that could have killed the wedge.

A two-pass smart-prompt Gemini baseline at the same budget as the governed path selected the current record 30/30 on the blind n=30 corpus. It answered 28/30 with zero stale substitutions. Selection is matchable by a careful prompt. What survives is governed, auditable answer-path control — not the claim that frontier models cannot find the current memory.

Selection

30/30

smart baseline found the current record

Answer

28/30

two unstructured misses; no stale tokens

Test 2 — Deliberate sabotage

We broke the governance on purpose. The audit caught every break.

Across 36 lifecycle scenarios, we disabled governance, removed supersession links, re-promoted rejected memory, leaked wrong-project scope, and stale-re-promoted to the answer path. The auditability diagnostic detected every sabotage. 437/437 invariants held. 180/180 negative controls fired. Zero LLM calls — this is the deterministic layer.

Sabotage caught

180/180

five breakage modes × 36 scenarios

Invariants held

437/437

deterministic, no LLM calls

Internal deterministic code-path diagnostic with zero LLM calls. It verifies governed memory state, falsifiability checks, and audit telemetry behavior; it is not external validation, not a named memory-system comparison, and not evidence of base-model weight learning or general reasoning improvement.

Read the evidence behind the loop.

Start with the public-safe synthesis report, then follow the component studies behind each loop step.