Memory control plane for small-model long-memory inference.
True north: prove that a 4,096-token-class ~3B local model can use substrate-curated memory to handle long-memory pressure, isolated recall, and provenance-controlled retrieval at materially lower serving cost than brute-force long context.
What This System Is
A controller-mediated memory layer around a small local model. The model is not simply given an infinite prompt; the controller selects, scopes, and verifies memory payloads so the model can answer from long-memory state without brute-force long-context serving.
- Serving target: local/self-hosted Granite GGUF through the isolated direct-IP lane.
- Security target: per-tenant memory isolation, active binding precedence, revoked/stale memory rejection, and empty-result honesty.
- Business target: reduce the cost and reliability penalty of long-memory inference across many users.
What Is Actually Strong
- Pressure handling: successful prompts have exceeded native context by tens of times.
- Structured output: strict JSON / digest-handle patterns are working in prepared evals.
- Control-plane safety: empty result, duplicate conflict, and ambiguous candidate abstention tests are now explicit repeatable harnesses.
- Product-shaped eval: v0.51 live covered research, story canon, relationship boundaries, personal preferences, and long-running agent continuity.
- Cost guard: current prompt-injection runs are dry-run or direct self-hosted; frontier generation endpoints are blocked by policy audit.
Current Verdict, In Human Terms
Evidence Ladder
| Layer | Status | Evidence | Meaning |
|---|---|---|---|
| Transport under pressure | measured live | 198,888 prompt tokens, 48.6x native class | The endpoint path can accept and complete much larger prompt pressure than the model's native context class. |
| Quality floor | measured live | Semantic 100.0%; provenance 80.0% | Basic structured recall can work under controlled pressure. |
| Pocket generalization | measured mixed | 825,958 prompt tokens total; exact provenance 28.6% | There is a real exactness pocket, but the system drifts outside that pocket. |
| Empty-result contract | prepared dry-run | v0.47 verifier 100.0% | The harness now tests that no memory selected means no invented placeholder memory. |
| Duplicate conflict rejection | prepared dry-run | v0.48 page safe 100.0%; bundle safe 83.3% | The harness now tests stale/shadow duplicate binding conflicts before live launch. |
| Ambiguous candidate abstention | prepared dry-run | v0.49 abstention modes 100%; aggregate 95.8% | The harness now tests that candidate hints do not become recalled memory without controller selection. |
| Personal entity coherence | measured live | v0.51 true-north 84.6%; coherence 81.2%; rows 80/80; prompt tokens 1,199,692 | First complete live eval shaped like single-user long-term memory across research, story, relationship, psychology, and agent workflows. |
v0.51 Live Findings
- Transport: 80/80 rows returned HTTP 200; no non-200 rows recorded.
- Format control: parse success 100.0%; entity id exactness 92.5%.
- Memory safety: forbidden-fact absence 100.0%; no stale/forbidden phrase leakage in scored answers.
- Recall: required fact recall 95.4%; coherence pass 81.2%.
- Focused audit: rerunning the original failed slices produced 25.0% coherence while keeping parse and forbidden-fact absence at 100.0%/100.0%.
v0.51 Constraints
- Latency: p50 2.4586s; p90 89.9753s; max 93.018s across completed rows.
- Pressure behavior: 1024-band rows were slow but scored best: coherence 100.0%.
- Weakest domain: long-running agent task at 56.2%; failures were mostly entity-id exactness and temporal update handling, not fact leakage.
- Temporal updates: some rows remembered core facts but missed the newest update condition; V3 needs explicit update precedence.
Scenario Scorecard
This is the part that should drive product and CTO decisions. It shows where the memory system is already useful, and where V3 should focus.
| Scenario | Coherence | Meaning | Decision |
|---|---|---|---|
| Relationship boundary memory | 100.0% | Best current product fit: safety-sensitive stated preferences and boundaries stayed clean. | candidate demo lane |
| Story-world canon | 87.5% | Strong for creative continuity; still needs explicit supersession rules for canon changes. | keep testing |
| Personal preference / psychology | 81.2% | Useful but must be conservative because wrong continuity can feel personally invasive. | guard heavily |
| Research program continuity | 81.2% | Promising for long-running research, but update precedence is the blocker. | V3 target |
| Long-running agent task | 56.2% | The main failure cluster: facts are often present, but entity binding and task identity drift. | do not demo yet |
v0.52 Targeted Follow-Up
v0.52 is live evidence now. It targeted the v0.51 failure cluster: update precedence and stable memory-key/entity binding under stale and foreign near-duplicate pressure.
- Composite live result: 24 rows, 24 HTTP 200 after focused replacement, 74.1% true-north score.
- Safety preserved: stale-fact absence 100.0%; foreign-fact absence 100.0%.
- Unexpected result: prose memory coherence 91.7%; structured control-plane coherence 66.7%.
- Pressure cliff: 1024-pressure coherence 50.0%; 64-pressure coherence 100.0%.
Convergence Standard
Future pages should only publish after the run has a named hypothesis, stored scores, and a convergence/review artifact. Raw visual output without this context is not a research result.
- Every page must state the research question and why it matters to true north.
- Every chart/table must have a decision attached to it.
- Every institutional claim must distinguish measured fact, inference, and open question.
- Dry-runs may validate harness logic; they must not be framed as live model capability.
v0.53 Format / Schema / Pressure Diagnostic
v0.53 is live evidence. It tested whether the v0.52 failure cluster was caused by record format, output schema, or pressure budget. The result is constraining: every row returned HTTP 200, but quality fell sharply under the diagnostic matrix.
| Slice | Measured result | CTO meaning |
|---|---|---|
| All rows | 36 rows, 36 HTTP 200, 811,194 prompt tokens | Transport is not the blocker in this run; quality and binding are. |
| Strict true north | 25.9% | The strict machine-contract path is not yet robust enough for production memory claims. |
| Semantic true north | 46.3% | Even allowing field recovery, many rows lose rejected-key completeness or admit stale/foreign facts. |
| Best memory format | prose coherence 50.0%; compact KV 25.0%; verbose JSONL 8.3% | More explicit structure did not solve the problem. Prose remains the current reliable baseline. |
| Output contract | strict JSON coherence 55.6%; field-tagged coherence 0.0% | Relaxing the answer envelope did not recover coherence. Use strict JSON plus verifier/repair, not looser text. |
| Pressure cliff | 0-pressure semantic pass 72.2%; 1024-pressure semantic pass 33.3% | Pressure still materially degrades reliable memory behavior. |
v0.54 Runtime-Selected Payload Diagnostic
v0.54 is live evidence. It tested whether the system improves when the controller sends selected memory instead of asking the model to sort through stale and foreign fact bodies. This is the strongest post-v0.53 result and should drive V3.
| Slice | Measured result | CTO meaning |
|---|---|---|
| All rows | 36 rows, 36 HTTP 200, 822,190 prompt tokens | Transport stayed clean; this is a quality/control-plane signal. |
| Strict true north | 77.8% | Recovered sharply from v0.53 strict true-north 25.9%. |
| Semantic true north | 85.2% | Recovered sharply from v0.53 semantic true-north 46.3%. |
| Noisy full payload | coherence 58.3%; semantic 58.3% | Letting the model see stale and foreign fact bodies is worse. |
| Selected current only | coherence 100.0%; semantic 100.0% | Best quality result. Runtime selection works when audit handles are not required. |
| Selected + rejected handles | coherence 66.7%; semantic 91.7% | Promising audit envelope, but exact rejected-key set formatting still needs repair. |
| Verifier repair | repair attempted 13.9%; repair improved 0.0% | The current repair prompt does not fix failures. V3 needs deterministic repair or a stronger constrained decoder. |
What Is Not Proven
- Not proven: general random-access exact recall across arbitrary positions.
- Not proven: production multi-tenant safety under live traffic.
- Not proven: transfer of the same multiplier to frontier-scale models.
- Not proven: stable correction on known failed slices; focused rerun confirmed the weak cluster.
- Not a leaderboard-certified MTRAG result in this board.
V3 Optimization Targets
- Make controller-selected payloads first-class: selected, candidate, stale, revoked, and foreign-tenant states must be impossible to confuse.
- Improve arbitrary-position exactness, especially non-tail provenance recovery.
- Promote focused-row/resume execution into the standard harness so hard suite walls do not create partial matrices.
- Promote update precedence to a first-class memory field: original fact, superseding fact, effective timestamp, and scope.
- Add stable entity-id constraints for agent-task memories; current failures often recall facts but return the wrong entity id.
- Use runtime-selected payloads as the V3 default; v0.54 shows selected-current-only hit 100% coherence in this matrix.
- Keep rejected handles, but make their exact set deterministic; selected-with-handles reached high semantic pass but lower strict coherence.
- Do not rely on the current repair prompt; v0.54 repair attempts produced 0% measured improvement.
- Expose substrate observability per run: what was selected, what was rejected, and why.
Next Decision
- continue Build V3 update-precedence and stable-entity binding tests from the confirmed failure cluster.
- tighten Replace broad dashboards with decision boards after major rounds only.
- measure Add latency/cost normalization to true-north score so slow 1024-band wins do not hide serving constraints.
- do not claim “infinite perfect memory” or “general exact recall” yet.
Deck-Safe One-Liner
Hypernym Infinite Memory is a memory control plane for model fleets: per-tenant memory stores, controller-curated recall, provenance-handle verification, and lower long-memory serving cost for small local models under extreme context pressure.
Deck-safe caveat: current evidence supports pressure handling and prepared control-plane safety tests; general exact arbitrary recall remains the optimization target.