Hypernym Infinite Memory / Librarian · CTO research board · Direct IP only

Memory control plane for small-model long-memory inference.

True north: prove that a 4,096-token-class ~3B local model can use substrate-curated memory to handle long-memory pressure, isolated recall, and provenance-controlled retrieval at materially lower serving cost than brute-force long context.

Current decision: continue research and build V3 around the confirmed weak cluster. Complete live v0.51 shows strong structured memory behavior with zero forbidden-fact leakage, while temporal updates and agent-identity continuity remain optimization targets.

Why this page exists: it is not a pretty dashboard. It is the current decision record: what we can say, what we cannot say, and what the next eval must prove before a CTO should change direction.

QuestionCan a 4,096-context-class small model behave like a useful long-memory system when a controller curates memory?

Evidence80 live rows, 1,199,692 prompt tokens, 84.6% true-north score.

ConstraintFailed-slice rerun coherence stayed at 25.0%; the weak cluster is real.

Current testv0.54 live found the V3 lever: controller-selected current payloads beat noisy payloads; repair did not improve failed rows.

Largest observed pressure

198,888

48.6x a 4,096-token native context class. Measured successful pressure, not a base-context increase.

Transport liveness

3 / 4

Known compact transport probes that completed under the request-path wall.

Pocket semantic score

83.3%

Latest pocket-generalization semantic mean across completed rows.

Personal-memory score

84.6%

v0.51 live true-north score across the complete 80-row matrix.

v0.53 semantic score

46.3%

Format/schema/pressure diagnostic. This is a constraint finding, not a demo score.

v0.54 semantic score

85.2%

Runtime-selected minimal-payload diagnostic. This is the current V3 direction signal.

What This System Is

A controller-mediated memory layer around a small local model. The model is not simply given an infinite prompt; the controller selects, scopes, and verifies memory payloads so the model can answer from long-memory state without brute-force long-context serving.

Serving target: local/self-hosted Granite GGUF through the isolated direct-IP lane.
Security target: per-tenant memory isolation, active binding precedence, revoked/stale memory rejection, and empty-result honesty.
Business target: reduce the cost and reliability penalty of long-memory inference across many users.

What Is Actually Strong

Pressure handling: successful prompts have exceeded native context by tens of times.
Structured output: strict JSON / digest-handle patterns are working in prepared evals.
Control-plane safety: empty result, duplicate conflict, and ambiguous candidate abstention tests are now explicit repeatable harnesses.
Product-shaped eval: v0.51 live covered research, story canon, relationship boundaries, personal preferences, and long-running agent continuity.
Cost guard: current prompt-injection runs are dry-run or direct self-hosted; frontier generation endpoints are blocked by policy audit.

Current Verdict, In Human Terms

Deck-safe claim

Small local models can be made materially more useful for long-memory workflows when memory selection, rejection, and provenance handling are moved into a controller layer instead of being left to raw prompt text.

Hard proof

v0.51 completed 80/80 live rows with 80 HTTP 200 responses, 100.0% parse success, 100.0% forbidden-fact absence, and 95.4% required-fact recall.

Hard weakness

The system is not yet robust enough for arbitrary exact long-term recall: failed-slice rerun coherence was 25.0%, especially in update precedence and long-running agent entity binding.

Research action

Do fewer visuals. After each major run, publish only this form: question, evidence, limits, next decision. v0.52 live is now complete after focused replacement of transient 503 rows.

Evidence Ladder

Layer	Status	Evidence	Meaning
Transport under pressure	measured live	198,888 prompt tokens, 48.6x native class	The endpoint path can accept and complete much larger prompt pressure than the model's native context class.
Quality floor	measured live	Semantic 100.0%; provenance 80.0%	Basic structured recall can work under controlled pressure.
Pocket generalization	measured mixed	825,958 prompt tokens total; exact provenance 28.6%	There is a real exactness pocket, but the system drifts outside that pocket.
Empty-result contract	prepared dry-run	v0.47 verifier 100.0%	The harness now tests that no memory selected means no invented placeholder memory.
Duplicate conflict rejection	prepared dry-run	v0.48 page safe 100.0%; bundle safe 83.3%	The harness now tests stale/shadow duplicate binding conflicts before live launch.
Ambiguous candidate abstention	prepared dry-run	v0.49 abstention modes 100%; aggregate 95.8%	The harness now tests that candidate hints do not become recalled memory without controller selection.
Personal entity coherence	measured live	v0.51 true-north 84.6%; coherence 81.2%; rows 80/80; prompt tokens 1,199,692	First complete live eval shaped like single-user long-term memory across research, story, relationship, psychology, and agent workflows.

v0.51 Live Findings

Transport: 80/80 rows returned HTTP 200; no non-200 rows recorded.
Format control: parse success 100.0%; entity id exactness 92.5%.
Memory safety: forbidden-fact absence 100.0%; no stale/forbidden phrase leakage in scored answers.
Recall: required fact recall 95.4%; coherence pass 81.2%.
Focused audit: rerunning the original failed slices produced 25.0% coherence while keeping parse and forbidden-fact absence at 100.0%/100.0%.

v0.51 Constraints

Latency: p50 2.4586s; p90 89.9753s; max 93.018s across completed rows.
Pressure behavior: 1024-band rows were slow but scored best: coherence 100.0%.
Weakest domain: long-running agent task at 56.2%; failures were mostly entity-id exactness and temporal update handling, not fact leakage.
Temporal updates: some rows remembered core facts but missed the newest update condition; V3 needs explicit update precedence.

Scenario Scorecard

This is the part that should drive product and CTO decisions. It shows where the memory system is already useful, and where V3 should focus.

Scenario	Coherence	Meaning	Decision
Relationship boundary memory	100.0%	Best current product fit: safety-sensitive stated preferences and boundaries stayed clean.	candidate demo lane
Story-world canon	87.5%	Strong for creative continuity; still needs explicit supersession rules for canon changes.	keep testing
Personal preference / psychology	81.2%	Useful but must be conservative because wrong continuity can feel personally invasive.	guard heavily
Research program continuity	81.2%	Promising for long-running research, but update precedence is the blocker.	V3 target
Long-running agent task	56.2%	The main failure cluster: facts are often present, but entity binding and task identity drift.	do not demo yet

v0.52 Targeted Follow-Up

v0.52 is live evidence now. It targeted the v0.51 failure cluster: update precedence and stable memory-key/entity binding under stale and foreign near-duplicate pressure.

Composite live result: 24 rows, 24 HTTP 200 after focused replacement, 74.1% true-north score.
Safety preserved: stale-fact absence 100.0%; foreign-fact absence 100.0%.
Unexpected result: prose memory coherence 91.7%; structured control-plane coherence 66.7%.
Pressure cliff: 1024-pressure coherence 50.0%; 64-pressure coherence 100.0%.

Convergence Standard

Future pages should only publish after the run has a named hypothesis, stored scores, and a convergence/review artifact. Raw visual output without this context is not a research result.

Every page must state the research question and why it matters to true north.
Every chart/table must have a decision attached to it.
Every institutional claim must distinguish measured fact, inference, and open question.
Dry-runs may validate harness logic; they must not be framed as live model capability.

v0.53 Format / Schema / Pressure Diagnostic

v0.53 is live evidence. It tested whether the v0.52 failure cluster was caused by record format, output schema, or pressure budget. The result is constraining: every row returned HTTP 200, but quality fell sharply under the diagnostic matrix.

Slice	Measured result	CTO meaning
All rows	36 rows, 36 HTTP 200, 811,194 prompt tokens	Transport is not the blocker in this run; quality and binding are.
Strict true north	25.9%	The strict machine-contract path is not yet robust enough for production memory claims.
Semantic true north	46.3%	Even allowing field recovery, many rows lose rejected-key completeness or admit stale/foreign facts.
Best memory format	prose coherence 50.0%; compact KV 25.0%; verbose JSONL 8.3%	More explicit structure did not solve the problem. Prose remains the current reliable baseline.
Output contract	strict JSON coherence 55.6%; field-tagged coherence 0.0%	Relaxing the answer envelope did not recover coherence. Use strict JSON plus verifier/repair, not looser text.
Pressure cliff	0-pressure semantic pass 72.2%; 1024-pressure semantic pass 33.3%	Pressure still materially degrades reliable memory behavior.

v0.54 Runtime-Selected Payload Diagnostic

v0.54 is live evidence. It tested whether the system improves when the controller sends selected memory instead of asking the model to sort through stale and foreign fact bodies. This is the strongest post-v0.53 result and should drive V3.

Slice	Measured result	CTO meaning
All rows	36 rows, 36 HTTP 200, 822,190 prompt tokens	Transport stayed clean; this is a quality/control-plane signal.
Strict true north	77.8%	Recovered sharply from v0.53 strict true-north 25.9%.
Semantic true north	85.2%	Recovered sharply from v0.53 semantic true-north 46.3%.
Noisy full payload	coherence 58.3%; semantic 58.3%	Letting the model see stale and foreign fact bodies is worse.
Selected current only	coherence 100.0%; semantic 100.0%	Best quality result. Runtime selection works when audit handles are not required.
Selected + rejected handles	coherence 66.7%; semantic 91.7%	Promising audit envelope, but exact rejected-key set formatting still needs repair.
Verifier repair	repair attempted 13.9%; repair improved 0.0%	The current repair prompt does not fix failures. V3 needs deterministic repair or a stronger constrained decoder.

What Is Not Proven

Not proven: general random-access exact recall across arbitrary positions.
Not proven: production multi-tenant safety under live traffic.
Not proven: transfer of the same multiplier to frontier-scale models.
Not proven: stable correction on known failed slices; focused rerun confirmed the weak cluster.
Not a leaderboard-certified MTRAG result in this board.

V3 Optimization Targets

Make controller-selected payloads first-class: selected, candidate, stale, revoked, and foreign-tenant states must be impossible to confuse.
Improve arbitrary-position exactness, especially non-tail provenance recovery.
Promote focused-row/resume execution into the standard harness so hard suite walls do not create partial matrices.
Promote update precedence to a first-class memory field: original fact, superseding fact, effective timestamp, and scope.
Add stable entity-id constraints for agent-task memories; current failures often recall facts but return the wrong entity id.
Use runtime-selected payloads as the V3 default; v0.54 shows selected-current-only hit 100% coherence in this matrix.
Keep rejected handles, but make their exact set deterministic; selected-with-handles reached high semantic pass but lower strict coherence.
Do not rely on the current repair prompt; v0.54 repair attempts produced 0% measured improvement.
Expose substrate observability per run: what was selected, what was rejected, and why.

Next Decision

continue Build V3 update-precedence and stable-entity binding tests from the confirmed failure cluster.
tighten Replace broad dashboards with decision boards after major rounds only.
measure Add latency/cost normalization to true-north score so slow 1024-band wins do not hide serving constraints.
do not claim “infinite perfect memory” or “general exact recall” yet.

Deck-Safe One-Liner

Hypernym Infinite Memory is a memory control plane for model fleets: per-tenant memory stores, controller-curated recall, provenance-handle verification, and lower long-memory serving cost for small local models under extreme context pressure.

Deck-safe caveat: current evidence supports pressure handling and prepared control-plane safety tests; general exact arbitrary recall remains the optimization target.