PLMR

Pre-tokenizer Latent Memory Routing for byte-level language models.

Akhil Ponnada · Nuro AI Labs

Abstract

We introduce Pre-tokenizer Latent Memory Routing (PLMR), a mechanism that injects external persistent memory into a byte-level language model before the model segments its input into patches.

PLMR retrieves from a FAISS index of evidence-span keys and uses retrieval similarity to modulate the entropy threshold that drives BLT-style boundary placement.

plmr-update.texmath

θ_t = θ_base − α · max-sim(local-repr_t, top-k retrieved keys) · 𝟙[max-sim > τ]

When retrieved memory is highly similar to the local representation, the threshold drops, making a boundary more likely at that byte position. When retrieval is weak, PLMR collapses to standard BLT. Clean attribution, cheap to ablate, and a new design slot in byte-level language modeling that no published work has occupied.

Four contributions

Four claims, all validated at toy scale.

Each contribution is statistically robust and reproducible from the released code. We make no scale-invariant claims — the BLT-1B continuation is what tests generality.

Mechanism validation

Pre-segment retrieval (condition C) reliably beats post-segment retrieval (CLaRa-style control B) by 7–10% relative reduction in evidence-region next-byte loss on paraphrased and perturbed evidence (p < 5×10⁻⁵). Holds across two independent encoder geometries.

Encoder-geometry trade-off + resolution

Naive contrastive bi-encoder training rejects paraphrases along with OOD noise. A multi-positive contrastive recipe (paraphrase + perturbation as positives, random filler as negatives) resolves the trade-off cleanly.

Multi-seed M3 methodology

At toy scale, single-seed plain-perplexity comparisons against tight thresholds are structurally meaningless: identical conditions retrained with identical seeds produce ±50% per-seed nondeterminism under BF16 + cuDNN. We apply PyTorch's reproducibility hygiene and report a CI-aware verdict.

Boundary-response measurement

PLMR places measurably different boundaries in evidence-bearing windows (edit distance 1.327, p = 4×10⁻¹³) but only marginally different in plain-filler windows (0.280) — confirming the modulation is selective on memory presence.

Evidence loss reduction

7–10%

vs CLaRa-style control

Significance

p < 5×10⁻⁵

across both encoders

Compute

~6 hrs

single A10 GPU

Toy scale

9.3M params

TinyBLT

Future work · Phase 1

Continual pretraining a BLT-1B-class model with PLMR.

The toy-scale results justify the architectural intervention. Phase 1 takes the same mechanism — same multi-positive encoder, same θ_t rule, same FAISS index — to the BLT-1B scale on a real corpus with paraphrased evidence. That is the gated next step before claiming the architecture generalizes.

Compute, dataset and ablation budget are already in flight. Preprint imminent; full training writeup follows the BLT-1B continuation.

← All research Read AVALON-2B →

Open weights, open papers.

Memory, agents, applied minds.

Personal intelligence in production.

London-registered research lab.