PLMR
Pre-tokenizer Latent Memory Routing for byte-level language models.
We introduce Pre-tokenizer Latent Memory Routing (PLMR), a mechanism that injects external persistent memory into a byte-level language model before the model segments its input into patches.
PLMR retrieves from a FAISS index of evidence-span keys and uses retrieval similarity to modulate the entropy threshold that drives BLT-style boundary placement.
θ_t = θ_base − α · max-sim(local-repr_t, top-k retrieved keys) · 𝟙[max-sim > τ]When retrieved memory is highly similar to the local representation, the threshold drops, making a boundary more likely at that byte position. When retrieval is weak, PLMR collapses to standard BLT. Clean attribution, cheap to ablate, and a new design slot in byte-level language modeling that no published work has occupied.
Four claims, all validated at toy scale.
Mechanism validation
Pre-segment retrieval (condition C) reliably beats post-segment retrieval (CLaRa-style control B) by 7–10% relative reduction in evidence-region next-byte loss on paraphrased and perturbed evidence (p < 5×10⁻⁵). Holds across two independent encoder geometries.
Encoder-geometry trade-off + resolution
Naive contrastive bi-encoder training rejects paraphrases along with OOD noise. A multi-positive contrastive recipe (paraphrase + perturbation as positives, random filler as negatives) resolves the trade-off cleanly.
Multi-seed M3 methodology
At toy scale, single-seed plain-perplexity comparisons against tight thresholds are structurally meaningless: identical conditions retrained with identical seeds produce ±50% per-seed nondeterminism under BF16 + cuDNN. We apply PyTorch's reproducibility hygiene and report a CI-aware verdict.
Boundary-response measurement
PLMR places measurably different boundaries in evidence-bearing windows (edit distance 1.327, p = 4×10⁻¹³) but only marginally different in plain-filler windows (0.280) — confirming the modulation is selective on memory presence.
Continual pretraining a BLT-1B-class model with PLMR.
The toy-scale results justify the architectural intervention. Phase 1 takes the same mechanism — same multi-positive encoder, same θ_t rule, same FAISS index — to the BLT-1B scale on a real corpus with paraphrased evidence. That is the gated next step before claiming the architecture generalizes.
Compute, dataset and ablation budget are already in flight. Preprint imminent; full training writeup follows the BLT-1B continuation.