Runs on the device
AVALON-2B is sub-3B parameters and ships as GGUF quants. 40 tok/s on Apple M3, comfortable on flagship phones, free on every desktop. Latency you can feel.
Personal intelligence isn't luxury infrastructure. It should run on the phone in your pocket, on the laptop in front of you, in the apps you actually use every day. AVALON-2B is the on-device Self-RAG runtime; Hypersave is the cognitive memory layer that makes the app remember you. Together they're the substrate for the next generation of consumer AI.
AVALON-2B is sub-3B parameters and ships as GGUF quants. 40 tok/s on Apple M3, comfortable on flagship phones, free on every desktop. Latency you can feel.
Hypersave gives every consumer app a persistent cognitive layer. The user’s preferences, history, voice — captured once, recalled everywhere. They stop introducing themselves.
AVALON’s reflection vocabulary lets a small on-device model decide when it needs to consult memory or external sources — and admit when it doesn’t know. No silent confabulation.
Inference local. Memory keys per-device. No mandatory call-home. Build consumer apps where the privacy story is the product story.
Nuro Chat is multi-model chat with persistent memory built on Hypersave — switch models without losing context, recall conversations from last week, run AVALON-2B locally when you don't want to send a turn to the cloud. The Nuro stack (Chat, Studio, One) is the consumer surface and the reference implementation.
The fastest way from open-weights model to consumer app with a memory. Run AVALON through Ollama on the user's machine, route memory through Hypersave (managed or self-hosted). Nothing else to glue.
# On the user's device
ollama pull nuroai/avalon-2b
# In your app
npm install @hypersave/sdk ollamaimport { Hypersave } from "@hypersave/sdk";
import { ollama } from "ollama";
const memory = new Hypersave({ apiKey: process.env.HYPERSAVE_KEY });
async function reply(userId: string, prompt: string) {
const { answer: context } = await memory.recall({ userId, query: prompt });
const response = await ollama.chat({
model: "nuroai/avalon-2b", // local — never hits the cloud
messages: [
{ role: "system", content: `What you remember about the user: ${context}` },
{ role: "user", content: prompt },
],
});
await memory.remember({ userId, text: prompt, sector: "episodic" });
return response.message.content;
}Sub-3B Self-Reflective RAG model. Apache 2.0. GGUF quants on Hugging Face. Ollama-ready. 40 tok/s on Apple M3. Beats Qwen 3.5 2B, Gemma 4 E2B, SmolLM3 3B on its target benchmarks.
Read the paper →Five sectors, Ebbinghaus decay, RRF hybrid retrieval. Embed via TS or Python SDK. Use the managed cloud or self-host on the user’s own infrastructure.
Read the docs →A 2B model that knows when it doesn't know, paired with a memory layer that actually fuses retrieval — that is the substrate consumer AI has been waiting for.
Free tier on Hypersave. AVALON-2B free forever (Apache 2.0). If you're shipping consumer AI and want to compare notes, reach press@nuroailabs.com.