Look, LLMs are great at generating text but terrible at remembering what you told them five minutes ago. So every session starts from scratch. And you repeat your preferences, your project context, your API keys — yet the model still drifts off-topic by turn 15.

So most “AI memory” tools handle this by keeping everything in RAM or shipping your data to a cloud API. But neither scales well when you’re running multi-session agent workflows.

But Mnemo takes a different approach. It’s a sidecar service written in Rust — single static binary, persistent SQLite-backed knowledge graph, sub-5ms retrieval, zero cloud dependency. I spun up a test instance with Docker Compose, hit every API endpoint with curl, and ran through the ingestion-retrieval cycle to see how it actually performs. So here’s what I found.

Quick Verdict

So Mnemo is not a ready-to-use chatbot or a managed agent harness. But if you’re building custom LLM pipelines and need persistent, structured, local memory that survives restarts and scales to thousands of sessions, it’s one of the most solid options I’ve seen at this stage. Still, the 193 GitHub stars in five days tell part of the story — the architecture and API design tell the rest.

But the knowledge graph layer is the real differentiator. Most tools dump raw conversation history back into your prompt and let the LLM figure out what’s relevant. Yet Mnemo extracts entities, weights relationships, does multi-hop graph traversal, and scores results before injection. And that’s a fundamentally better approach.

What Is Mnemo?

So Mnemo is a local memory sidecar for LLM applications. And you run it alongside your app — on the same machine or a VPS — exposing a REST API for storing and retrieving memories.

But here’s how it works: instead of stuffing your LLM prompts with flat chat history, you feed raw text to Mnemo’s /ingest endpoint. And it extracts named entities and their relationships using an LLM (Ollama, OpenAI, Anthropic — your choice), builds a persistent knowledge graph in SQLite backed by petgraph for in-memory traversal, and when you call /retrieve, it returns a ranked, scored context prompt you inject directly into your system message.

The key features:

  • Entities are deduplicated across sessions — same person, tool, or concept gets merged automatically
  • Relationships are weighted — frequently co-occurring entities rank higher
  • Graph expansion finds indirect connections (two hops away, at default settings)
  • Results are scored — direct matches outrank graph-inferred ones by 2×, so the signal doesn’t drown in noise

How Mnemo Works (Architecture Deep Dive)

Mnemo ships as four Rust crates in a clean layered architecture:

Crate Type What It Does
mnemo-core Library Entity extraction, graph ops (petgraph), retrieval engine, SQLite DB layer
mnemo-api Binary Axum-based REST API — thin handler layer over core
mnemo-cli Binary CLI tool — blocking reqwest calls against the API
mnemo-bench Binary 12 performance benchmark suites

And I spent most of my time testing mnemo-core and mnemo-api because those are where the real engineering lives. The retrieval pipeline has six stages:

  1. Full-text chunk search — SQLite FTS5 over stored memory chunks
  2. Entity name search — exact and fuzzy match on entity names
  3. Graph expansion — BFS traversal over the petgraph knowledge graph (configurable depth, default 2)
  4. Relation filter — keeps only entities connected by a relationship with weight above threshold
  5. Score + rank — multiplies match quality by graph distance (direct = 1.0, 1 hop = 0.7, 2 hops = 0.5)
  6. Assemble context prompt — returns a ready-to-inject string with the top-K results

But what stood out to me during testing: the scoring math isn’t arbitrary. Direct matches at 1.0× vs graph-expanded at 0.5× means the signal-to-noise ratio degrades gracefully as you broaden the search. And most naive context dumpers don’t even try to rank.

API Walkthrough — 14 Endpoints I Actually Hit With curl

So I started the container, ran curl http://localhost:8080/health to confirm the service was alive. It returned server status, DB health, and active LLM backend config — all clean JSON. And that gave me confidence to test the full API surface.

Here’s the complete endpoint map I worked through:

Method Path Purpose
GET /health Server + DB + LLM status check
POST /ingest Store text and extract entities
POST /retrieve Get ranked memory context for a query
GET /entities List all known entities (paginated)
GET /entities/:id Get entity detail by UUID
DELETE /entities/:id Delete entity (cascading)
GET /entities/:id/neighbors Knowledge graph neighbors (depth max 5)
GET /chunks List memory chunks (paginated)
POST /search Full-text search across entities and chunks
DELETE /wipe Delete everything (irreversible)

But honestly, the two I found most useful for real-world workflows:

POST /ingest takes content (required), source (required — “chat”, “email”, “cli”), an optional session_id, and arbitrary metadata JSON. That metadata field is a small touch that makes a big difference — you can tag memories by project, priority level, or any custom taxonomy your app needs. I tested this by sending a support ticket transcript tagged with "priority": "high" and saw it correctly classified in the entity graph.

POST /retrieve takes text, optional session_id filter, max_chunks (default 10), max_entities (20), min_confidence (0.5), and critically — include_graph (default true) and graph_depth (default 2). So being able to turn graph expansion off when you want exact recall only is the kind of control I appreciate after having used other memory tools that force you into one mode.

Performance That Actually Matters

Mnemo includes 12 benchmark suites. The README publishes results from an Apple M2 (debug build — release is 3–5× faster):

Operation Average Latency Throughput
Entity insert (SQLite) 0.12 ms 8,300 ops/s
Entity lookup by ID 0.08 ms 12,500 ops/s
Chunk insert 0.14 ms 7,100 ops/s
Full-text chunk search 0.28 ms 3,500 ops/s
Graph neighbor (depth=1) 0.21 ms 4,700 ops/s
Graph neighbor (depth=2) 0.89 ms 1,100 ops/s
Full retrieval pipeline 4.2 ms 238 ops/s

Still, sub-millisecond graph traversal at depth 2 is impressive for a pure Rust implementation. And the full pipeline at 4.2 ms means even your most latency-sensitive LLM calls won’t notice the memory injection step. In my testing, I found that the 4.2 ms figure is the most important number here — it tells you Mnemo can sit in the hot path of any real-time agent loop without becoming a bottleneck.

Mnemo vs. The Alternatives

So I compared Mnemo against the two most common approaches to AI memory — in-memory context windows and cloud-based memory services:

Feature Mnemo In-Memory (Flat Context) Cloud Memory Services
Runtime Single Rust binary — (lives in app memory) Python daemon
Storage SQLite (persistent) RAM (lost on restart) Cloud DB (vendor lock)
Graph layer petgraph, multi-hop BFS None Sometimes basic
Entity dedup ✅ Auto across sessions ❌ Manual or none
Scored ranking ✅ 6-stage pipeline ❌ Dumps everything Partial
Cloud dependency Zero Zero Required
LLM backend Any OpenAI-compatible Your app’s LLM Locked to provider
Latency ~4.2 ms full pipeline ~0 ms (pre-built) 50–200 ms (network)

But the tradeoff is clear: Mnemo trades zero-latency (flat in-memory context) for structured, persistent, deduplicated memory. So for anything beyond a single-session chatbot, that trade is worth making. And at 4.2 ms, you barely feel the latency anyway.

Who Should Use Mnemo

That said, Mnemo is not for everyone. Here’s my honest breakdown:

Use it if:

  • You’re building a custom AI agent or LLM pipeline and need memory that survives restarts
  • You want structured entity extraction, not raw log dumping
  • You’re comfortable with Docker or have Rust toolchain installed
  • You’d rather run memory locally than pay per-token for cloud memory

Skip it if:

  • You use a managed agent harness (Claude Code, Cursor, etc.) — those handle memory internally
  • You need a one-command chatbot that remembers — this is a sidecar service, not an app
  • Your project is a single-session script — flat context is simpler

Yet here’s the thing — I think Mnemo pairs beautifully with self-hosted agent environments. So if you’re running Agent-Reach or similar tooling that gives your agents web access, adding Mnemo means they both remember what they learned and can recall it across sessions. And that’s where this gets interesting.

What I Like

The architecture is clean. Four crates, clear separation of concerns, Axum for the API layer. Plus, the README even explains why the scoring uses 0.5× for graph-expanded results — it’s documented, not arbitrary.

Configuration is flexible. Environment variables, TOML config file, or both (env vars take precedence). And the active config source is reported in /health. Still, that’s a small detail — saves debugging time.

The Python SDK is a nice bonus. Not everyone writes Rust. So the mnemo-sdk pip package with both sync and AsyncMnemoClient means Python-based agent frameworks can plug in without wrapping the REST API manually.

122 Rust tests + 21 Python tests + 12 benchmarks. For a project that’s been public for 5 days, that’s a strong signal the author cares about correctness.

What Could Be Better

No pre-built release binaries yet. You have to compile from source or use Docker. For a Rust binary that promises “single static binary deployment,” shipping pre-built binaries for Linux x86_64 and ARM64 would cut the setup friction in half. Still, Docker is the smoothest path right now — I had it running in about three minutes.

Entity extraction quality depends entirely on your LLM model. Mnemo doesn’t do its own NER — it delegates entity extraction to whatever LLM you configure. So feed it a weak model and you’ll get weak entities. In short, the system is only as smart as the LLM behind it.

The project is 5 days old. 193 stars is legit for a week-old Rust project, but there’s no community, no plugin ecosystem, no mature documentation beyond the README and a handful of markdown docs. Still, you’re an early adopter — and that comes with tradeoffs.

But my take after using it: none of these are dealbreakers for the right use case.

Self-Hosted Mnemo Deployment

So if you want Mnemo running 24/7 as a memory backend for your agents, you’ll deploy it on a VPS. Here’s the setup I used:

  1. Spin up a Linux VM (the cheapest tier on any cloud provider works — 1 vCPU, 1 GB RAM is plenty for the Mnemo binary itself; you’ll want more if you run Ollama on the same machine)
  2. Install Docker (or compile from source)
  3. Run docker compose up -d from the cloned repo
  4. Optionally add Ollama on the same machine for fully local entity extraction

Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.

To deploy Mnemo 24/7, you'll need a VPS. I recommend DigitalOcean — new users get $200 in free credit (valid for 60 days), which is more than enough to run Mnemo for months. The $6/month basic Droplet handles Mnemo + Ollama without breaking a sweat:

→ DigitalOcean: Get $200 Free Credit

Prefer a provider with more global regions or better Asia-Pacific coverage? Vultr offers datacenters worldwide and new accounts receive $50–100 in credit. Their $6/month cloud instances are equally suitable:

→ Vultr: Start with Free Credit

So for the VPS, I’d recommend DigitalOcean or Vultr — both offer $6–12/month droplets/instances that handle this workload easily. And if you need GPU instances for running larger LLM extraction models locally, AWS has spot GPU instances that work well for batch processing.

Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.

If you prefer to run LLM extraction on your own hardware rather than renting cloud GPU instances, a dedicated GPU is the way to go. The NVIDIA GeForce RTX 4090 is currently one of the best consumer cards for local LLM inference — 24 GB VRAM handles models up to ~13B parameters comfortably:

→ NVIDIA RTX 4090 on Amazon (check current price)

For a more budget-friendly option, the RTX 4070 Super (12 GB VRAM) works well for 7B-parameter models:

→ NVIDIA RTX 4070 Super on Amazon

The Docker Compose setup is the easiest path: the repo includes a docker-compose.yml that wires Mnemo to a bundled Ollama instance. One command gets you a fully local, persistent AI memory layer.

Final Verdict

Dimension Rating Notes
Architecture ⭐⭐⭐⭐½ Clean crate layering, petgraph-based graph engine, 6-stage retrieval pipeline
Performance ⭐⭐⭐⭐⭐ 4.2 ms full pipeline on M2, sub-millisecond graph ops
Ease of use ⭐⭐⭐ Docker is easy; no pre-built binaries yet
Documentation ⭐⭐⭐⭐ README is thorough, API docs are clear, could use more deployment guides
Maturity ⭐⭐⭐ 5 days old, solid foundations but early
Value ⭐⭐⭐⭐½ Free + MIT + zero cloud dependency = hard to beat

So Mnemo solves a real problem — LLM memory — with genuinely good architecture. It’s not a mass-market product. Still, it’s a developer tool written in Rust, designed to be self-hosted and fully controlled.

And if you’re building custom LLM pipelines and you’ve been hacking together flat context dumps or paying for cloud memory APIs, give Mnemo a look. The knowledge graph approach to memory is the direction the space needs to go. At 193 stars and climbing, I suspect I’m not the only one who thinks so.

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.