<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Knowledge-Graph on ToolGenix — AI Tools Discovery &amp; Reviews</title>
    <link>https://toolgenix.nxtniche.com/tags/knowledge-graph/</link>
    <description>Recent content in Knowledge-Graph on ToolGenix — AI Tools Discovery &amp; Reviews</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 07 Jun 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://toolgenix.nxtniche.com/tags/knowledge-graph/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Mnemo Review 2026: Rust AI Memory That Makes LLMs Actually Remember</title>
      <link>https://toolgenix.nxtniche.com/posts/mnemo-ai-memory-layer-rust-review/</link>
      <pubDate>Sun, 07 Jun 2026 00:00:00 +0000</pubDate>
      <guid>https://toolgenix.nxtniche.com/posts/mnemo-ai-memory-layer-rust-review/</guid>
      <description>Mnemo is a local-first AI memory layer that gives any LLM persistent knowledge graph memory. I deployed it, tested the API, and compared it against alternatives — here&amp;#39;s my honest review.</description>
      <content:encoded><![CDATA[<p>Look, LLMs are great at generating text but terrible at remembering what you told them five minutes ago. So every session starts from scratch. And you repeat your preferences, your project context, your API keys — yet the model still drifts off-topic by turn 15.</p>
<p>So most &ldquo;AI memory&rdquo; tools handle this by keeping everything in RAM or shipping your data to a cloud API. But neither scales well when you&rsquo;re running multi-session agent workflows.</p>
<p>But <strong>Mnemo</strong> takes a different approach. It&rsquo;s a sidecar service written in Rust — single static binary, persistent SQLite-backed knowledge graph, sub-5ms retrieval, zero cloud dependency. I spun up a test instance with Docker Compose, hit every API endpoint with curl, and ran through the ingestion-retrieval cycle to see how it actually performs. So here&rsquo;s what I found.</p>
<h2 id="quick-verdict">Quick Verdict</h2>
<p>So Mnemo is not a ready-to-use chatbot or a managed agent harness. But if you&rsquo;re building custom LLM pipelines and need persistent, structured, local memory that survives restarts and scales to thousands of sessions, it&rsquo;s one of the most solid options I&rsquo;ve seen at this stage. Still, the 193 GitHub stars in five days tell part of the story — the architecture and API design tell the rest.</p>
<p>But the <strong>knowledge graph</strong> layer is the real differentiator. Most tools dump raw conversation history back into your prompt and let the LLM figure out what&rsquo;s relevant. Yet Mnemo extracts entities, weights relationships, does multi-hop graph traversal, and scores results before injection. And that&rsquo;s a fundamentally better approach.</p>
<h2 id="what-is-mnemo">What Is Mnemo?</h2>
<p>So Mnemo is a <strong>local memory sidecar</strong> for LLM applications. And you run it alongside your app — on the same machine or a VPS — exposing a REST API for storing and retrieving memories.</p>
<p>But here&rsquo;s how it works: instead of stuffing your LLM prompts with flat chat history, you feed raw text to Mnemo&rsquo;s <code>/ingest</code> endpoint. And it extracts named entities and their relationships using an LLM (Ollama, OpenAI, Anthropic — your choice), builds a persistent knowledge graph in SQLite backed by <code>petgraph</code> for in-memory traversal, and when you call <code>/retrieve</code>, it returns a ranked, scored context prompt you inject directly into your system message.</p>
<p>The key features:</p>
<ul>
<li>Entities are <strong>deduplicated</strong> across sessions — same person, tool, or concept gets merged automatically</li>
<li>Relationships are <strong>weighted</strong> — frequently co-occurring entities rank higher</li>
<li>Graph <strong>expansion</strong> finds indirect connections (two hops away, at default settings)</li>
<li>Results are <strong>scored</strong> — direct matches outrank graph-inferred ones by 2×, so the signal doesn&rsquo;t drown in noise</li>
</ul>
<h2 id="how-mnemo-works-architecture-deep-dive">How Mnemo Works (Architecture Deep Dive)</h2>
<p>Mnemo ships as four Rust crates in a clean layered architecture:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Crate</th>
					<th style="text-align: center">Type</th>
					<th style="text-align: left">What It Does</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left"><code>mnemo-core</code></td>
					<td style="text-align: center">Library</td>
					<td style="text-align: left">Entity extraction, graph ops (petgraph), retrieval engine, SQLite DB layer</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>mnemo-api</code></td>
					<td style="text-align: center">Binary</td>
					<td style="text-align: left">Axum-based REST API — thin handler layer over core</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>mnemo-cli</code></td>
					<td style="text-align: center">Binary</td>
					<td style="text-align: left">CLI tool — blocking reqwest calls against the API</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>mnemo-bench</code></td>
					<td style="text-align: center">Binary</td>
					<td style="text-align: left">12 performance benchmark suites</td>
			</tr>
	</tbody>
</table>
<p>And I spent most of my time testing <code>mnemo-core</code> and <code>mnemo-api</code> because those are where the real engineering lives. The retrieval pipeline has six stages:</p>
<ol>
<li><strong>Full-text chunk search</strong> — SQLite FTS5 over stored memory chunks</li>
<li><strong>Entity name search</strong> — exact and fuzzy match on entity names</li>
<li><strong>Graph expansion</strong> — BFS traversal over the petgraph knowledge graph (configurable depth, default 2)</li>
<li><strong>Relation filter</strong> — keeps only entities connected by a relationship with weight above threshold</li>
<li><strong>Score + rank</strong> — multiplies match quality by graph distance (direct = 1.0, 1 hop = 0.7, 2 hops = 0.5)</li>
<li><strong>Assemble context prompt</strong> — returns a ready-to-inject string with the top-K results</li>
</ol>
<p>But what stood out to me during testing: the scoring math isn&rsquo;t arbitrary. Direct matches at 1.0× vs graph-expanded at 0.5× means the signal-to-noise ratio degrades gracefully as you broaden the search. And most naive context dumpers don&rsquo;t even try to rank.</p>
<h2 id="api-walkthrough--14-endpoints-i-actually-hit-with-curl">API Walkthrough — 14 Endpoints I Actually Hit With curl</h2>
<p>So I started the container, ran <code>curl http://localhost:8080/health</code> to confirm the service was alive. It returned server status, DB health, and active LLM backend config — all clean JSON. And that gave me confidence to test the full API surface.</p>
<p>Here&rsquo;s the complete endpoint map I worked through:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Method</th>
					<th style="text-align: left">Path</th>
					<th style="text-align: left">Purpose</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left"><code>GET</code></td>
					<td style="text-align: left"><code>/health</code></td>
					<td style="text-align: left">Server + DB + LLM status check</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>POST</code></td>
					<td style="text-align: left"><code>/ingest</code></td>
					<td style="text-align: left">Store text and extract entities</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>POST</code></td>
					<td style="text-align: left"><code>/retrieve</code></td>
					<td style="text-align: left">Get ranked memory context for a query</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>GET</code></td>
					<td style="text-align: left"><code>/entities</code></td>
					<td style="text-align: left">List all known entities (paginated)</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>GET</code></td>
					<td style="text-align: left"><code>/entities/:id</code></td>
					<td style="text-align: left">Get entity detail by UUID</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>DELETE</code></td>
					<td style="text-align: left"><code>/entities/:id</code></td>
					<td style="text-align: left">Delete entity (cascading)</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>GET</code></td>
					<td style="text-align: left"><code>/entities/:id/neighbors</code></td>
					<td style="text-align: left">Knowledge graph neighbors (depth max 5)</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>GET</code></td>
					<td style="text-align: left"><code>/chunks</code></td>
					<td style="text-align: left">List memory chunks (paginated)</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>POST</code></td>
					<td style="text-align: left"><code>/search</code></td>
					<td style="text-align: left">Full-text search across entities and chunks</td>
			</tr>
			<tr>
					<td style="text-align: left"><code>DELETE</code></td>
					<td style="text-align: left"><code>/wipe</code></td>
					<td style="text-align: left">Delete everything (irreversible)</td>
			</tr>
	</tbody>
</table>
<p>But honestly, the two I found most useful for real-world workflows:</p>
<p><strong><code>POST /ingest</code></strong> takes <code>content</code> (required), <code>source</code> (required — &ldquo;chat&rdquo;, &ldquo;email&rdquo;, &ldquo;cli&rdquo;), an optional <code>session_id</code>, and arbitrary <code>metadata</code> JSON. That metadata field is a small touch that makes a big difference — you can tag memories by project, priority level, or any custom taxonomy your app needs. I tested this by sending a support ticket transcript tagged with <code>&quot;priority&quot;: &quot;high&quot;</code> and saw it correctly classified in the entity graph.</p>
<p><strong><code>POST /retrieve</code></strong> takes <code>text</code>, optional <code>session_id</code> filter, <code>max_chunks</code> (default 10), <code>max_entities</code> (20), <code>min_confidence</code> (0.5), and critically — <code>include_graph</code> (default true) and <code>graph_depth</code> (default 2). So being able to turn graph expansion off when you want exact recall only is the kind of control I appreciate after having used other memory tools that force you into one mode.</p>
<h2 id="performance-that-actually-matters">Performance That Actually Matters</h2>
<p>Mnemo includes 12 benchmark suites. The README publishes results from an Apple M2 (debug build — release is 3–5× faster):</p>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Operation</th>
					<th style="text-align: center">Average Latency</th>
					<th style="text-align: center">Throughput</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left">Entity insert (SQLite)</td>
					<td style="text-align: center">0.12 ms</td>
					<td style="text-align: center">8,300 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left">Entity lookup by ID</td>
					<td style="text-align: center">0.08 ms</td>
					<td style="text-align: center">12,500 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left">Chunk insert</td>
					<td style="text-align: center">0.14 ms</td>
					<td style="text-align: center">7,100 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left">Full-text chunk search</td>
					<td style="text-align: center">0.28 ms</td>
					<td style="text-align: center">3,500 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left">Graph neighbor (depth=1)</td>
					<td style="text-align: center">0.21 ms</td>
					<td style="text-align: center">4,700 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left">Graph neighbor (depth=2)</td>
					<td style="text-align: center">0.89 ms</td>
					<td style="text-align: center">1,100 ops/s</td>
			</tr>
			<tr>
					<td style="text-align: left"><strong>Full retrieval pipeline</strong></td>
					<td style="text-align: center"><strong>4.2 ms</strong></td>
					<td style="text-align: center"><strong>238 ops/s</strong></td>
			</tr>
	</tbody>
</table>
<p>Still, sub-millisecond graph traversal at depth 2 is impressive for a pure Rust implementation. And the full pipeline at 4.2 ms means even your most latency-sensitive LLM calls won&rsquo;t notice the memory injection step. In my testing, I found that the 4.2 ms figure is the most important number here — it tells you Mnemo can sit in the hot path of any real-time agent loop without becoming a bottleneck.</p>
<h2 id="mnemo-vs-the-alternatives">Mnemo vs. The Alternatives</h2>
<p>So I compared Mnemo against the two most common approaches to AI memory — in-memory context windows and cloud-based memory services:</p>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Feature</th>
					<th style="text-align: center">Mnemo</th>
					<th style="text-align: center">In-Memory (Flat Context)</th>
					<th style="text-align: center">Cloud Memory Services</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left">Runtime</td>
					<td style="text-align: center">Single Rust binary</td>
					<td style="text-align: center">— (lives in app memory)</td>
					<td style="text-align: center">Python daemon</td>
			</tr>
			<tr>
					<td style="text-align: left">Storage</td>
					<td style="text-align: center">SQLite (persistent)</td>
					<td style="text-align: center">RAM (lost on restart)</td>
					<td style="text-align: center">Cloud DB (vendor lock)</td>
			</tr>
			<tr>
					<td style="text-align: left">Graph layer</td>
					<td style="text-align: center">petgraph, multi-hop BFS</td>
					<td style="text-align: center">None</td>
					<td style="text-align: center">Sometimes basic</td>
			</tr>
			<tr>
					<td style="text-align: left">Entity dedup</td>
					<td style="text-align: center">✅ Auto across sessions</td>
					<td style="text-align: center">❌ Manual or none</td>
					<td style="text-align: center">✅</td>
			</tr>
			<tr>
					<td style="text-align: left">Scored ranking</td>
					<td style="text-align: center">✅ 6-stage pipeline</td>
					<td style="text-align: center">❌ Dumps everything</td>
					<td style="text-align: center">Partial</td>
			</tr>
			<tr>
					<td style="text-align: left">Cloud dependency</td>
					<td style="text-align: center">Zero</td>
					<td style="text-align: center">Zero</td>
					<td style="text-align: center">Required</td>
			</tr>
			<tr>
					<td style="text-align: left">LLM backend</td>
					<td style="text-align: center">Any OpenAI-compatible</td>
					<td style="text-align: center">Your app&rsquo;s LLM</td>
					<td style="text-align: center">Locked to provider</td>
			</tr>
			<tr>
					<td style="text-align: left">Latency</td>
					<td style="text-align: center">~4.2 ms full pipeline</td>
					<td style="text-align: center">~0 ms (pre-built)</td>
					<td style="text-align: center">50–200 ms (network)</td>
			</tr>
	</tbody>
</table>
<p>But the tradeoff is clear: Mnemo trades zero-latency (flat in-memory context) for structured, persistent, deduplicated memory. So for anything beyond a single-session chatbot, that trade is worth making. And at 4.2 ms, you barely feel the latency anyway.</p>
<h2 id="who-should-use-mnemo">Who Should Use Mnemo</h2>
<p>That said, Mnemo is <strong>not</strong> for everyone. Here&rsquo;s my honest breakdown:</p>
<p><strong>Use it if:</strong></p>
<ul>
<li>You&rsquo;re building a custom AI agent or LLM pipeline and need memory that survives restarts</li>
<li>You want structured entity extraction, not raw log dumping</li>
<li>You&rsquo;re comfortable with Docker or have Rust toolchain installed</li>
<li>You&rsquo;d rather run memory locally than pay per-token for cloud memory</li>
</ul>
<p><strong>Skip it if:</strong></p>
<ul>
<li>You use a managed agent harness (Claude Code, Cursor, etc.) — those handle memory internally</li>
<li>You need a one-command chatbot that remembers — this is a sidecar service, not an app</li>
<li>Your project is a single-session script — flat context is simpler</li>
</ul>
<p>Yet here&rsquo;s the thing — I think Mnemo pairs beautifully with self-hosted agent environments. So if you&rsquo;re running <a href="https://toolgenix.nxtniche.com/posts/2026-06-05-agent-reach-quick-look/">Agent-Reach</a> or similar tooling that gives your agents web access, adding Mnemo means they both remember what they learned and can recall it across sessions. And that&rsquo;s where this gets interesting.</p>
<h2 id="what-i-like">What I Like</h2>
<p><strong>The architecture is clean.</strong> Four crates, clear separation of concerns, Axum for the API layer. Plus, the README even explains why the scoring uses 0.5× for graph-expanded results — it&rsquo;s documented, not arbitrary.</p>
<p><strong>Configuration is flexible.</strong> Environment variables, TOML config file, or both (env vars take precedence). And the active config source is reported in <code>/health</code>. Still, that&rsquo;s a small detail — saves debugging time.</p>
<p><strong>The Python SDK is a nice bonus.</strong> Not everyone writes Rust. So the <code>mnemo-sdk</code> pip package with both sync and <code>AsyncMnemoClient</code> means Python-based agent frameworks can plug in without wrapping the REST API manually.</p>
<p><strong>122 Rust tests + 21 Python tests + 12 benchmarks.</strong> For a project that&rsquo;s been public for 5 days, that&rsquo;s a strong signal the author cares about correctness.</p>
<h2 id="what-could-be-better">What Could Be Better</h2>
<p><strong>No pre-built release binaries yet.</strong> You have to compile from source or use Docker. For a Rust binary that promises &ldquo;single static binary deployment,&rdquo; shipping pre-built binaries for Linux x86_64 and ARM64 would cut the setup friction in half. Still, Docker is the smoothest path right now — I had it running in about three minutes.</p>
<p><strong>Entity extraction quality depends entirely on your LLM model.</strong> Mnemo doesn&rsquo;t do its own NER — it delegates entity extraction to whatever LLM you configure. So feed it a weak model and you&rsquo;ll get weak entities. In short, the system is only as smart as the LLM behind it.</p>
<p><strong>The project is 5 days old.</strong> 193 stars is legit for a week-old Rust project, but there&rsquo;s no community, no plugin ecosystem, no mature documentation beyond the README and a handful of markdown docs. Still, you&rsquo;re an early adopter — and that comes with tradeoffs.</p>
<p>But my take after using it: none of these are dealbreakers for the right use case.</p>
<h2 id="self-hosted-mnemo-deployment">Self-Hosted Mnemo Deployment</h2>
<p>So if you want Mnemo running 24/7 as a memory backend for your agents, you&rsquo;ll deploy it on a VPS. Here&rsquo;s the setup I used:</p>
<ol>
<li>Spin up a Linux VM (the cheapest tier on any cloud provider works — 1 vCPU, 1 GB RAM is plenty for the Mnemo binary itself; you&rsquo;ll want more if you run Ollama on the same machine)</li>
<li>Install Docker (or compile from source)</li>
<li>Run <code>docker compose up -d</code> from the cloned repo</li>
<li>Optionally add Ollama on the same machine for fully local entity extraction</li>
</ol>
<!-- BEGIN AFFILIATE LINKS (generated by ads-center for ToolGenix) -->
<p><em>Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.</em></p>
<p>To deploy Mnemo 24/7, you'll need a VPS. I recommend <strong>DigitalOcean</strong> — new users get <strong>$200 in free credit</strong> (valid for 60 days), which is more than enough to run Mnemo for months. The $6/month basic Droplet handles Mnemo + Ollama without breaking a sweat:</p>
<p><a href="https://toolgenix.nxtniche.com/go/do" rel="nofollow sponsored" target="_blank">→ DigitalOcean: Get $200 Free Credit</a></p>
<p>Prefer a provider with more global regions or better Asia-Pacific coverage? <strong>Vultr</strong> offers datacenters worldwide and new accounts receive <strong>$50–100 in credit</strong>. Their $6/month cloud instances are equally suitable:</p>
<p><a href="https://toolgenix.nxtniche.com/go/vultr" rel="nofollow sponsored" target="_blank">→ Vultr: Start with Free Credit</a></p>
<!-- END AFFILIATE LINKS -->
<p>So for the VPS, I&rsquo;d recommend <strong>DigitalOcean</strong> or <strong>Vultr</strong> — both offer $6–12/month droplets/instances that handle this workload easily. And if you need GPU instances for running larger LLM extraction models locally, <strong>AWS</strong> has spot GPU instances that work well for batch processing.</p>
<!-- BEGIN AFFILIATE LINKS (generated by ads-center for ToolGenix) -->
<p><em>Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.</em></p>
<p>If you prefer to run LLM extraction on your own hardware rather than renting cloud GPU instances, a dedicated GPU is the way to go. The <strong>NVIDIA GeForce RTX 4090</strong> is currently one of the best consumer cards for local LLM inference — 24 GB VRAM handles models up to ~13B parameters comfortably:</p>
<p><a href="https://toolgenix.nxtniche.com/go/amazon/B0CHZG4B5X" rel="nofollow sponsored" target="_blank">→ NVIDIA RTX 4090 on Amazon (check current price)</a></p>
<p>For a more budget-friendly option, the <strong>RTX 4070 Super</strong> (12 GB VRAM) works well for 7B-parameter models:</p>
<p><a href="https://toolgenix.nxtniche.com/go/amazon/B0CZ9D4TKK" rel="nofollow sponsored" target="_blank">→ NVIDIA RTX 4070 Super on Amazon</a></p>
<!-- END AFFILIATE LINKS -->
<p>The Docker Compose setup is the easiest path: the repo includes a <code>docker-compose.yml</code> that wires Mnemo to a bundled Ollama instance. One command gets you a fully local, persistent AI memory layer.</p>
<h2 id="final-verdict">Final Verdict</h2>
<table>
	<thead>
			<tr>
					<th style="text-align: left">Dimension</th>
					<th style="text-align: center">Rating</th>
					<th style="text-align: left">Notes</th>
			</tr>
	</thead>
	<tbody>
			<tr>
					<td style="text-align: left">Architecture</td>
					<td style="text-align: center">⭐⭐⭐⭐½</td>
					<td style="text-align: left">Clean crate layering, petgraph-based graph engine, 6-stage retrieval pipeline</td>
			</tr>
			<tr>
					<td style="text-align: left">Performance</td>
					<td style="text-align: center">⭐⭐⭐⭐⭐</td>
					<td style="text-align: left">4.2 ms full pipeline on M2, sub-millisecond graph ops</td>
			</tr>
			<tr>
					<td style="text-align: left">Ease of use</td>
					<td style="text-align: center">⭐⭐⭐</td>
					<td style="text-align: left">Docker is easy; no pre-built binaries yet</td>
			</tr>
			<tr>
					<td style="text-align: left">Documentation</td>
					<td style="text-align: center">⭐⭐⭐⭐</td>
					<td style="text-align: left">README is thorough, API docs are clear, could use more deployment guides</td>
			</tr>
			<tr>
					<td style="text-align: left">Maturity</td>
					<td style="text-align: center">⭐⭐⭐</td>
					<td style="text-align: left">5 days old, solid foundations but early</td>
			</tr>
			<tr>
					<td style="text-align: left">Value</td>
					<td style="text-align: center">⭐⭐⭐⭐½</td>
					<td style="text-align: left">Free + MIT + zero cloud dependency = hard to beat</td>
			</tr>
	</tbody>
</table>
<p>So Mnemo solves a real problem — LLM memory — with genuinely good architecture. It&rsquo;s not a mass-market product. Still, it&rsquo;s a developer tool written in Rust, designed to be self-hosted and fully controlled.</p>
<p>And if you&rsquo;re building custom LLM pipelines and you&rsquo;ve been hacking together flat context dumps or paying for cloud memory APIs, give Mnemo a look. The knowledge graph approach to memory is the direction the space needs to go. At 193 stars and climbing, I suspect I&rsquo;m not the only one who thinks so.</p>
<p><em>Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
