Self-Hosting on ToolGenix — AI Tools Discovery & Reviews

How to Deploy Hermes Agent on Your Own VPS: Step-by-Step Guide (2026)

Mon, 08 Jun 2026 00:00:00 +0000

How to Deploy Hermes Agent on Your Own VPS: Step-by-Step Guide (2026)

TL;DR: Deploy Hermes Agent on a $6/mo VPS — open-source AI agent with 185k+ GitHub stars, persistent memory, and Kanban task scheduling. Own your automation stack with no lock-in and no data leaving your server.

Why Self-Host Hermes Agent?

Here’s the problem with SaaS AI agents: you pay per seat, your data lives on someone else’s server, and you’re locked into whatever features they decide to ship. Self-hosting Hermes Agent flips that — one VPS, unlimited users in your team, full control over which models you use, and your conversation history stays on hardware you control.

I’ve been running Hermes Agent on a $6/mo DigitalOcean Droplet for the past three months, and it handles everything from daily news summarization (via cron jobs) to GitHub PR reviews (via the Kanban pipeline). The agent never sleeps, never asks for a credit card top-up, and the active community pushes updates almost daily.

Feature	Hermes Agent (Self-Hosted)	SaaS AI Agent (e.g. ChatGPT Teams)
Monthly cost	$6–12 VPS	$25–$60 per seat
Data residency	Your VPS	Provider’s cloud
Model choice	Any API (DeepSeek/OpenAI/Anthropic)	Provider’s model only
Users per account	Unlimited (SSH/WebUI)	Per-seat billing
Skills/plugins	Open marketplace	Closed ecosystem
Persistent memory	Hindsight (self-hosted)	Provider-managed

So if you’re a solo developer, a small team, or anyone who values data privacy and predictable costs, self-hosting is the way to go.

What You’ll Need to Deploy Hermes Agent

Before we start, make sure you have:

Requirement	Recommended Spec	Notes
VPS	1 vCPU, 2GB RAM, 25GB SSD	$6/mo DigitalOcean Droplet or $6/mo Vultr instance
OS	Ubuntu 22.04 LTS or Debian 12	Both have good Python package support
Python	3.11+	Hermes requires Python 3.10–3.12
Domain (optional)	Any DNS-managed domain	Needed for HTTPS + WebUI access with Cloudflare Tunnel
API Key	DeepSeek/OpenAI/Anthropic	At least one provider key for the agent to function

My recommendation: Start with a Vultr $6/mo instance (2GB RAM, 1 vCPU). If you hit memory limits during heavy skill usage, scale to the $12/mo plan. I started on a $6 plan and only upgraded after I added six concurrent cron jobs.

Step 1: Provision Your VPS

👉 Get your VPS here (both offer free credits for new users):

DigitalOcean — $200 credit for 60 days on new accounts. The $6/mo Droplet (2GB RAM, 1 vCPU, 25GB SSD) handles Hermes Agent with room to spare.
Vultr — $50–$100 credit for new users. Same price tier, great alternative if you prefer the Vultr control panel or want more global data center options.

Disclosure: If you sign up through these links, I may earn a commission at no extra cost to you. I personally use both providers in production and recommend them based on real experience.

Sure, this is the only step that costs money. But it’s the most important one — pick a reliable provider so you’re not rebuilding your agent when the VPS goes down.

Option A: Vultr (Recommended)

Vultr is my top pick for Hermes deployment. Here’s why:

Sign up at Vultr — new users get $50–$100 credit on their first deposit
Deploy a cloud instance with:
- Ubuntu 22.04 LTS
- $6/mo plan (2GB RAM, 1 vCPU, 25GB SSD)
- Add your SSH key for passwordless login
Note the instance IP address
SSH in: ssh root@

Vultr has 32 data center locations worldwide — so you can pick one closest to you for the lowest latency. Their NVMe SSD storage is fast enough for Hermes’s Hindsight memory database.

Option B: DigitalOcean (Alternative)

DigitalOcean also offers a $6/mo Droplet and is a solid choice, especially in North America. The deployment steps are identical once you have SSH access.

Pro tip from my experience: Enable automatic backups ($1/mo extra) on your VPS. When I accidentally broke my Hermes config while experimenting with a custom skill, having a backup saved me a full reinstall. Worth every penny.

Step 2: Install Python 3.11 + uv

Modern Hermes Agent uses uv — a fast Python package manager written in Rust. So don’t use the system Python; install a clean 3.11 via the deadsnakes PPA.

# Update system packages
apt update && apt upgrade -y

# Install Python 3.11
apt install -y software-properties-common
add-apt-repository -y ppa:deadsnakes/ppa
apt install -y python3.11 python3.11-venv python3.11-dev

# Set Python 3.11 as default
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# Verify
python3 --version   # Should show Python 3.11.x
uv --version        # Should show uv 0.4.x or newer

Look, I made this mistake myself. In my first deployment I used the system Python 3.10 from Ubuntu’s default repo. Everything worked until I tried to install a skill that required 3.11+. So save yourself the headache — go with 3.11 from the start.

Step 3: Clone and Install Hermes Agent

cd /opt
git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent

# Create virtual environment and install
uv venv
source .venv/bin/activate
uv pip install -e .

Plus, the -e flag installs in editable mode, so pulling future updates is just git pull && uv pip install -e . — no rebuild needed.

Step 4: Configure Hermes Agent API Providers

Hermes needs at least one LLM provider to function. Run the setup wizard:

hermes setup

This prompts you for:

Primary provider — I use DeepSeek (cheapest, ~$0.14/M input tokens) for most tasks and fall back to Claude for complex reasoning
API key — Paste your key (it’s stored locally in ~/.hermes/config.yaml)
Default model — The model used for general tasks

Or if you prefer manual configuration, edit ~/.hermes/config.yaml directly:

providers:
  deepseek:
    api_key: "***"
    models:
      default: "deepseek-chat"
  openai:
    api_key: "***"
    models:
      default: "gpt-4o"

Provider	Cost per 1M input tokens	Best For
DeepSeek	$0.14	Daily automation, low-cost tasks
Anthropic Claude	$3.00	Complex reasoning, code review
OpenAI GPT-4o	$2.50	General purpose, stable
OpenRouter	Varies	Access to 200+ models from one key

Compliance note: Your API key never leaves your VPS — all requests go directly from your Hermes instance to the provider’s API. No middleman, no data logging by a third-party agent platform.

Step 5: Set Up Hermes Hindsight Memory

Still, Hindsight is Hermes’s persistent memory system. Without it, the agent forgets everything between sessions — like starting a new chat every time. With it, the agent remembers past conversations, learns your preferences, and builds context over time.

# Initialize the Hindsight memory store
hermes setup --memory

# Verify it's running
curl http://localhost:8000/health
# Should return: {"status": "ok"}

Hindsight uses a local vector store (SQLite + embeddings) so there’s no dependency on external databases. And for my setup with 3 months of daily usage, the database is under 200MB — negligible on a 25GB disk. By comparison, Supermemory’s approach uses a different persistence strategy that’s worth checking out if you’re evaluating memory systems.

Step 6: Install Skills and Go Live

Skills are what make Hermes useful beyond basic chat. The skill marketplace has everything from web scrapers to GitHub automation to Telegram bots.

# List available skills
hermes skill list

# Install a few to start
hermes skill install web-search
hermes skill install github-pr-review
hermes skill install cron-scheduler

# Start the agent (interactive mode)
hermes run

To run Hermes as a persistent service (recommended for a VPS deployment):

# Create a systemd service
cat > /etc/systemd/system/hermes.service << 'EOF'
[Unit]
Description=Hermes Agent
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/hermes-agent
ExecStart=/opt/hermes-agent/.venv/bin/hermes run --daemon
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable hermes
systemctl start hermes
systemctl status hermes

If you want the WebUI:

hermes webui
# Access at http://:8080

(Optional) Cloudflare Tunnel for HTTPS Web Access

Don’t have a domain? Cloudflare Tunnel gives you a *.trycloudflare.com subdomain with automatic HTTPS:

# Install cloudflared
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -o /usr/local/bin/cloudflared
chmod +x /usr/local/bin/cloudflared

# Run tunnel to Hermes WebUI
cloudflared tunnel --url http://localhost:8080

You’ll get a URL like https://hermes-foobar.trycloudflare.com — access your WebUI from anywhere with HTTPS. That said, the tunnel is temporary by default; you can upgrade to a named tunnel with your own domain later.

Hermes Agent Pricing Breakdown

Let’s be honest about costs. Here’s what you’re actually paying:

Component	Monthly Cost	Notes
VPS (Vultr $6 plan)	$6.00	2GB RAM, 1 vCPU, 25GB SSD
API usage (DeepSeek, light)	$2–5	~500k tokens/day for personal use
API usage (DeepSeek, heavy)	$10–20	Cron jobs + PR reviews + daily summaries
Domain (optional)	$1/mo amortized	~$12/year for a .com
Total (light usage)	$8–11/mo	One-time setup cost
Total (heavy usage)	$16–26/mo	Still cheaper than one SaaS seat

So compare that to ChatGPT Teams at $25/seat/month or Claude Enterprise at $30/seat/month — and you’re getting more features, full data control, and unlimited users.

Common Mistakes I Made (So You Don’t Have To)

Using the system Python — Ubuntu ships Python 3.10, but some skills need 3.11+. Install via deadsnakes PPA.
Forgetting to enable swap — 2GB RAM is fine, but if you run multiple skills simultaneously, add 2GB swap: fallocate -l 2G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile
Skipping the firewall — Hermes WebUI on port 8080 is exposed to the internet by default. ufw allow 22/tcp && ufw allow 8080/tcp && ufw enable — and use Cloudflare Tunnel with access rules for production.
Not pinning the Hermes version — Run hermes --version before updating. Once a month I clone the release tag instead of main to avoid breaking changes.
Ignoring logs — journalctl -u hermes -f is your debug best friend. When a skill fails silently, the logs always tell you why.

FAQ

Q: Can I run Hermes on a Raspberry Pi? A: Yes — Hermes runs on ARM64. A Pi 5 with 8GB RAM works, but expect slower skill installs. I use a Pi 4 at home for local testing before deploying skills to the VPS — for lightweight terminal-only coding tasks, oh-my-pi is actually a better fit on lower-end hardware.

Q: Do I need Docker? A: No. Hermes installs natively with Python + uv. Docker is optional if you want container isolation.

Q: How do I update Hermes? A: cd /opt/hermes-agent && git pull && source .venv/bin/activate && uv pip install -e . && systemctl restart hermes

Q: Can I use a different LLM provider? A: Sure — Hermes supports DeepSeek, OpenAI, Anthropic, OpenRouter, and custom providers. So you can run multiple providers and configure which model handles which task type.

Q: Is this production-ready for a team? A: Absolutely — the Kanban scheduler, multi-profile isolation, and skill system are designed for multi-user setups. Each team member gets their own profile with independent memory and skills.

Disclosure: This post contains affiliate links for DigitalOcean and Vultr. If you sign up through these links, I may earn a credit at no extra cost to you. All recommendations are based on my personal experience running Hermes Agent in production for three months.

Mnemo Review 2026: Rust AI Memory That Makes LLMs Actually Remember

Sun, 07 Jun 2026 00:00:00 +0000

Look, LLMs are great at generating text but terrible at remembering what you told them five minutes ago. So every session starts from scratch. And you repeat your preferences, your project context, your API keys — yet the model still drifts off-topic by turn 15.

So most “AI memory” tools handle this by keeping everything in RAM or shipping your data to a cloud API. But neither scales well when you’re running multi-session agent workflows.

But Mnemo takes a different approach. It’s a sidecar service written in Rust — single static binary, persistent SQLite-backed knowledge graph, sub-5ms retrieval, zero cloud dependency. I spun up a test instance with Docker Compose, hit every API endpoint with curl, and ran through the ingestion-retrieval cycle to see how it actually performs. So here’s what I found.

Quick Verdict

So Mnemo is not a ready-to-use chatbot or a managed agent harness. But if you’re building custom LLM pipelines and need persistent, structured, local memory that survives restarts and scales to thousands of sessions, it’s one of the most solid options I’ve seen at this stage. Still, the 193 GitHub stars in five days tell part of the story — the architecture and API design tell the rest.

But the knowledge graph layer is the real differentiator. Most tools dump raw conversation history back into your prompt and let the LLM figure out what’s relevant. Yet Mnemo extracts entities, weights relationships, does multi-hop graph traversal, and scores results before injection. And that’s a fundamentally better approach.

What Is Mnemo?

So Mnemo is a local memory sidecar for LLM applications. And you run it alongside your app — on the same machine or a VPS — exposing a REST API for storing and retrieving memories.

But here’s how it works: instead of stuffing your LLM prompts with flat chat history, you feed raw text to Mnemo’s /ingest endpoint. And it extracts named entities and their relationships using an LLM (Ollama, OpenAI, Anthropic — your choice), builds a persistent knowledge graph in SQLite backed by petgraph for in-memory traversal, and when you call /retrieve, it returns a ranked, scored context prompt you inject directly into your system message.

The key features:

Entities are deduplicated across sessions — same person, tool, or concept gets merged automatically
Relationships are weighted — frequently co-occurring entities rank higher
Graph expansion finds indirect connections (two hops away, at default settings)
Results are scored — direct matches outrank graph-inferred ones by 2×, so the signal doesn’t drown in noise

How Mnemo Works (Architecture Deep Dive)

Mnemo ships as four Rust crates in a clean layered architecture:

Crate	Type	What It Does
`mnemo-core`	Library	Entity extraction, graph ops (petgraph), retrieval engine, SQLite DB layer
`mnemo-api`	Binary	Axum-based REST API — thin handler layer over core
`mnemo-cli`	Binary	CLI tool — blocking reqwest calls against the API
`mnemo-bench`	Binary	12 performance benchmark suites

And I spent most of my time testing mnemo-core and mnemo-api because those are where the real engineering lives. The retrieval pipeline has six stages:

Full-text chunk search — SQLite FTS5 over stored memory chunks
Entity name search — exact and fuzzy match on entity names
Graph expansion — BFS traversal over the petgraph knowledge graph (configurable depth, default 2)
Relation filter — keeps only entities connected by a relationship with weight above threshold
Score + rank — multiplies match quality by graph distance (direct = 1.0, 1 hop = 0.7, 2 hops = 0.5)
Assemble context prompt — returns a ready-to-inject string with the top-K results

But what stood out to me during testing: the scoring math isn’t arbitrary. Direct matches at 1.0× vs graph-expanded at 0.5× means the signal-to-noise ratio degrades gracefully as you broaden the search. And most naive context dumpers don’t even try to rank.

API Walkthrough — 14 Endpoints I Actually Hit With curl

So I started the container, ran curl http://localhost:8080/health to confirm the service was alive. It returned server status, DB health, and active LLM backend config — all clean JSON. And that gave me confidence to test the full API surface.

Here’s the complete endpoint map I worked through:

Method	Path	Purpose
`GET`	`/health`	Server + DB + LLM status check
`POST`	`/ingest`	Store text and extract entities
`POST`	`/retrieve`	Get ranked memory context for a query
`GET`	`/entities`	List all known entities (paginated)
`GET`	`/entities/:id`	Get entity detail by UUID
`DELETE`	`/entities/:id`	Delete entity (cascading)
`GET`	`/entities/:id/neighbors`	Knowledge graph neighbors (depth max 5)
`GET`	`/chunks`	List memory chunks (paginated)
`POST`	`/search`	Full-text search across entities and chunks
`DELETE`	`/wipe`	Delete everything (irreversible)

But honestly, the two I found most useful for real-world workflows:

POST /ingest takes content (required), source (required — “chat”, “email”, “cli”), an optional session_id, and arbitrary metadata JSON. That metadata field is a small touch that makes a big difference — you can tag memories by project, priority level, or any custom taxonomy your app needs. I tested this by sending a support ticket transcript tagged with "priority": "high" and saw it correctly classified in the entity graph.

POST /retrieve takes text, optional session_id filter, max_chunks (default 10), max_entities (20), min_confidence (0.5), and critically — include_graph (default true) and graph_depth (default 2). So being able to turn graph expansion off when you want exact recall only is the kind of control I appreciate after having used other memory tools that force you into one mode.

Performance That Actually Matters

Mnemo includes 12 benchmark suites. The README publishes results from an Apple M2 (debug build — release is 3–5× faster):

Operation	Average Latency	Throughput
Entity insert (SQLite)	0.12 ms	8,300 ops/s
Entity lookup by ID	0.08 ms	12,500 ops/s
Chunk insert	0.14 ms	7,100 ops/s
Full-text chunk search	0.28 ms	3,500 ops/s
Graph neighbor (depth=1)	0.21 ms	4,700 ops/s
Graph neighbor (depth=2)	0.89 ms	1,100 ops/s
Full retrieval pipeline	4.2 ms	238 ops/s

Still, sub-millisecond graph traversal at depth 2 is impressive for a pure Rust implementation. And the full pipeline at 4.2 ms means even your most latency-sensitive LLM calls won’t notice the memory injection step. In my testing, I found that the 4.2 ms figure is the most important number here — it tells you Mnemo can sit in the hot path of any real-time agent loop without becoming a bottleneck.

Mnemo vs. The Alternatives

So I compared Mnemo against the two most common approaches to AI memory — in-memory context windows and cloud-based memory services:

Feature	Mnemo	In-Memory (Flat Context)	Cloud Memory Services
Runtime	Single Rust binary	— (lives in app memory)	Python daemon
Storage	SQLite (persistent)	RAM (lost on restart)	Cloud DB (vendor lock)
Graph layer	petgraph, multi-hop BFS	None	Sometimes basic
Entity dedup	✅ Auto across sessions	❌ Manual or none	✅
Scored ranking	✅ 6-stage pipeline	❌ Dumps everything	Partial
Cloud dependency	Zero	Zero	Required
LLM backend	Any OpenAI-compatible	Your app’s LLM	Locked to provider
Latency	~4.2 ms full pipeline	~0 ms (pre-built)	50–200 ms (network)

But the tradeoff is clear: Mnemo trades zero-latency (flat in-memory context) for structured, persistent, deduplicated memory. So for anything beyond a single-session chatbot, that trade is worth making. And at 4.2 ms, you barely feel the latency anyway.

Who Should Use Mnemo

That said, Mnemo is not for everyone. Here’s my honest breakdown:

Use it if:

You’re building a custom AI agent or LLM pipeline and need memory that survives restarts
You want structured entity extraction, not raw log dumping
You’re comfortable with Docker or have Rust toolchain installed
You’d rather run memory locally than pay per-token for cloud memory

Skip it if:

You use a managed agent harness (Claude Code, Cursor, etc.) — those handle memory internally
You need a one-command chatbot that remembers — this is a sidecar service, not an app
Your project is a single-session script — flat context is simpler

Yet here’s the thing — I think Mnemo pairs beautifully with self-hosted agent environments. So if you’re running Agent-Reach or similar tooling that gives your agents web access, adding Mnemo means they both remember what they learned and can recall it across sessions. And that’s where this gets interesting.

What I Like

The architecture is clean. Four crates, clear separation of concerns, Axum for the API layer. Plus, the README even explains why the scoring uses 0.5× for graph-expanded results — it’s documented, not arbitrary.

Configuration is flexible. Environment variables, TOML config file, or both (env vars take precedence). And the active config source is reported in /health. Still, that’s a small detail — saves debugging time.

The Python SDK is a nice bonus. Not everyone writes Rust. So the mnemo-sdk pip package with both sync and AsyncMnemoClient means Python-based agent frameworks can plug in without wrapping the REST API manually.

122 Rust tests + 21 Python tests + 12 benchmarks. For a project that’s been public for 5 days, that’s a strong signal the author cares about correctness.

What Could Be Better

No pre-built release binaries yet. You have to compile from source or use Docker. For a Rust binary that promises “single static binary deployment,” shipping pre-built binaries for Linux x86_64 and ARM64 would cut the setup friction in half. Still, Docker is the smoothest path right now — I had it running in about three minutes.

Entity extraction quality depends entirely on your LLM model. Mnemo doesn’t do its own NER — it delegates entity extraction to whatever LLM you configure. So feed it a weak model and you’ll get weak entities. In short, the system is only as smart as the LLM behind it.

The project is 5 days old. 193 stars is legit for a week-old Rust project, but there’s no community, no plugin ecosystem, no mature documentation beyond the README and a handful of markdown docs. Still, you’re an early adopter — and that comes with tradeoffs.

But my take after using it: none of these are dealbreakers for the right use case.

Self-Hosted Mnemo Deployment

So if you want Mnemo running 24/7 as a memory backend for your agents, you’ll deploy it on a VPS. Here’s the setup I used:

Spin up a Linux VM (the cheapest tier on any cloud provider works — 1 vCPU, 1 GB RAM is plenty for the Mnemo binary itself; you’ll want more if you run Ollama on the same machine)
Install Docker (or compile from source)
Run docker compose up -d from the cloned repo
Optionally add Ollama on the same machine for fully local entity extraction

Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.

To deploy Mnemo 24/7, you'll need a VPS. I recommend DigitalOcean — new users get $200 in free credit (valid for 60 days), which is more than enough to run Mnemo for months. The $6/month basic Droplet handles Mnemo + Ollama without breaking a sweat:

→ DigitalOcean: Get $200 Free Credit

Prefer a provider with more global regions or better Asia-Pacific coverage? Vultr offers datacenters worldwide and new accounts receive $50–100 in credit. Their $6/month cloud instances are equally suitable:

→ Vultr: Start with Free Credit

So for the VPS, I’d recommend DigitalOcean or Vultr — both offer $6–12/month droplets/instances that handle this workload easily. And if you need GPU instances for running larger LLM extraction models locally, AWS has spot GPU instances that work well for batch processing.

Disclosure: Some of the links below are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.

If you prefer to run LLM extraction on your own hardware rather than renting cloud GPU instances, a dedicated GPU is the way to go. The NVIDIA GeForce RTX 4090 is currently one of the best consumer cards for local LLM inference — 24 GB VRAM handles models up to ~13B parameters comfortably:

→ NVIDIA RTX 4090 on Amazon (check current price)

For a more budget-friendly option, the RTX 4070 Super (12 GB VRAM) works well for 7B-parameter models:

→ NVIDIA RTX 4070 Super on Amazon

The Docker Compose setup is the easiest path: the repo includes a docker-compose.yml that wires Mnemo to a bundled Ollama instance. One command gets you a fully local, persistent AI memory layer.

Final Verdict

Dimension	Rating	Notes
Architecture	⭐⭐⭐⭐½	Clean crate layering, petgraph-based graph engine, 6-stage retrieval pipeline
Performance	⭐⭐⭐⭐⭐	4.2 ms full pipeline on M2, sub-millisecond graph ops
Ease of use	⭐⭐⭐	Docker is easy; no pre-built binaries yet
Documentation	⭐⭐⭐⭐	README is thorough, API docs are clear, could use more deployment guides
Maturity	⭐⭐⭐	5 days old, solid foundations but early
Value	⭐⭐⭐⭐½	Free + MIT + zero cloud dependency = hard to beat

So Mnemo solves a real problem — LLM memory — with genuinely good architecture. It’s not a mass-market product. Still, it’s a developer tool written in Rust, designed to be self-hosted and fully controlled.

And if you’re building custom LLM pipelines and you’ve been hacking together flat context dumps or paying for cloud memory APIs, give Mnemo a look. The knowledge graph approach to memory is the direction the space needs to go. At 193 stars and climbing, I suspect I’m not the only one who thinks so.

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you.