Local-first AI memory layer for any LLM. Persistent knowledge graph,
entity extraction, semantic retrieval — no cloud required.
Most LLMs forget everything the moment a conversation ends. mnemo fixes that.
mnemo is a sidecar service that watches every conversation you feed it, extracts named entities and relationships using an LLM, builds a persistent knowledge graph in SQLite, and injects relevant context back into future prompts — automatically, in under 50ms. It works with Ollama (fully local, free), OpenAI, Anthropic, or any OpenAI-compatible API. It ships as a single static binary with zero cloud dependency.
your app
│
▼
POST /ingest ──► entity extraction (LLM) ──► knowledge graph (SQLite + petgraph)
│
POST /retrieve ◄── scoring + ranking ◄── graph traversal + full-text search
│
▼
context_prompt ──► inject into your LLM prompt
- You POST raw text to
/ingest(a conversation turn, a document, a note). - mnemo sends it to your configured LLM and extracts entities (people, tools, places, concepts) and the relationships between them.
- Entities are deduplicated by name+type, aliases are merged, and everything is written to SQLite. The in-memory petgraph is updated atomically.
- On POST
/retrieve, mnemo runs a 6-stage pipeline: full-text chunk search → entity name search → graph expansion (BFS over the knowledge graph) → relation filter → score+rank → assemble acontext_promptstring. - You inject
context_promptinto your LLM’s system prompt. Done.
git clone https://github.com/zaydmulani09/mnemo
cd mnemo
docker compose up -d
# Pull the llama3 model the first time (~4 GB)
docker exec mnemo-ollama ollama pull llama3
# Verify everything is healthy
curl http://localhost:8080/health
cargo install --path crates/mnemo-api
# With Ollama
export MNEMO_LLM_BASE_URL=http://localhost:11434/v1
mnemo-api
# With OpenAI
export MNEMO_LLM_BASE_URL=https://api.openai.com/v1
export MNEMO_LLM_API_KEY=sk-...
export MNEMO_LLM_MODEL=gpt-4o-mini
export MNEMO_LLM_PROVIDER=openai
mnemo-api
from mnemo import MnemoClient
client = MnemoClient() # server at http://localhost:8080
# Store a memory
client.ingest("I'm building a Rust vector database called vecdb")
# Get context for injection into your next LLM prompt
print(client.get_context("what am I working on?"))
All endpoints accept and return application/json. Base URL: http://localhost:8080.
| Method | Path | Description | Request body | Response |
|---|---|---|---|---|
GET |
/health |
Server + DB + LLM status | — | HealthResponse |
POST |
/ingest |
Store text, extract entities | IngestRequest |
IngestResponse |
POST |
/retrieve |
Retrieve ranked memory context | RetrievalQuery |
RetrievalResult |
GET |
/entities |
List entities (paginated) | ?limit&offset |
Entity[] |
GET |
/entities/:id |
Get entity by UUID | — | Entity |
DELETE |
/entities/:id |
Delete entity (cascades) | — | {"deleted":true} |
GET |
/entities/:id/neighbors |
Knowledge graph neighbors | ?depth (max 5) |
GraphNode[] |
GET |
/chunks |
List memory chunks (paginated) | ?limit&offset&session_id |
MemoryChunk[] |
GET |
/chunks/:id |
Get chunk by UUID | — | MemoryChunk |
DELETE |
/chunks/:id |
Delete chunk | — | {"deleted":true} |
POST |
/search |
Full-text search entities + chunks | {"query","limit"} |
{"entities","chunks"} |
DELETE |
/wipe |
Delete all memory (irreversible) | header: X-Confirm-Wipe: true |
{"wiped":true} |
GET |
/stats |
Entity/chunk/graph counts + uptime | — | StatsResponse |
Key request/response types:
Full endpoint documentation with curl examples: docs/api.md
| Variable | Default | Description |
|---|---|---|
MNEMO_DB_PATH |
mnemo.db |
SQLite database file path |
MNEMO_PORT |
8080 |
API server port |
MNEMO_LLM_BASE_URL |
http://localhost:11434/v1 |
OpenAI-compatible LLM base URL |
MNEMO_LLM_MODEL |
llama3 |
Model name for entity extraction |
MNEMO_LLM_API_KEY |
ollama |
API key (any value works for Ollama) |
MNEMO_LLM_PROVIDER |
ollama |
Provider type: ollama, openai, anthropic, custom |
Pass --config path/to/config.toml to mnemo-api. See mnemo.example.toml:
db_path = "mnemo.db"
port = 8080
[llm]
provider = "ollama"
base_url = "http://localhost:11434/v1"
model = "llama3"
api_key = "ollama"
timeout_secs = 30
max_retries = 3
max_tokens = 2048
temperature = 0.1
Environment variables take precedence over TOML values. The active config source is reported in GET /health → config_source.
Install:
cargo install --path crates/mnemo-cli
Usage:
# Store a memory
mnemo ingest "I use Neovim and prefer dark mode"
# Retrieve relevant context
mnemo search "what editor do I use?"
# List all extracted entities
mnemo entities
# Show entity detail + graph neighbors
mnemo entity <uuid> --neighbors
# List memory chunks
mnemo chunks
# Server health
mnemo health
# Memory statistics
mnemo stats
# Delete everything (prompts for confirmation)
mnemo wipe
# Skip confirmation prompt
mnemo wipe --yes
# Point at a non-default server
mnemo --server http://192.168.1.10:8080 stats
Install:
See sdk/python/README.md for the full API reference.
Async example:
import asyncio
from mnemo import AsyncMnemoClient
async def main():
async with AsyncMnemoClient() as client:
await client.ingest(
"Alice is a principal engineer at Stripe working on payment infrastructure.",
session_id="session-001",
)
context = await client.get_context(
"what does Alice work on?",
session_id="session-001",
)
print(context)
asyncio.run(main())
A working standalone example: examples/basic_usage.py
Four Rust crates wired together:
| Crate | Type | Role |
|---|---|---|
mnemo-core |
lib | Entity extraction, graph ops, retrieval engine, DB layer |
mnemo-api |
bin | Axum REST API — thin handler layer over mnemo-core |
mnemo-cli |
bin | CLI tool using blocking reqwest against the API |
mnemo-bench |
bin | Performance benchmarks (12 suites) |
Full architecture documentation: docs/architecture.md
Benchmarked on Apple M2, SQLite WAL mode, in-memory petgraph. Debug build numbers — release build (--release) is 3–5× faster.
| Operation | Avg latency | Throughput |
|---|---|---|
| Entity insert (SQLite) | ~0.12 ms | ~8,300 ops/s |
| Entity lookup by ID | ~0.08 ms | ~12,500 ops/s |
| Chunk insert | ~0.14 ms | ~7,100 ops/s |
| Full-text chunk search | ~0.28 ms | ~3,500 ops/s |
| Graph neighbor (depth=1) | ~0.21 ms | ~4,700 ops/s |
| Graph neighbor (depth=2) | ~0.89 ms | ~1,100 ops/s |
| Full retrieval pipeline | ~4.2 ms | ~238 ops/s |
Run cargo run -p mnemo-bench to benchmark on your hardware.
cargo test --workspace # run all 122 tests
make coverage # HTML coverage report (requires cargo-llvm-cov)
make coverage-summary # summary to stdout
cd sdk/python && pytest tests/ -v
cargo run -p mnemo-bench # all 12 benchmarks
cargo run -p mnemo-bench -- --filter graph # graph benchmarks only
cargo run -p mnemo-bench -- --json out.json # save results to JSON
Current test counts: 122 Rust tests · 21 Python tests · 12 benchmarks
PRs welcome. Please run make fmt && make lint before submitting.
Open an issue first for large changes.
See CONTRIBUTING.md for full setup instructions, code style guide, and how to add a new LLM provider.
MIT — see LICENSE



