Research & architecture
Korely's memory stack is paper-grounded. Every non-trivial mechanism in the product, how facts get extracted, how contradictions get resolved, how context gets ordered before it reaches the model, traces back to a specific published result. This page maps each mechanism to its source, so you can audit our design decisions instead of taking our word for them.
The framing we find most useful is the Forms, Functions, Dynamics taxonomy from Memory in the Age of AI Agents (arXiv 2512.13564). In that vocabulary, Korely's three layers decompose cleanly:
| Korely layer | Taxonomy position | What lives there |
|---|---|---|
| Layer 1 — Memory store + entity graph | Factual / token-level memory | Notes in a managed cloud store (Postgres + pgvector, EU-hosted), auto-extracted entities, typed relationships |
| Layer 2 — session memory | Working memory | Per-conversation notes with evolving links (A-MEM pattern) |
| Layer 3 — typed facts | Factual memory with temporal dynamics | a canonical vocabulary of predicates across 9 families, bi-temporal validity, contradiction handling |
We didn't pick papers and build features around them. We hit problems while running Korely on our own work: stale facts winning over fresh ones, the model trusting its training data over retrieved sources, contradiction spam in the graph. Then we went looking for who had already solved them.
What we implement
Each entry: what the paper proposes, what Korely implements, and the concept page that documents the mechanism in API-level detail.
| Paper | What it proposes | In Korely |
|---|---|---|
| Graphiti / Zep (arXiv 2501.13956) | A bi-temporal knowledge graph for agent memory: when something became true and when the system learned it are separate axes | Every fact carries valid_from and invalid_at, so point-in-time queries stay correct. See temporal facts |
| ConflictBank, Su et al. 2024 (arXiv 2408.12076, NeurIPS 2024 D&B) | A systematic taxonomy of knowledge conflicts, including temporal contradictions between sources | Two-stage contradiction detection at write time: cheap candidate filter first, model-judged resolution second. See temporal facts |
| A-MEM, Xu et al. 2025 (arXiv 2502.12110) | Agentic memory as structured notes whose links evolve as the conversation progresses | Layer 2 session memory: structured note construction plus link evolution, scoped to run_id. See memory model |
| Mem0, Chhikara et al. 2025 (arXiv 2504.19413) | Extracted, structured memories outperform replaying full conversation history into the context window (the §3.2 result) | A dedicated facts surface (GET /v1/facts), separate from document search. See memory model |
| SELF-RAG, Asai et al. 2023 (arXiv 2310.11511) | Critique tokens: the model verifies that a claim is supported by a retrieved passage before asserting it | Grounding rules in chat: a claim must be backed by a retrieved source before it earns a citation. See architecture |
| Lost in the Middle, Liu et al. 2024 (arXiv 2307.03172) | Models attend most to the beginning and end of the context window, and lose information buried in the middle | Context ordering: the most decision-relevant sources are placed where models actually attend, not mid-prompt. See architecture |
| Toolformer, Schick et al. 2023 (arXiv 2302.04761) | Models use tools well when the tool's behavior is predictable | GET /v1/facts is a deterministic filter and sort, not a semantic search. Details below and in the API reference |
| DPR, Karpukhin et al. 2020 (arXiv 2004.04906) | Dense retrieval with asymmetric encoders for queries and passages | Queries and documents are embedded with different prefixes, because a three-word question and a thousand-word note are not the same kind of object. See architecture |
| FrugalGPT, Chen, Zaharia, Zou 2023 (arXiv 2305.05176) | A model cascade: route easy steps to small models, reserve strong models for the hard step | Classification and query rewriting run on small models; only the final synthesis step in chat sees a large one. Reads never invoke a generative model at all. See architecture |
| GLiNER, Zaratiana et al. 2023 (arXiv 2311.08526) | Zero-shot named entity recognition with a compact bidirectional model instead of a generative LLM | Entity extraction runs on our own infrastructure as part of the write path; the entity graph builds itself without per-note generative calls. See the graph |
Bi-temporal facts, concretely
The Graphiti insight is that "when something became true" and "when we
learned about it" are different timestamps, and conflating them corrupts
point-in-time reasoning. A real example from our own store: the fact
"EU server costs 40" was superseded on 2026-06-07 by
"EU server costs 50 euro per month". The old fact wasn't
deleted. Its invalid_at was set, so an agent asking "what
does hosting cost now?" gets the current answer, while the history stays
queryable. The full mechanics are on the
temporal facts page.
Why the facts endpoint is deterministic
Toolformer's core finding is that models use tools well when the tool's
behavior is predictable. So GET /v1/facts is a filter and
a sort, by subject, predicate family, validity window. Not a semantic
search. Same inputs, same outputs, every time. That is also why facts
reads are fast, typically under 50 ms: there is no embedding lookup and
no reranker in the hot path. Semantic retrieval exists too
(POST /v1/memories/search), but it is a separate endpoint with separate
semantics, and the agent knows which is which.
Both papers, in one CLI call:
# deterministic read: same inputs, same Markdown, every time
› korely facts --entity "EU server" --include-invalidated
## Financial (2 facts)
| Fact | Valid from | Status |
|----------------------------------------|------------|-------------------------|
| EU server — costs — 50 euro per month | 2026-06-07 | active |
| ~~EU server — costs — 40~~ | 2026-05-21 | invalidated 2026-06-07 | The bi-temporal chain (Graphiti) is visible in the output, and the read itself is a pure lookup (Toolformer). No generative model ever composes output on the read path. Your agent's own model does the reasoning, which is why read quotas are an order of magnitude more generous than write quotas. The facts read is available on every tier, including the free Hobby plan, gated only by your monthly query quota.
One stack, two consumers. The mechanisms above serve both the human-facing chat and the agent-facing REST API and SDK. We deliberately did not build a separate "agent memory" pipeline. Agents read the same graph, the same facts, the same temporal model that the end user sees and edits in the Memory Panel.
Related work we study
Two papers that don't map to a single mechanism but shape how we think about the stack:
| Paper | Why it matters here |
|---|---|
| HippoRAG 2 (arXiv 2502.14802) | Personalized PageRank over an entity graph as a retrieval primitive. The strongest published treatment of multi-hop neighbor ranking, the mechanism Korely uses for shared-entity traversal in the graph. See the graph |
| Sleep-time compute, Lin et al. 2025 (arXiv 2504.13171) | Consolidation work belongs off the critical path, when no one is waiting on a response. Korely follows the same principle: extraction, embedding and contradiction checking run at write time and in background maintenance, never while an agent waits on a read. See architecture |
Forgetting
Korely's temporal decay is scoring decay, never hard deletion. Older facts lose retrieval weight on a half-life schedule, but the rows stay in the graph. Erasure is a separate, user-initiated surface. When an end user deletes a fact from the Memory Panel, it is actually gone, but the ranking algorithm never silently destroys data on its own.
For the design spectrum here: MemoryBank (arXiv 2305.10250) introduced Ebbinghaus-inspired forgetting curves for agent memory, and more recent variants (FadeMem, FSFM) explore learned and selective forgetting policies. We sit at the conservative end of that spectrum on purpose: a stale fact that ranks too high is visible and correctable by the user, data that disappears silently is not.
Benchmarks
We evaluate on benchmarks that probe the properties this page describes, rather than generic retrieval leaderboards.
The first is LongMemEval
(arXiv 2410.10813): 500
questions over timestamped multi-session haystacks. Its
temporal-reasoning subset is exactly what Layer 3 is built for. Korely
preserves event time (valid_from) rather than only
ingestion time, which is the property those questions probe.
The second is MemoryAgentBench (arXiv 2507.05257), which scores selective forgetting: updating or invalidating a memory when new information contradicts it. That is what Korely's two-stage invalidation does on every write, so it measures the contradiction pipeline directly.
Our publishing principle: numbers ship with methodology, including the question categories where the system loses, so you can judge the result instead of a headline score.
How the write path works
Every mechanism in the paper table above runs exactly once: when something is written. The read path never invokes a generative model. Here is the full write pipeline, annotated by the paper it implements:
| Stage | What happens | Paper grounding |
|---|---|---|
| Embed | Content is embedded with asymmetric prefix routing: queries and documents carry different prefixes because a three-word question and a thousand-word note are not the same kind of object | DPR, Karpukhin et al. 2020 |
| Extract entities | Named entities are extracted by GLiNER, a compact bidirectional model that runs on EU infrastructure. No generative LLM call per document | GLiNER, Zaratiana et al. 2023 |
| Extract facts | (subject, predicate, object) triples are typed against a canonical vocabulary of 9 predicate families and stamped with valid_from | Graphiti / Zep, arXiv 2501.13956 |
| Contradiction check | Candidate conflict filter first; model-judged resolution second. Opposing fact gets invalid_at set, never deleted | ConflictBank, Su et al. 2024 |
| Persist | Vectors, graph, and facts committed in one transaction. Extraction, embedding, and contradiction checking happen here, off the read critical path | Sleep-time compute, Lin et al. 2025 |
Every subsequent read is a pure data retrieval. The intelligence ran once, at write time. For a detailed diagram of the two paths, see Architecture: reads without an LLM.
How we evaluate: connecting claims to results
The paper table maps mechanisms to sources. The benchmark page maps those mechanisms to measurable outcomes on public datasets, using an open harness so you can reproduce the numbers independently.
Our evaluation principle: numbers ship with their method, including the axes where the system is weak, so you can judge a result rather than a headline score.
| What we claim | How it is tested | Where to read |
|---|---|---|
| Facts stay current after a change (bi-temporal model) | LongMemEval knowledge-update axis: returns the new value after a supersession, not the stale one. Korely score: 75% | Benchmark |
| Temporal reasoning over when things happened | LongMemEval temporal-reasoning axis: recency, before/after, point-in-time. Korely score: 48%. An open problem across the field | Benchmark |
| Contradiction detection at write time | MemoryAgentBench selective-forgetting score: tests whether the system invalidates a stale fact when a contradicting one arrives, which is exactly what the two-stage pipeline does | Benchmark |
| Single-session recall | LongMemEval single-session-user (82%) and single-session-assistant (92%) axes | Benchmark |
The harness is open: clone korely-benchmarks, supply a
kor_live_ key and a Groq judge key, and run the same
evaluation we did. Scores are self-reported from our own runs; the open
procedure is what makes them inspectable.
Going deeper
The concept pages walk through each mechanism with API-level detail:
- Temporal facts: the bi-temporal model, contradiction resolution, and point-in-time queries
- Memory model: the
three layers, scoping (
user_id/agent_id/run_id), and how reads compose - The graph: typed entities, relationships, and how shared-entity traversal connects them
- Architecture: the read path versus the write path, and why reads are retrieval, not generation
Read a paper we should know about? We mean it. If there is published work on agent memory, temporal knowledge graphs, or forgetting policies that you think beats what's described here, open an issue on GitHub or email [email protected]. Several rows in the tables above started as exactly that kind of pointer.
See also
- Benchmark — LongMemEval per-axis scores, the open harness, and how to reproduce the evaluation
- Temporal facts — bi-temporal model,
valid_from/invalid_at, theas_ofparameter, contradiction pipeline in full - Architecture — the write pipeline vs the read path, why reads carry no model cost
- The graph — GLiNER entity extraction, typed relationships, shared-entity traversal
- Memory model — three layers, scoping, how facts and memories compose in
GET /context - API reference — every endpoint with request and response shapes