Korely

Research & architecture

Korely's memory stack is paper-grounded. Every non-trivial mechanism in the product, how facts get extracted, how contradictions get resolved, how context gets ordered before it reaches the model, traces back to a specific published result. This page maps each mechanism to its source, so you can audit our design decisions instead of taking our word for them.

The framing we find most useful is the Forms, Functions, Dynamics taxonomy from Memory in the Age of AI Agents (arXiv 2512.13564). In that vocabulary, Korely's three layers decompose cleanly:

Korely layerTaxonomy positionWhat lives there
Layer 1 — Memory store + entity graph Factual / token-level memory Notes in a managed cloud store (Postgres + pgvector, EU-hosted), auto-extracted entities, typed relationships
Layer 2 — session memory Working memory Per-conversation notes with evolving links (A-MEM pattern)
Layer 3 — typed facts Factual memory with temporal dynamics a canonical vocabulary of predicates across 9 families, bi-temporal validity, contradiction handling

We didn't pick papers and build features around them. We hit problems while running Korely on our own work: stale facts winning over fresh ones, the model trusting its training data over retrieved sources, contradiction spam in the graph. Then we went looking for who had already solved them.

What we implement

Each entry: what the paper proposes, what Korely implements, and the concept page that documents the mechanism in API-level detail.

PaperWhat it proposesIn Korely
Graphiti / Zep (arXiv 2501.13956) A bi-temporal knowledge graph for agent memory: when something became true and when the system learned it are separate axes Every fact carries valid_from and invalid_at, so point-in-time queries stay correct. See temporal facts
ConflictBank, Su et al. 2024 (arXiv 2408.12076, NeurIPS 2024 D&B) A systematic taxonomy of knowledge conflicts, including temporal contradictions between sources Two-stage contradiction detection at write time: cheap candidate filter first, model-judged resolution second. See temporal facts
A-MEM, Xu et al. 2025 (arXiv 2502.12110) Agentic memory as structured notes whose links evolve as the conversation progresses Layer 2 session memory: structured note construction plus link evolution, scoped to run_id. See memory model
Mem0, Chhikara et al. 2025 (arXiv 2504.19413) Extracted, structured memories outperform replaying full conversation history into the context window (the §3.2 result) A dedicated facts surface (GET /v1/facts), separate from document search. See memory model
SELF-RAG, Asai et al. 2023 (arXiv 2310.11511) Critique tokens: the model verifies that a claim is supported by a retrieved passage before asserting it Grounding rules in chat: a claim must be backed by a retrieved source before it earns a citation. See architecture
Lost in the Middle, Liu et al. 2024 (arXiv 2307.03172) Models attend most to the beginning and end of the context window, and lose information buried in the middle Context ordering: the most decision-relevant sources are placed where models actually attend, not mid-prompt. See architecture
Toolformer, Schick et al. 2023 (arXiv 2302.04761) Models use tools well when the tool's behavior is predictable GET /v1/facts is a deterministic filter and sort, not a semantic search. Details below and in the API reference
DPR, Karpukhin et al. 2020 (arXiv 2004.04906) Dense retrieval with asymmetric encoders for queries and passages Queries and documents are embedded with different prefixes, because a three-word question and a thousand-word note are not the same kind of object. See architecture
FrugalGPT, Chen, Zaharia, Zou 2023 (arXiv 2305.05176) A model cascade: route easy steps to small models, reserve strong models for the hard step Classification and query rewriting run on small models; only the final synthesis step in chat sees a large one. Reads never invoke a generative model at all. See architecture
GLiNER, Zaratiana et al. 2023 (arXiv 2311.08526) Zero-shot named entity recognition with a compact bidirectional model instead of a generative LLM Entity extraction runs on our own infrastructure as part of the write path; the entity graph builds itself without per-note generative calls. See the graph

Bi-temporal facts, concretely

The Graphiti insight is that "when something became true" and "when we learned about it" are different timestamps, and conflating them corrupts point-in-time reasoning. A real example from our own store: the fact "EU server costs 40" was superseded on 2026-06-07 by "EU server costs 50 euro per month". The old fact wasn't deleted. Its invalid_at was set, so an agent asking "what does hosting cost now?" gets the current answer, while the history stays queryable. The full mechanics are on the temporal facts page.

Why the facts endpoint is deterministic

Toolformer's core finding is that models use tools well when the tool's behavior is predictable. So GET /v1/facts is a filter and a sort, by subject, predicate family, validity window. Not a semantic search. Same inputs, same outputs, every time. That is also why facts reads are fast, typically under 50 ms: there is no embedding lookup and no reranker in the hot path. Semantic retrieval exists too (POST /v1/memories/search), but it is a separate endpoint with separate semantics, and the agent knows which is which.

Both papers, in one CLI call:

korely cli korely facts
# deterministic read: same inputs, same Markdown, every time
 korely facts --entity "EU server" --include-invalidated

## Financial (2 facts)

| Fact                                   | Valid from | Status                  |
|----------------------------------------|------------|-------------------------|
| EU server — costs — 50 euro per month  | 2026-06-07 | active                  |
| ~~EU server — costs — 40~~             | 2026-05-21 | invalidated 2026-06-07  |

The bi-temporal chain (Graphiti) is visible in the output, and the read itself is a pure lookup (Toolformer). No generative model ever composes output on the read path. Your agent's own model does the reasoning, which is why read quotas are an order of magnitude more generous than write quotas. The facts read is available on every tier, including the free Hobby plan, gated only by your monthly query quota.

One stack, two consumers. The mechanisms above serve both the human-facing chat and the agent-facing REST API and SDK. We deliberately did not build a separate "agent memory" pipeline. Agents read the same graph, the same facts, the same temporal model that the end user sees and edits in the Memory Panel.

Related work we study

Two papers that don't map to a single mechanism but shape how we think about the stack:

PaperWhy it matters here
HippoRAG 2 (arXiv 2502.14802) Personalized PageRank over an entity graph as a retrieval primitive. The strongest published treatment of multi-hop neighbor ranking, the mechanism Korely uses for shared-entity traversal in the graph. See the graph
Sleep-time compute, Lin et al. 2025 (arXiv 2504.13171) Consolidation work belongs off the critical path, when no one is waiting on a response. Korely follows the same principle: extraction, embedding and contradiction checking run at write time and in background maintenance, never while an agent waits on a read. See architecture

Forgetting

Korely's temporal decay is scoring decay, never hard deletion. Older facts lose retrieval weight on a half-life schedule, but the rows stay in the graph. Erasure is a separate, user-initiated surface. When an end user deletes a fact from the Memory Panel, it is actually gone, but the ranking algorithm never silently destroys data on its own.

For the design spectrum here: MemoryBank (arXiv 2305.10250) introduced Ebbinghaus-inspired forgetting curves for agent memory, and more recent variants (FadeMem, FSFM) explore learned and selective forgetting policies. We sit at the conservative end of that spectrum on purpose: a stale fact that ranks too high is visible and correctable by the user, data that disappears silently is not.

Benchmarks

We evaluate on benchmarks that probe the properties this page describes, rather than generic retrieval leaderboards.

The first is LongMemEval (arXiv 2410.10813): 500 questions over timestamped multi-session haystacks. Its temporal-reasoning subset is exactly what Layer 3 is built for. Korely preserves event time (valid_from) rather than only ingestion time, which is the property those questions probe.

The second is MemoryAgentBench (arXiv 2507.05257), which scores selective forgetting: updating or invalidating a memory when new information contradicts it. That is what Korely's two-stage invalidation does on every write, so it measures the contradiction pipeline directly.

Our publishing principle: numbers ship with methodology, including the question categories where the system loses, so you can judge the result instead of a headline score.

How the write path works

Every mechanism in the paper table above runs exactly once: when something is written. The read path never invokes a generative model. Here is the full write pipeline, annotated by the paper it implements:

StageWhat happensPaper grounding
Embed Content is embedded with asymmetric prefix routing: queries and documents carry different prefixes because a three-word question and a thousand-word note are not the same kind of object DPR, Karpukhin et al. 2020
Extract entities Named entities are extracted by GLiNER, a compact bidirectional model that runs on EU infrastructure. No generative LLM call per document GLiNER, Zaratiana et al. 2023
Extract facts (subject, predicate, object) triples are typed against a canonical vocabulary of 9 predicate families and stamped with valid_from Graphiti / Zep, arXiv 2501.13956
Contradiction check Candidate conflict filter first; model-judged resolution second. Opposing fact gets invalid_at set, never deleted ConflictBank, Su et al. 2024
Persist Vectors, graph, and facts committed in one transaction. Extraction, embedding, and contradiction checking happen here, off the read critical path Sleep-time compute, Lin et al. 2025

Every subsequent read is a pure data retrieval. The intelligence ran once, at write time. For a detailed diagram of the two paths, see Architecture: reads without an LLM.

How we evaluate: connecting claims to results

The paper table maps mechanisms to sources. The benchmark page maps those mechanisms to measurable outcomes on public datasets, using an open harness so you can reproduce the numbers independently.

Our evaluation principle: numbers ship with their method, including the axes where the system is weak, so you can judge a result rather than a headline score.

What we claimHow it is testedWhere to read
Facts stay current after a change (bi-temporal model) LongMemEval knowledge-update axis: returns the new value after a supersession, not the stale one. Korely score: 75% Benchmark
Temporal reasoning over when things happened LongMemEval temporal-reasoning axis: recency, before/after, point-in-time. Korely score: 48%. An open problem across the field Benchmark
Contradiction detection at write time MemoryAgentBench selective-forgetting score: tests whether the system invalidates a stale fact when a contradicting one arrives, which is exactly what the two-stage pipeline does Benchmark
Single-session recall LongMemEval single-session-user (82%) and single-session-assistant (92%) axes Benchmark

The harness is open: clone korely-benchmarks, supply a kor_live_ key and a Groq judge key, and run the same evaluation we did. Scores are self-reported from our own runs; the open procedure is what makes them inspectable.

Going deeper

The concept pages walk through each mechanism with API-level detail:

  • Temporal facts: the bi-temporal model, contradiction resolution, and point-in-time queries
  • Memory model: the three layers, scoping (user_id / agent_id / run_id), and how reads compose
  • The graph: typed entities, relationships, and how shared-entity traversal connects them
  • Architecture: the read path versus the write path, and why reads are retrieval, not generation

Read a paper we should know about? We mean it. If there is published work on agent memory, temporal knowledge graphs, or forgetting policies that you think beats what's described here, open an issue on GitHub or email [email protected]. Several rows in the tables above started as exactly that kind of pointer.


See also

  • Benchmark — LongMemEval per-axis scores, the open harness, and how to reproduce the evaluation
  • Temporal facts — bi-temporal model, valid_from / invalid_at, the as_of parameter, contradiction pipeline in full
  • Architecture — the write pipeline vs the read path, why reads carry no model cost
  • The graph — GLiNER entity extraction, typed relationships, shared-entity traversal
  • Memory model — three layers, scoping, how facts and memories compose in GET /context
  • API reference — every endpoint with request and response shapes