How we measured Korely's memory — and how you can check it yourself
We ran Korely against LongMemEval, the public benchmark for long-term agent memory. This page documents the dataset, the harness, our per-axis scores in plain language, and the open repo so you can reproduce it. Where we have gaps, we say so.
The dataset: LongMemEval
LongMemEval (Wu et al., 2024 — arXiv:2410.10813, ICLR 2025) is a public benchmark built to stress-test the long-term memory of chat assistants. It constructs long, multi-session conversation histories and then asks questions that can only be answered correctly if the system genuinely remembered — and correctly reasoned over — facts spread across that history.
It groups questions into six ability axes. We report all six. We lead with the two where memory most often fails in production:
| Axis | In plain language | The failure it catches |
|---|---|---|
| knowledge-update | A fact changed; use the new value. | Telling a Pro customer they're on Free. |
| temporal-reasoning | Reason about when things happened. | "Latest invoice" returns last year's. |
| multi-session | Join facts across many past chats. | Re-asking what the user already answered. |
| single-session-user | Recall what the user said earlier this chat. | Forgetting a constraint stated minutes ago. |
| single-session-assistant | Recall what the agent itself said. | "Do what you suggested" confuses it. |
| single-session-preference | Honor a stated style/format preference. | Drifting back to long answers after "be brief." |
A fair test: only the memory changes
A benchmark is only meaningful if the only variable is the thing being measured. Our harness holds everything else constant:
- Same questions. Every system answers the identical LongMemEval question set.
- Same reader model. The model that reads the retrieved memory and writes the final answer is the same across runs. Memory quality is what differs — not the LLM doing the talking.
- Same neutral judge. Answers are graded by Groq Llama-3.3-70B, a third-party open model with no stake in the result. We do not grade our own answers with our own model.
- Only the memory differs. The single variable is which memory system supplies the context.
The pipeline, end to end:
LongMemEval questions
-> [ memory system under test ] (the one variable)
-> same reader LLM (writes the answer)
-> neutral Groq judge (scores vs the gold answer)
-> score Our results
Self-reported, from our own runs of the harness above, on the LongMemEval oracle corpus (the question's evidence sessions, no distractor haystack). The two axes we lead with are where the field is weakest and Korely is strongest.
| Axis | What it tests | Korely |
|---|---|---|
| knowledge-update | Returns the current fact after a change | 75% |
| temporal-reasoning | Reasons about when (recency, before/after) | 48% |
| single-session-assistant | Recalls what the agent itself said | 92% |
| single-session-user | Recalls what the user said this session | 82% |
| multi-session | Joins facts across many sessions | 32% |
| single-session-preference | Honors stated formatting preferences | 20% |
We don't cherry-pick. Korely is exceptional where it matters most for production agents — keeping facts current and time-aware — and competitive on plain recall. The two lower axes (multi-session join, in-session preference adherence) are where we're actively improving. We publish them rather than hide them.
Why we win on the first two: every Korely fact is a typed
(subject, predicate, object) triple with a valid_from and an
invalid_at. When a fact changes, the new one supersedes the old
— the old is marked invalid, not deleted — so a query returns the
current truth, and you can still ask what was true on any past date with
as_of.
Reproduce it yourself
The harness is open. Clone it, plug in your own API key, and run the same evaluation we did.
- Repo:
korely-benchmarks— the harness runner, the reader and Groq-judge prompt templates, the axis mapping, and a one-command run script. - What you bring: a
kor_live_key (from early access) and a Groq API key for the judge.
git clone https://github.com/korely/korely-benchmarks
cd korely-benchmarks
export KORELY_API_KEY="kor_live_..."
export GROQ_API_KEY="gsk_..."
python run.py --benchmark longmemeval What's published — and what's coming
We believe numbers should ship with their method, and that includes admitting the limits of what we've published so far.
What we have today: aggregate per-axis scores, produced by the harness above, with the dataset, reader model, and neutral judge all fixed. The harness is open-source, so the procedure is fully inspectable.
What we don't have published yet: a per-question transcript — every individual LongMemEval question, the exact context Korely retrieved, the answer the reader produced, and the judge's verdict, item by item. We currently store the aggregate scores, not the saved per-question predictions.
What a full transparency release needs: re-running the harness
with prediction logging on, then publishing the complete
question → retrieved-context → answer → judge-verdict
record as a downloadable JSONL alongside the scores. That's on the roadmap; when
it ships, this page will link the raw file so anyone can audit every item, not
just the totals.
Caveats
- This is the oracle corpus (only each question's evidence sessions). The full-distractor "haystack" corpus is a harder, separate test we run next.
- Scores are an internal, directional signal from our own runs. The harness is open precisely so you don't have to take them on faith.
Want a key to run this yourself? Join early access.
Reading the docs first? The korely-memory SDK reference.