How we measured Korely's memory — and how you can check it yourself

We ran Korely against LongMemEval, the public benchmark for long-term agent memory. This page documents the dataset, the harness, our per-axis scores in plain language, and the open repo so you can reproduce it. Where we have gaps, we say so.

The dataset: LongMemEval

LongMemEval (Wu et al., 2024 — arXiv:2410.10813, ICLR 2025) is a public benchmark built to stress-test the long-term memory of chat assistants. It constructs long, multi-session conversation histories and then asks questions that can only be answered correctly if the system genuinely remembered — and correctly reasoned over — facts spread across that history.

It groups questions into six ability axes. We report all six. We lead with the two where memory most often fails in production:

Axis	In plain language	The failure it catches
knowledge-update	A fact changed; use the new value.	Telling a Pro customer they're on Free.
temporal-reasoning	Reason about when things happened.	"Latest invoice" returns last year's.
multi-session	Join facts across many past chats.	Re-asking what the user already answered.
single-session-user	Recall what the user said earlier this chat.	Forgetting a constraint stated minutes ago.
single-session-assistant	Recall what the agent itself said.	"Do what you suggested" confuses it.
single-session-preference	Honor a stated style/format preference.	Drifting back to long answers after "be brief."

A fair test: only the memory changes

A benchmark is only meaningful if the only variable is the thing being measured. Our harness holds everything else constant:

Same questions. Every system answers the identical LongMemEval question set.
Same reader model. The model that reads the retrieved memory and writes the final answer is the same across runs. Memory quality is what differs — not the LLM doing the talking.
Same neutral judge. Answers are graded by Groq Llama-3.3-70B, a third-party open model with no stake in the result. We do not grade our own answers with our own model.
Only the memory differs. The single variable is which memory system supplies the context.

The pipeline, end to end:

LongMemEval questions
   -> [ memory system under test ]   (the one variable)
   -> same reader LLM               (writes the answer)
   -> neutral Groq judge            (scores vs the gold answer)
   -> score

Our results

Self-reported, from our own runs of the harness above, on the LongMemEval oracle corpus (the question's evidence sessions, no distractor haystack). The two axes we lead with are where the field is weakest and Korely is strongest.

Axis	What it tests	Korely
knowledge-update	Returns the current fact after a change	75%
temporal-reasoning	Reasons about when (recency, before/after)	48%
single-session-assistant	Recalls what the agent itself said	92%
single-session-user	Recalls what the user said this session	82%
multi-session	Joins facts across many sessions	32%
single-session-preference	Honors stated formatting preferences	20%

We don't cherry-pick. Korely is exceptional where it matters most for production agents — keeping facts current and time-aware — and competitive on plain recall. The two lower axes (multi-session join, in-session preference adherence) are where we're actively improving. We publish them rather than hide them.

Why we win on the first two: every Korely fact is a typed (subject, predicate, object) triple with a valid_from and an invalid_at. When a fact changes, the new one supersedes the old — the old is marked invalid, not deleted — so a query returns the current truth, and you can still ask what was true on any past date with as_of.

Reproduce it yourself

The harness is open. Clone it, plug in your own API key, and run the same evaluation we did.

Repo: korely-benchmarks — the harness runner, the reader and Groq-judge prompt templates, the axis mapping, and a one-command run script.
What you bring: a kor_live_ key (from early access) and a Groq API key for the judge.

git clone https://github.com/korely/korely-benchmarks
cd korely-benchmarks
export KORELY_API_KEY="kor_live_..."
export GROQ_API_KEY="gsk_..."
python run.py --benchmark longmemeval

What's published — and what's coming

We believe numbers should ship with their method, and that includes admitting the limits of what we've published so far.

What we have today: aggregate per-axis scores, produced by the harness above, with the dataset, reader model, and neutral judge all fixed. The harness is open-source, so the procedure is fully inspectable.

What we don't have published yet: a per-question transcript — every individual LongMemEval question, the exact context Korely retrieved, the answer the reader produced, and the judge's verdict, item by item. We currently store the aggregate scores, not the saved per-question predictions.

What a full transparency release needs: re-running the harness with prediction logging on, then publishing the complete question → retrieved-context → answer → judge-verdict record as a downloadable JSONL alongside the scores. That's on the roadmap; when it ships, this page will link the raw file so anyone can audit every item, not just the totals.

Caveats

This is the oracle corpus (only each question's evidence sessions). The full-distractor "haystack" corpus is a harder, separate test we run next.
Scores are an internal, directional signal from our own runs. The harness is open precisely so you don't have to take them on faith.

Want a key to run this yourself? Join early access. Reading the docs first? The korely-memory SDK reference.