Architecture: reads without an LLM
Korely is a write-once, read-many memory store: all model
work runs exactly once when you call add(), and every
subsequent read is pure data retrieval with no generation step. The store
holds three layers, vectors, a typed entity graph, and bi-temporal
subject-predicate-object facts, all indexed at write time so that
get_context(), get_facts(), and
search() return structured data with no LLM in the loop.
The primary recall path is get_context(): it assembles the
active typed facts and the most relevant memories for a query into a
ready-to-prompt block, which is the differentiator over a plain vector
store. For an agent builder this means predictable latency, no
surprise model charges on the read path, and a read-to-write quota ratio
that matches how agents actually behave: many retrievals for every new
memory.
Korely never puts a model between your agent and its memory. No read pays for a generation, adds a second or two of latency, or returns prose that another model wrote. The principle is simple: reads are retrieval, not generation. Your agent already has a model. It does not need ours to read a database.
This page explains exactly what runs on the read path, what runs on the write path, and why that split makes read quotas an order of magnitude more generous than write quotas.
flowchart TD
A["your agent calls add<br/>POST /v1/memories"] --> B["chunk + embed<br/>document and chunks"]
A --> C["entity extraction<br/>into typed graph"]
A --> D["fact extraction<br/>two-stage contradiction check<br/>supersede, never delete<br/>valid_from / invalid_at stamped"]
B --> P[("persisted store<br/>vectors · graph · facts")]
C --> P
D --> P
P --> R1
P --> R2
P --> R3
subgraph READ ["READ PATH — no model"]
R1["get_context · get_facts<br/>typed facts assembled from store<br/>deterministic · zero AI"]
R2["search<br/>query embedding then<br/>vector similarity"]
R3["get / get_all / users / agents<br/>pure SQL<br/>zero AI"]
end
R1 --> OUT["structured JSON<br/>returned to your agent"]
R2 --> OUT
R3 --> OUT
style READ fill:#0f172a,stroke:#334155,color:#e2e8f0
style P fill:#1e293b,stroke:#475569,color:#e2e8f0
style OUT fill:#052e16,stroke:#166534,color:#bbf7d0
The read path: SQL, vectors you already paid for, and one tiny exception
Korely exposes read operations over the REST API (GET /v1/context,
GET /v1/facts, GET /v1/memories,
POST /v1/memories/search) and mirrors them 1:1 in the SDK. Of
these, all except search are pure SQL lookups with
zero AI calls:
| Operation | What runs |
|---|---|
GET /v1/context — korely.get_context() | The moat recall path. Assembles the active typed facts and the relevant memories for a query into a ready-to-prompt block. Pure retrieval, no generation. |
GET /v1/facts — korely.get_facts() | Deterministic filter and sort over the typed fact store. Typically under 50 ms. |
GET /v1/memories/{id} — korely.get(id) | Primary-key lookup. One SQL query. |
GET /v1/memories/{id}/history | Event log for a single memory. Pure SQL over the history table. |
GET /v1/memories — korely.get_all() | Paginated SQL with filters. |
GET /v1/users — korely.users() | List end-user IDs seen in your namespace. Pure SQL. |
GET /v1/agents | List active agent IDs. Pure SQL. |
The one exception is POST /v1/memories/search (SDK:
korely.search(query, ...)). It embeds your query, which costs
a fraction of a hundredth of a cent. Retrieval is semantic vector
similarity (cosine) over the embeddings written at add() time.
That query embedding is the only model call on the entire read surface.
No generative model ever composes output on the read path. There is no reranker LLM, no answer synthesis, no summarization layer. Every read endpoint returns structured JSON assembled directly from the store. Your agent's own model does the reasoning over what comes back, which means you choose the model quality, the tone, and the prompt, not us.
Request flow
your agent (its own model)
| ^
tool call | | structured JSON
v | (no LLM composed this)
+---------------------------------------------+
| READ PATH |
| |
| POST /v1/memories/search (korely.search) |
| query embedding, then semantic vector |
| similarity (cosine) over stored vectors |
| |
| get_context / get_facts and all other |
| read endpoints |
| pure SQL and fact assembly, zero AI |
+---------------------------------------------+
^
| reads only what writes indexed
+---------------------------------------------+
| WRITE PATH |
| |
| chunking -> document + chunk embeddings |
| -> entity extraction -> typed-fact |
| extraction with contradiction checking |
| and bi-temporal validity stamps |
+---------------------------------------------+
^
| korely.add() / POST /v1/memories
your application Worked round trip: a search
Your agent asks about a customer. The call, and the typed objects that come back:
# the call your agent makes
› hits = korely.search("acme renewal objections", limit=3)
# List[SearchHit] — structured objects, no generation
[
SearchHit(id="mem_9f21ac", score=0.91,
snippet="Main objection is the per-seat price after the team grew to 40.
Dana asked for a usage-based quote before the board meeting..."),
SearchHit(id="mem_b774e0", score=0.86,
snippet="Renewal window opens July 1. Champion: Dana Reyes (VP Ops).
Risk flagged: procurement now requires a security review upfront..."),
SearchHit(id="mem_3c10d8", score=0.79,
snippet="When a customer cites per-seat cost after headcount growth,
lead with the usage tier comparison, not a discount..."),
]
Latency note: the query embedding is the only model call in this round
trip. Everything after it is a semantic vector similarity lookup over the
stored embeddings, returned as structured data. No reranker model runs, no
answer is synthesized. Your agent reads the three snippets, decides that
mem_9f21ac matters, and follows up with
korely.get("mem_9f21ac") (REST: GET /v1/memories/mem_9f21ac),
which is a single SQL query with zero AI calls.
Worked round trip: facts
Typed facts are the fastest read in the system because contradiction
checking already happened at write time. A facts read is a deterministic
filter and sort, typically under 50 ms even with thousands
of facts in the store. GET /v1/facts is available on every
plan, including the free Hobby tier.
# the call your agent makes
› facts = korely.get_facts(entity="Dana Reyes")
# List[Fact] — deterministic filter+sort, zero AI, typically <50 ms
[
Fact(subject="Dana Reyes", predicate="works_at", object="Acme",
valid_from="2026-02-12", invalid_at=None),
Fact(subject="Dana Reyes", predicate="role_is", object="VP Ops",
valid_from="2026-04-03", invalid_at=None),
# predicate is normalized; the raw verb is kept in predicate_raw
Fact(subject="Dana Reyes", predicate="likes", predicate_raw="prefers",
object="usage-based pricing",
valid_from="2026-06-04", invalid_at=None),
] Latency note: no embedding, no model, no semantic search. The read is deterministic, so two agents issuing the same call get byte-identical output, which makes facts safe to use inside tool chains that need reproducibility. How facts get their validity intervals is covered in temporal facts.
The write path: where the intelligence runs
All of the model work happens once, when a memory is written. A single
korely.add() (REST: POST /v1/memories) triggers the
full pipeline:
- Chunking and embeddings. The document is split and both
document-level and chunk-level embeddings are computed and stored. These
are the same vectors that
POST /v1/memories/searchreuses on every later retrieval, free of charge. - Entity extraction. People, companies, places, and concepts are extracted on our own infrastructure and wired into the typed knowledge graph. See the graph.
- Typed-fact extraction. Subject-predicate-object triples are extracted, checked against existing facts for contradictions, and stamped with bi-temporal validity. Conflicts are resolved here, at write time, so reads never have to arbitrate between a stale fact and a fresh one.
All of this costs about a tenth of a cent per memory, and it is all included in the per-memory write price. There are no surprise model charges on top.
Why read quotas are an order of magnitude more generous
The economics follow directly from the architecture. A write runs embeddings, entity extraction, and fact extraction. A read runs SQL, plus at most one tiny query embedding. So the pricing mirrors reality: write quotas are sized around the work writes actually do, and read quotas are an order of magnitude more generous, because reads are nearly free to serve.
For agent workloads this is the shape you want. A typical agent reads its memory many times for every time it writes, often dozens of retrievals per new memory. With Korely that ratio costs you almost nothing, and it never pushes you toward caching reads or batching lookups to stay inside a quota. See pricing for the exact numbers per plan.
Everything described on this page, the API, the vector store, the graph, and the write-time models, runs on our own infrastructure, EU-hosted.
See also
- Memory model — how
memories, entities, and facts map to
user_id,agent_id, andrun_id, and when to scope reads to a single user versus the whole agent namespace. - Temporal facts — the
bi-temporal validity model, how contradiction detection runs at write
time, and how to query facts
as_ofa past date. - The graph — the typed
entity graph behind the bi-temporal facts that
get_context()andget_facts()assemble, and how entity extraction wires new memories into it automatically.