A voice companion that remembers, fast

A voice loop is brutal about latency. The user finishes speaking, and every millisecond until the first synthesized word is a pause they can hear. Drop a memory read into that gap, a vector-DB round-trip, then an LLM re-rank to decide what's relevant, and you've added a beat of silence to every turn. The assistant sounds like it's thinking when it should already know.

Korely's read path has no model on it. get_context is a deterministic assembly of the user's active facts plus their most relevant memories, fitted to a token budget, a SQL-shaped retrieval, not a generation. That's the whole trick: recall that's fast enough to sit inside a voice turn, before the first spoken token, instead of stretching it. This cookbook walks the pattern at the call level.

The snippets use the Python SDK (pip install korely-memory). The same calls work over the REST API and the Node SDK.

The shape of a voice turn

Three stages, each from a different vendor. Speech-to-text (Whisper, Deepgram) turns the mic into a transcript. Your agent reasons and replies. Text-to-speech (ElevenLabs, and the rest) turns the reply into audio. Korely is none of those, it's only the memory, and it has exactly one job on the critical path: hand the agent everything it knows about this user before the model runs, fast enough that nobody hears the lookup.

from korely_memory import Korely

korely = Korely(api_key="kor_live_...", region="eu")

# STT + TTS are your choice — placeholders here, Korely is only the memory.
transcript = stt.listen()        # Whisper / Deepgram -> text
USER = "user-amelia"             # the end user this companion serves

1. On the critical path: read context before the model runs

The moment the transcript lands, assemble what's known about the user into a prompt-ready block. One call. get_context returns the active facts plus the most relevant memories, already trimmed to a token budget, and because it's deterministic retrieval, there's no second LLM hop deciding relevance. This is the call you can afford to put before time-to-first-token.

ctx = korely.get_context(
    query=transcript,            # "remind me what my doctor said about coffee"
    user_id=USER,
    token_budget=600,            # keep it tight — this rides the latency budget
)

# ctx.context -> a Markdown block to prepend to the system prompt
# ctx.tokens  -> how many tokens it actually used
# ctx.sources -> the memory ids it drew from

reply = agent.respond(system=ctx.context, user_turn=transcript)
tts.speak(reply)                 # ElevenLabs / your TTS -> audio out

{
  "context": "Known facts about user-amelia:\n- user-amelia avoids \"caffeine after 2pm\" (doctor's advice)\n- user-amelia prefers \"decaf oat flat white\"\n- user-amelia lives_in \"Lisbon\"\n\nRelevant memories:\n- Last week: mentioned the new cafe on Rua do Carmo does a good decaf...",
  "tokens": 188,
  "sources": ["mem_a91c", "mem_4f20"]
}

The agent now speaks with the user's history already in hand, the doctor's advice, the decaf preference, without a single round-trip the listener can perceive. No "let me look that up."

Why this is the moat. No model runs on the read path. get_context is a deterministic assembly of facts plus memories, not a generation step. That's what makes it fast enough to live inside a voice turn, where a vector-DB round-trip plus an LLM re-rank would be an audible pause. The contradictions are already resolved (the stale "drinks espresso all day" fact was superseded long before this turn), so the model is handed truth, not a pile of candidates to sort through out loud.

STT (your choice)

Mic -> transcript

Whisper / Deepgram
The user stops speaking

Korely

get_context(query=transcript)

Active facts + relevant memories
Deterministic, no model on the read

Agent + TTS

First token, no pause

Model speaks already grounded
Lookup is inaudible

2. Mid-turn: pull one specific memory

When the agent needs a particular past detail rather than the whole context block, search the user's memories directly. Scoped to the user, ranked by relevance, no model in the path beyond the query embedding.

hits = korely.search(
    "cafe Rua do Carmo decaf",
    user_id=USER,
    limit=3,
)
# search() returns the list of hits directly — not a page object.
# each hit: {id, score, snippet, user_id, agent_id, metadata}

[
  {
    "id": "mem_4f20",
    "score": 0.83,
    "snippet": "mentioned the new cafe on Rua do Carmo does a good decaf oat flat white",
    "user_id": "user-amelia",
    "agent_id": null,
    "metadata": {"channel": "voice"}
  }
]

3. Off the critical path: capture the turn

After the assistant has spoken, once the latency-critical work is done, write the turn down. This is fire-and-forget: it doesn't block the next spoken word, and the user is already hearing the reply while it runs.

korely.add(
    "User said her doctor moved the caffeine cutoff earlier — "
    "no coffee after noon now, not 2pm.",
    user_id=USER,
    metadata={"channel": "voice"},
)
# add() accepts a plain string or a chat-message list [{role, content}, ...].
# Returns immediately; fact extraction runs server-side, off this path.

{
  "id": "mem_77e1",
  "content": "User said her doctor moved the caffeine cutoff earlier — no coffee after noon now, not 2pm.",
  "user_id": "user-amelia",
  "agent_id": null,
  "run_id": null,
  "metadata": {"channel": "voice"},
  "created_at": "2026-06-16T18:40:11Z",
  "updated_at": "2026-06-16T18:40:11Z",
  "facts": []
}

Note facts comes back empty here, extraction hasn't run yet. The new (subject, predicate, object) triple (user-amelia avoids "caffeine after noon", superseding the old "after 2pm" fact) lands a few seconds later, server-side. When you do need to confirm what was extracted, read it back explicitly rather than expecting it inline:

# moments later, off the voice loop — read the extracted facts back
mem = korely.get("mem_77e1")
# mem.facts is now populated; the old "after 2pm" fact has been superseded
# you can also query the typed facts directly:
facts = korely.get_facts(entity="caffeine", user_id=USER)

By the next turn, and certainly by the next conversation, get_context already serves the corrected cutoff. The agent never repeats the stale "2pm" advice, because the contradiction was resolved before it ever reached the read path again.

One honest caveat: capture is the asynchronous half. korely.add returns sub-second and runs fact extraction off the latency-critical path, so a turn's typed facts land a few seconds after the write, ready for the next turn's get_context, not inside the same call. Write turns fire-and-forget once the assistant has spoken. And to be clear about the division of labour: STT and TTS are yours (Whisper, Deepgram, ElevenLabs and the rest), Korely is only the memory that loads in time.

Where to go next

get context is the call on the critical path in step 1, its full options and budgeting behaviour; Temporal facts explains how the "after 2pm -> after noon" supersession happens so the read path only ever serves what's true now; and the multi-session research cookbook shows the same read/write split when latency isn't the constraint and the emphasis is point-in-time replay.