Everything you need to know about LLM memory

Despite what you see, memory for conversational LLMs remains an unsolved problem.

The dream is: the model remembers what you said before and draws meaning across it over time. Not just recall, but interpretation, narrative, the kind of memory that makes a conversation feel continuous and cumulative across months or years.

Today, you can achieve an illusion of this dream. For days, or weeks if you're lucky. Until the illusion breaks when the LLM starts forgetting.

Why does this happen?

As your conversation history grows, the memory system must decide what to capture, how to represent it, and what to surface on any given conversation turn. Every one of those decisions is lossy, opinionated, and non-deterministic.

Over time, either the corpus of information becomes too large to reliably search, or what the system remembers starts to drift from what was actually said due to repeated summarization. The model forgets because the system either can't hold a complete picture, or the picture becomes distorted.

So how do we solve for this?

In an ideal world, the LLM would have perfect historic context on the conversation turns that matter. Infinite attention across every word you've ever exchanged, with none of the cost or latency that would actually entail.

Since that's not possible, every memory system is an attempt to approximate it. Each with its own drawbacks.

There are ultimately only two ways to preserve information from a conversation:

Raw — original messages, stored verbatim
Derived — summaries, narratives, structured extractions

Every memory system is choosing a position on this spectrum. And neither extreme works.

Raw is lossless but inert. A pile of transcripts isn't understanding. The information is all there, but nothing is connected, prioritized, or interpreted. It's just buried in the source material.

Derived is compact and usable, but repeated derivation drifts from the source the way a photocopy of a photocopy degrades. You don't lose the information all at once. You lose it gradually, and can't tell exactly when it stopped being accurate.

Won't infinite context solve this?

This is the most natural objection. Context windows keep getting bigger. Won't they eventually get big enough that we can just skip the memory system entirely and feed in the full history?

Not anytime soon. For two reasons:

Cost. Even if you could fit two years of conversation history into a context window, you'd be paying to process all of it on every single turn. The economics are brutal and they scale linearly with history. No consumer product survives that margin structure.
Degradation. Models get worse as the context window fills. Attention drops on information in the middle, overall reasoning quality declines, instruction following gets sloppier. You're paying more for worse performance.