AI Agent Memory: Short-Term, Long-Term, and How Agents Remember What Matters

Q: What does external agent memory cost on top of everything else?

Direct costs are low — a vector database like `Qdrant` is open-source and a reasonably priced server handles typical volumes (hundreds of thousands of records) without difficulty. The bigger cost impact comes from retrieval latency (adds 50–200 ms per query) and developer time to design the memory schema correctly. Poor memory management (storing everything instead of storing selectively) can increase token costs at retrieval time — the relevance of a record directly affects how many tokens are loaded back into context.

Q: What is the most common mistake when implementing agent memory?

From experience: **forgetting about summarisation and cleanup.** Teams implement writes to memory but do not address what happens when memory fills up with irrelevant entries or stale facts. After a few weeks the agent retrieves noise that degrades the quality of its responses. The memory layer needs an equally well-considered "forgetting strategy" as a storage strategy. *Memory architecture is one of those system layers that projects most often underestimate in the prototype phase — and where the price is paid when scaling into production. If you are designing an agent that needs to operate in a real environment with real users across days and weeks, we are happy to discuss what memory layer makes sense for your specific case.*

A client from the financial sector once showed us an agent that handled simple questions brilliantly — but when a conversation ran long, or they came back the next day, the agent behaved as if they had never met. It forgot context. It forgot decisions they had reached together. It forgot what the client said about their priorities two hours earlier. "It's AI — why doesn't it remember?" was their frustrated question.

This is an architecture question, not a model question. An AI agent's memory is not an automatic feature — it is a designed layer of the system. Design it poorly and the agent repeats questions, loses context on every new call, and cannot learn from previous interactions. This article explains how agent memory works, where its limits lie, and how to address those limits in practice.

What agent memory is — and why it is not the same as RAG

Before going further, it is important to distinguish two things that are often confused in practice: agent memory and RAG (Retrieval-Augmented Generation).

RAG is a mechanism for loading knowledge — when an agent needs to answer a question, it reaches into a vector database and retrieves relevant documents (technical documentation, company policies, product catalogues). RAG is about accessing knowledge from outside.

Agent memory is about the agent's state — what it remembers from the ongoing interaction, from session history, or about a specific user or entity. Memory is about conversational continuity and context.

The practical difference: when an agent asks "what format do you want the output in?" and you say "Excel", it should remember that for the rest of the session. That is not RAG — that is the agent's working memory. You would use RAG when the agent needs to find out which output formats your system supports.

Conflating these two concepts leads to poor architectural decisions. Teams deploy RAG when they need memory, or vice versa.

The four layers of agent memory

In practice we distinguish four distinct types of memory, each solving a different problem:

1. In-context memory (short-term)

This is the agent's primary working memory — the conversational history that sits directly inside the model's context window. The agent "sees" the full conversation history as part of its input.

The advantages are obvious: zero-latency access, the agent always "sees" what was said, simple to implement.

The limits are equally obvious:

Context is not infinite. Most production models offer a context of 128k–1M tokens, but that does not mean the agent processes information equally well at the start and at the end of a long conversation.
The "lost in the middle" effect — a well-documented phenomenon: models pay significantly less attention to information in the middle of a long context. A long history does not guarantee reliable processing of all of it.
Costs grow linearly with conversation length — every token in the history is sent again on every subsequent call.
Memory disappears when the session ends. In-context memory is ephemeral — in a new conversation the agent starts with a blank slate.

In-context memory is excellent for short, self-contained interactions. For long or persistent scenarios it is not sufficient on its own.

2. External (long-term) memory

When an interaction spans more than one session or exceeds the context window, we need external memory — persistent storage that the agent writes to and reads from.

The most common implementations:

Vector database (e.g. Qdrant, pgvector) — the agent stores key facts or whole conversational fragments as vectors. In a later session it retrieves them via semantic search. This resembles RAG, but over memory history rather than a knowledge base.
Relational database / key-value store — for structured facts about entities (user prefers English, works in department X, holds role Y). Fast and deterministic.
Episodic memory — records of past interactions stored as "what happened in session Z" — the agent can retrieve them, learn from them, or reference them.

The key difference from RAG over a knowledge base: in external memory the agent writes for itself — it is the agent's own state, not an external document.

External memory brings its own challenges: when to write (after each exchange? at session end?), what is worth storing and what is not, how to handle conflicting or stale records.

3. Summarisation memory

For long conversations, rolling summarisation is a practical solution. Instead of keeping the full history, the agent (or a dedicated summariser step) periodically compresses older parts of the conversation into a concise summary.

A typical implementation pattern in LangGraph:

Keep the last N messages in full inside the context (e.g. the last 20 exchanges)
Compress everything older into 2–3 paragraphs of summary
The summary sits at the top of the context as the "state so far"

Advantages: context stays manageable, costs do not grow without bound, the agent still has access to the substance of prior conversations.

Disadvantages: summarisation is lossy — detail can disappear. In legal or compliance scenarios where exact wording matters, this is a risk. In technical conversations where one specific configuration parameter was mentioned an hour ago, summarisation may omit it.

4. Working memory (procedural / state memory)

This is the most frequently overlooked layer. Working memory is the structured state the agent maintains while executing a task — not the conversation, but the *operational context*.

Examples: - The list of steps the agent has completed and those still pending - Intermediate results from tool calls - Decisions made in previous steps (e.g. "we chose XML format") - The state of a multi-step pipeline (which phase we are in)

In LangGraph, working memory maps naturally onto the State of the graph — each node reads from and writes to a shared state object. This is one of the main advantages of explicit state management over simpler frameworks.

Working memory is critical for agents executing long pipeline tasks — without it the agent "forgets" what it did in earlier steps and may duplicate work or contradict itself.

How these layers work together

In practice, a production agent combines all four layers. Here is what that looks like for a technical support agent at a manufacturing company:

1.User opens a new session → the agent loads from external memory (vector DB) relevant information about this customer: their machines, previous tickets, preferred contact method. This goes into the beginning of the context.

1.During the conversation → the full message history is in-context. The agent maintains working memory: "customer described a fault with motor M3, I have already checked steps A and B, waiting for the diagnostic tool output."

1.The conversation exceeds 20 exchanges → a summarisation step compresses older parts. Key facts (machine type, serial number, fault description) are extracted and stored in the summary.

1.Session ends → the agent writes to external memory: "ticket #1234, customer X, problem Y, resolution Z, satisfaction 4/5". This will be available when the customer next opens a conversation.

1.Customer writes again a week later → the cycle repeats, but this time with that history as a foundation.

This is not science fiction — it is a standard production architecture we implement for clients. The complexity is not in the technology; it is in deciding what to store, when, and in what format.

Designing external memory: what to store and what not to

One of the most common problems we encounter when deploying agents with long-term memory is memory explosion — the agent stores too much, the relevance of entries declines, and retrieval surfaces irrelevant noise.

Several rules that work in practice:

Store facts, not text. Instead of a full paragraph of conversation, the agent stores an extracted fact: {"user_id": "X", "preference": "output_format: Excel", "confidence": "explicit"}. A structured fact is easier to retrieve, easier to update, and takes up less space.

Distinguish permanence. Some information is permanent (customer runs Siemens machines), some is transient (in this session we are diagnosing a specific fault). These should live in separate stores with different TTLs (time-to-live).

Memory is versioned. When a customer changes a preference, the old value should not be overwritten — it should be archived with a timestamp. This matters for auditability.

Explicit vs. implicit memory. When the user explicitly says "remember that…", that is unambiguous. When the agent infers from context (the customer asked about the same topic three times → it is probably a priority), that is implicit memory — less reliable, requiring a higher confidence threshold.

When a long context window is not enough

It may seem that with million-token context models the memory problem is solved — just include the full history. In practice it is not that simple.

First, the "lost in the middle" effect — models pay disproportionately less attention to information in the middle of a long context. A critical fact from the beginning of a long conversation can be effectively invisible.

Second, cost — the full history is sent on every call. For a conversation with thousands of exchanges this is unnecessary token overhead that directly affects agent costs in production.

Third, scalability — a system with 10,000 active agents each holding full history in context is not realistic. Persistent external memory enables scaling because state lives outside the process.

Fourth, multi-session continuity — the context window always resets at the start of a new session. Without external memory every new conversation is a blank slate regardless of how many times the customer has interacted before.

A long context window is a powerful tool — but it is a tool for processing long documents and complex one-shot tasks, not a substitute for a designed memory architecture.

Summarisation vs. selective storage: where to draw the line

The two main strategies for managing long conversational history are summarisation and selective storage of key facts. Each has its place.

Summarisation is appropriate when: - The conversation is narrative, contextual, hard to structure - Relationships and flow of thought matter more than isolated facts - Speed of implementation is a priority

Selective fact storage is appropriate when: - The conversation contains clearly identifiable entities and attributes - Facts need to be retrieved, updated, or aggregated later - Auditability is a requirement (finance, compliance)

In production systems we typically combine both: summarisation for the "story" of the session, structured facts for entities and decisions.

A direct connection to how an agent maintains state between steps can be found in the article on AI agent architectures — particularly the section on checkpointing in LangGraph.

Technical implementation: LangGraph as a reference pattern

For production deployments we recommend LangGraph as the reference framework for memory patterns — not because it is fashionable, but because it models state as a first-class citizen of the architecture.

Key concepts in the context of memory:

State schema — you define what is in the agent's working memory: which fields, which types, which default values. This eliminates hidden global state.
Checkpointing — LangGraph natively supports saving graph state to persistent storage (SQLite, PostgreSQL). This enables resume after interruption, HITL (human-in-the-loop interrupt), and post-mortem debugging.
Memory store — newer versions of LangGraph include a dedicated memory store for cross-thread memory (shared across different conversations of the same user).

Checkpointing in LangGraph is also the foundational technique for human-in-the-loop with agents — the agent pauses before a critical action, state is saved, and after human approval it resumes from exactly the same point.

Alternative: mem0 is a dedicated open-source library for agent memory that provides an abstraction over various storage backends (vector DB, KV store, graph). Suitable if you do not want to be tied to a single framework.

Security aspects of agent memory

This rarely surfaces in discussions about agent memory, but it is critical: agent memory is an attack surface.

If an agent loads memory from external storage and injects it into context, an attacker with access to that storage can insert malicious instructions that become silent components of every future session. This is a variant of a prompt injection attack — and it is more insidious than a direct injection because it arrives from a trusted internal source.

Practical mitigations:

Memory records should be sanitised before being injected into context
Distinguish between trusted memory (written by the system) and untrusted memory (written by the customer or an external service)
Log what the agent retrieves from memory — this is part of AI agent observability
For high-risk agents (financial transactions, system access) consider memory with explicit human review before activation

Frequently asked questions

Is RAG the same as agent memory?

No. RAG is a mechanism for retrieving knowledge from an external knowledge base — for example, company documentation or a product database. Agent memory is about the agent's own state: what it remembers from the conversation, about the user, or about previous interactions. In the architecture these are two separate layers, even though both may technically use a vector database for storage.

Is a long context window sufficient instead of external memory?

For one-off, self-contained tasks, yes — a model with a million-token context can process even a very long document. For multi-session agents with history, for scalable systems, or for scenarios where the agent must remember facts across days and weeks, a long context window is not enough. In addition, the "lost in the middle" effect limits reliability with very long inputs.

How do you prevent an agent from "forgetting" important information from a long conversation?

A combination of three measures: (1) history summarisation with explicit extraction of key facts into structured format, (2) storing facts to external memory as soon as they are identified — not at session end, (3) working memory as an explicit state object in a graph-based agent, where important decisions are written directly as fields — not buried in conversational text.

What does external agent memory cost on top of everything else?

Direct costs are low — a vector database like Qdrant is open-source and a reasonably priced server handles typical volumes (hundreds of thousands of records) without difficulty. The bigger cost impact comes from retrieval latency (adds 50–200 ms per query) and developer time to design the memory schema correctly. Poor memory management (storing everything instead of storing selectively) can increase token costs at retrieval time — the relevance of a record directly affects how many tokens are loaded back into context.

What is the most common mistake when implementing agent memory?

From experience: forgetting about summarisation and cleanup. Teams implement writes to memory but do not address what happens when memory fills up with irrelevant entries or stale facts. After a few weeks the agent retrieves noise that degrades the quality of its responses. The memory layer needs an equally well-considered "forgetting strategy" as a storage strategy.

*Memory architecture is one of those system layers that projects most often underestimate in the prototype phase — and where the price is paid when scaling into production. If you are designing an agent that needs to operate in a real environment with real users across days and weeks, we are happy to discuss what memory layer makes sense for your specific case.*