Shilpa Blog

LLM — Large Language Models which are the base for any GPT models. The memory for the LLM is always handled outside of the LLM. By including some extra information, the prompts (input context to LLM) can actually use this information to generate the output. The information in this context window provided as input to LLM determines the behavior of our application. For Instance, a chatbot application has conversational memory like we may have to memorize the name of the user over time or keep track of tasks or we may have to share information between different LLMs by retrieving relevant information from an external data source. But do you know what the actual problem is? It is the size of this context window. Yes, we are saying that if we increase the context the output is even better but can we increase this context, this is directly related to cost and also response time will be more and increases processing.

Managing what should be in the context window is really important. To better manage this memory there are different analogies that different companies are following. But recently I have come across the paper on MemGPT: Towards LLMs as Operating Systems.

Why is this paper interesting?

Before going into this paper, let me explain an important concept called Virtual Memory in the computer.

This is not the physical memory but we feel that the computer has more memory. When there is reference to the virtual location that is not present in the physical memory, the operating system will first make room for it by moving a block of information that is in physical memory out to disk, preserving any changes in that block, and then fetching the new block of information with a reference address from disk back into physical memory.

Similarly, we can think of the context window of LLM as analogous to this physical memory. In our system, an LLM agent includes the role of the operating system and decides what information should be in the context window.

AI Agents can use an LLM for planning, use tools, and make decisions such as deciding to stop or continue a task. A similar approach also lets an AI agent mange as a memory. To support memory management, the agent is given a section of the context window for long term memory, which the agent can write to. The agent is given tools to access external storage, such as databases to create large memory store. Combining tools to write to both in context and external memory, as well as tools to search external memory and place those results into the LLM context.

Key Ideas behind MemGPT

Self Editing Memory — Usually the instructions or tools are static but here the the MemGPT can edit its own memory or prompt based on things it learned during the chat
Inner Thoughts — In MemGPT, the agents are always thinking to themselves even when they dont reply.
Output is tool — The MemGPT output calls tools. When agent wants to communicate it always calls the tools
Looping via heartbeats — MemGPT agents are designed to run many LLM steps off of a single user input.

By all these features, the MemGPT agents can be Autonomous which can take actions by itself and self-improving which can learn over time using long-term memory

MemGPT agents use heartbeats to chain functions. For example, you said your name. The MemGPT agent updates the memory by using a function call to save to database but in this case it does not respond. so the heartbeat is set to true which loops in the function calls that replies “ Hi, Nice meeting you Shilpa!!”

The MemGPT stores all your messages tools and memories. Hence, even if you close the python script and rerun your MemGPT Agent code it still remember your conversation. We just have to decide how to make this state that is stored turned into prompt. This is context compilation which turn our agent state into a prompt. This decide the performance of the agent.

Breaking down Context Window

In the LLM API’s which is the chat completion the LLM input is divided into system prompt — which decides the behavior of the LLM how it differs from the base LLM and Chat history — Memory of previous conversation. The chat completion provides the output with respect to the chat history.

In MemGPT, we create special section called core memory

Core memory is used to store the important information of the user to personalize the conversation.

The agent also has the information of the tool to store the important information of the context for long term memory as well as understand that it has the power to edit this memory if it sees fit. It also stores information about the agent. We can also use customized memory module. Depending on what your agent wants to do, you can make this memory module as simple or as complex as you want it to be. Core agent is what makes the agent to learn over time and it is the special section in the context window which is visible to the agents all the time.

What happens if the context window is out of memory?

When we run out of memory, it flushes the conversation by storing the summarization of the conversation into few or single sentence. Many agents uses this technique of truncating and storing the data as summary but the messages are permanently deleted. Instead, MemGPT stores all the messages into the disk and does not delete any message permanently. The old messages are moved out of chat memory and saves to recall memory. This frees up the memory in chat history and it is also made available to the agent to recall whenever required.

Similar to core memory, the recall memory also uses the tools to get the information when required by the agent using conversation search tool which will search the database of old messages.

Even the core memory is also limited in size as they have limited number of 2000 characters. But what happens if the agent runs out of memory in core memory? Similar to chat history having the recall memory, the core memory also has the backup memory called archival memory. MemGPT decides what information is important to be in the context window and what should be moved to the archival memory. This is general memory and can be pinned to context window at any time. The datasource it can be stored in the form of pdf for example. When the question is related to it, the MemGPt agent searches the context in data source and gets the enough information. As archival memory and recall memory are both external, it is unlimited.

If this is large enough and has huge amount of external datasource, the MemGPT decides which external source to look for based on the statistics called A/R stats which tells the MemGPT if the source has information related to question by using search tool.

Letta tool to build the MemGPT agent

Let us create an agent and see the above concept working

First, we create the Letta client. We can download this tool and run this from local.

from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

Then Create the agent. Here I would like to use Open AI as the LLM and Embedding

agent_state = client.agents.create(
    name="simple_agent",
    memory_blocks=[
        {
          "label": "human",
          "value": "My name is Shilpa",
          "limit": 10000 # character limit
        },
        {
          "label": "persona",
          "value": "You are a helpful assistant and you always use emojis"
        }
    ],
    model="openai/gpt-4o-mini-2024-07-18",
    embedding="openai/text-embedding-3-small"
)

For sending the message to the agent, we can use the messages method

# send a message to the agent
response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "hows it going????"
        }
    ]
)

# if we want to print the messages
for message in response.messages:
    print_message(message)

We can print the usage statistics using the response object

# if we want to print the usage stats
print(response.usage.completion_tokens)
print(response.usage.prompt_tokens)
print(response.usage.step_count)

If you want to look at what tools the agent uses

Now lets give the agent new information

# send a message to the agent
response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "my name actually is Sid"
        }
    ]
)

# if we want to print the messages
for message in response.messages:
    print_message(message)

Here is the output reasoning

For retrieving new values

client.agents.blocks.retrieve(
    agent_id=agent_state.id,
    block_label="human"
).value

Now I would like to update some new information from the chat, so I would send the messages to the client in the conversation

response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "Save the information that 'shilpa loves machine learning' to archival"
        }
    ]
)

# if we want to print the messages
for message in response.messages:
    print_message(message)

You can see what is stores in the passages

We can also explicitly create the memories

client.agents.passages.create(
    agent_id=agent_state.id,
    text="Shilpa's loves Hersheys",
)

Now let us test it

# send a message to the agent
response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "What chocolates do I like? Search archival."
        }
    ]
)

for message in response.messages:
    print_message(message)

Cool Right! I’m able to look at the memory that is stored and update it as per my needs.

How is Core memory Designed?

Core memory is defined by memory blocks and memory tools. Each block will have the character limit. the block also has label which is human or Persona used to reference the block. The value of the block has actual data.

We also have tools like core_memory_replace() where we can replace the existing content in the core memory. The block data is compiled into the context window at the inference time, to make up the core memory context.

The memory blocks are synced to database and has unique ID which can be shared across multiple agents context window by syncing the block value.

Let us start practical example and see how this works

As before, we have to start the Letta client running on the local machine. Now let us get into creating the agent

agent_state = client.agents.create(
    memory_blocks=[
        {
          "label": "human",
          "value": "The human's name is Winnie the Pooh."
        },
        {
          "label": "persona",
          "value": "My name is Rabbit, the all-knowing sentient AI."
        }
    ],
    model="openai/gpt-4o-mini",
    embedding="openai/text-embedding-3-small"
)

Now create a blocks object to access it

blocks = client.agents.blocks.list(
    agent_id=agent_state.id,
)

If you print it, you can see 2 blocks created

We can also retrieve the block using block ID or the label

Now let us see how to create the tools to access the AgentState.

def get_agent_id(agent_state: "AgentState"):
    """
    Query your agent ID field
    """
    return agent_state.id

get_id_tool = client.tools.upsert_from_function(func=get_agent_id)

This is the function that returns the Agent ID

Creating the agent that uses tools by passing the output to the tool_ids field of the agents

agent_state = client.agents.create(
    memory_blocks=[],
    model="openai/gpt-4o-mini",
    embedding="openai/text-embedding-3-small",
    tool_ids=[get_id_tool.id]
)

Let us test the working

response_stream = client.agents.messages.create_stream(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "What is your agent id?" 
        }
    ]
)

for chunk in response_stream:
    print_message(chunk)

Great!! Now let’s move on to other part to create custom tasks. We are now trying to add custom tasks to the queue and pop it

def task_queue_push(agent_state: "AgentState", task_description: str):
    """
    Push to a task queue stored in core memory.

    Args:
        task_description (str): A description of the next task you must accomplish.

    Returns:
        Optional[str]: None is always returned as this function
        does not produce a response.
    """

    from letta_client import Letta
    import json

    client = Letta(base_url="http://localhost:8283")

    block = client.agents.blocks.retrieve(
        agent_id=agent_state.id,
        block_label="tasks",
    )
    tasks = json.loads(block.value)
    tasks.append(task_description)

    # update the block value
    client.agents.blocks.modify(
        agent_id=agent_state.id,
        value=json.dumps(tasks),
        block_label="tasks"
    )
    return None
def task_queue_pop(agent_state: "AgentState"):
    """
    Get the next task from the task queue 
 
    Returns:
        Optional[str]: Remaining tasks in the queue
    """

    from letta_client import Letta
    import json 

    client = Letta(base_url="http://localhost:8283") 

    # get the block 
    block = client.agents.blocks.retrieve(
        agent_id=agent_state.id,
        block_label="tasks",
    )
    tasks = json.loads(block.value) 
    if len(tasks) == 0: 
        return None
    task = tasks[0]

    # update the block value 
    remaining_tasks = json.dumps(tasks[1:])
    client.agents.blocks.modify(
        agent_id=agent_state.id,
        value=remaining_tasks,
        block_label="tasks"
    )
    return f"Remaining tasks {remaining_tasks}"

We can now upsert the tools into Letta

task_queue_pop_tool = client.tools.upsert_from_function(
    func=task_queue_pop
)
task_queue_push_tool = client.tools.upsert_from_function(
    func=task_queue_push
)

Use letta agent to use the tasks

import json

task_agent = client.agents.create(
    system=open("task_queue_system_prompt.txt", "r").read(),
    memory_blocks=[
        {
          "label": "tasks",
          "value": json.dumps([])
        }
    ],
    model="openai/gpt-4o-mini-2024-07-18",
    embedding="openai/text-embedding-3-small", 
    tool_ids=[task_queue_pop_tool.id, task_queue_push_tool.id], 
    include_base_tools=False, 
    tools=["send_message"]
)

Good work!! The tools are now added to the agent successfully

Let us use them by adding the content

response_stream = client.agents.messages.create_stream(
    agent_id=task_agent.id,
    messages=[
        {
            "role": "user",
            "content": "Add 'start calling me Mickey' and "
            + "'tell me a haiku about my name' as two seperate tasks."
        }
    ]
)

for chunk in response_stream:
    print_message(chunk)

We can also retrieve the task list

I can also see the usage statistics attached to it. You can also observe that request_heartbeat is true until it returns some response to the user.

How do we use External Memory with an example of RAG?

Till now we have seen that the agents are using the memory that is stored and pushed into the internal memory. Let us see how we can use external sources in this case it is PDF file.

For this, dont forget to run your letta in your local

I would like to create an employee handbook embeddings as a datasource

source = client.sources.create(
    name="employee_handbook",
    embedding="openai/text-embedding-3-small"
)
source

We will update the datasource with the data

job = client.sources.files.upload(
    source_id=source.id,
    file=open("handbook.pdf", "rb")
)

Let us check the status of the job

import time
from letta_client import JobStatus

while job.status != 'completed':
    job = client.jobs.retrieve(job.id)
    print(job.status)
    time.sleep(1)

The output is completed which means the data is added into our datasource.

Let us create the agent and attach the data source

agent_state = client.agents.create(
    memory_blocks=[
        {
          "label": "human",
          "value": "My name is Sarah"
        },
        {
          "label": "persona",
          "value": "You are a helpful assistant"
        }
    ],
    model="openai/gpt-4o-mini-2024-07-18",
    embedding="openai/text-embedding-3-small"
)

Now attach the data source

agent_state = client.agents.sources.attach(
    agent_id=agent_state.id, 
    source_id=source.id
)

Checking the data attached to the agent

Now time to test the agent about the handbook we attached

response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "Search archival for our company's vacation policies"
        }
    ]
)
for message in response.messages:
    print_message(message)

Wonderful!! Now this is RAG framework created which can retrieve the information from the data sources

How can we run this Service and share memory across the agents?

In Letta framework, the agents are designed to run as a service, so that the real application can communicate with them via REST APIs.

Let us learn this using Multi-agent recruiting workflow where one agent is evaluator and gives decision and the other agent drafts the personalized outreach emails. They both have shared memory or block about the company they are recruiting for.

Firstly, run the letta server running on the local system. For this usecase, we can create the block with the information of the company that agents can use

company_description = "The company is called AgentOS " \
+ "and is building AI tools to make it easier to create " \
+ "and deploy LLM agents."

company_block = client.blocks.create(
    value=company_description,
    label="company",
    limit=10000 # character limit
)

Then we can create the tool for the outreach agent to use for drafting the email

def draft_candidate_email(content: str):
    """
    Draft an email to reach out to a candidate.

    Args:
        content (str): Content of the email
    """
    return f"Here is a draft email: {content}"
draft_email_tool = client.tools.upsert_from_function(func=draft_candidate_email)

Now that tool is ready, we can create agent that uses this tool

outreach_persona = (
    "You are responsible for drafting emails "
    "on behalf of a company with the draft_candidate_email tool. "
    "Candidates to email will be messaged to you. "
)

outreach_agent = client.agents.create(
    name="outreach_agent",
    memory_blocks=[
        {"label": "persona", "value": outreach_persona}
    ],
    model="openai/gpt-4o-mini-2024-07-18",
    embedding="openai/text-embedding-ada-002",
    tools=[draft_email_tool.name],
    block_ids=[company_block.id]
)

Now let us create the tool for the evaluation agent

def reject(candidate_name: str): 
    """ 
    Reject a candidate. 

    Args: 
        candidate_name (str): The name of the candidate
    """
    return


reject_tool = client.tools.upsert_from_function(func=reject)

It is time to create persona for the evaluation agent

skills = "Front-end (React, Typescript) or software engineering skills"

eval_persona = (
    f"You are responsible for evaluating candidates. "
    f"Ideal candidates have skills: {skills}. "
    "Reject bad candidates with your reject tool. "
    f"Send strong candidates to agent ID {outreach_agent.id}. "
    "You must either reject or send candidates to the other agent. "
)

Build the agent now as we have all necessary tools and blocks

eval_agent = client.agents.create(
    name="eval_agent",
    memory_blocks=[
        {"label": "persona", "value": eval_persona}
    ],
    model="openai/gpt-4o-mini-2024-07-18",
    embedding="openai/text-embedding-ada-002",
    tool_ids=[reject_tool.id],
    tools=['send_message_to_agent_and_wait_for_reply'],
    include_base_tools=False,
    block_ids=[company_block.id],
    tool_rules = [
        {
            "type": "exit_loop",
            "tool_name": "send_message_to_agent_and_wait_for_reply"
        }
    ]
)

In both the above agents, we are passing the block ID agent to use the information of the company

As you see the 2 tools are created for the agent

Now it is time to test it. Send the resume of the candidate to decide if they can proceed with the candidate or not

resume = open("resumes/tony_stark.txt", "r").read()

Send the data to the agent

response = client.agents.messages.create_stream(
    agent_id=eval_agent.id,
    messages=[
        {
            "role": "user",
            "content": f"Evaluate: {resume}"
        }
    ]
)
for message in response:
    print_message(message)

Here you go with the output

We can also update the shared memory

response = client.agents.messages.create_stream(
    agent_id=outreach_agent.id,
    messages=[
        {
            "role": "user",
            "content": "The company has rebranded to Letta"
        }
    ]
)
for message in response:
    print_message(message)

Hurray!! We have used all the functionalities of the MemGPT using Letta client.

What if your LLM had RAM, Disk, and a Memory Manager?