Shilpa Blog

GenAI Series — A way to optimize your Agentic Apps

In the GenAI world, we are revolving around the prompts. Two seekers asking the same information formatted in different ways often have different results from the LLM. If the output is not desired, we tweek the prompt multiple times to get optimal results. This also changes when we change the underlying LLM. But is there a way of optimal way to write the prompt?

DSPy optimizes the whole prompt engineering process by defining inputs your model needs and output it returns and it also provides the dataset of inputs and the desired outputs.

Instead of tweaking the prompts multiple times, lets reduce the manual work, use DSPy and see how it achieves optimal results.

DSPy has 2 modules : Signature and Module.

When we are trying to build an application, we specify signature which defines the input and output to expect from the LLM component. For example, if we have to predict a mail is spam or not spam. The input will be the text and output will be the 1 or 0 which represents spam or not spam.

A module uses these signatures to actually call the LLM and get results.

DSPy is a GenAI authoring framework, that simplifies the development of GenAI applications. We can use this for multiple use cases like improving RAG (Retrieval Augmented Generation) or improve the chain of thought prompting or generate the structured output or can also improve LLM pipelines for agents or tools.

As we know prompts decide the quality of output, but we do not have clean or straight forward way to say this way of prompt gives better results. We often do trial and error and tweak the prompts which increases operational cost. Still we are not sure if what changes actually help to get better results. And it would become worse if we had to switch underlying model.

What does DSPy brings?

DSPy is the lightweight and flexible framework which supports automatic prompt optimization and fine tuning language model weights with DSPy optimizer. It also has additional support for streaming, async functions too.

LM-agnostic programming and not LM-biased programming — which means it works for any underlying LLM the same way as the LLM is just a RESTful API

MLFlow Integration — which makes us easy to debug the applications very easily

Automatic Program Optimization — can create DSPy Optimizer and apply to a program and automatically gets the quality improvements

Use Case to build Sentiment Analysis using DSPy built-in modules

Let’s go little technical into Python classes — DSPy has 2 classes : Signature and Module

Signature (dspy.Signature) defines the input and output of LLM interaction. It has 2 methods InputField and OutputField. This is called Class based Signature

There is other way of defining the same using predict method which takes string as argument. In argument we use arrow and left is input and right side of arrow is output. This is called String based Signature

Module (dspy.Module) is the core unit of DSPy module which encapsulates the logic for interacting with LMs. modules has attributes like lm — Specifies which language model to use and demos — examples.

Some of the built-in modules are:

dspy.Predict is the module which formats user input and parses LM output based on the signature
dspy.ChainOfThought — It asks LM to include reasoning along with the answer
dspy.ReAct — Common abstraction of continuous tool calling
dspy.ProgramOfThought — a special tool calling module similar to ReAct but tool calling is code
dspy.Refine — users can set the reward function and threshold, if threshold is not met it retries

We can also write custom modules as per our need

Sentiment Analysis with built-in Modules

Step: 1 — choose the LLM

import dspy
dspy.settings.configure(lm=dspy.LM("gemini/gemini-2.5-flash", api_key=GEMINI_API_KEY))

Step: 2 — Create a class signature with input and output fields. Also make sure to put the summary or the prompt in the quotations that was provided in the first line inside the class.

class SentimentClassifier(dspy.Signature):
  """Classify the sentiment of a text."""
  text: str = dspy.InputField(desc="input text to classify sentiment")
  sentiment: int = dspy.OutputField(desc="sentiment, the higher the more positive", ge=0, le=10)

Step:3 — Create a String Signature. Here we have used string signature

str_signature = dspy.make_signature("text -> sentiment")

Step:4 —Create the module and Predict the Sentiment

predict = dspy.Predict(SentimentClassifier)

output = predict(text="I am feeling pretty happy!")
print(output)

As in the above code, you never saw the prompt created. How did it actually create one. To view that we can do inspect history

Here is the output of what happened behind the scenes and how the dspy has created the prompt

Step:5 — ChainOfThought module

cot = dspy.ChainOfThought(SentimentClassifier)

output = cot(text="I am liking this product!")
print(output)

This is the output which shows the chain of thought reasoning.

Behind the scenes —

dspy.Predict composes a multi-turn prompt using user input (formatted by signature and adapter) and demos (examples to guide the Language model) and outputs the dict object containing the extracted output fields

Input flow — Adapter takes in signature, user input and module input and transforms it into the familiar multi-turn messages with different roles.

dspy.LM — A thin wrapper that unifies the experience of different LM providers.

After receiving all the information including signature, user queries and other attributes, the adapter formats the actual prompt combining all this information and sends the prompts to LM. The prompt tells the LM about the response format according to an adapter type we use. We want to show in the lab how to change the adapter based on our language model, and normally we collect DSPy automatically select that for you.

Output Flow — The adapter takes in signature and LM response to parse the response into the user-specified fields.

This flow is just reverse. The LM gives back a response in our define format. Then the adapter parses it into the output fields and sends back to the module because the response comes in the format, we define the prompt. The adapter knows how to parse it into the required fields. the parser result is wrapped in the DSPy prediction, which is similar to dict but allows both the accessor and the key accessor.

Customizing DSPy module

Using a different adapter

dspy.configure(adapter=dspy.JSONAdapter())

print(cot(text="I am feeling pretty happy!"))
dspy.inspect_history(n=1)

Here we are asking for the JSON output so instead of section header followed by the value, the response will be formatted as a Json object so the adapter can parse it.

Building a Name the Celebrity Game customizing DSPy module

The game is Player 1 thinks about celebrity name and Player 2 is LM asks the yes or no questions until you find the name or use up all the quotas. Here is the code for the builder module

class QuestionGenerator(dspy.Signature):
    """Generate a yes or no question in order to guess the celebrity name in users' mind. You can ask in general or directly guess the name if you think the signal is enough. You should never ask the same question in the past_questions."""
    past_questions: list[str] = dspy.InputField(desc="past questions asked")
    past_answers: list[bool] = dspy.InputField(desc="past answers")
    new_question: str = dspy.OutputField(desc="new question that can help narrow down the celebrity name")
    guess_made: bool = dspy.OutputField(desc="If the new_question is the celebrity name guess, set to True, if it is still a general question set to False")


class Reflection(dspy.Signature):
    """Provide reflection on the guessing process"""
    correct_celebrity_name: str = dspy.InputField(desc="the celebrity name in user's mind")
    final_guessor_question: str = dspy.InputField(desc="the final guess or question LM made")
    past_questions: list[str] = dspy.InputField(desc="past questions asked")
    past_answers: list[bool] = dspy.InputField(desc="past answers")

    reflection: str = dspy.OutputField(
        desc="reflection on the guessing process, including what was done well and what can be improved"
    )

def ask(prompt, valid_responses=("y", "n")):
    while True:
        response = input(f"{prompt} ({'/'.join(valid_responses)}): ").strip().lower()
        if response in valid_responses:
            return response
        print(f"Please enter one of: {', '.join(valid_responses)}")

class CelebrityGuess(dspy.Module):
    def __init__(self, max_tries=10):
        super().__init__()

        self.question_generator = dspy.ChainOfThought(QuestionGenerator)
        self.reflection = dspy.ChainOfThought(Reflection)

        self.max_tries = 20

    def forward(self):
        celebrity_name = input("Please think of a celebrity name, once you are ready, type the name and press enter...")
        past_questions = []
        past_answers = []

        correct_guess = False

        for i in range(self.max_tries):
            question = self.question_generator(
                past_questions=past_questions,
                past_answers=past_answers,
            )
            answer = ask(f"{question.new_question}").lower() == "y"
            past_questions.append(question.new_question)
            past_answers.append(answer)

            if question.guess_made and answer:
                correct_guess = True
                break

        if correct_guess:
            print("Yay! I got it right!")
        else:
            print("Oops, I couldn't guess it right.")

        reflection = self.reflection(
            correct_celebrity_name=celebrity_name,
            final_guessor_question=question.new_question,
            past_questions=past_questions,
            past_answers=past_answers,
        )
        print(reflection.reflection)

Then we create an object of celebrity guess class

celebrity_guess = CelebrityGuess()
celebrity_guess()

In the code above, we have a subclass which is question generator — generates a yes or no question and its a chain of thought module with this question generator and signature. In the signature there are 2 input fields which is past questions and past answers. It starts with an empty list and it has 2 outputs. New question indicating yes or no and the guess made indicates if that’s generic guess or direct guess on the name.

The second module is the reflection. After we wrap up the game, we want to do a self reflection which takes in a character celebrity name and the final guesser and past questions and answers. An output will be single string for the refraction process on what’s going good and what is going wrong.

We define custom forward method as user enter the name it goes to for loop and asks the series of questions until we reach the correct guess or use up all the quotas. When the process has started doing self reflection, logs the running process. We can write any function inside forward and it is flexible to migrate. As we have signature we need not worry about the output of LM response as it parses the input

We can also save and load the DSPy module.

celebrity_guess.save("dspy_program/celebrity.json", save_program=False)
celebrity_guess.load("dspy_program/celebrity.json")
# or can give folder
celebrity_guess.save("dspy_program/celebrity/", save_program=True)
loaded = dspy.load("dspy_program/celebrity/")
loaded()

Debug with MLFlow

Tracing records inputs and outputs of intermediate steps and captures hierarchial call stack. This helps us to debug the GenAI applications which is complex inside. If the call to LM fails, it blocks the whole program. Even though DSPy provides inspect to find the LM calls, it is hard to trace back. Tracing provides an easy way for interpretability and debugging.

MLFlow is an open-source AI ops package that streamlines your GenAI app development. MLFlow helps with a full lifecycle of building GenAI applications, ensuring that its face is traceable and reproducible. Both MLFlow server and appliance are fully open sourced. We only add one line of code.

mlflow.dspy.autolog()

After this our program will be automatically traced and the trace will be saved in the MLFlow server that you can access point in time after being generated. The following DSPy components are traced:

Module — We can trace all the modules and submodules
Adapter — see cost of the adapter and how it formats and use query and process the LLM response
LM — trace the calls to LM and inspect the actual prompt and the actual LM response
Tool — Wrapper over the DSPy modules can be traced

Build an Airline Customer Service Agent with dspy.ReAct

from pydantic import BaseModel

class Date(BaseModel):
    # Somehow LLM is bad at specifying `datetime.datetime`
    year: int
    month: int
    day: int
    hour: int

class UserProfile(BaseModel):
    user_id: str
    name: str
    email: str

class Flight(BaseModel):
    flight_id: str
    date_time: Date
    origin: str
    destination: str
    duration: float
    price: float

class Itinerary(BaseModel):
    confirmation_number: str
    user_profile: UserProfile
    flight: Flight

class Ticket(BaseModel):
    user_request: str
    user_profile: UserProfile

The above code represents the model classes for the objects required.

user_database = {
    "Adam": UserProfile(user_id="1", name="Adam", email="adam@gmail.com"),
    "Bob": UserProfile(user_id="2", name="Bob", email="bob@gmail.com"),
    "Chelsie": UserProfile(user_id="3", name="Chelsie", email="chelsie@gmail.com"),
    "David": UserProfile(user_id="4", name="David", email="david@gmail.com"),
}

flight_database = {
    "DA123": Flight(
        flight_id="DA123",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=1),
        duration=3,
        price=200,
    ),
    "DA125": Flight(
        flight_id="DA125",
        origin="SFO",
        destination="JFK",
        date_time=Date(year=2025, month=9, day=1, hour=7),
        duration=9,
        price=500,
    ),
    "DA456": Flight(
        flight_id="DA456",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=1),
        duration=2,
        price=100,
    ),
    "DA460": Flight(
        flight_id="DA460",
        origin="SFO",
        destination="SNA",
        date_time=Date(year=2025, month=10, day=1, hour=9),
        duration=2,
        price=120,
    ),
}

itinery_database = {}
ticket_database = {}

The above code shows the dummy data

import random
import string


def fetch_flight_info(date: Date, origin: str, destination: str):
    """Fetch flight information from origin to destination on the given date"""
    flights = []

    for flight_id, flight in flight_database.items():
        if (
            flight.date_time.year == date.year
            and flight.date_time.month == date.month
            and flight.date_time.day == date.day
            and flight.origin == origin
            and flight.destination == destination
        ):
            flights.append(flight)
    return flights


def fetch_itinerary(confirmation_number: str):
    """Fetch a booked itinerary information from database"""
    return itinery_database.get(confirmation_number)


def pick_flight(flights: list[Flight]):
    """Pick up the best flight that matches users' request."""
    sorted_flights = sorted(
        flights,
        key=lambda x: (
            x.get("duration") if isinstance(x, dict) else x.duration,
            x.get("price") if isinstance(x, dict) else x.price,
        ),
    )
    return sorted_flights[0]

def generate_id(length=8):
    chars = string.ascii_lowercase + string.digits
    return "".join(random.choices(chars, k=length))


def book_itinerary(flight: Flight, user_profile: UserProfile):
    """Book a flight on behalf of the user."""
    confirmation_number = generate_id()
    while confirmation_number in itinery_database:
        confirmation_number = generate_id()
    itinery_database[confirmation_number] = Itinerary(
        confirmation_number=confirmation_number,
        user_profile=user_profile,
        flight=flight,
    )
    return confirmation_number, itinery_database[confirmation_number]


def cancel_itinerary(confirmation_number: str, user_profile: UserProfile):
    """Cancel an itinerary on behalf of the user."""
    if confirmation_number in itinery_database:
        del itinery_database[confirmation_number]
        return
    raise ValueError("Cannot find the itinerary, please check your confirmation number.")


def get_user_info(name: str):
    """Fetch the user profile from database with given name."""
    return user_database.get(name)


def file_ticket(user_request: str, user_profile: UserProfile):
    """File a customer support ticket if this is something the agent cannot handle."""
    ticket_id = generate_id(length=6)
    ticket_database[ticket_id] = Ticket(
        user_request=user_request,
        user_profile=user_profile,
    )
    return ticket_id

These are the required methods for the functionality

class DSPyAirlineCustomerSerice(dspy.Signature):
    """You are an airline customer service agent. You are given a list of tools to handle user request. You should decide the right tool to use in order to fullfil users' request."""
    user_request: str = dspy.InputField()
    process_result: str = dspy.OutputField(desc="Message that summarizes the process result, and the information users need, e.g., the confirmation_number if it's a flight booking request.")

Here is the class with signature with input and output fields

react = dspy.ReAct(
    DSPyAirlineCustomerSerice,
    tools = [
        fetch_flight_info,
        fetch_itinerary,
        pick_flight,
        book_itinerary,
        cancel_itinerary,
        get_user_info,
        file_ticket,
    ]
)

This is the ReAct module combining all the tools for each task defined earlier

result = react(user_request="please help me book a flight from SFO to JFK on 09/01/2025, my name is Adam")

Here is the result of MLFlow traces

Using DSPy Optimizer

We first enable the MLFlow for the application using the above explanation. Now let us build a RAG Agent to understand this.

Let us define the search wikipedia using ReAct module

def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)
    return [x["text"] for x in results]

react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

Parse the json as train set and validation set

import json

# Load trainset
trainset = []
with open("trainset.jsonl", "r") as f:
    for line in f:
        trainset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

# Load valset
valset = []
with open("valset.jsonl", "r") as f:
    for line in f:
        valset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

Use the optimizer

tp = dspy.MIPROv2(
    metric=dspy.evaluate.answer_exact_match,
    auto="light",
    num_threads=16
)

# Load the model
dspy.cache.load_memory_cache("./memory_cache.pkl")

# Compile the optmizer
optimized_react = tp.compile(
    react,
    trainset=trainset,
    valset=valset,
    requires_permission_to_run=False,
)

Here is the output of this

Let us evaluate this optimizer

evaluator = dspy.Evaluate(
    metric=dspy.evaluate.answer_exact_match,
    devset=valset,
    display_table=True,
    display_progress=True,
    num_threads=24,
)

original_score = evaluator(react)
print(f"Original score: {original_score}")

Optimized score

optimized_score = evaluator(optimized_react)
print(f"Optimized score: {optimized_score}")

As you can see it has improved from 31 to 54 percent.

This is all about DSPy which is framework that simplifies interaction with LLMs and agent development. It also provides automatic program to optimize through DSPy optimizers and it has native MLFlow integration for easy development. We use DSPy signatures to define inputs and outputs and DSPy modules to wrap the custom logic.

Hope you had good learning