An In-Depth Guide to Evaluate LLM and

An In-Depth Guide to Evaluate LLM and RAG

LLM evaluation is crucial process used to assess the performance and capabilities of LLM Models. It can be multiple-choice question answering, open-ended instructions and feedback from real uses.

The general purpose evaluations are most popular with benchmarks like Massive Multi-Task Language Understanding (MMLU) or LMSYS Chaybot Arena, domain and task specific models benefit from more narrow approaches. In case of RAG system, we might have to expand our evaluation framework to the entire system of modules like retrievers and post-processors.

Model Evaluation

The objective is to assess the capabilities of a single model without any prompt engineering, RAG pipeline and so on. It is important to select the most relevant LLM and make sure that fine tuning improves the model.

Difference between ML Evaluation and LLM Evaluation

ML Evaluation is centered on assessing the performance of models designed for tasks like prediction, classification and regression. LLM focus on how model understands and generates language, where as ML evaluates how accurate and efficient is the model that can process structured data to produce specific outcomes.

ML models are often designed for specific tasks like predicting stock prices, detecting outliers which has numerical and categorical data and the evaluation is straight forward. LLM task is to interpret and generate language which adds a layer of subject. Along with numerical benchmarks, LLM evaluation requires approach to incorporate quality assessments, examining how well the model produces coherent, relevant, contextually accurate responses in natural language.

  • ML can have performance metrics like accuracy, precision, recall or mean squared error but with LLM which handles multiple tasks we cannot rely on numerical metrics
  • The critical part of ML is selecting manually or transforming the relevant data features before training the model. LLM handle raw text data directly, reducing the need for manual feature engineering
  • In ML, we can easily interpret why model made certain predictions or classifications and this can be core of evaluation but in LLM, we might have to request the explanation during generation process to give insights into model decision-making process.

General Purpose LLM Evaluations

This referes to metrics that are dedicated to base and general purpose fine tuned models. They cover breadth of capabilities that are correlated with knowledge and usefulness without focusing on specific tasks or domains. Based on strengths and weaknesses of the model, it is possible to tweak the dataset and hyperparameters or even modify the architecture.

We can look at 3 categories for evaluation:

Pre-training — During pre-training we monitor how well model learns. These are low level metrics and straight forward.

  • Training Loss — Based on cross entropy loss, measures the difference between the model’s predicted probability distribution to the true distribution of next token.
  • Validation loss — The same loss as training loss but on a held-out validation set to assess generalization
  • Perplexity — Exponential of cross entropy loss representing how surprised the model is by the data
  • Gradient norm — Monitors the magnitude of gradients during training to detect potential instabilities or vanishing/exploding gradients

We can also introduce the benchmarks like HellaSwag during this stage but there is risk of overfitting these evaluations

After Pre-training — The suite of internal and public benchmarks is used and there is a big list.

  • MMLU(Knowledge) — Tests models on multiple-choice questions across 57 subjects, from elementary to professional levels
  • HellaSwag(Reasoning) — Challenges models to complete a given situation with the most plausible ending from multiple choices
  • ARC-C(reasoning) — Evaluates models on grade-school level multiple choice science questions requires casual reasoning

Many of these datasets are also used to evaluate general-purpose fine tuned models and we can focus on the difference in given score between base and the fine-tuned model. It can also find if the fine-tuning was done too close to test set.

After fine-tuning — In addition to above benchmarks, fine-tuned models also have their own benchmarks. for the model trained with the supervised fine tuning(SFT) and preference alignment.

  • IFEval (Instruction Following) — Assesses a model’s ability to follow instructions with particular constraints like not outputting any commas in the answer
  • Chatbot Arena(Conversation) — A framework where human vote the best answer to an instruction comparing two models in head-to head conversation
  • AlpaceEval(Instruction Following) — Automatic evaluation for fine tuned models that is highly correlated with chatbot arena
  • MT-Bench(Conversation) — Evaluates models on multi-turn conversations, testing their ability to maintain context and provide coherent response.
  • GAIA(Agentic) — Tests wide range of abilities like tool use and web browsing in multi-step fashion.

Understanding how these evaluations are designed and used is important to choose the best LLM for your application. For example, if you want to build chatbot use the one having good score on chatbot arena.

Domain Specific LLM Evaluations

These models are helpful to target more fine-grained capabilities with more depth than the previous benchmarks. The choice of benchmarks entirely depends on the domain in question. Some domain-specific evaluations on leaderboard of hugging face hub are

  • Open Medical-LLM Leaderboard — Evaluate the performance of LLMs in medical question answering tasks.
  • BigCodeBench Leaderboard — Evaluates the performance of code LLMs, featuring 2 main categories like code completion based on structured docstrings and code generation from natural language instructions
  • Hallucinations Leaderboard — Evaluates LLMs tendency to produce false or unsupported information across 16 diverse tasks spanning 5 categories.
  • Enterprise Scenarios Leaderboard — Evaluates the performance of LLM on 6 real-world enterprise use cases covering diverse tasks relevant to business applications

There are also other translations based on general purpose benchmarks specific to Korean or Portuguese. But the ultimate aim of evaluations should be

  • It should be complex and challenge models to distinguish good and bad outputs
  • They should be diverse and cover as many topics and scenarios as possible
  • They should be practical and easy to run.

Task-Specific LLM Evaluations

While general purpose and domain specific evaluations indicate strong base or instruct models, they cannot provide insights into how well these models work for a given task. Because of the narrow focus, task-specific LLMs can rarely rely on pre-existing evaluation datasets. This can be advantageous because their outputs also tend to be more structured and easier to evaluate using traditional ML metrics.

  • Summarization task can leverage the Recall-Oriented Understudy for Gisting Evaluation(ROUGE) metric, which measures the overlap between the generated text and referenced text using n-grams
  • Classification tasks benefit from metrics like Accuracy (proportion of correctly predicted instances compared to total instances), Precision ( ratio of true positive predictions to the total positive predictions), Recall ( Ratio of true positive predictions to the total actual positive instances), F1 Score ( harmonic mean of precision and recall used to balance both metrics)

If any of these does not fit, then we can use the pattern of benchmarks that are existing like multiple choice question answering. There are 2 main ways of evaluating models with this schema.

  • Text Generation — having model generate text responses and comparing those to the predefined answer choices.
  • Log-likelihood evaluation — using probabilities by looking at model predicted probability of different answer options without requiring text generation. This approach allows for more better assessment of model understanding as it can capture relative confidence the model has in different options.

LLM-as-a-judge

If the task is too open-ended, traditional ML metrics and multiple-choice question answering might not be relevant. In this scneario, LLM-as-a Judge technique can be used ot evaluate the quality of answers. If you have ground-truth answers, providing them as additional context improves the accuracy of evaluation. It is also recommended to use large models for evaluation and to iteratively refine your prompt. Understanding the reasoning and fixing them through additional prompts is important. In order to parse the output easily it is good to use structured generation.

Judge LLM can also exhibit biasing for assertive or verbose responses and can also overrate answers that sound more confident. The main drawback is it lacks the domain expertise for specialized topics and leads to mis judgement. Consistency is also concern, as LLM might score similar responses differently. It can also have preferences for implicit writing styles unrelated to actual answer quality.

To mitigate these issues, we can combine LLM evaluations with other metrics, use multiple judges and carefully design prompts to address biases.

Once model is properly evaluated it can be included in broader system.

RAG Evaluation

While traditional LLM evaluation focuses on model’s inherent capabilities, RAG evaluation requires a more comprehensive approach that considers both the model’s generative abilities and its interaction with external information sources.

RAG system combines the strengths of LLM with information retrieval mechanisms, allowing them to generate responses that are not only coherent and contextually appropriate but also grounded in up-to-date, externally sourced information.

RAG system evaluation goes beyond assessing standalone LLM.

  • Retrieval Accuracy — How well does the system fetch relevant information? The key metrics would be Retrieval Precision and Recall which measures accuracy and comprehensiveness of the retrieved information.
  • Integration quality — How effectively is the retrieved information incorporated into the generated response? Measuring quality of integration between retrieved data and generated text is crucial
  • Factuality and relevance — Does the final output address the query appropriately while seamlessly blending retrieved and generated content? So measuring the overall factuality and coherence of the output is important

There are 2 methods to evaluate how well RAG models incorporate external information into their responses

Ragas — Retrieval Augmented Generation Assessment

It is open source toolkit designed to provide developers with comprehensive set of tools for RAG evaluation and optimization. It is designed around the idea of metrics-driven Development (MDD) that relies on data to make well-informed decisions, involving the ongoing monitoring of essential metrics over time to gain valuable insights into an application’s performance.

It enables developers to objectively assess their RAG systems, identify areas for improvement, and track the impact of changes over time.

Ragas has ability to synthetically generate diverse and complex test datasets.

This reduces the pain point in manually creating question answers and contexts which is time consuming and labor intensive. It uses evolutionary approach paradigm inspired by works like Evol-Instruct to craft questions with varying characteristics such as reasoning complexity, conditional elements, and multi- context requirements.

Ragas can generate conversational samples that simulate chat based question and follow up interactions allowing developers to evaluate their systems in more realistic scenarios.

Ragas Architecture

It provides a suite of LLM assisted evaluation metrics designed to objectively measure different aspects of RAG system performance.

Faithfulness — Measures the factual consistency of the generated answer against the given context. Works by breaking down the answer into individual claims and verifying if each claim can be inferred from the provided context. The faithfulness score is calculated as ratio of verifiable claims to the total number of claims in the answer

Answer Relevancy — Evaluates how relevant the generated answer is to the given prompt. LLM is prompted to generate multiple questions based on the answer and then calculates the mean cosine similarity between these generated questions and the original question. It helps identify answers that may be factually correct but off-topic or incomplete

Context Precision — Evaluates whether all the ground truth relevant items present in the contexts are ranked appropriately. Considers the position of relevant information within the retrieved context, rewarding systems that gives most relevant information at the top

Context Recall — Measures the extent to which the retrieved context aligns with the annotated ground truth. Analyses each claim in the ground truth answer to determine whether it can be attributed to the retrieved context, providing insights into the completeness of the retrieved information.

It also provides building blocks for monitoring RAG quality in production environments. By leveraging the evaluation results from test datasets and insights gathered from production monitoring, developers can iteratively enhance their applications. It can be fine-tuning retrieval algorithms, adjusting prompt engineering strategies or optimizing balance between retrieved context and LLM generation

ARES — An Automated Evaluation Framework for RAG

It is a tool designed to evaluated RAG systems.It offers automated process that combines synthetic data generation with fine-tuned classifiers to assess various aspects of RAG performance including context relevance, answer faithfulness and answer relevance.

It has 3 main stages —

Synthetic Data generation — It creates datasets that closely mimic real-world scenarios for robust RAG testing. Users can configure this process by specifying document file paths, few-shot prompt files, and output locations for the synthetic queries. It supports pre-trained LM for this task with default being google/flan-t5-xxl. Users can control number of documents sampled and other parameters to balance between comprehensive coverage and computational efficiency

Classifier Training Stage — Creates high precision classifiers to determine the relevance and faithfulness of RAG outputs. Users can specify the classification dataset, test set for evaluations, label columns and model choice. ARES uses microsoft/deberta-v3-large as default model but supports other Hugging face models. Training parameters such as number of epochs, patience value for early stopping, and learning rate can be fine-tuned to optimize classifier performance.

RAG Evaluation — leverages the trained classifiers and synthetic data to assess the RAG model’s performance. Users provide evaluation datasets, few shot examples for guiding the evaluation, classifier checkpoints and gold label paths. It supports various evaluation metrics and can generate confidence intervals for its assessments.

It also supports cloud based and local runs using vLLM integrations.

Both Ragas and ARES complement each other through distinct approaches to evaluation and dataset generation. Ragas strength is production monitoring and LLM assisted metrics and can be combined with ARES highly configurable evaluation process and classifier based assessments.

Implementation of RAGAS evaluation for RAG using Langchain

This is a simple workflow with extracting the docs, chunking them and creating embeddings storing in FAISS retriever. Evaluate this framework using Ragas

Import the libraries

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
# ----- Ragas -----
from ragas import evaluate
from ragas.metrics import (
answer_relevancy, # Answer addresses the question?
faithfulness, # Claims supported by retrieved context
context_precision, # Retrieved chunks are useful for the answer
context_recall, # Retrieved chunks cover what was needed
answer_correctness # If you supply ground_truth answers
)
from ragas.dataset import Dataset
from sentence_transformers import SentenceTransformer
from ragas.run_config import RunConfig

Created dummy corpus

corpus = [
"LangChain is a framework for developing applications powered by language models.",
"FAISS is a library for efficient similarity search and clustering of dense vectors.",
"Ragas is a library that provides evaluation metrics for Retrieval-Augmented Generation systems.",
"SentenceTransformers provides easy-to-use sentence embeddings.",
]

docs = [Document(page_content=t) for t in corpus]

Chunking the documents

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks: List[Document] = splitter.split_documents(docs)

Building embeddings and FAISS retriever

from google.colab import userdata
hf_token = userdata.get("HF_TOKEN")
emb_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = HuggingFaceEmbeddings(model_name=emb_model_name,model_kwargs={"use_auth_token": hf_token})
vectordb = FAISS.from_documents(chunks, embedder)
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

Choose LLM to answer questions

def get_answer_llm():
"""
Returns a LangChain ChatModel-like object with .invoke() support.
You can switch among Groq, Together, or a tiny local HF model.
"""
if USE_GROQ:
# e.g., GROQ_API_KEY env var set
return ChatGroq(model_name="llama-3.1-8b-instant", temperature=0.2)
if USE_TOGETHER:
# e.g., TOGETHER_API_KEY env var set
return ChatTogether(model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", temperature=0.2)

# Fallback: tiny local model (demo-quality only)
model_id = "distilgpt2"
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForCausalLM.from_pretrained(model_id)
mdl.eval()

class TinyLocal:
def invoke(self, messages, **kwargs):
# messages is list[{"role": "user"/"system"/"assistant", "content": "..."}]
prompt = messages[-1]["content"]
input_ids = tok.encode(prompt, return_tensors="pt")
with torch.no_grad():
out = mdl.generate(input_ids, max_new_tokens=120, do_sample=False)
text = tok.decode(out[0], skip_special_tokens=True)
# Return last segment after the prompt
text = text[len(prompt):].strip()
return type("Resp", (), {"content": text})
return TinyLocal()

answer_llm = get_answer_llm()

Create evaluation set with question and ground_truth answer

eval_items = [
{
"question": "What is Ragas and what is it used for?",
"ground_truth": "Ragas is a library for evaluating RAG systems, providing metrics like faithfulness, context precision and recall."
},
{
"question": "What library can I use for dense vector similarity search?",
"ground_truth": "FAISS is a library for efficient similarity search and clustering of dense vectors."
},
{
"question": "How can I create sentence embeddings easily?",
"ground_truth": "Use SentenceTransformers to get sentence embeddings easily."
},
]

Run RAG for each question and retrieve the contexts and generate answer

def make_rag_answer(q: str) -> Dict:
ctx_docs = retriever.invoke(q)
context_texts = [d.page_content for d in ctx_docs]

# Simple prompt template
system = (
"You are a helpful assistant. Use only the provided context when possible. "
"If the context is insufficient, say you are unsure."
)
user = f"Question: {q}\n\nContext:\n" + "\n- ".join([""] + context_texts) + "\n\nAnswer clearly:"

# LangChain ChatModels accept messages like below; TinyLocal wrapper also adheres
resp = answer_llm.invoke([
{"role": "system", "content": system},
{"role": "user", "content": user},
])

answer_text = getattr(resp, "content", str(resp))
return {
"question": q,
"contexts": context_texts,
"answer": answer_text,
}

pred_rows = []
for item in eval_items:
out = make_rag_answer(item["question"])
out["ground_truth"] = item["ground_truth"]
pred_rows.append(out)

Build Ragas Dataset, choose the metrics to evaluate


dataset = Dataset.from_list(pred_rows)

metrics = [
answer_relevancy, # needs question + answer
faithfulness, # needs contexts + answer
context_precision, # needs contexts + question + answer
context_recall, # needs contexts + ground_truth + question
answer_correctness # needs ground_truth + answer
]

Provide the Judge LLM and embeddings for Ragas. Here we have used same embedder and for LLM ragas supports Langchain chat models or OpenAI-compatible ones

ragas_embeddings = HuggingFaceEmbeddings(model_name=emb_model_name,model_kwargs={"use_auth_token": hf_token})
ragas_llm = answer_llm

Run the Ragas evaluation

run_cfg = RunConfig(
timeout=120, # per task
max_retries=5, # retry transient provider failures
max_wait=30, # backoff upper bound
max_workers=2 # keep parallelism modest to avoid rate/timeouts
)

result = evaluate(
dataset=dataset,
metrics=metrics,
llm=ragas_llm, # the model configured with n=1 above
embeddings=ragas_embeddings,
run_config=run_cfg,
raise_exceptions=False, # keep going; failed rows get NaN
)

df = result.to_pandas() # per-row metric scores
agg = (
df.select_dtypes(include=["number"])
.mean(numeric_only=True)
.to_frame(name="mean")
.T
)

Now time to see the results

From the observations — Most of the output is good except the faithfulness for sample 2, likely an answer in the invented details not found in retrieved documents.

Action Items

  • Sort df by faithfulness or answer_correctness to review worst cases
  • Inspect the question , contexts and answer fields for those rows to debug
  • Tune retriever if context_recall is low, Prompt/Generator if faithfulness or answer_correctness is low, Filtering/ Reranking if context_precision is low

References: