Task specific Evaluation Metrics

Task-specific metrics are essential for understanding how well a model performs on specific tasks. These metrics allow us to quantify and analyze the model’s output across various dimensions, whether it’s generating text, answering questions, or translating languages. In this section, we’ll explain some common evaluation metrics used in LLM tasks.

Perplexity

Perplexity is a measure of how well a probability model predicts a sample. It is often used to evaluate language models by calculating how surprised the model is by a given sequence of words. A low perplexity suggests that the model is confident in its predictions and understands the language well. The higher the perplexity, the less confident the model is in its predictions.

Formula:

This formula means that perplexity is the exponential of the average negative log-likelihood of the words in the sequence, given their previous context.

Implementation:

import torch

# Load the pretrained model and tokenizer
model_name = "#your model name"
tokenizer = # your tokenizer
model = # your model 
text = # your text
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad(): #ensures that only the forward pass is computed.
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss 
    perplexity = torch.exp(loss) #calculates the perplexity using the loss.

print(f"Perplexity: {perplexity.item()}")

BLEU (Bilingual Evaluation Understudy)

BLEU score evaluates the output of your LLM application against annotated ground truths. It compares n-grams (n consecutive words) in the model’s output with reference text. BLEU is widely used in machine translation and text generation tasks. It ranges from 0 (no overlap with references) to 1 (perfect overlap).

Formula:

Implementation:

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']

bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score}")

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE score is used to evaluate text summaries by comparing the overlap of n-grams, word sequences, and word pairs between a model-generated summary and a reference summary. It determines the proportion (0–1) of n-grams in the reference that are present in the LLM output.

Formula:

Implementation:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score('The cat sat on the mat.', 'The cat sat on the rug.')
print(f"ROUGE Scores: {scores}")

METEOR (Metric for Evaluation of Translation with Explicit Ordering)

METEOR is a comprehensive evaluation metric designed to improve upon BLEU by taking into account both precision (n-gram matches) and recall (n-gram overlaps), as well as word order differences. Unlike BLEU, METEOR also considers synonym matching using external linguistic resources like WordNet, making it more adaptable to variations in phrasing.

Formula:

Implementation:

from nltk.translate.meteor_score import meteor_score

reference = "The cat is on the mat."
candidate = "The cat sits on the mat."

meteor = meteor_score([reference], candidate)
print(f"METEOR Score: {meteor}")

Levenshtein Distance (Edit Distance)

The Levenshtein distance, or edit distance, calculates the minimum number of single-character edits (insertions, deletions, or substitutions) needed to convert one string into another. This metric is particularly useful for tasks where the exact alignment of characters is crucial, such as in spelling correction, OCR (optical character recognition) output evaluation, or comparing short text strings.

Formula: The Levenshtein distance is calculated using dynamic programming:

Implementation:

import Levenshtein

string1 = "hello"
string2 = "hallo"

levenshtein_distance = Levenshtein.distance(string1, string2)
print(f"Levenshtein Distance: {levenshtein_distance}")

Accuracy

Accuracy measures the proportion of correct predictions out of total predictions. Accuracy is commonly used in classification tasks, like sentiment analysis, where the model’s prediction is compared against a true label.

Formula:

Implementation:

from sklearn.metrics import accuracy_score

true_labels = [1, 0, 1, 1]
predictions = [1, 0, 1, 0]
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy}")

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. It is especially useful when the classes are imbalanced. F1 score is widely used in tasks where both precision (correctness) and recall (completeness) are crucial, such as in classification or information retrieval.

Formula:

Implementation:

from sklearn.metrics import f1_score

true_labels = [1, 0, 1, 1]
predictions = [1, 0, 0, 1]
f1 = f1_score(true_labels, predictions)
print(f"F1 Score: {f1}")

Precision

Precision measures how many of the predicted positive instances are actually correct. Used when it is important to avoid false positives, such as in spam detection or medical diagnoses.

Formula:

Implementation:

from sklearn.metrics import precision_score

true_labels = [1, 0, 1, 1]
predictions = [1, 0, 1, 0]
precision = precision_score(true_labels, predictions)
print(f"Precision: {precision}")

Recall

Recall (or Sensitivity) measures how many of the actual positive instances are correctly identified by the model. Recall is essential in tasks where missing positive instances (false negatives) is costly, such as in medical diagnostics.

Formula:

Implementation:

from sklearn.metrics import recall_score

true_labels = [1, 0, 1, 1]
predictions = [1, 0, 1, 0]
recall = recall_score(true_labels, predictions)
print(f"Recall: {recall}")

Confusion Matrix

A confusion matrix summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. Useful for understanding the types of errors a model makes, especially in classification tasks (e.g., sentiment analysis, topic classification)

The matrix presents four key values that represent the outcomes of a classification task:

True Positive (TP): The number of instances that were correctly predicted as positive.
False Positive (FP): The number of instances that were incorrectly predicted as positive.
True Negative (TN): The number of instances that were correctly predicted as negative.
False Negative (FN): The number of instances that were incorrectly predicted as negative.

Implementation:

from sklearn.metrics import confusion_matrix

true_labels = [1, 0, 1, 1]
predictions = [1, 0, 0, 1]
cm = confusion_matrix(true_labels, predictions)
print(f"Confusion Matrix:\n{cm}")

PreviousEvaluation Techniques NextPopular Benchmarks

Last updated 7 months ago