Task specific Evaluation Metrics
Last updated
Last updated
Task-specific metrics are essential for understanding how well a model performs on specific tasks. These metrics allow us to quantify and analyze the model’s output across various dimensions, whether it’s generating text, answering questions, or translating languages. In this section, we’ll explain some common evaluation metrics used in LLM tasks.
Perplexity is a measure of how well a probability model predicts a sample. It is often used to evaluate language models by calculating how surprised the model is by a given sequence of words. A low perplexity suggests that the model is confident in its predictions and understands the language well. The higher the perplexity, the less confident the model is in its predictions.
Formula:
This formula means that perplexity is the exponential of the average negative log-likelihood of the words in the sequence, given their previous context.
BLEU score evaluates the output of your LLM application against annotated ground truths. It compares n-grams (n consecutive words) in the model’s output with reference text. BLEU is widely used in machine translation and text generation tasks. It ranges from 0 (no overlap with references) to 1 (perfect overlap).
Formula:
ROUGE score is used to evaluate text summaries by comparing the overlap of n-grams, word sequences, and word pairs between a model-generated summary and a reference summary. It determines the proportion (0–1) of n-grams in the reference that are present in the LLM output.
Formula:
METEOR is a comprehensive evaluation metric designed to improve upon BLEU by taking into account both precision (n-gram matches) and recall (n-gram overlaps), as well as word order differences. Unlike BLEU, METEOR also considers synonym matching using external linguistic resources like WordNet, making it more adaptable to variations in phrasing.
Formula:
Implementation:
The Levenshtein distance, or edit distance, calculates the minimum number of single-character edits (insertions, deletions, or substitutions) needed to convert one string into another. This metric is particularly useful for tasks where the exact alignment of characters is crucial, such as in spelling correction, OCR (optical character recognition) output evaluation, or comparing short text strings.
Formula: The Levenshtein distance is calculated using dynamic programming:
Implementation:
Accuracy measures the proportion of correct predictions out of total predictions. Accuracy is commonly used in classification tasks, like sentiment analysis, where the model’s prediction is compared against a true label.
Formula:
Implementation:
The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. It is especially useful when the classes are imbalanced. F1 score is widely used in tasks where both precision (correctness) and recall (completeness) are crucial, such as in classification or information retrieval.
Formula:
Implementation:
Precision measures how many of the predicted positive instances are actually correct. Used when it is important to avoid false positives, such as in spam detection or medical diagnoses.
Formula:
Implementation:
Recall (or Sensitivity) measures how many of the actual positive instances are correctly identified by the model. Recall is essential in tasks where missing positive instances (false negatives) is costly, such as in medical diagnostics.
Formula:
Implementation:
A confusion matrix summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. Useful for understanding the types of errors a model makes, especially in classification tasks (e.g., sentiment analysis, topic classification)
The matrix presents four key values that represent the outcomes of a classification task:
True Positive (TP): The number of instances that were correctly predicted as positive.
False Positive (FP): The number of instances that were incorrectly predicted as positive.
True Negative (TN): The number of instances that were correctly predicted as negative.
False Negative (FN): The number of instances that were incorrectly predicted as negative.
Implementation: