LLM Guide
  • An introduction to Large Language Models
  • Understanding Pre-trained Language Models
  • Solutions to LLM Challenges
    • Prompt Engineering
    • Neuro-Symbolic Methods
    • Retrieval-Augmented Generation (RAG)
    • Honorable Mentions
  • Fine-Tuning
  • Supervised Fine-Tuning Strategies
    • Full Parameter Fine-Tuning
    • Half Fine-Tuning (HFT)
    • Parameter-Efficient Fine-Tuning (PEFT)
      • LoRA (Low-Rank Adaptation)
      • QLoRA (Quantized LoRA)
      • DoRA (Decomposed Low-Rank Adaptation)
      • NEFTune (Noise-Enhanced Fine-Tuning)
  • Fine-tuning Best Practices
  • Fine-tuning Using Ubiai (No-Codesolution)
  • Evaluation of Fine-Tuned Models
    • Evaluation Techniques
    • Task specific Evaluation Metrics
    • Popular Benchmarks
    • Best Practices for Model Evaluation
  • Directory of Links By Section
Powered by GitBook
On this page
  • Introduction
  • Prompt Engineering
  • Neuro-Symbolic Methods
  • Retrieval-Augmented Generation (RAG)
  • Honorable Mentions
  • Fine-Tuning
  • Full Parameter Fine-Tuning
  • Parameter-Efficient Fine-Tuning (PEFT)
  • Task specific Evaluation Metrics
  • Popular Benchmarks

Directory of Links By Section

PreviousBest Practices for Model Evaluation

Last updated 4 months ago

Introduction

  • – The foundational paper on Transformers.

Prompt Engineering

  • – Describes LLMs ability to perform a wide range of tasks with few-shot learning.

  • – Explores how chain-of-thought prompting improves reasoning in LLMs.

  • – Shows how generating multiple responses can improve reasoning consistency.

  • – Introduces the Tree of Thought approach to enhance reasoning capabilities in LLMs.

  • – Injecting Logic into Contexts for Full Reasoning in Large Language Models

Neuro-Symbolic Methods

  • - This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners.

Retrieval-Augmented Generation (RAG)

  • – Introduces RAG, a method combining retrieval-based and generation-based models to enhance the performance on knowledge-intensive tasks.

Honorable Mentions

  • – Foundational paper on knowledge distillation.

  • – Overview paper on ensemble methods and voting systems.

Fine-Tuning

Full Parameter Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT)

Task specific Evaluation Metrics

Popular Benchmarks

– This paper discusses the use of Stochastic Gradient Descent (SGD) for Machine Learning.

– This paper introduces Reinforcement Learning from Human Feedback (RLHF).

– Official documentation for the AdamW optimizer in PyTorch, widely used for fine-tuning models.

– The paper introducing Layer-wise Fine-Tuning (LIFT) for model adaptation.

– A popular library for working with pre-trained transformer models, including support for fine-tuning tasks.

– The original paper discussing Low-Rank Adaptation (LoRA), a technique for efficient fine-tuning of large pre-trained models.

– Documentation for the PEFT (Parameter-Efficient Fine-Tuning) technique, including implementation and configurations related to LoRA and other efficient fine-tuning approaches.

– The Rotten Tomatoes dataset, used for sentiment analysis and fine-tuning tasks in this example.

– A popular library for working with pre-trained transformer models, including support for fine-tuning tasks.

- Official documentation for PyTorch, including tools for implementing perplexity calculations.

- Official documentation for the NLTK library, which includes BLEU score calculation.

- GitHub repository for the ROUGE metric, used for evaluating text summaries.

- NLTK’s official documentation for METEOR score, an alternative to BLEU.

- GitHub repository for the Python-Levenshtein package, which implements Levenshtein distance.

- GitHub repository for scikit-learn, which provides implementations of F1 score and other classification metrics.

- A widely used benchmark in NLP that tests models on tasks such as sentiment analysis, textual entailment, and factual question answering to assess general-purpose language understanding.

- An advanced version of GLUE that includes more complex tasks like commonsense reasoning, multi-hop inference, and complex question answering to push the boundaries of LLM capabilities.

- A benchmark designed to evaluate LLMs on 57 diverse tasks, from elementary school-level math to advanced subjects like law and medicine, testing reasoning and domain-specific knowledge.

- A benchmark for evaluating a model's reading comprehension and question-answering capabilities, with two versions: SQuAD 1.1 (answer extraction) and SQuAD 2.0 (including unanswerable questions).

- A benchmark for testing commonsense reasoning and contextual understanding by predicting the most likely continuation of incomplete sentences or narratives in multiple-choice format.

- A benchmark focused on coreference resolution, particularly resolving pronouns in complex sentences, to assess a model's ability to understand relationships between entities in a document.

Transformer Paper: Attention Is All You Need
Language Models are Few-Shot Learners
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Logic-of-Thought
Large Language Models Are Neurosymbolic Reasoners
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Distilling the Knowledge in a Neural Network
Popular Ensemble Methods
Stochastic Gradient Descent
Learning to Summarize with Human Feedback
AdamW Documentation
LIFT
Hugging Face Transformers Library
LoRA (Low-Rank Adaptation) Paper
PEFT Documentation
Rotten Tomatoes Dataset
Hugging Face Transformers Library
PyTorch Documentation
NLTK BLEU Documentation
Rouge Score GitHub
NLTK Meteor Documentation
python-Levenshtein GitHub
Scikit-learn GitHub Repository
GLUE (General Language Understanding Evaluation)
SuperGLUE
MMLU (Massive Multitask Language Understanding)
SQuAD (Stanford Question Answering Dataset)
HellaSwag
WinoGrande