Directory of Links By Section
Introduction
Transformer Paper: Attention Is All You Need – The foundational paper on Transformers.
Prompt Engineering
Language Models are Few-Shot Learners – Describes LLMs ability to perform a wide range of tasks with few-shot learning.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models – Explores how chain-of-thought prompting improves reasoning in LLMs.
Self-Consistency Improves Chain of Thought Reasoning in Language Models – Shows how generating multiple responses can improve reasoning consistency.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models – Introduces the Tree of Thought approach to enhance reasoning capabilities in LLMs.
Logic-of-Thought – Injecting Logic into Contexts for Full Reasoning in Large Language Models
Neuro-Symbolic Methods
Large Language Models Are Neurosymbolic Reasoners - This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks – Introduces RAG, a method combining retrieval-based and generation-based models to enhance the performance on knowledge-intensive tasks.
Honorable Mentions
Distilling the Knowledge in a Neural Network – Foundational paper on knowledge distillation.
Popular Ensemble Methods – Overview paper on ensemble methods and voting systems.
Fine-Tuning
Stochastic Gradient Descent – This paper discusses the use of Stochastic Gradient Descent (SGD) for Machine Learning.
Learning to Summarize with Human Feedback – This paper introduces Reinforcement Learning from Human Feedback (RLHF).
Full Parameter Fine-Tuning
AdamW Documentation – Official documentation for the AdamW optimizer in PyTorch, widely used for fine-tuning models.
LIFT – The paper introducing Layer-wise Fine-Tuning (LIFT) for model adaptation.
Hugging Face Transformers Library – A popular library for working with pre-trained transformer models, including support for fine-tuning tasks.
Parameter-Efficient Fine-Tuning (PEFT)
LoRA (Low-Rank Adaptation) Paper – The original paper discussing Low-Rank Adaptation (LoRA), a technique for efficient fine-tuning of large pre-trained models.
PEFT Documentation – Documentation for the PEFT (Parameter-Efficient Fine-Tuning) technique, including implementation and configurations related to LoRA and other efficient fine-tuning approaches.
Rotten Tomatoes Dataset – The Rotten Tomatoes dataset, used for sentiment analysis and fine-tuning tasks in this example.
Hugging Face Transformers Library – A popular library for working with pre-trained transformer models, including support for fine-tuning tasks.
Task specific Evaluation Metrics
PyTorch Documentation - Official documentation for PyTorch, including tools for implementing perplexity calculations.
NLTK BLEU Documentation - Official documentation for the NLTK library, which includes BLEU score calculation.
Rouge Score GitHub - GitHub repository for the ROUGE metric, used for evaluating text summaries.
NLTK Meteor Documentation - NLTK’s official documentation for METEOR score, an alternative to BLEU.
python-Levenshtein GitHub - GitHub repository for the Python-Levenshtein package, which implements Levenshtein distance.
Scikit-learn GitHub Repository - GitHub repository for scikit-learn, which provides implementations of F1 score and other classification metrics.
Popular Benchmarks
GLUE (General Language Understanding Evaluation) - A widely used benchmark in NLP that tests models on tasks such as sentiment analysis, textual entailment, and factual question answering to assess general-purpose language understanding.
SuperGLUE - An advanced version of GLUE that includes more complex tasks like commonsense reasoning, multi-hop inference, and complex question answering to push the boundaries of LLM capabilities.
MMLU (Massive Multitask Language Understanding) - A benchmark designed to evaluate LLMs on 57 diverse tasks, from elementary school-level math to advanced subjects like law and medicine, testing reasoning and domain-specific knowledge.
SQuAD (Stanford Question Answering Dataset) - A benchmark for evaluating a model's reading comprehension and question-answering capabilities, with two versions: SQuAD 1.1 (answer extraction) and SQuAD 2.0 (including unanswerable questions).
HellaSwag - A benchmark for testing commonsense reasoning and contextual understanding by predicting the most likely continuation of incomplete sentences or narratives in multiple-choice format.
WinoGrande - A benchmark focused on coreference resolution, particularly resolving pronouns in complex sentences, to assess a model's ability to understand relationships between entities in a document.
Last updated