# LLM Research Themes — Evolution Map (2017 → May 2026)

Comparative reference for TSFM.ai vs LLM research trajectory study. For each theme: when it appeared as *novel research*, when it became *standard practice*, and the typical lag between the two.

---

## A. Scaling Laws

- **Seminal paper**: Kaplan et al., *Scaling Laws for Neural Language Models*, arXiv:2001.08361, Jan 23 2020 (OpenAI).
- **Compute-optimal correction**: Hoffmann et al., *Training Compute-Optimal Large Language Models* (Chinchilla), arXiv:2203.15556, Mar 29 2022 (DeepMind). Found tokens ≈ 20× params is compute-optimal; trained 70B Chinchilla on 1.4T tokens, beating 280B Gopher.
- **Novel research window**: 2020 – 2022.
- **Standard practice from**: ~mid-2022 (Chinchilla immediately reshaped training budgets); fully load-bearing by 2023 (LLaMA-1 explicitly Chinchilla-style; every major lab fits a scaling law before launch).
- **Lag (novel → standard)**: ~2 years (Jan 2020 paper → ~mid-2022 industry-wide).
- **Evolution**: 2024–2025 the framing shifted: inference-time / test-time compute scaling (o1, Sep 2024) opened a *second* scaling axis; MoE-specific scaling laws and data-quality-aware laws (e.g. Beyond-Chinchilla, 2024) extended the concept. By 2025–2026, "fit a scaling law" is table-stakes — model releases are pitched as "we picked size X because the law said so."

---

## B. Emergent Abilities Debate

- **Pro-emergence**: Wei et al., *Emergent Abilities of Large Language Models*, arXiv:2206.07682, Jun 15 2022.
- **Counter (mirage)**: Schaeffer, Miranda, Koyejo, *Are Emergent Abilities of Large Language Models a Mirage?*, arXiv:2304.15004, Apr 28 2023 — NeurIPS 2023 Outstanding Paper Award. Argues "emergence" is an artifact of non-linear/discontinuous metrics.
- **Novel research window**: 2022 – 2024.
- **Standard practice**: Never settled into a single recipe; instead, the *evaluation methodology* it triggered became standard. By 2024–2025 model cards routinely report per-task scaling curves rather than threshold claims. The 2025 survey (arXiv:2503.05788, *Emergent Abilities in LLMs: A Survey*) treats the debate as ongoing but methodologically incorporated.
- **Lag**: ~1.5–2 years from claim to formal counter, ~3 years to "incorporated into evaluation hygiene."
- **Evolution**: Pure "emergence claims" cooled by 2024; the debate now informs how scaling curves are reported.

---

## C. Instruction Tuning / RLHF / Alignment

- **InstructGPT**: Ouyang et al., arXiv:2203.02155, Mar 4 2022 — established the SFT → reward model → PPO recipe; 1.3B InstructGPT preferred over 175B GPT-3.
- **Constitutional AI / RLAIF**: Bai et al. (Anthropic), arXiv:2212.08073, Dec 15 2022 — replaces human harmlessness labels with model self-critique against a constitution.
- **DPO**: Rafailov et al., arXiv:2305.18290, May 29 2023 — removes RL loop; binary cross-entropy directly on preference pairs. Became the *de facto* open-source post-training method by 2024.
- **Novel research window**: 2022 – mid-2023.
- **Standard practice from**: late 2022 for instruction tuning (ChatGPT launch Nov 2022 cemented it); 2023 for RLHF; 2024 for DPO/preference-tuning as table stakes.
- **Lag**: ~9–12 months for instruction-tuning (Mar 2022 → ChatGPT Nov 2022 → every release by 2023); ~1.5 years for DPO (May 2023 → universal in open-source by 2024).
- **Evolution**: 2024–2026 the alignment stack is multi-stage (SFT + DPO/IPO/KTO + RLAIF + constitutional methods); Tülu 3 (Allen AI, 2024) is the open canonical example. RLHF is no longer "one company's safety research" — it's a *required* post-training stage.

---

## D. In-Context Learning / Few-Shot / Chain-of-Thought

- **In-context learning**: Brown et al., *Language Models are Few-Shot Learners* (GPT-3), arXiv:2005.14165, May 28 2020.
- **CoT prompting**: Wei et al., *Chain-of-Thought Prompting Elicits Reasoning in LLMs*, arXiv:2201.11903, Jan 28 2022.
- **Novel research window**: 2020 (ICL) – 2022 (CoT).
- **Standard practice from**: ICL was standard by GPT-3 release (immediate, ~3 months); CoT became default for reasoning benchmarks by late 2022 / early 2023 (Self-Consistency, PoT, ToT all build on it within a year).
- **Lag**: ~3 months for ICL; ~6–9 months for CoT.
- **Evolution**: CoT became the *substrate* for o1-style reasoning models (Sep 2024) — RL-trained internal CoT plus test-time-compute scaling. By 2026 every flagship has a "thinking mode" descended from CoT.

---

## E. Retrieval-Augmented Generation (RAG)

- **REALM**: Guu et al., arXiv:2002.08909, Feb 10 2020 — first joint training of retriever + LM.
- **RAG (canonical)**: Lewis et al., arXiv:2005.11401, May 22 2020 — BART encoder-decoder retrieving Wikipedia for open-domain QA.
- **Novel research window**: 2020 – 2022.
- **Standard practice from**: 2023 (LangChain explosion, vector DB market formed); 2024 widely called "the year of RAG" — vector DB market hit $2.2B with 21.9% CAGR.
- **Lag**: ~3 years (May 2020 paper → 2023 enterprise default).
- **Evolution**: 2023 basic RAG → 2024 advanced (RAG-Fusion, GraphRAG, agentic RAG, hybrid retrieval) → 2025+ "context engineering" displaces RAG as the umbrella term (long-context models eat naive single-shot RAG).

---

## F. Mixture of Experts (MoE)

- **Sparse-gated MoE**: Shazeer et al., arXiv:1701.06538, Jan 23 2017 — 137B-param MoE between LSTM layers; >1000× capacity at marginal compute cost.
- **GShard**: Lepikhin et al., arXiv:2006.16668, Jun 30 2020 — 600B Transformer MoE, automatic sharding across 2048 TPU v3.
- **Switch Transformer**: Fedus et al., arXiv:2101.03961, Jan 11 2021 — top-1 routing; trillion-parameter scale.
- **GLaM**: Du et al., arXiv:2112.06905, Dec 13 2021 — 1.2T params, 64 experts, top-2 routing.
- **Mixtral 8x7B**: Mistral AI, arXiv:2401.04088, Jan 8 2024 (released Dec 8 2023) — 46.7B total / 12.9B active; matched/beat GPT-3.5; *the* open-weight MoE moment.
- **Novel research window**: 2017 – 2021.
- **Standard practice from**: rumored in GPT-4 (Mar 2023, never confirmed); explicit/open from Mixtral (Dec 2023); standard for flagships by 2024 (DeepSeek-V2/V3, Llama 4, Qwen 2.5-MoE, Gemini 1.5 confirmed MoE).
- **Lag**: ~6–7 years from Shazeer 2017 to "common in flagships" (~2024). MoE is the longest-incubated theme in this list.
- **Evolution**: 2024–2026 — MoE-specific scaling laws emerged; fine-grained / shared-experts (DeepSeek-MoE); MoE is now the default architecture for any model > ~100B params.

---

## G. Multimodal LLMs

- **CLIP**: Radford et al., arXiv:2103.00020, Feb 26 2021 — contrastive image-text pretraining; zero-shot classification.
- **DALL-E**: Ramesh et al., arXiv:2102.12092, Feb 24 2021 — text-to-image via discrete VAE + autoregressive transformer.
- **Flamingo**: Alayrac et al., arXiv:2204.14198, Apr 29 2022 — 80B VLM with gated cross-attention; few-shot vision-language.
- **DALL-E 2**: Ramesh et al., arXiv:2204.06125, Apr 13 2022 — CLIP-latent diffusion.
- **GPT-4V**: OpenAI, Sep 2023 (system card) — first widely deployed multimodal frontier model.
- **Gemini 1.0**: Google, Dec 6 2023 — first *natively* multimodal (text/image/audio/video) frontier model.
- **GPT-4o**: OpenAI, May 13 2024 — unified text/image/audio neural architecture.
- **Novel research window**: 2021 – 2023.
- **Standard practice from**: 2024 — multimodal became table stakes for flagship LLM releases (Claude 3 Mar 2024, GPT-4o May 2024, Gemini 1.5 Feb 2024). Pure text-only flagship launches stopped.
- **Lag**: ~3 years (CLIP Feb 2021 → multimodal standard by 2024).
- **Evolution**: Text → vision → audio → video → unified omni-models within 3 years. Native multimodality (Gemini 1.0, Dec 2023) displaced bolt-on adapters.

---

## H. Tool Use / Agents

- **ReAct**: Yao et al., arXiv:2210.03629, Oct 6 2022 — interleaved reasoning + acting traces.
- **Toolformer**: Schick et al., arXiv:2302.04761, Feb 9 2023 — self-supervised teaching of API calls.
- **Novel research window**: late 2022 – 2024.
- **Standard practice from**: 2024 — "agent" became a recognized subfield with dedicated workshops, benchmarks (SWE-bench, WebArena, AgentBench), surveys (arXiv:2412.17481, arXiv:2503.16416). Function-calling APIs (OpenAI Jun 2023, Anthropic tool use Apr 2024) standardized the interface.
- **Lag**: ~1.5 years (Oct 2022 ReAct → mid-2024 mainstream).
- **Evolution**: 2024 single-agent → 2025 multi-agent systems → 2025–2026 "agentic" frontier models trained explicitly for tool use, browsing, code execution (Claude computer use Oct 2024, OpenAI o3/Operator, Anthropic Claude Code). By 2026, agent benchmarks drive more attention than static QA.

---

## I. Benchmark Evolution

- **GLUE**: Wang et al., arXiv:1804.07461, Apr 20 2018.
- **SuperGLUE**: Wang et al., arXiv:1905.00537, May 2 2019 — needed because BERT/RoBERTa already surpassed humans on GLUE within a year.
- **MMLU**: Hendrycks et al., arXiv:2009.03300, Sep 7 2020 — 57 subjects, 15,908 questions.
- **BIG-bench**: Srivastava et al., arXiv:2206.04615, Jun 9 2022 — 204 tasks, 442 authors, 132 institutions.
- **HELM**: Liang et al., arXiv:2211.09110, Nov 16 2022 — 16 scenarios × 7 metrics standardized across 30 models.
- **Novel research window**: continuous since 2018.
- **Standard practice**: every benchmark obsoletes within 18–24 months. MMLU saturated ~2023 → MMLU-Pro (2024). GSM8K saturated ~2024. New benchmarks (GPQA, ARC-AGI, SWE-bench, Humanity's Last Exam) cycle in.
- **Lag**: benchmarks ARE the standard practice — there's no "novel" gap. But by 2023 *benchmark-release papers* arguably drove more downstream papers than individual *model-release* papers (HELM, BIG-bench each have 1000+ citations within 2 years).
- **Evolution**: 2018 GLUE (NLU) → 2020 MMLU (knowledge) → 2022 BIG-bench + HELM (holistic) → 2024 reasoning/agent benchmarks (SWE-bench, AgentBench, GPQA) → 2025 LMSys Arena / human-preference-as-benchmark.

---

## J. Open-Weights Wave

- **LLaMA-1**: Touvron et al., arXiv:2302.13971, Feb 27 2023 — leaked on 4chan Mar 3 2023; reset the open-weight frontier.
- **LLaMA-2**: Touvron et al., arXiv:2307.09288, Jul 18 2023 — first openly licensed (with restrictions) frontier model.
- **Mistral 7B**: Jiang et al., arXiv:2310.06825, Oct 10 2023 — Apache 2.0; beat Llama 2 13B on every benchmark.
- **Mixtral 8x7B**: Jan 8 2024 — first open MoE matching GPT-3.5.
- **Llama 3.1 405B**: Jul 2024 — first open-weight model competitive with GPT-4 / Claude 3.5 on reasoning benchmarks (ARC 96.9, GSM8K 96.8).
- **DeepSeek V3 / R1**: Dec 2024 / Jan 2025 — open-weight reasoning model matching o1-level performance.
- **Novel research window**: 2023.
- **Standard practice from**: 2024 — open-weights became *competitive* (not just a fallback). The expectation flipped: every quarter has a major open release.
- **Lag**: ~1 year from LLaMA-1 (Feb 2023) to open-weights-as-credible-frontier (Jul 2024 with Llama 3.1 405B); ~2 years to open-weight reasoning parity (Jan 2025 DeepSeek-R1).
- **Evolution**: 2023 "good but a tier below" → 2024 "competitive on benchmarks" → 2025 "leading on cost/perf and some capabilities." Chinese open releases (DeepSeek, Qwen, GLM, Kimi) became dominant by 2025–2026.

---

## Summary Table: Theme Lag Times

| Theme | Seminal paper date | "Standard practice" date | Lag |
|---|---|---|---|
| Scaling laws | Jan 2020 | mid-2022 | ~2 yr |
| Emergent abilities (methodology) | Jun 2022 | ~2024 | ~2 yr |
| Instruction tuning / RLHF | Mar 2022 | late 2022 – 2023 | ~9 mo |
| DPO / preference tuning | May 2023 | 2024 | ~1.5 yr |
| In-context learning | May 2020 | mid-2020 (GPT-3 release) | ~3 mo |
| Chain-of-thought | Jan 2022 | late 2022 | ~6–9 mo |
| RAG | May 2020 | 2023–2024 | ~3 yr |
| MoE | Jan 2017 | 2024 | **~6–7 yr** |
| Multimodal (CLIP-era) | Feb 2021 | 2024 | ~3 yr |
| Tool use / agents | Oct 2022 | mid-2024 | ~1.5 yr |
| Open-weights frontier | Feb 2023 | Jul 2024 | ~1.5 yr |

**Median lag**: ~1.5–2 years.
**Mode (most common)**: ~1.5–2 years from arXiv preprint to "in every flagship."
**Outliers**:
- *Faster*: in-context learning (~3 mo) and CoT (~6–9 mo) — both are prompting techniques that need no retraining.
- *Slower*: MoE (~6–7 yr) — required infrastructure (sharding, all-to-all kernels), and only paid off once dense scaling started to plateau.

---

## Bibliometric Studies of LLM Publication Trends

1. **Fan, Li, Ma, Lee, Yu, Hemphill** (2023, *ACM Transactions on Intelligent Systems and Technology* 2024). *A Bibliometric Review of Large Language Models Research from 2017 to 2023*. arXiv:2304.02020. ~5,000 publications analyzed. Documents the inflection point in 2022–2023 publication volume and identifies core algorithmic vs applied subfields.

2. **Healthcare-focused bibliometric**: 500+ articles 2021–2024 (Journal of Multidisciplinary Healthcare, 2025) — top contributing countries US, Germany, UK.

3. **Environmental science / civil engineering bibliometric** covering 2018–2024 (EarthArXiv 2024) — documents diffusion of LLMs into adjacent fields.

4. **Bibliometric Analysis of Generative AI and LLMs in Scopus** — Mesopotamian Press *Applied Data Science and Analysis* (2024).

## "Where the Field Is Going" Roadmap Papers from Major Labs

1. **Stanford CRFM**: Bommasani et al. (Aug 2021), *On the Opportunities and Risks of Foundation Models*, arXiv:2108.07258. ~100+ authors. The canonical agenda-setting paper for the "foundation model" framing.

2. **Comprehensive survey**: Zhao Xin et al. (Mar 2023), *A Survey of Large Language Models*, arXiv:2303.18223. Pre-training / adaptation / utilization / evaluation. Single most-cited LLM survey.

3. **Anthropic**: *Responsible Scaling Policy v3.0* (Feb 2026). Frontier Safety Roadmap with concrete capability thresholds; published recurring Risk Reports every 3–6 months. Less a research roadmap, more a deployment/safety one — but it explicitly defines what capabilities they're watching for.

4. **Emergent Abilities Survey**: arXiv:2503.05788 (Mar 2025) — synthesizes the post-Wei/Schaeffer literature.

5. **Stanford HAI**: Foundation Model Transparency Index (2023, ongoing) — tracks the maturity of practices across labs.

6. **DeepMind / Google**: No single roadmap paper, but Gemini technical reports (1.0 Dec 2023, 1.5 Feb 2024, 2.0 Dec 2024) collectively serve as a public roadmap for native-multimodal + long-context + agentic capability.
