AI prompt engineering glossary

Working definitions for the prompt engineering, LLM, RAG, diffusion, and AI agent terms you need when you're actually shipping. 201 entries, each written to be useful in one paragraph and accurate in five.

Concepts

  • Act-One (Runway) — Act-One is Runway's performance-capture feature that takes a webcam recording of a person's face and retargets the performance onto a generated character — making AI-generated characters convincingly act.
  • Agent handoff — Agent handoff is the multi-agent pattern where one agent decides another specialised agent should take over the task — transferring the conversation state to the new agent's context.
  • Agent loop — An agent loop is the repeating cycle of an AI agent — observe state, decide on an action (usually a tool call), execute, observe the result, and repeat — until a goal is reached or a stop condition fires.
  • Agent marketplace — An agent marketplace is a platform where users discover, distribute, and sometimes monetise pre-built AI agents — examples include the GPT Store, Claude Skills directory, Poe Bots, Hugging Chat Assistants.
  • Agent OS — Agent OS is the 2026 informal term for the layer of infrastructure — agent runtime, tool registry, state store, tracing, evals — that production agentic systems share, comparable to a traditional OS in role.
  • Agent rollback — Agent rollback is the pattern of restoring an agent's state to a previous checkpoint when the current trajectory has gone wrong — letting it try a different approach without starting over.
  • Agent sandbox — An agent sandbox is the isolated execution environment where an LLM-driven agent runs code, browses, or controls a desktop — the safety boundary that contains prompt-injection blast radius.
  • Agent tracing — Agent tracing captures the full execution graph of an agent run — every step, every tool call, every model output — so engineers can debug, audit, and improve the agent over time.
  • Agentic workflow — An agentic workflow is a multi-step business process orchestrated by AI agents — where one or more LLM-driven agents make decisions, call tools, and adapt to inputs rather than following a fixed automation script.
  • AI agent — An AI agent is a system where a language model autonomously plans and executes a sequence of tool calls to accomplish a goal.
  • AI companion — An AI companion is a chat-app category built around long-term emotional relationship rather than task completion — Replika, Character.AI, Talkie are the genre leaders in 2026.
  • AI Overview (Google) — AI Overview is Google's generative answer panel that appears at the top of search results for many queries — synthesised by Gemini from web sources with inline citations.
  • AI pair programming — AI pair programming is the practice of working alongside an AI coding assistant continuously — sharing intent, reviewing suggestions, accepting / rejecting / refining — instead of using AI for occasional one-off tasks.
  • AI roleplay (LLM) — AI roleplay is the use of language models for interactive fiction, character chat, and collaborative storytelling — usually with a persistent character persona maintained across many turns.
  • AI search engine — An AI search engine answers a user's query by retrieving relevant web sources and synthesising a cited answer with a language model — the category that includes Perplexity, ChatGPT Search, Claude with web, and Gemini AI Overviews.
  • AI terminal — An AI terminal is a modern terminal emulator with built-in AI features — natural-language command search, agent modes, AI explanations of output — like Warp's Agent Mode in 2026.
  • AI watermarking — AI watermarking embeds invisible-to-humans signals in model output (text, image, audio, video) so the content can later be detected as AI-generated.
  • Async inference (Batch API) — Async inference (also called Batch API) submits LLM jobs that complete within 24 hours instead of seconds — used for non-interactive workloads at half the per-token price or less.
  • Attention mechanism — The attention mechanism is the transformer building block that lets each token in an input weight the importance of every other token when computing its representation — the core technique that made modern LLMs possible.
  • Auto-eval (LLM) — Auto-eval is the automated grading of LLM output — usually by an LLM judge with a rubric — that replaces or supplements human grading in eval suites.
  • Barge-in — Barge-in is the voice-agent feature where the user can interrupt the assistant mid-response — the assistant detects the speech and stops talking — making conversations feel natural instead of robotic turn-taking.
  • Batched inference — Batched inference packs multiple prompts into a single GPU forward pass, dramatically improving throughput and unit cost at the cost of per-request latency.
  • BM25 — BM25 is the classic lexical retrieval algorithm — a tuned TF-IDF variant that scores documents by query-term frequency and inverse document frequency, still essential as part of [[hybrid-search]] in 2026.
  • Bring-your-own-LLM (BYO-LLM) — Bring-your-own-LLM (BYO-LLM) is the developer pattern where a tool or product lets users configure their own model and API key — instead of locking them into the product's bundled LLM.
  • Browser agent — A browser agent is an LLM-driven system that controls a real or headless web browser to navigate sites, fill forms, click, and extract data — automating tasks that require interacting with web UIs.
  • Character card — A character card is a structured description of a character used in AI roleplay — name, appearance, personality, backstory, speaking style — loaded as the system prompt at the start of a conversation.
  • Cheap-tier model — A cheap-tier model is the small-fast LLM each major provider ships alongside their frontier model — Claude Haiku, GPT-4o-mini, Gemini Flash, Mistral Small, DeepSeek V3 — used for routing, classification, extraction, and bulk inference.
  • Code completion (AI) — AI code completion is the inline suggestion of code as a developer types — the autocomplete category dominated by Cursor Tab, GitHub Copilot, Windsurf, Tabnine, and Codeium in 2026.
  • Cold start (inference) — Cold start is the delay incurred when a serverless inference function loads its model into GPU memory for the first time after being idle — typically 5-60 seconds for large LLMs.
  • Content provenance — Content provenance is cryptographic metadata attached to media that records how it was created, by whom or what model, and what edits it has been through.
  • Context window — The context window is the maximum number of tokens — system prompt, conversation history, retrieved documents, and the response — that a language model can process in a single turn.
  • CRDT (Conflict-free Replicated Data Type) — A CRDT is a data structure designed so that concurrent edits from multiple clients automatically merge into a consistent result — used to power realtime collaboration without server-side conflict resolution.
  • Danbooru tags — Danbooru tags are the structured tagging vocabulary inherited from the Danbooru imageboard — character, style, scene, expression tags — used heavily as the prompt language for anime-style diffusion models in 2026.
  • Deepfake — A deepfake is synthetic media — image, audio, or video — that depicts a real person doing or saying something they did not actually do, produced by AI generation or face/voice swap.
  • Dev loop (AI-assisted) — The AI-assisted dev loop is the inner-loop pattern engineers use in 2026 — plan with AI, edit in IDE, run tests, iterate — where AI participates at every step rather than just code generation.
  • Edge function — An edge function is a small serverless function that runs at the network edge — geographically close to the user — typically on Cloudflare Workers, Deno Deploy, Vercel Edge, or AWS Lambda@Edge.
  • Embedded bot (LLM) — An embedded bot is an LLM-powered assistant that lives inside a product's surface — a chat widget on a website, a sidebar in a SaaS app, a CLI helper — rather than as a standalone chat platform.
  • Embeddings — Embeddings are dense numeric vectors that represent the meaning of text, images, or other data, allowing similarity to be measured as vector distance.
  • Evals (LLM evaluations) — Evals are systematic tests that measure how well a language model or LLM-powered system performs on a defined task using a golden set of inputs and reference outputs.
  • Evals-driven development — Evals-driven development is the discipline of writing the eval suite first, then iterating prompts and models against it — borrowing test-driven development for LLM work.
  • Extended thinking — Extended thinking is Anthropic's flag on Claude that allocates a configurable budget of internal reasoning tokens before the user-visible answer — enabling deeper reasoning on hard problems for a higher cost.
  • Function call validation — Function call validation is the server-side check that ensures an LLM-emitted function call has a valid name, schema-compliant arguments, and acceptable value ranges — before the tool actually runs.
  • Function router LLM — A function router LLM is a small fast model whose only job is to classify the incoming request and emit a structured tool call — pre-deciding which function the downstream pipeline should run.
  • Generative UI — Generative UI is the pattern where a language model dynamically chooses, configures, and renders UI components in real time based on the user's intent — instead of returning text, it returns a UI.
  • Grounding — Grounding is any technique that ties a language model's output to verifiable sources — retrieved documents, tool results, structured data — instead of pure memory.
  • Guardrails — Guardrails are deterministic checks layered around a language model to prevent unsafe, off-topic, or non-compliant outputs from reaching the user.
  • Human-in-the-loop — Human-in-the-loop is the design pattern of placing human approval checkpoints inside an AI workflow — gating destructive actions, low-confidence outputs, or high-stakes decisions on explicit human review.
  • IDE agent — An IDE agent is an AI coding assistant embedded inside an integrated development environment — Cursor, Cline, Windsurf, Roo Code, GitHub Copilot agent mode — that can edit multiple files, run commands, and run tests autonomously from inside the editor.
  • In-context learning — In-context learning is when a language model adapts its behaviour from examples shown in the prompt — no weights change, no fine-tuning.
  • IndexNow — IndexNow is a publisher-side protocol that lets a website notify search engines instantly when content is added or changed — supported by Bing, Yandex, Yep, Naver, and Seznam with a single ping.
  • Instruction hierarchy — Instruction hierarchy is a model's trained ordering of trust — system prompt outranks user message which outranks retrieved content — used to resist prompt injection and jailbreak attempts.
  • Knowledge cutoff — The knowledge cutoff is the date after which a language model has no training data — anything that happened after is unknown to it unless supplied at inference time.
  • KV cache — The KV (key-value) cache stores the attention keys and values for tokens already processed, so each new token only attends to history instead of recomputing it.
  • Local LLM — A local LLM is a language model that runs entirely on the user's own machine — laptop, desktop, or self-hosted server — rather than via a cloud API, trading some quality for privacy, offline access, and zero per-token cost.
  • LoRA hot-swapping — LoRA hot-swapping is the serving pattern where many fine-tuned LoRA adapters share a single base model on GPU — the appropriate adapter is loaded per request without reloading the base model.
  • Managed service — A managed service is a cloud-hosted offering where the provider runs the infrastructure — Supabase, Pinecone, n8n Cloud, Anthropic API — and the user pays for usage rather than operating the underlying systems.
  • MCP (Model Context Protocol) — MCP is an open protocol from Anthropic that standardises how language models connect to external tools, data sources, and prompts — the USB-C of LLM integrations.
  • Model card — A model card is structured documentation accompanying a released model — what it does, what it was trained on, its evaluation results, intended uses, and known limitations.
  • Model Context Protocol (MCP) — Model Context Protocol (MCP) is Anthropic's open standard for connecting AI assistants to external data sources and tools — letting any compliant client use any compliant server's capabilities.
  • Model personality — Model personality is the consistent voice / tone / value-set baked into a frontier LLM through training and system prompts — Claude's careful + helpful, GPT's task-completing, Grok's irreverent are distinct personalities in 2026.
  • Model router — A model router picks which language model handles each request based on cost, latency, or task type — the standard production pattern in 2026.
  • Model router policy — A model router policy is the rule set that decides which model handles each request — usually as a chain of conditions (intent, latency budget, cost ceiling, quality required) over the available model set.
  • MoE routing — MoE routing is the per-token gating function inside a Mixture-of-Experts model that selects which expert sub-networks process each token — the critical detail that determines MoE quality + efficiency.
  • Motion brush — Motion brush is the AI video tool that lets a user paint motion onto specific regions of an image — telling the model where motion should happen and which direction it should go — instead of relying purely on text prompts.
  • Multi-agent system — A multi-agent system is a coordinated set of specialised AI agents that delegate to each other — each agent has a focused role, tool set, and system prompt rather than one mega-agent doing everything.
  • Multimodal model — A multimodal model accepts more than one input type — text plus images, audio, or video — and reasons across them in a single forward pass.
  • Node graph workflow — A node graph workflow is a visual programming pattern — most prominently ComfyUI, n8n, Make, LangGraph — where a pipeline is built as connected nodes that pass data along edges.
  • Output guard — An output guard is a deterministic check applied to a language model's response before it reaches the user — validating JSON shape, blocking unsafe content, refusing if confidence is low, or rewriting failures.
  • PagedAttention — PagedAttention is vLLM's memory-management technique that partitions the KV cache into fixed-size pages — borrowed from OS virtual memory — to eliminate fragmentation and enable efficient KV-cache sharing.
  • Personal memory (AI assistant) — Personal memory in AI assistants is the per-user persistent context — preferences, facts, history — that the assistant retains across sessions to personalise its behaviour and answers.
  • Plan-first workflow — Plan-first workflow is the agent pattern of explicitly drafting and (sometimes) confirming the plan before executing — catching misunderstandings before code is changed, instead of after.
  • Prefix caching — Prefix caching reuses the KV-cache state computed for a shared prompt prefix across many requests, so the prefix is processed once and amortised over all subsequent calls.
  • Prompt caching — Prompt caching reuses the model's internal state for a repeated prompt prefix so the API charges and computes the prefix only once across many calls.
  • Prompt engineering — Prompt engineering is the practice of designing input text that reliably steers a large language model toward a specific output.
  • Prompt orchestration — Prompt orchestration is the discipline of coordinating multiple LLM calls — routing, chaining, branching, retrying — to compose a reliable end-to-end workflow from individually less-reliable steps.
  • Prompt rewriter — A prompt rewriter is a layer — often a small LLM — that takes the user's raw query and rewrites it into a form that downstream retrieval or generation handles better.
  • Prompt template — A prompt template is a parameterised prompt — a string with named slots (e.g. {{name}}, {{context}}) that get filled at runtime so the same skeleton can serve many requests.
  • Prompt versioning — Prompt versioning is the discipline of treating prompts as source-controlled artefacts — each prompt has a versioned ID, a deploy history, and a regression-tested change log.
  • Rate limit — A rate limit is a hard cap on how many requests or tokens an API will accept from a single client in a given time window — the single most common production failure mode for LLM apps.
  • Real-time knowledge — Real-time knowledge is an LLM's access to information from the past minutes/hours/days via live data feeds — Grok's X firehose, Perplexity's web search, ChatGPT's browse — separate from the model's static training cutoff.
  • Realtime sync — Realtime sync is the pattern where every connected client receives updates within milliseconds of a database change — without polling — via WebSockets, server-sent events, or similar live-channel mechanisms.
  • Reasoning tokens — Reasoning tokens (or thinking tokens) are the internal chain-of-thought tokens reasoning models produce before the user-visible answer — billed separately and not shown to the end user.
  • Regression suite (LLM) — A regression suite is the standing set of evals that runs on every prompt change, model upgrade, or pipeline modification — designed to catch quality regressions on previously-working cases.
  • Response streaming — Response streaming pipes the model's output token-by-token to the client as it's generated, so users see text appearing in real time instead of waiting for the full answer.
  • Response validation — Response validation is the post-generation check that ensures a language model's output meets schema, content, and quality constraints before it's used downstream — distinct from guardrails which gate the call itself.
  • Retrieval evals — Retrieval evals measure how well a RAG system's retrieval stage performs — Recall@K, nDCG@K, coverage — separately from the generation quality of the answer.
  • Router fallback — A router fallback is a chain of model providers that the application tries in order — failing over from primary to secondary to tertiary on 429s, 500s, or quality thresholds.
  • Router LLM — A router LLM is a small fast language model whose only job is to classify or rewrite an incoming request — deciding which downstream model, agent, or tool should handle it.
  • Scaffolded agent — A scaffolded agent is built on top of a strong general-purpose framework (LangGraph, OpenAI Agents SDK) that provides the agent loop, tracing, and tool-use plumbing — letting developers focus on the domain logic.
  • Scaling law (LLM) — A scaling law is an empirical relationship — typically a power law — between a language model's loss and inputs like parameter count, training compute, or training data size.
  • Schema.org — Schema.org is a shared vocabulary of structured-data types — Article, Product, Person, FAQPage, DefinedTerm — embedded as JSON-LD in HTML so search engines and AI answer engines can extract structured meaning from a page.
  • Self-hosted LLM — A self-hosted LLM runs entirely on infrastructure you control — your GPUs, your servers, your data residency — versus calling a cloud API.
  • Semantic cache (LLM) — A semantic cache stores LLM responses keyed by the meaning of the request — embedding-based lookup returns a cached answer when a new query is semantically close enough.
  • Semantic search — Semantic search finds documents by meaning rather than keyword match, using embedding similarity in a vector space.
  • Serverless database — A serverless database scales compute and storage independently and bills based on actual use — no fixed instance provisioning — typical of Neon, PlanetScale, Supabase, Convex in 2026.
  • Shadow deployment (LLM) — Shadow deployment runs a new model or prompt alongside the production one — receiving the same traffic but never showing output to users — to measure quality, latency, and cost before flipping live.
  • Side-by-side eval — Side-by-side eval presents two model or prompt outputs to a rater (human or LLM judge) for direct comparison — "which one is better?" — instead of grading each on an absolute scale.
  • Skill (Claude / GPT) — A Skill is a packaged capability — instructions + tools + files — that an AI assistant can selectively load to handle a specific task type, distinct from a Custom GPT in distribution and surface.
  • Speculative execution (agents) — Speculative execution in agents launches multiple plausible tool calls in parallel before knowing which the user wants — accepting the winning result and discarding the others — to cut perceived latency.
  • Spot instance (AI training) — A spot instance is a cloud GPU rented at a discount (often 50-90% off) on the condition that the provider can reclaim it on short notice — used for cost-sensitive training that can checkpoint and resume.
  • Stateful agent — A stateful agent persists state — memory, learned facts, long-running context — across sessions, in contrast to stateless agents that start fresh on every conversation.
  • Streaming STT — Streaming STT (speech-to-text) emits partial transcriptions as the user speaks — instead of waiting for end-of-utterance — enabling sub-second response from a voice assistant.
  • Structured prompt — A structured prompt has explicit sections (role, task, constraints, format, examples, input) instead of free-form prose — the dominant pattern for production LLM prompts in 2026.
  • Synthetic data — Synthetic data is training or evaluation data generated by a model rather than collected from humans — increasingly used to fine-tune smaller models and to fill gaps in real datasets.
  • System card — A system card extends the model card concept to cover an entire AI system — the model plus prompts, tools, retrieval, guardrails, and intended deployment context.
  • System message — A system message is the highest-priority instruction message in a chat-style API call — used to set role, constraints, and behaviour for the entire conversation.
  • System prompt — A system prompt is the high-priority instruction block that defines a model's role, constraints, and default behaviors for an entire conversation.
  • Terminal agent — A terminal agent is an AI coding assistant that runs in the user's terminal — Claude Code, Aider, OpenAI Codex CLI — interacting with the file system and shell rather than an IDE editor.
  • Test-time scaling — Test-time scaling is the trend of allocating more inference compute — longer reasoning traces, more samples, more verification — to get better answers from the same trained model.
  • Text-to-3D — Text-to-3D is generative AI that produces 3D models — meshes, textures, sometimes rigged animation — from a natural-language prompt or a single image.
  • Throughput per dollar — Throughput per dollar is the production metric for LLM inference cost — tokens served per second of compute time per dollar of GPU cost — used to compare inference engines, serving platforms, and hardware in 2026.
  • Token — A token is the smallest unit a language model reads or writes — typically a sub-word fragment, with one English word averaging about 1.3 tokens.
  • Token budget — A token budget is the maximum number of tokens an application allows for a single LLM call (or an agent loop) — enforced to control cost, latency, and runaway behaviour.
  • Tokenizer — A tokenizer is the algorithm that splits text into the tokens a language model actually reads — BPE, SentencePiece, or tiktoken in 2026.
  • Tool bundling — Tool bundling groups related tools behind a single high-level interface — so the agent calls one tool that does the right thing internally rather than choosing between many narrow tools.
  • Tool router — A tool router is a layer in an agent that decides which tool to call (or which sub-agent to delegate to) for a given step — distinct from a model router which picks the underlying LLM.
  • Tool use (LLM) — Tool use is the umbrella term for any LLM mechanism that lets the model invoke external functions, APIs, or services — function calling, code interpreter, MCP servers, browser actions.
  • User intent classification — User intent classification is the layer that determines what the user actually wants from a query — used to route to the right agent, tool, or response strategy before generation.
  • Verified knowledge — Verified knowledge is a curated corpus of facts that have been confirmed by humans or trusted sources — used to ground LLM answers and to detect hallucinations against a known-good baseline.
  • Vibe coding — Vibe coding is the 2024-2026 idiom for building software primarily by describing intent to an AI coding assistant — coding by feel and outcome rather than line-by-line authorship.
  • Vibe eval — Vibe eval is the pejorative for unsystematic eyeball-grading of LLM output — "it feels better" rather than measurable rubric-based comparison. The opposite of proper evals.
  • Virtual context (LLM) — Virtual context is the agent-memory pattern introduced by MemGPT (now Letta) — a small in-context working memory plus an external archival memory the model can read from and write to as the conversation grows.
  • Voice (LLM apps) — Voice in LLM apps refers to the full speech pipeline — speech-to-text (STT), language model, text-to-speech (TTS) — that lets users converse with an AI assistant in spoken language.
  • Voice activity detection (VAD) — Voice activity detection is the lightweight signal-processing step that determines whether incoming audio contains speech — used to start STT, trigger barge-in, and gate microphone use in voice agents.
  • Workflow engine — A workflow engine is the orchestration runtime — n8n, Make.com, Zapier, Temporal, Airflow — that executes multi-step business processes, handles retries, manages state, and integrates with external systems.
  • World model — A world model is an AI model that learns a representation of how an environment evolves over time — enabling planning, simulation, and prediction inside the model itself rather than in real environments.

Techniques

  • A/B testing prompts — A/B testing prompts runs two prompt variants against the same input distribution and compares scored outputs, attributing quality differences to the prompt change.
  • Approval workflow — An approval workflow is the agent pattern where high-impact actions (send email, make purchase, delete data) pause for human approval before executing — the production-safe alternative to fully autonomous agents.
  • Chain-of-density — Chain-of-density (CoD) is a prompting technique for summarisation that asks the model to iteratively produce denser summaries — each rewrite keeps the length but adds more entities.
  • Chain-of-thought prompting — Chain-of-thought (CoT) prompting tells a language model to write its reasoning steps before its final answer, increasing accuracy on multi-step problems.
  • Chain-of-verification — Chain-of-verification (CoVe) is a prompting technique where the model first drafts an answer, then generates verification questions for each claim, answers them independently, and revises the draft accordingly.
  • Computer use — Computer use is the agent capability where an LLM controls a real desktop or browser via screenshots + mouse/keyboard primitives — Anthropic introduced it in 2024 and it's mainstream across Claude, GPT, Gemini in 2026.
  • Constitutional AI — Constitutional AI is Anthropic's alignment method where a model is trained to follow a written constitution — a set of principles applied during self-critique and revision — without per-task human preference labels at every step.
  • Context distillation — Context distillation summarises an agent's growing conversation history into a compact representation, so each step's input stays small while preserving the relevant signal.
  • Context pinning — Context pinning explicitly keeps critical pieces of information at the head or tail of an agent's prompt across many turns — defending against the lost-in-the-middle recall problem on long contexts.
  • Contextual retrieval — Contextual retrieval prepends a chunk's surrounding context (document title, section, summary) to each chunk before embedding, dramatically improving retrieval relevance on long documents.
  • ControlNet — ControlNet is a neural-network architecture that conditions a diffusion image model on extra spatial inputs — edges, depth, pose, segmentation — for precise control over output structure.
  • Conversation compaction — Conversation compaction summarises a long agent or chat history into a tight representation that preserves the relevant signal — used when the conversation approaches the model's context window.
  • Direct preference optimisation (DPO) — Direct preference optimisation is a fine-tuning method that aligns a model to human preferences directly from preference pairs — without training an explicit reward model first.
  • Distillation — Distillation trains a smaller "student" model to mimic a larger "teacher" model's outputs, capturing most of the quality at a fraction of the inference cost.
  • Embedding clustering — Embedding clustering groups documents, queries, or users by embedding similarity — used for topic discovery, deduplication, semantic indexing, and personalisation.
  • Embedding fine-tuning — Embedding fine-tuning adapts a pretrained embedding model to your domain by training on (anchor, positive, negative) triplets — improving retrieval recall on domain-specific terminology that off-the-shelf models miss.
  • Ensemble prompting — Ensemble prompting runs the same task with multiple different prompts (or models) and aggregates the responses — typically majority vote, weighted average, or a final reconciliation model.
  • Few-shot prompting — Few-shot prompting supplies 2–10 input–output examples inside the prompt so the model imitates the pattern on a new input.
  • Fine-tuning — Fine-tuning updates a pretrained model's weights on task-specific data, baking the new behaviour into the model rather than relying on prompts.
  • Function calling (tool use) — Function calling lets a language model emit a structured request to invoke a developer-defined tool, enabling reliable JSON output and agent workflows.
  • Graph RAG — Graph RAG builds a knowledge graph from the corpus during ingestion — entities, relationships, facts — and retrieves via graph traversal alongside vector search, improving recall on relational queries.
  • Hybrid search (retrieval) — Hybrid search combines dense vector retrieval with sparse keyword (BM25) retrieval, then fuses the two ranked lists — the production retrieval default for RAG in 2026.
  • Image-to-video — Image-to-video is the AI generation pattern where a static image is the starting frame of a generated video — combined with text prompts and optionally motion brush + camera controls — for precise creative control.
  • In-context RAG — In-context RAG skips a vector index entirely and stuffs the whole knowledge base into the prompt — only viable when the corpus fits in the model's context window and is small enough that retrieval overhead exceeds inference cost.
  • Instruction tuning — Instruction tuning is the post-training stage where a base language model is fine-tuned on examples of (instruction, ideal response) pairs to follow human instructions reliably.
  • LLM jury — An LLM jury is an evaluation pattern where multiple LLM judges score the same output, and their scores are aggregated to reduce single-judge bias.
  • Long-context prompting — Long-context prompting is the discipline of writing prompts that exploit 200K-1M+ token windows effectively — chunk ordering, head-and-tail anchoring, summarisation, and recall-aware structure.
  • LoRA (Low-Rank Adaptation) — LoRA is a fine-tuning method that trains a small set of low-rank adapter weights on top of a frozen base model — cheaper to train and store than full fine-tuning.
  • LoRA stacking — LoRA stacking applies multiple LoRA adapters simultaneously to a diffusion model — combining a character LoRA, a style LoRA, and a quality LoRA — to compose effects without retraining.
  • Mixture of agents — Mixture of agents is an inference pattern where multiple specialised LLM agents run in parallel and a router aggregator combines their outputs into a single answer — higher quality than any single agent at higher cost.
  • Multimodal RAG — Multimodal RAG retrieves images, audio, video, or tables alongside (or instead of) text, embedding each modality with a compatible encoder so they can be searched and ranked together.
  • Negative prompt — A negative prompt is text that tells an image, video, or audio generator what to avoid producing — the opposite of the main prompt.
  • Persona prompting — Persona prompting is the practice of assigning the model a specific identity, expertise, and audience in the system prompt to steer voice, tone, and answer depth.
  • Prompt chaining — Prompt chaining splits a complex task into a sequence of smaller LLM calls — each step's output feeds the next — improving reliability over a single mega-prompt.
  • Prompt tuning — Prompt tuning trains a small set of "soft prompt" tokens — continuous vectors that prepend to the model input — to specialise a frozen LLM for a task with minimal parameters.
  • RAG fusion — RAG fusion runs multiple query rewrites in parallel against the retrieval index and fuses the ranked results — improving recall on ambiguous or multi-aspect queries.
  • ReAct pattern — ReAct interleaves Reasoning + Acting in an agent loop — the model writes a thought, then decides to call a tool, then observes the result, then thinks again.
  • Reciprocal rank fusion (RRF) — RRF is the standard way to combine multiple retrieval rankings (e.g., BM25 + vector) into one final score — sum `1 / (k + rank)` across all rankings for each document, sort by total.
  • Retrieval-augmented generation (RAG) — Retrieval-augmented generation (RAG) injects relevant documents into the prompt at query time so the model answers from your data instead of its training memory.
  • Self-consistency — Self-consistency runs the same prompt multiple times at non-zero temperature and picks the most common final answer, raising accuracy on reasoning tasks.
  • Self-correction (LLM) — Self-correction is a prompting pattern where the model reviews its own initial answer, identifies errors, and produces a revised answer — a cheap reliability boost for many tasks.
  • Semantic routing — Semantic routing classifies an incoming query by meaning — via embedding similarity to predefined route prototypes — and dispatches it to the right model, agent, or sub-system.
  • Speaker diarisation — Speaker diarisation is the technique of segmenting an audio recording by who-spoke-when — answering "who said what" rather than just "what was said" — used heavily in meeting transcription, podcasts, and call analytics.
  • Speculative decoding — Speculative decoding is an inference technique where a small "draft" model proposes several tokens at once and a large "verifier" model accepts or rejects them, cutting latency by 2-4x.
  • Speculative RAG — Speculative RAG runs a small fast model to draft an answer + identify what's uncertain, then retrieves and verifies only the uncertain claims with the strong model — saving cost on confident parts.
  • Vector RAG — Vector RAG is the classic retrieval-augmented generation pattern — embed documents, store in a vector DB, retrieve by query embedding similarity, inject top-K into the prompt — vs. graph RAG, in-context RAG, or hybrid RAG.
  • Voice cloning — Voice cloning takes a sample of someone speaking — sometimes as little as 30 seconds — and produces a model that can synthesise new speech in that voice.
  • Zero-shot prompting — Zero-shot prompting asks the model to perform a task with no examples — only the instruction and the input.

Parameters

  • CFG scale (classifier-free guidance) — CFG scale controls how strongly a diffusion image model follows its text prompt — higher values stick closer to the prompt, lower values explore more.
  • Seed — A seed is an integer that initializes the random number generator inside an image, video, or audio model, making generation reproducible.
  • Temperature — Temperature is a sampling parameter that controls randomness in a language model's output, where 0 is fully deterministic and higher values introduce more variety.
  • Top-p (nucleus sampling) — Top-p (nucleus sampling) restricts the model to the smallest set of tokens whose cumulative probability is at least p, then samples from that set.

Models

  • Diffusion model — A diffusion model is a generative neural network that creates images, video, or audio by iteratively denoising random noise toward a learned target distribution.
  • Mixture of Depths — Mixture of Depths (MoD) is an efficiency technique where the model learns to skip some layers for some tokens — applying compute selectively based on token importance.
  • Mixture of Experts (MoE) — Mixture of Experts is an architecture where a router activates only a subset of the model's parameters per token, so total parameter count is huge but inference cost stays low.
  • Reasoning model — A reasoning model is an LLM trained to produce extensive internal chain-of-thought before its final answer, trading latency for higher accuracy on hard problems.
  • Reranker — A reranker is a small cross-encoder model that takes a query + a candidate document and outputs a relevance score — used as the second stage after embedding retrieval to push the right answer to the top.

Tools

  • Code interpreter (LLM tool) — A code interpreter is a sandboxed execution environment that lets a language model run code (usually Python) it generates, inspect the results, and iterate — turning the model into a data analyst.
  • Graphiti — Graphiti is Zep's open-source temporal knowledge graph framework that ingests agent conversations + structured data, extracts facts with timestamps, and supports time-aware queries over the graph.
  • OpenRouter — OpenRouter is a unified API that lets you call 200+ language models through one endpoint with one API key — the de-facto model-router infrastructure layer in 2026.
  • Prompt marketplace — A prompt marketplace is a platform where creators publish, distribute, and sometimes sell prompt artefacts — recipes, templates, full agents — for image, video, text, and code generation tools.
  • Safety classifier — A safety classifier is a small specialised model that scores LLM input or output for unsafe categories — toxicity, PII, prompt injection, jailbreak, NSFW — so the application can refuse, rewrite, or escalate.
  • Vector database — A vector database stores embeddings and performs approximate nearest-neighbor search at scale, the persistence layer behind RAG and semantic search.

Output formats

  • JSON mode (structured output) — JSON mode forces a language model to emit only syntactically valid JSON, usually conforming to a schema you supply.
  • JSON Schema mode (strict) — JSON Schema mode is the API feature — OpenAI strict mode, Anthropic tool-use schemas, Gemini response_schema — that constrains a language model's output to match a supplied JSON Schema at decode time.
  • OpenAI strict schema mode — OpenAI strict schema mode is the API flag that constrains GPT output to exactly match the supplied JSON schema — required fields, types, enums — making the model unable to emit invalid output.
  • Strict mode (structured output) — Strict mode is OpenAI's constrained-decoding flag that guarantees model output conforms to a supplied JSON schema — the model literally cannot emit invalid syntax or violate the schema's shape.
  • Structured output — Structured output is any production prompt pattern that forces a language model to return data in a deterministic, machine-parseable form (JSON, XML, custom).

Failure modes

  • Context window stuffing — Context window stuffing is the antipattern of putting everything you might need into a single LLM call's context — degrading reasoning, blowing up cost, and obscuring which piece of context actually mattered.
  • Hallucination — A hallucination is when a language model produces output that is factually wrong, fabricated, or unsupported, while sounding confident.
  • Jailbreak (LLM) — A jailbreak is a prompt-level attack that bypasses a language model's safety guardrails, causing it to produce content the model was trained to refuse.
  • Model collapse — Model collapse is what happens when a model is trained or fine-tuned on its own outputs across generations — quality degrades, diversity shrinks, and tail knowledge is forgotten.
  • Prompt injection — Prompt injection is an attack where hostile content in a model's input (a webpage, a retrieved document, a user message) overrides the system prompt's instructions.
  • Prompt leakage — Prompt leakage is when a language model reveals its hidden system prompt, tool definitions, or other proprietary context to the user — usually under prompt-injection attack.
  • Tool shadowing — Tool shadowing is when two or more tools in an agent's toolkit overlap in purpose enough that the model routes ambiguously — usually picking the worse one or oscillating between them.