Glossary

GPU & AI Glossary

49+ essential terms for GPU compute and AI infrastructure

API

Development

Application Programming Interface - A set of protocols for b...

Batch Processing

Performance

Processing multiple requests together to improve efficiency.

BF16 (Brain Floating Point 16)

AI/ML

Google's 16-bit format optimized for deep learning, offering...

Checkpoint

AI/ML

Saved state of a model during training, allowing resume or r...

Cold Start

Performance

Initial delay when starting a service or loading a model for...

Context Window

AI/ML

Maximum number of tokens an LLM can process in a single requ...

CUDA

Hardware

NVIDIA's parallel computing platform and programming model f...

Decentralized

Infrastructure

Distributed system without single point of control or failur...

Diffusion Model

AI/ML

Generative AI architecture that creates images/video by iter...

Docker

Infrastructure

Platform for developing, shipping, and running applications...

Egress

Networking

Data transfer out of a cloud service, often subject to addit...

Epoch

AI/ML

One complete pass through the entire training dataset.

Fine-tuning

AI/ML

Adapting a pre-trained model to specific tasks or domains.

FP16

Hardware

Half-precision floating-point format using 16 bits, common i...

GGUF

AI/ML

File format for quantized LLM models, optimized for CPU and...

GPU

Hardware

Graphics Processing Unit - Specialized hardware designed for...

HBM (High Bandwidth Memory)

Hardware

Stacked memory technology used in data center GPUs (A100, H1...

Hot Reload

Development

Updating code or models without restarting the service.

Hugging Face

Development

Platform and library ecosystem for sharing and deploying ML...

Inference

AI/ML

The process of using a trained AI model to make predictions...

Jupyter

Development

Interactive computing environment for data science and machi...

Kubernetes

Infrastructure

Container orchestration platform for automating deployment a...

Latency

Performance

The time delay between sending a request and receiving a res...

LLM

AI/ML

Large Language Model - AI models trained on vast amounts of...

LoRA (Low-Rank Adaptation)

AI/ML

Parameter-efficient fine-tuning method that adds small train...

Markup

Infrastructure

Percentage added to base cost to determine final pricing.

MoE (Mixture of Experts)

AI/ML

Architecture where only a subset of model parameters activat...

NVLink

Hardware

High-bandwidth GPU interconnect (up to 900 GB/s on H100), en...

ONNX (Open Neural Network Exchange)

Development

Open format for representing ML models, enabling portability...

OpenAI Compatible

Development

APIs that follow OpenAI's interface standards, allowing easy...

P95 Latency

Performance

95th percentile response time - 95% of requests complete fas...

Pod

Infrastructure

A containerized GPU instance that provides isolated compute...

Prompt Engineering

AI/ML

Crafting input instructions to optimize LLM outputs without...

Quantization

Performance

Reducing model precision to decrease memory usage and increa...

RAG (Retrieval-Augmented Generation)

AI/ML

Technique combining LLMs with external knowledge retrieval t...

RLHF (Reinforcement Learning from Human Feedback)

AI/ML

Training technique using human preferences to align AI outpu...

SLA

Infrastructure

Service Level Agreement - Guaranteed performance and availab...

SSH

Infrastructure

Secure Shell - Protocol for secure remote access to computin...

Tensor

AI/ML

Multi-dimensional array used in deep learning computations.

Tensor Parallelism

Infrastructure

Splitting a single model across multiple GPUs for inference...

TFLOPS

Performance

Trillion floating-point operations per second. Standard GPU...

Throughput

Performance

The amount of data processed per unit of time.

Token

AI/ML

Basic unit of text processed by language models, roughly equ...

Tokens per Second (tok/s)

Performance

Measure of LLM inference speed — how many tokens the model g...

Training

AI/ML

The process of teaching an AI model by feeding it data and a...

TTFT (Time to First Token)

Performance

Latency before an LLM starts generating output. Critical for...

vLLM

AI/ML

High-throughput LLM serving engine using PagedAttention for...

vLLM Server

Infrastructure

Production-grade inference server supporting continuous batc...

VRAM

Hardware

Video Random Access Memory - Dedicated memory on GPUs used t...

Can't find what you're looking for?

Explore our documentation for in-depth guides and API references.

API Docs Quick Start