49+ essential terms for GPU compute and AI infrastructure
Application Programming Interface - A set of protocols for b...
Processing multiple requests together to improve efficiency.
Google's 16-bit format optimized for deep learning, offering...
Saved state of a model during training, allowing resume or r...
Initial delay when starting a service or loading a model for...
Maximum number of tokens an LLM can process in a single requ...
NVIDIA's parallel computing platform and programming model f...
Distributed system without single point of control or failur...
Generative AI architecture that creates images/video by iter...
Platform for developing, shipping, and running applications...
Data transfer out of a cloud service, often subject to addit...
One complete pass through the entire training dataset.
Adapting a pre-trained model to specific tasks or domains.
Half-precision floating-point format using 16 bits, common i...
File format for quantized LLM models, optimized for CPU and...
Graphics Processing Unit - Specialized hardware designed for...
Stacked memory technology used in data center GPUs (A100, H1...
Updating code or models without restarting the service.
Platform and library ecosystem for sharing and deploying ML...
The process of using a trained AI model to make predictions...
Interactive computing environment for data science and machi...
Container orchestration platform for automating deployment a...
The time delay between sending a request and receiving a res...
Large Language Model - AI models trained on vast amounts of...
Parameter-efficient fine-tuning method that adds small train...
Percentage added to base cost to determine final pricing.
Architecture where only a subset of model parameters activat...
High-bandwidth GPU interconnect (up to 900 GB/s on H100), en...
Open format for representing ML models, enabling portability...
APIs that follow OpenAI's interface standards, allowing easy...
95th percentile response time - 95% of requests complete fas...
A containerized GPU instance that provides isolated compute...
Crafting input instructions to optimize LLM outputs without...
Reducing model precision to decrease memory usage and increa...
Technique combining LLMs with external knowledge retrieval t...
Training technique using human preferences to align AI outpu...
Service Level Agreement - Guaranteed performance and availab...
Secure Shell - Protocol for secure remote access to computin...
Multi-dimensional array used in deep learning computations.
Splitting a single model across multiple GPUs for inference...
Trillion floating-point operations per second. Standard GPU...
The amount of data processed per unit of time.
Basic unit of text processed by language models, roughly equ...
Measure of LLM inference speed — how many tokens the model g...
The process of teaching an AI model by feeding it data and a...
Latency before an LLM starts generating output. Critical for...
High-throughput LLM serving engine using PagedAttention for...
Production-grade inference server supporting continuous batc...
Video Random Access Memory - Dedicated memory on GPUs used t...
Explore our documentation for in-depth guides and API references.