LLM inference is the process of using a trained large language model to turn a new input prompt into an output, for example answers, summaries, code, or tool calls. Training has already happened at this point. Inference is the application phase where the model uses what it has learned.

Most modern LLMs use a transformer architecture. Inference usually happens in two phases.


The user prompt is tokenized and embedded. The model processes all input tokens in parallel and builds the key and value cache that represents the context. This phase is compute bound. Performance is mostly limited by raw matrix multiplication throughput on the GPU. A key metric here is time to first token, which is the delay before the model starts answering.
Once the first token is produced, the model switches to an autoregressive loop. It generates one new token at a time and reuses the existing key and value cache so it only needs to process the new token through the layers. This phase is more memory bound because it constantly reads and writes the growing cache in GPU memory. The key metric here is time per output token.
LLM inference is expensive. It needs large amounts of GPU memory and steady compute. Optimizing this stage directly affects user experience and cost.
Latency. Lower time to first token and lower inter token delay make chat interfaces feel responsive.
Throughput. Higher throughput lets a single cluster serve many users and agents.
Cost. Better utilization means fewer idle GPUs and lower cost per token.


Training and fine tuning are the phases where models actually learn.


Generative media workloads create new visual, audio or 3D content.


Simulation and research workloads include reinforcement learning, scientific simulations and large evaluation sweeps.


