AI Workloads on OpenGPU
LLM inference
LLM inference is the process of using a trained large language model to turn a new input prompt into an output, for example answers, summaries, code, or tool calls. Training has already happened at this point. Inference is the application phase where the model uses what it has learned.

How LLM inference works
Most modern LLMs use a transformer architecture. Inference usually happens in two phases.


1. Prefill phase
The user prompt is tokenized and embedded. The model processes all input tokens in parallel and builds the key and value cache that represents the context. This phase is compute bound. Performance is mostly limited by raw matrix multiplication throughput on the GPU. A key metric here is time to first token, which is the delay before the model starts answering.
2. Decode phase
Once the first token is produced, the model switches to an autoregressive loop. It generates one new token at a time and reuses the existing key and value cache so it only needs to process the new token through the layers. This phase is more memory bound because it constantly reads and writes the growing cache in GPU memory. The key metric here is time per output token.
Why optimization matters
LLM inference is expensive. It needs large amounts of GPU memory and steady compute. Optimizing this stage directly affects user experience and cost.
Latency. Lower time to first token and lower inter token delay make chat interfaces feel responsive.
Throughput. Higher throughput lets a single cluster serve many users and agents.
Cost. Better utilization means fewer idle GPUs and lower cost per token.

How OpenGPU supports LLM inference
- Route inference jobs to GPUs matching memory and speed needs
- Scale horizontally during traffic spikes
- Keep costs predictable by placing jobs on efficient nodes

Training and fine tuning
Training and fine tuning are the phases where models actually learn.

Types of training workloads
- Full pre training. Training from scratch.
- Fine tuning. Adapting an existing model.
- Continual training. Periodic updates.
Resource profile and challenges
- GPU memory. Huge VRAM demands.
- Distributed training. Needs fast interconnects.
- I/O and storage. Constant data streaming.
- Reliability. Must survive interruptions.
How OpenGPU supports training and fine tuning
- Match long jobs to stable providers
- Group GPUs for distributed runs
- Use scheduling policies that minimize churn

Generative media and 3D
Generative media workloads create new visual, audio or 3D content.

Types of generative media workloads
- Image generation.
- Video generation and editing.
- Audio and speech.
- 3D and scene generation.
Resource patterns and routing
- These jobs vary from bursty interactive tasks to heavy batch runs.
- Throughput. Needed for batch runs.
- Latency. Key for interactive tools.
- Memory and storage. Heavy assets.
How OpenGPU supports generative media and 3D
- Route bursty workloads across providers
- Place latency-sensitive jobs on fast nodes
- Scale horizontally without pipeline redesign

Simulation and research
Simulation and research workloads include reinforcement learning, scientific simulations and large evaluation sweeps.


Types of simulation and research workloads
- Reinforcement learning.
- Scientific simulations.
- Search and evaluation.
Scaling experiments with OpenGPU
- Horizontal scale.
- Cost control.
- Fault tolerance.
How OpenGPU supports simulation and research
- Launch many small jobs instead of giant clusters
- Scale capacity up or down on demand
- Automatically recycle idle GPUs into new tasks
