In the pre-LLM era, deploying machine learning models was an entirely different experience. You’d containerize them with tools like Docker and a Flask server, then deploy to your chosen infrastructure, handling details like CUDA versions yourself (or choose from a handful of inference providers that existed at the time).
Today, inference providers have made it easy for users, but that ease is built on top of a multi-layered infrastructure. It was only after I joined Together that I realized running an inference service for open-source models is a complex challenge, blending traditional distributed systems with new, unique components.
If you’re already familiar with LLM inference, this post might be too basic for you but if not, read on.
Routing
At the top of the stack is typically a routing layer that directs user requests to the correct models. I am oversimplifying, but load balancing (latency-aware strategies), multi-tenancy (avoiding starvation), and QoS (prioritization across user tiers, rate limiting) are all built into this layer.
Inference Engine
Requests then hit the inference engine, which is responsible for actually running the model. It holds the model weights and architecture. Popular open-source engines include vLLM, SGLang, and TRT-LLM. Inference providers may also build their own custom inference engines.
This layer also holds specific features and optimizations that are implemented differently by different inference providers. These choices are what causes variance in the behavior of the same open-models on different inference platforms. Tool calling in gpt-oss is a recent example of this.
Features and optimizations include
- Batching: merging requests into a single forward pass for throughput.
- Plugins/extensions: function calling, structured outputs, safety layers, guardrails.
- Memory optimizations: tensor parallelism, expert parallelism, quantization.
- Speculative decoding and prefix caching: faster token generation.
The inference engine decides what operations the model needs: for instance, run attention, apply a layer norm, and generate the next token. It then calls into the kernel layer, which provides highly optimized implementations of those operations (FlashAttention, GEMM, activation functions).
Kernel
I was pretty dumb when I first started and thought this kernel was akin to something like a Linux kernel. Spoiler alert: it is not. GPU kernels are low-level, highly optimized math routines. Kernel code is compiled into CUDA and executed on GPUs.
- Standard libraries: cuBLAS, cuDNN, CUTLASS (NVIDIA)
- FlashAttention and fused kernels: Accelerate attention by reducing memory overhead.
- Bespoke kernels: For example, Together’s Kernel Collection (TKC) includes custom CUDA kernels, tuned for training and inference. This offers serious benefits in speeding up inference and making it cost-effective.
The inference engine maps model operations to be executed by the kernel.
Hardware Layer
At the bottom sits the hardware. In Together’s case, we operate large-scale GPU clusters across NVIDIA’s HGX H100, H200, and Blackwell B200 platforms. Inference can run on bare-metal GPUs or on a thin virtualization layer to enable the use of Kubernetes or custom schedulers.
Closing Thoughts
Running inference as a service, particularly for LLMs, is a complex stack, from routing at the top to custom GPU kernels at the bottom. Each layer introduces unique challenges and optimizations, and together they make it possible to serve large models efficiently at scale.
I conveniently skipped the nuances of other modalities like audio and video in this blog post, but I’ll save that for another day!