Will's Inference Calculations

Things everyone should know about inference

Will Arnold — LinkedIn · GitHub · will@swaglu.com · warnold@nvidia.com

Roofline Calculations

KV Cache Bytes per Token

MLA Multi-head Latent Attention | MHA Multi-Head Attention | SWA Sliding Window

Model Size vs KV Cache Bytes per Token

Active Parameters

Total Parameters

KV Cache per Request at Sequence Length

Max Concurrent Requests at Sequence Length

max reqs = ⌊(GPU × 0.9 − weights/N) / (KV/tok × seq_len + fixed)⌋

Model Details

Model Type Layers KV Heads Head Dim BF16 B/tok FP8 B/tok 128K BF16 128K FP8