How the estimates work

Local LLM generation is dominated by memory bandwidth: to produce each token, the hardware streams the model's active weights (and the growing KV cache) out of memory. But raw bandwidth isn't the whole story — there's also a fixed per-token overhead that puts a ceiling on speed, which is why a chip with twice the bandwidth isn't twice as fast. We model the time for each token directly:

tokens/sec = 1 ÷ ( weight_read_time + kv_cache_read_time + fixed_overhead ) weight_read_time = active_model_size ÷ (memory_bandwidth × efficiency) kv_cache_read_time = kv_cache_size ÷ (memory_bandwidth × efficiency)

Active model size = active parameters × bits-per-weight ÷ 8. For mixture-of-experts models only a fraction of parameters run per token, so they are faster than their download size suggests.
KV cache grows with your context length, so longer contexts mean slower generation — the context selector affects the speed estimate, not just whether a model fits. By default we assume an fp16 cache (2 bytes per element); the KV-cache selector models Q8_0 (≈½) and Q4_0 (≈¼) quantization, a runtime setting (e.g. llama.cpp --cache-type-k/-v) that shrinks the cache and speeds long-context generation for a little quality. It can't be read from a model on Hugging Face — it's your choice at inference time — so we let you pick it here.
Efficiency and fixed overhead are calibrated per hardware class (Apple unified memory, discrete GPU, CPU).
Concurrent streams (serving several requests at once) share one copy of the weights but each needs its own KV cache, so memory use grows with the stream count and large models may stop fitting. Because the weight read is amortized across the batch, the speed we show is per stream — aggregate throughput is roughly the stream count times that, with diminishing returns as batched decode becomes compute-bound. Real batching efficiency varies by runtime (vLLM is built for it; llama.cpp less so), so treat the multi-stream numbers as a first-order guide.

Time to first token is a separate story. Before any token is generated the model must read your whole prompt — “prefill” — which is compute-bound(≈ 2 FLOPs per active parameter per token), not bandwidth-bound like generation. So a chip with great memory bandwidth but modest compute can stream tokens quickly yet still take many seconds to start at long context. We estimate prefill from each device's approximate fp16 compute and show the time to first token at your selected context length. These compute figures aren't benchmark-calibrated, so treat the time-to-first-token as a rough ballpark.

Number format (NVFP4, INT8/FP8, and quantization) matters for prefill but barely for decode — and the reason is the same bound-by split. Decode is bandwidth-bound, so what counts is how many bitseach weight occupies: a 4-bit weight streams in the same time whether it's an integer K-quant (Q4) or NVIDIA's NVFP4, a hardware 4-bit floaton Blackwell. NVFP4's advantage over a plain 4-bit integer is quality per bit(a shared scale and a floating-point mantissa preserve more of the model), not raw decode speed — so it doesn't move the tokens/sec we show, which are calibrated to llama.cpp/Ollama integer quants. Prefill is compute-bound, and that's where a card's tensor-core formats decide throughput. Each NVIDIA generation added a lower-precision tensor path that roughly doubles peak math: INT8 on Ampere (A100), FP8 on Ada and Hopper (L40S, H100/H200), and FP4/NVFP4on Blackwell (RTX Pro 6000, B200, GB300). The prefill compute figures for the workstation and datacenter cards below already reflect that tensor-core advantage, which is why they sit well ahead of a consumer card at the same memory bandwidth. NVFP4 itself is consumed today by TensorRT-LLM and vLLM, not GGUF/llama.cpp, so on this site it shows up as faster prefill on Blackwell rather than a new decode number.

Whether a model fits is decided by the total weights at a given quantization, plus the KV cache for your chosen context length, plus runtime overhead — compared against the usable portion of your VRAM or unified memory.

On a discrete GPU, a model that overflows VRAM isn't necessarily out of reach: llama.cpp can keep some layers on the GPU and offload the rest to system RAM, which runs — slowly, because the RAM-resident weights are read at a fraction of VRAM bandwidth each token. When that applies we show the offload speed and how much spills to RAM, assuming a typical desktop with ~64 GB of DDR5. Apple unified memory has no VRAM/RAM split, so offload doesn't apply there.

Unified memory & multiple GPUs

Some machines have more than one memory tier. A Grace-Blackwell part like the GB300pairs ~288 GB of fast HBM (≈8 TB/s) with the Grace CPU's ~480 GB of LPDDR (≈0.38 TB/s), coherently unified — so a model far bigger than the HBM can run by spilling into the LPDDR tier, slowly. We model this directly: weights fill the fast tier first (after the KV cache and overhead) and spill to slower tiers, and the per-token read time is the sum across tiers — so the portion living in the slow tier sets the pace. That's why a ~600B model fits a single GB300 but generates at only a few tokens a second.

Multiple GPUsare the same idea in reverse: pooling N discrete cards scales the usable memory by N (weights shard across them) and the memory bandwidth by roughly N as well — minus an interconnect penalty, so the realized speedup is sub-linear (we assume ~0.9× per GPU at 2, falling toward ~0.7× at 8). Both reduce to the single-tier, single-GPU model when they don't apply, so ordinary machines are unaffected.

Fine-tuning memory

Fine-tuning is a different memory problem from inference, which is why we give it its own page. Running a model needs the weights plus a KV cache; training also has to hold a gradient and optimizer statefor every parameter it updates. With mixed-precision AdamW that's a bf16 gradient (2 bytes), an fp32 master copy (4 bytes), and two fp32 optimizer moments (8 bytes) — about 16 bytes per trainable parameter, on top of the weights. So a full fine-tune of a 7–8B model needs well over 100 GB and only fits datacenter GPUs.

Full fine-tune trains every weight — the 16-bytes-per-param cost applies to the whole model.
LoRA freezes the base weights (kept in bf16) and trains only a small low-rank adapter, so gradients and optimizer state apply to a fraction of a percent of the parameters — collapsing that cost to near zero.
QLoRA goes further and holds the frozen base in 4-bit (≈¼ the weight memory) while still training the adapter — which is how a 7B fine-tune fits in single-digit gigabytes.

On top of that sits activation memory, which grows with batch size and sequence length; gradient checkpointing trades extra compute to store only a fraction of it, and an 8-bit optimizerhalves the optimizer state. We estimate all of these from the model's shape and your chosen settings. The training figures are calibrated to standard references (a QLoRA-7B run in single-digit GB, a full 7B fine-tune north of 100 GB) but, like the speed numbers, are ballpark estimates — real usage varies with framework, kernel, and config.

Vision-language models

A vision-language model costs more memory than its text size suggests, for two reasons. First, it carries a vision encoder (a ~0.3–0.7B ViT) that stays resident in ~fp16 even when the language weights are quantized. Second — and this is the one people miss — every image becomes hundreds to thousands of tokens that are prepended into the context, so they consume KV cache exactly like a long prompt. We add both terms to the normal weight + KV-cache math: the vision encoder to the weights, and images × tokens-per-image to the context.

Tokens-per-image varies by model and resolution — LLaVA's CLIP encoder emits a fixed 576, while Qwen2-VL and Pixtral scale from a few hundred to a couple thousand — so the calculator has a resolution control, and the image KV cost is shown as its own line. Because that cost scales with the attention shape, it's far heavier on a multi-head model (LLaVA) than a grouped-query one (Qwen-VL) for the same token count. We model the common decoder-only VLMs that prepend visual tokens; cross-attention designs (e.g. Llama 3.2 Vision) condition differently and aren't covered yet. Specs are hand-curated and approximate.

Image generation

Diffusion models are a different machine again — no KV cache, nothing autoregressive. A denoiser (a UNet or a DiT) is run for a number of steps over an image latent, with a VAE and one or more text encodersresident alongside. So memory is just the sum of those weights (at your chosen precision) plus activations, which scale with the image resolution. The text encoder is the surprise: FLUX and SD3.5 ship a ~4.7B T5-XXL that's bigger than many language models, which is why a 12B model like FLUX needs ~33 GB in fp16 and is usually run quantized.

Speed is reported as seconds per image = steps × per-step compute. Each denoise step is compute-bound(a forward pass over the latent), so we estimate it from the denoiser size, the latent resolution, and the device's compute — the same TFLOPS figures used for prompt prefill — calibrated so SDXL at 1024px/30 steps lands near ~12 s on an RTX 4090. That makes the levers visible: a distilled 4-step model (FLUX schnell) is far faster than its 28-step sibling, higher resolution costs both memory and time, and a faster GPU scales the per-step time down. Ballpark estimates, like the rest.

Calibration

The speed model is fitted against real measured token-generation benchmarks from the llama.cpp benchmark threads, XiongjieDai's GPU-Benchmarks-on-LLM-Inference, and LocalScore. The Apple-silicon fit explains 98% of the variance in the measured data; the discrete-GPU fit, 90%.

As the crowdsourced reports below accumulate, we periodically re-fit the same constants against the accepted submissions and update them when the data warrants — so every benchmark you contribute directly sharpens the estimates everyone sees.

These are estimates, shown as ranges. They're calibrated to the mainstream llama.cpp / Ollama setup, which is the default; the runtime selector adjusts the estimate for faster backends — MLX on Apple silicon, vLLM or ExLlamaV2 on discrete GPUs — using approximate per-runtime factors (themselves refined over time by the crowdsourced reports, which record the engine used). Real numbers also vary with OS, thermal state, and build. The goal is a reliable ballpark for every machine, not a benchmark. CPU-only estimates are not yet benchmark-calibrated and are rougher.

Measured speeds

Alongside our estimates, we show crowdsourced measured speeds when people report them. On the contribute page anyone can paste the raw timing output from their own llama.cpp or Ollama run; we parse the tokens-per-second from it (never a self-typed number), sanity-check it against the estimate, and store it anonymously. Once a given device, model, and quantization has at least three accepted reports, cards and the chart show the median of them, with a count of how many back it — so no single submission can move the number. A measured median is a real number from real hardware, so trust it over the estimate when both are present — the estimate is the prediction, the measured value is the ground truth filling it in. Submissions are rate-limited and gated by a lightweight proof-of-work check (no third-party CAPTCHA); we keep only the one parsed benchmark line, not your full paste.

Per-device calibration

Once a device has enough confirmedcommunity benchmarks — three or more reports on at least two model/quant combinations — we compare those measured speeds to our estimate and apply a single correction factor to that device's decodenumbers, clamped to a sane band. It's recomputed weekly and merged by a human before it goes live, so crowdsourced data never silently changes the model. Calibrated devices are marked calibrated; everything else is purely modeled. Prefill / time-to-first-token is compute-bound and is never adjusted this way.

Capability score

The capability score (0–100) lets you pick the strongest model you can actually run. Its grounding varies by model, and we say which on every card:

Benchmark-anchored — where a model has a public LMArena Elo, the score is anchored to it and the card shows the Elo. These are grounded in a real, independent benchmark.
Editorial estimate — brand-new open models often have no clean, machine-readable public benchmark yet, so their score is our estimate from published results and size class, clearly labeled as such. A score upgrades to benchmark-anchored automatically once the model is rated.

Scores reflect full-precision weights; heavy quantization (e.g. Q4) may run a few percent weaker on math and reasoning. Quality data credit: LMArena (CC BY 4.0).