Hosting

Context Window & GPU Settings

How the setup wizard picks your context size, GPU layers, and why these values matter for performance and VRAM usage.

What is context size?

Context size (--ctx-size) is the maximum number of tokens that can be in a single conversation โ€” prompt + history + response combined. Larger = the model can "remember" more of the conversation. Smaller = less VRAM, faster inference.

๐Ÿ’ก Most real-world use cases are fine with 8kโ€“32k tokens. 128k is useful for long documents but needs a lot of VRAM.

How the wizard picks context size

The wizard reads the model's binary (GGUF) metadata directly and measures your available GPU memory, then calculates the largest context that fits comfortably.

Step 1 โ€” Read the model's native max context

GGUF files embed the model's training-time context limit in their header. The wizard reads this without loading the model (takes milliseconds). For example:

Setting --ctx-size above this number causes degraded output quality โ€” the model was never trained for longer sequences.

Step 2 โ€” Estimate KV cache VRAM cost

Every token in the context window requires memory in the KV (key-value) cache โ€” the structure that lets the model attend to prior tokens. The formula:

bytes_per_token = 2 ร— n_layers ร— n_kv_heads ร— head_dim ร— 2 bytes (FP16)

For a typical 7โ€“8B model (32 layers, 8 KV heads, 128 head dim):

bytes_per_token = 2 ร— 32 ร— 8 ร— 128 ร— 2 = 131,072 bytes โ‰ˆ 128 KB per token

At 32,768 ctx: 32,768 ร— 128 KB = ~4 GB KV cache
At 8,192 ctx:   8,192 ร— 128 KB = ~1 GB KV cache
โš ๏ธ Bigger models (70B+) have more layers and heads โ€” their KV cache cost per token is proportionally larger.

Step 3 โ€” Reserve VRAM headroom

The wizard reserves 20% of total VRAM for the KV cache. The rest is assumed to be used by model weights. This is a conservative estimate โ€” actual weight usage depends on quantization (Q4 models use ~4 bits/weight vs Q8 at 8 bits).

usable_kv_vram = total_vram ร— 0.20
max_ctx_from_vram = usable_kv_vram รท bytes_per_token

Step 4 โ€” Pick the largest standard size that fits

The wizard snaps to a standard power-of-two size (to avoid fragmentation and match model training checkpoints):

Standard sizeTokens~KV cache (7B model)
40964k~512 MB
81928k~1 GB
1638416k~2 GB
3276832k~4 GB
6553664k~8 GB
131072128k~16 GB
๐Ÿ’ก The wizard caps at the model's native max even if you have VRAM to spare โ€” there's no benefit going beyond the training context.

GPU layers (--n-gpu-layers)

llama.cpp loads model layers individually onto the GPU. Setting --n-gpu-layers 99 tells it to load all layers onto GPU (99 is safely larger than any current model's layer count). This gives maximum speed.

If your VRAM is too small for the full model:

Changing context size after setup

Edit docker-compose.yml and restart the stack:

# Edit docker-compose.yml
nano docker-compose.yml
# Change: LLAMA_ARG_CTX_SIZE=32768

docker compose up -d

Rule of thumb by VRAM

GPU VRAM7B model fits?Recommended ctx
4 GBQ4 only (tight)4kโ€“8k
8 GBQ4/Q5 comfortable8kโ€“16k
12 GBQ6/Q8 comfortable16kโ€“32k
16 GBQ8 or small FP1632kโ€“64k
24 GBYes, all quants64kโ€“128k
48 GB+70B Q4 fitsFull native ctx

Further reading