Context Window & GPU Settings
How the setup wizard picks your context size, GPU layers, and why these values matter for performance and VRAM usage.
What is context size?
Context size (--ctx-size) is the maximum number of tokens that can be in a single conversation โ prompt + history + response combined. Larger = the model can "remember" more of the conversation. Smaller = less VRAM, faster inference.
How the wizard picks context size
The wizard reads the model's binary (GGUF) metadata directly and measures your available GPU memory, then calculates the largest context that fits comfortably.
Step 1 โ Read the model's native max context
GGUF files embed the model's training-time context limit in their header. The wizard reads this without loading the model (takes milliseconds). For example:
- Llama 3.1 8B โ 131,072 tokens
- Qwen 2.5 7B โ 131,072 tokens
- Gemma 3 9B โ 131,072 tokens
- Phi-3 Mini โ 128,000 tokens
Setting --ctx-size above this number causes degraded output quality โ the model was never trained for longer sequences.
Step 2 โ Estimate KV cache VRAM cost
Every token in the context window requires memory in the KV (key-value) cache โ the structure that lets the model attend to prior tokens. The formula:
bytes_per_token = 2 ร n_layers ร n_kv_heads ร head_dim ร 2 bytes (FP16)
For a typical 7โ8B model (32 layers, 8 KV heads, 128 head dim):
bytes_per_token = 2 ร 32 ร 8 ร 128 ร 2 = 131,072 bytes โ 128 KB per token
At 32,768 ctx: 32,768 ร 128 KB = ~4 GB KV cache
At 8,192 ctx: 8,192 ร 128 KB = ~1 GB KV cache
Step 3 โ Reserve VRAM headroom
The wizard reserves 20% of total VRAM for the KV cache. The rest is assumed to be used by model weights. This is a conservative estimate โ actual weight usage depends on quantization (Q4 models use ~4 bits/weight vs Q8 at 8 bits).
usable_kv_vram = total_vram ร 0.20
max_ctx_from_vram = usable_kv_vram รท bytes_per_token
Step 4 โ Pick the largest standard size that fits
The wizard snaps to a standard power-of-two size (to avoid fragmentation and match model training checkpoints):
| Standard size | Tokens | ~KV cache (7B model) |
|---|---|---|
4096 | 4k | ~512 MB |
8192 | 8k | ~1 GB |
16384 | 16k | ~2 GB |
32768 | 32k | ~4 GB |
65536 | 64k | ~8 GB |
131072 | 128k | ~16 GB |
GPU layers (--n-gpu-layers)
llama.cpp loads model layers individually onto the GPU. Setting --n-gpu-layers 99 tells it to load all layers onto GPU (99 is safely larger than any current model's layer count). This gives maximum speed.
If your VRAM is too small for the full model:
- llama.cpp will automatically fall back to CPU for overflow layers
- Speed degrades proportionally to how many layers end up on CPU
- You can tune this manually: e.g.
--n-gpu-layers 20for a partial offload
Changing context size after setup
Edit docker-compose.yml and restart the stack:
# Edit docker-compose.yml
nano docker-compose.yml
# Change: LLAMA_ARG_CTX_SIZE=32768
docker compose up -d
Rule of thumb by VRAM
| GPU VRAM | 7B model fits? | Recommended ctx |
|---|---|---|
| 4 GB | Q4 only (tight) | 4kโ8k |
| 8 GB | Q4/Q5 comfortable | 8kโ16k |
| 12 GB | Q6/Q8 comfortable | 16kโ32k |
| 16 GB | Q8 or small FP16 | 32kโ64k |
| 24 GB | Yes, all quants | 64kโ128k |
| 48 GB+ | 70B Q4 fits | Full native ctx |