Hosting

Context Window & GPU Settings

How the setup wizard picks your context size, GPU layers, and why these values matter for performance and VRAM usage.

What is context size?

Context size (--ctx-size) is the maximum number of tokens that can be in a single conversation — prompt + history + response combined. Larger = the model can "remember" more of the conversation. Smaller = less VRAM, faster inference.

💡 Most real-world use cases are fine with 8k–32k tokens. 128k is useful for long documents but needs a lot of VRAM.

How the wizard picks context size

The wizard reads the model's binary (GGUF) metadata directly and measures your available GPU memory, then calculates the largest context that fits comfortably.

Step 1 — Read the model's native max context

GGUF files embed the model's training-time context limit in their header. The wizard reads this without loading the model (takes milliseconds). For example:

Llama 3.1 8B → 131,072 tokens
Qwen 2.5 7B → 131,072 tokens
Gemma 3 9B → 131,072 tokens
Phi-3 Mini → 128,000 tokens

Setting --ctx-size above this number causes degraded output quality — the model was never trained for longer sequences.

Step 2 — Estimate KV cache VRAM cost

Every token in the context window requires memory in the KV (key-value) cache — the structure that lets the model attend to prior tokens. The formula:

bytes_per_token = 2 × n_layers × n_kv_heads × head_dim × 2 bytes (FP16)

For a typical 7–8B model (32 layers, 8 KV heads, 128 head dim):

bytes_per_token = 2 × 32 × 8 × 128 × 2 = 131,072 bytes ≈ 128 KB per token

At 32,768 ctx: 32,768 × 128 KB = ~4 GB KV cache
At 8,192 ctx:   8,192 × 128 KB = ~1 GB KV cache

⚠️ Bigger models (70B+) have more layers and heads — their KV cache cost per token is proportionally larger.

Step 3 — Reserve VRAM headroom

The wizard reserves 20% of total VRAM for the KV cache. The rest is assumed to be used by model weights. This is a conservative estimate — actual weight usage depends on quantization (Q4 models use ~4 bits/weight vs Q8 at 8 bits).

usable_kv_vram = total_vram × 0.20
max_ctx_from_vram = usable_kv_vram ÷ bytes_per_token

Step 4 — Pick the largest standard size that fits

The wizard snaps to a standard power-of-two size (to avoid fragmentation and match model training checkpoints):

Standard size	Tokens	~KV cache (7B model)
`4096`	4k	~512 MB
`8192`	8k	~1 GB
`16384`	16k	~2 GB
`32768`	32k	~4 GB
`65536`	64k	~8 GB
`131072`	128k	~16 GB

💡 The wizard caps at the model's native max even if you have VRAM to spare — there's no benefit going beyond the training context.

GPU layers (`--n-gpu-layers`)

llama.cpp loads model layers individually onto the GPU. Setting --n-gpu-layers 99 tells it to load all layers onto GPU (99 is safely larger than any current model's layer count). This gives maximum speed.

If your VRAM is too small for the full model:

llama.cpp will automatically fall back to CPU for overflow layers
Speed degrades proportionally to how many layers end up on CPU
You can tune this manually: e.g. --n-gpu-layers 20 for a partial offload

Changing context size after setup

Edit docker-compose.yml and restart the stack:

# Edit docker-compose.yml
nano docker-compose.yml
# Change: LLAMA_ARG_CTX_SIZE=32768

docker compose up -d

Rule of thumb by VRAM

GPU VRAM	7B model fits?	Recommended ctx
4 GB	Q4 only (tight)	4k–8k
8 GB	Q4/Q5 comfortable	8k–16k
12 GB	Q6/Q8 comfortable	16k–32k
16 GB	Q8 or small FP16	32k–64k
24 GB	Yes, all quants	64k–128k
48 GB+	70B Q4 fits	Full native ctx

Context Window & GPU Settings

What is context size?

How the wizard picks context size

Step 1 — Read the model's native max context

Step 2 — Estimate KV cache VRAM cost

Step 3 — Reserve VRAM headroom

Step 4 — Pick the largest standard size that fits

GPU layers (--n-gpu-layers)

Changing context size after setup

Rule of thumb by VRAM

Further reading

GPU layers (`--n-gpu-layers`)