llama-server configurations optimized for coding on Vulkan.
27B parameter IQ4 model with Multi-Token Prediction (MTP) for speculative decoding. Uses a Q8 nextn draft model for fast token generation. Best all-round config for coding. Running on a single 16GB AMD 6800
IQ4_XS quantized 27B model with light ngram-based speculative decoding intead of MTD. Running on a single 16GB AMD 6800 It consumes less VRAM and compute than the MTD version for 1/3 of prediction performances, more context available, better suited for domains where prediction don't matter much.
9B coder model (Qwopus3.5) with MTP speculative decoding. Smaller and faster, good for quick coding tasks yet slower than MoE.
Mixture-of-Experts model with 35B total / 3B active parameters. Running on a single 16GB AMD 6800Multiple configs ranging from fast IQ3_M quant to larger Q4_K_XL.
Base configuration for the 35B-A3B UD model with Q4_K_S quantization, no MTD heads. Good reference config with 80K context. Running on a single 16GB AMD 6800