User Guide
Basic Usage
List Available Models
To see all available models in the database:
python main.py --list-models
This shows model name, size, architecture, and KV cache requirements for each model.
Check Specific Model
To check VRAM requirements for a specific model:
python main.py --model 7 --context 8192
python main.py -m 70 -c 16384 -q int4
The output includes:
VRAM Breakdown: Parameters, overhead, and KV cache memory
Total VRAM Required: Minimum and recommended GPU VRAM
GPU Compatibility: List of GPUs that can run the model, with free VRAM percentage
All Combinations
To see all model × GPU combinations:
python main.py --context 4096
Or with a larger context:
python main.py -c 8192
Command-Line Options
python main.py [OPTIONS]
Basic Options
-h, --help Show help message and exit
-c, --context CONTEXT Context size in tokens (default: 4096)
--list-models List all available models
-m SIZE, --model SIZE Model size in billions (e.g., 0.6, 7, 13, 70)
--gpu-type TYPE GPU type: consumer, datacenter, all (default: all)
--only-runs Show only running combinations
--group-gpu Group results by GPU instead of model
--summary TYPE Summary: model, gpu, both, none (default: both)
--export-csv FILE Export results to CSV
--export-json FILE Export results to JSON
-q, --quantization Q Model precision (fp32, fp16, int8, int4)
--mode MODE Calculation mode (theoretical, conservative, production)
Advanced Options (v0.2.0)
Layer Offload Optimization
--optimize-config Show optimal layer offload configuration
Calculates how many transformer layers can fit in GPU VRAM, displaying:
Layers on GPU vs CPU
Recommended
--gpu-layersparameter for llama.cppPerformance impact estimation
Offload options for all available GPUs
CPU Offload Analysis
--cpu-offload Enable CPU offload calculations
--system-ram GB System RAM available in GB (default: 32.0)
--pcie-gen GEN PCIe generation: 3.0, 4.0, or 5.0 (default: 4.0)
Calculates hybrid GPU+CPU inference configuration:
System RAM requirements
PCIe bandwidth impact on performance
Estimated tokens/second
Layer distribution between GPU and CPU
Multi-GPU Support
--multi-gpu Enable multi-GPU mode
--gpu-config CONFIG Multi-GPU configuration (e.g., "2x4090,1x3090")
--multi-gpu-mode MODE Parallelism mode: tensor or pipeline (default: tensor)
Configuration format:
Homogeneous:
3x4090(3 identical GPUs)Heterogeneous:
2x4090,1x3090(mixed GPUs)Partial names:
2x3090,1x4090
Modes:
tensor - Model weights split across GPUs (same layers, different shards)
pipeline - Different layers on different GPUs
Model Format Support
--gguf-file FILENAME GGUF filename to auto-detect quantization
--format FORMAT Model format: fp16, gguf, exl2, gptq, awq (default: fp16)
GGUF auto-detection supports: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32
Format overhead multipliers:
FP16: 1.0x (baseline)
GGUF: 1.15x (+15% for metadata structure)
EXL2: 1.05x (+5% optimized layout)
GPTQ: 1.10x (+10% quantization metadata)
AWQ: 1.08x (+8% activation-aware quantization)
Usage Examples
Consumer GPUs only:
python main.py -c 8192 --gpu-type consumer
Show only viable combinations:
python main.py -c 4096 --only-runs
Export results to JSON:
python main.py -c 8192 --export-json results.json
Use INT4 quantization for larger models:
python main.py -c 8192 -q int4 --only-runs
Check if a 70B model fits on RTX 4090:
python main.py --model 70 --context 8192
Supported Precisions
- FP32 (Float32)
4 bytes per parameter. Highest precision, highest VRAM usage. Rarely used for LLM inference due to high VRAM usage.
- FP16 (Float16)
2 bytes per parameter. Industry standard for inference. Excellent precision with half the VRAM usage of FP32.
- INT8 (8-bit Integer)
1 byte per parameter. Aggressive quantization. Small quality loss with significant VRAM savings.
- INT4 (4-bit Integer)
0.5 byte per parameter. Very aggressive quantization. Highest savings but more noticeable quality loss.
Interpreting Results
RUNS(green)The GPU has sufficient VRAM to run the model with the specified context.
DOESN'T RUN(red)The GPU doesn’t have enough VRAM.
- Low safety margin warning
⚠️ Low safety marginindicates the combination runs but with less than 10% VRAM free. This may cause OOM in real scenarios due to implementation variations.
VRAM Calculation Formula
The tool uses a conservative model with four components:
1. Model Parameters
Base memory for model weights. Since params_billion is in billions:
params_memory_gb = params_billion × bytes_per_param
Example: 70B model in FP16 (2 bytes/param): 70 × 2 = 140 GB
2. Overhead
overhead_gb = params_memory_gb × 0.30 # 30%
3. KV Cache
kv_cache_gb = (kv_cache_mb_per_token × context_tokens × multiplier) / 1024
4. Total
total_vram_gb = model_with_overhead_gb + kv_cache_gb
Technical Notes
The calculations assume PyTorch-style memory allocation (HF Transformers, vLLM). Different backends may have varying memory behavior:
TensorRT-LLM: Custom allocators, ~10-20% less memory
llama.cpp (GGUF): Memory-mapped files, ~20-30% less memory
EXL2: Optimized allocation, minimal overhead
For detailed explanations of technical terms used in VRAM calculations, see Glossary / Glossário.