User Guide

Basic Usage

List Available Models

To see all available models in the database:

python main.py --list-models

This shows model name, size, architecture, and KV cache requirements for each model.

Check Specific Model

To check VRAM requirements for a specific model:

python main.py --model 7 --context 8192
python main.py -m 70 -c 16384 -q int4

The output includes:

VRAM Breakdown: Parameters, overhead, and KV cache memory
Total VRAM Required: Minimum and recommended GPU VRAM
GPU Compatibility: List of GPUs that can run the model, with free VRAM percentage

All Combinations

To see all model × GPU combinations:

python main.py --context 4096

Or with a larger context:

python main.py -c 8192

Command-Line Options

python main.py [OPTIONS]

Basic Options

-h, --help            Show help message and exit
-c, --context CONTEXT Context size in tokens (default: 4096)
--list-models         List all available models
-m SIZE, --model SIZE  Model size in billions (e.g., 0.6, 7, 13, 70)
--gpu-type TYPE       GPU type: consumer, datacenter, all (default: all)
--only-runs           Show only running combinations
--group-gpu           Group results by GPU instead of model
--summary TYPE        Summary: model, gpu, both, none (default: both)
--export-csv FILE     Export results to CSV
--export-json FILE    Export results to JSON
-q, --quantization Q  Model precision (fp32, fp16, int8, int4)
--mode MODE           Calculation mode (theoretical, conservative, production)

Advanced Options (v0.2.0)

Layer Offload Optimization

--optimize-config      Show optimal layer offload configuration

Calculates how many transformer layers can fit in GPU VRAM, displaying:

Layers on GPU vs CPU
Recommended --gpu-layers parameter for llama.cpp
Performance impact estimation
Offload options for all available GPUs

CPU Offload Analysis

--cpu-offload          Enable CPU offload calculations
--system-ram GB        System RAM available in GB (default: 32.0)
--pcie-gen GEN         PCIe generation: 3.0, 4.0, or 5.0 (default: 4.0)

Calculates hybrid GPU+CPU inference configuration:

System RAM requirements
PCIe bandwidth impact on performance
Estimated tokens/second
Layer distribution between GPU and CPU

Multi-GPU Support

--multi-gpu            Enable multi-GPU mode
--gpu-config CONFIG    Multi-GPU configuration (e.g., "2x4090,1x3090")
--multi-gpu-mode MODE  Parallelism mode: tensor or pipeline (default: tensor)

Configuration format:

Homogeneous: 3x4090 (3 identical GPUs)
Heterogeneous: 2x4090,1x3090 (mixed GPUs)
Partial names: 2x3090,1x4090

Modes:

tensor - Model weights split across GPUs (same layers, different shards)
pipeline - Different layers on different GPUs

Model Format Support

--gguf-file FILENAME  GGUF filename to auto-detect quantization
--format FORMAT        Model format: fp16, gguf, exl2, gptq, awq (default: fp16)

GGUF auto-detection supports: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32

Format overhead multipliers:

FP16: 1.0x (baseline)
GGUF: 1.15x (+15% for metadata structure)
EXL2: 1.05x (+5% optimized layout)
GPTQ: 1.10x (+10% quantization metadata)
AWQ: 1.08x (+8% activation-aware quantization)

Usage Examples

Consumer GPUs only:

python main.py -c 8192 --gpu-type consumer

Show only viable combinations:

python main.py -c 4096 --only-runs

Export results to JSON:

python main.py -c 8192 --export-json results.json

Use INT4 quantization for larger models:

python main.py -c 8192 -q int4 --only-runs

Check if a 70B model fits on RTX 4090:

python main.py --model 70 --context 8192

Supported Precisions

FP32 (Float32): 4 bytes per parameter. Highest precision, highest VRAM usage. Rarely used for LLM inference due to high VRAM usage.
FP16 (Float16): 2 bytes per parameter. Industry standard for inference. Excellent precision with half the VRAM usage of FP32.
INT8 (8-bit Integer): 1 byte per parameter. Aggressive quantization. Small quality loss with significant VRAM savings.
INT4 (4-bit Integer): 0.5 byte per parameter. Very aggressive quantization. Highest savings but more noticeable quality loss.

Interpreting Results

RUNS (green): The GPU has sufficient VRAM to run the model with the specified context.
DOESN'T RUN (red): The GPU doesn’t have enough VRAM.
Low safety margin warning: ⚠️ Low safety margin indicates the combination runs but with less than 10% VRAM free. This may cause OOM in real scenarios due to implementation variations.

VRAM Calculation Formula

The tool uses a conservative model with four components:

1. Model Parameters

Base memory for model weights. Since params_billion is in billions:

params_memory_gb = params_billion × bytes_per_param

Example: 70B model in FP16 (2 bytes/param): 70 × 2 = 140 GB

2. Overhead

overhead_gb = params_memory_gb × 0.30  # 30%

3. KV Cache

kv_cache_gb = (kv_cache_mb_per_token × context_tokens × multiplier) / 1024

4. Total

total_vram_gb = model_with_overhead_gb + kv_cache_gb

Technical Notes

The calculations assume PyTorch-style memory allocation (HF Transformers, vLLM). Different backends may have varying memory behavior:

TensorRT-LLM: Custom allocators, ~10-20% less memory
llama.cpp (GGUF): Memory-mapped files, ~20-30% less memory
EXL2: Optimized allocation, minimal overhead

For detailed explanations of technical terms used in VRAM calculations, see Glossary / Glossário.