Examples
Example 1: List Available Models
python main.py --list-models
Output:
AVAILABLE MODELS
======================================================================
[7B] LLaMA 2 / Mistral / Qwen 7B
Architecture: decoder-only
Default precision: fp16
KV cache: 0.6 MB/token (FP16 baseline)
[13B] LLaMA 2 13B
Architecture: decoder-only
Default precision: fp16
KV cache: 0.9 MB/token (FP16 baseline)
...
—
Example 2: How Much VRAM Do I Need for a 70B Model?
python main.py --model 70 --context 8192
Output:
VRAM BREAKDOWN: LLaMA 2 70B / LLaMA 3.1 70B
======================================================================
Configuration:
Context: 8,192 tokens
Quantization: FP16
Memory Breakdown:
Model parameters: 13.67 GB
Overhead (30%): 4.10 GB
Model + overhead: 17.77 GB
KV cache: 28.67 GB
----------------------------------------
TOTAL VRAM: 46.44 GB
Minimum GPU VRAM required: 46.4 GB
Recommended (with margin): 51.1 GB
—
Example 3: Small Model with Fractional Size (0.6B)
Small models like Phi-3 Mini (3.8B) or Qwen2-0.5B use fractional sizes:
python main.py --model 0.6 --context 8192
python main.py -m 3.8 -c 4096 -q int4
These models are designed for edge devices and can run on GPUs with as little as 4-6 GB VRAM.
—
Example 4: Can I Run a 70B Model with INT4 on RTX 4090?
python main.py -m 70 -c 16384 -q int4
This shows that with INT4 quantization, a 70B model with 16k context requires ~34 GB, so it fits on an RTX 4090 (24 GB) but would need quantization or a larger GPU for the full context.
—
Example 5: Which GPU to Buy for a 34B Model?
python main.py --model 34 --context 8192
The output will show all GPUs that can run the model, ranked by free VRAM percentage.
—
Example 6: Compare All Quantizations
for q in fp32 fp16 int8 int4; do
echo "=== $q ==="
python main.py -m 7 -c 8192 -q $q | grep "TOTAL VRAM"
done
This clearly shows how INT4 allows much larger models on the same GPU.
—
Example 7: Check Google Colab Compatibility
python main.py --model 13 --context 16384 -q int4 --gpu-type datacenter | grep -i colab
This quickly shows which Colab GPU tier can run your desired model.
—
Example 8: Export Results
python main.py -c 8192 --export-json results.json
python main.py -c 8192 --export-csv results.csv
The exported files contain all combinations for further analysis.
—
Programmatic Usage
You can also use the library directly in Python:
from calculator import VRAMCalculator, Quantization
from models import get_all_models
from gpus import get_all_gpus
# Create calculator with INT4 quantization
calc = VRAMCalculator(quantization=Quantization.INT4)
# Calculate for context 8192
model = get_all_models()[0] # 7B
gpu = get_all_gpus()[0] # RTX 3060
result = calc.evaluate_pair(model, gpu, context_tokens=8192)
print(f"Status: {result.status.value}")
print(f"VRAM required: {result.required_vram_gb:.1f} GB")
print(f"VRAM available: {result.gpu_vram_gb} GB")
— Advanced Examples (v0.2.0) ————————–
Example 9: Layer Offload Optimization
Calculate optimal GPU layer distribution for a model that doesn’t fully fit in VRAM:
python main.py --model 70 --context 8192 --optimize-config --quantization int4
Output:
OPTIMAL LAYER OFFLOAD CONFIGURATION
======================================================================
Model: LLaMA 2 70B / LLaMA 3.1 70B (70B parameters)
GPU: RTX 3090 (24 GB VRAM)
Total Layers: 80
Layer Distribution:
Layers on GPU: 0
Layers on CPU: 80
Offload Ratio: 0.0%
Memory Usage:
GPU VRAM used: 32.50 GB / 24 GB
CPU RAM used: 45.00 GB
Performance Impact:
Estimated slowdown: 1000%
Recommended Configuration:
llama.cpp: --gpu-layers 0
AutoGPTQ: --gpu-memory 32.5G
For GPUs with more VRAM, you’ll see partial offload options.
— Example 10: CPU Offload Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Analyze hybrid GPU+CPU inference with PCIe bandwidth considerations:
python main.py --model 13 --context 8192 --cpu-offload --system-ram 64 --pcie-gen 4.0
Output:
CPU OFFLOAD ANALYSIS
======================================================================
System Requirements:
System RAM required: 8.50 GB
System RAM available: 64.00 GB
Status: ✓ Fits in system RAM
PCIe Configuration:
Generation: PCIe 4.0
Bandwidth: ~24 GB/s effective
Performance Estimate:
Token speed: ~35.0 tokens/second
Speed ratio: 100.0% of full GPU
Layer Distribution:
Layers on GPU: 40
Layers on CPU: 0
Offload Ratio: 100.0%
PCIe generation impact:
PCIe 3.0: ~12 GB/s effective (slower data transfer)
PCIe 4.0: ~24 GB/s effective (standard for modern GPUs)
PCIe 5.0: ~48 GB/s effective (best for CPU offload)
— Example 11: Multi-GPU Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Calculate tensor parallelism across multiple GPUs:
python main.py --params-b 405 --context 8192 --quantization int4 --multi-gpu --gpu-config "2x4090,1x3090"
Output:
MULTI-GPU CONFIGURATION
======================================================================
Model: Custom Model 405B (405B parameters)
Status: DOESN'T RUN
Bottleneck: RTX 3090
Communication overhead: 5.28 GB
Per-GPU Allocation:
✗ RTX 4090 25.32 GB / 24 GB
Shard: 50.0%
Framework Configuration:
tensor_parallel_size: 3
mode: tensor_parallel
vllm: --tensor-parallel-size 3
Heterogeneous configurations (mixed GPU models) are supported. Each GPU’s allocation is proportional to its VRAM capacity.
— Example 12: GGUF Auto-Detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Auto-detect quantization from GGUF filename:
python main.py --gguf-file "llama-2-7b.Q4_K_M.gguf" --context 4096
Output:
GGUF file detected: llama-2-7b.Q4_K_M.gguf
Quantization: Q4_K_M
Effective bits: 5.50 bits/param
Using quantization: int4
Supported GGUF quantizations:
Q2_K, Q2_K_S, Q2_K_M, Q2_K_L (~3-4 bits effective)
Q3_K, Q3_K_S, Q3_K_M, Q3_K_L, Q3_K_XS (~4-5 bits)
Q4_K, Q4_K_S, Q4_K_M, Q4_0, Q4_1 (~5 bits)
Q5_K, Q5_K_S, Q5_K_M, Q5_0, Q5_1 (~6 bits)
Q6_K (~7 bits)
Q8_0 (8 bits)
F16, F32 (floating point)
— Example 13: Model Format Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compare VRAM requirements across different model formats:
python main.py --model 7 --context 8192 --format fp16
python main.py --model 7 --context 8192 --format gguf
python main.py --model 7 --context 8192 --format exl2
Format overhead is automatically applied to calculations.
— Advanced Programmatic Usage ~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the advanced calculators programmatically:
from calculator import LayerOffloadCalculator, CPUOffloadCalculator
from multi_gpu import MultiGPUCalculator, MultiGPUConfig, MultiGPUMode
from formats import detect_gguf_quantization
from models import get_model_by_size
from gpus import get_all_gpus
# Layer offload calculation
layer_calc = LayerOffloadCalculator(quantization=Quantization.INT4)
model = get_model_by_size(70)
gpu = get_all_gpus()[0] # RTX 3060
result = layer_calc.calculate_optimal_offload(model, gpu, context_tokens=8192)
print(f"Layers on GPU: {result.layers_on_gpu}/{result.total_layers}")
print(f"Recommended: --gpu-layers {result.layers_on_gpu}")
# Multi-GPU calculation
multi_calc = MultiGPUCalculator(quantization=Quantization.INT4)
config = MultiGPUConfig(
gpus=[gpu1, gpu2, gpu3],
mode=MultiGPUMode.TENSOR_PARALLEL
)
result = multi_calc.calculate(model, config, context_tokens=8192)
# GGUF detection
gguf_info = detect_gguf_quantization("llama-2-7b.Q4_K_M.gguf")
print(f"Quantization: {gguf_info.quant_name}")
print(f"Effective bits: {gguf_info.bits_per_param}")