Examples

Example 1: List Available Models

python main.py --list-models

Output:

AVAILABLE MODELS
======================================================================
  [7B] LLaMA 2 / Mistral / Qwen 7B
       Architecture: decoder-only
       Default precision: fp16
       KV cache: 0.6 MB/token (FP16 baseline)

  [13B] LLaMA 2 13B
       Architecture: decoder-only
       Default precision: fp16
       KV cache: 0.9 MB/token (FP16 baseline)

  ...

Example 2: How Much VRAM Do I Need for a 70B Model?

python main.py --model 70 --context 8192

Output:

VRAM BREAKDOWN: LLaMA 2 70B / LLaMA 3.1 70B
======================================================================
Configuration:
  Context: 8,192 tokens
  Quantization: FP16

Memory Breakdown:
  Model parameters: 13.67 GB
  Overhead (30%):      4.10 GB
  Model + overhead:    17.77 GB
  KV cache:            28.67 GB
  ----------------------------------------
  TOTAL VRAM:          46.44 GB

Minimum GPU VRAM required: 46.4 GB
Recommended (with margin): 51.1 GB

Example 3: Small Model with Fractional Size (0.6B)

Small models like Phi-3 Mini (3.8B) or Qwen2-0.5B use fractional sizes:

python main.py --model 0.6 --context 8192
python main.py -m 3.8 -c 4096 -q int4

These models are designed for edge devices and can run on GPUs with as little as 4-6 GB VRAM.

Example 4: Can I Run a 70B Model with INT4 on RTX 4090?

python main.py -m 70 -c 16384 -q int4

This shows that with INT4 quantization, a 70B model with 16k context requires ~34 GB, so it fits on an RTX 4090 (24 GB) but would need quantization or a larger GPU for the full context.

Example 5: Which GPU to Buy for a 34B Model?

python main.py --model 34 --context 8192

The output will show all GPUs that can run the model, ranked by free VRAM percentage.

Example 6: Compare All Quantizations

for q in fp32 fp16 int8 int4; do
    echo "=== $q ==="
    python main.py -m 7 -c 8192 -q $q | grep "TOTAL VRAM"
done

This clearly shows how INT4 allows much larger models on the same GPU.

Example 7: Check Google Colab Compatibility

python main.py --model 13 --context 16384 -q int4 --gpu-type datacenter | grep -i colab

This quickly shows which Colab GPU tier can run your desired model.

Example 8: Export Results

python main.py -c 8192 --export-json results.json
python main.py -c 8192 --export-csv results.csv

The exported files contain all combinations for further analysis.

Programmatic Usage

You can also use the library directly in Python:

from calculator import VRAMCalculator, Quantization
from models import get_all_models
from gpus import get_all_gpus

# Create calculator with INT4 quantization
calc = VRAMCalculator(quantization=Quantization.INT4)

# Calculate for context 8192
model = get_all_models()[0]  # 7B
gpu = get_all_gpus()[0]  # RTX 3060

result = calc.evaluate_pair(model, gpu, context_tokens=8192)

print(f"Status: {result.status.value}")
print(f"VRAM required: {result.required_vram_gb:.1f} GB")
print(f"VRAM available: {result.gpu_vram_gb} GB")

— Advanced Examples (v0.2.0) ————————–

Example 9: Layer Offload Optimization

Calculate optimal GPU layer distribution for a model that doesn’t fully fit in VRAM:

python main.py --model 70 --context 8192 --optimize-config --quantization int4

Output:

OPTIMAL LAYER OFFLOAD CONFIGURATION
======================================================================
Model: LLaMA 2 70B / LLaMA 3.1 70B (70B parameters)
GPU:   RTX 3090 (24 GB VRAM)
Total Layers: 80

Layer Distribution:
  Layers on GPU:  0
  Layers on CPU:  80
  Offload Ratio:  0.0%

Memory Usage:
  GPU VRAM used:  32.50 GB / 24 GB
  CPU RAM used:   45.00 GB

Performance Impact:
  Estimated slowdown: 1000%

Recommended Configuration:
  llama.cpp:  --gpu-layers 0
  AutoGPTQ:   --gpu-memory 32.5G

For GPUs with more VRAM, you’ll see partial offload options.

— Example 10: CPU Offload Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Analyze hybrid GPU+CPU inference with PCIe bandwidth considerations:

python main.py --model 13 --context 8192 --cpu-offload --system-ram 64 --pcie-gen 4.0

Output:

CPU OFFLOAD ANALYSIS
======================================================================
System Requirements:
  System RAM required: 8.50 GB
  System RAM available: 64.00 GB
  Status: ✓ Fits in system RAM

PCIe Configuration:
  Generation: PCIe 4.0
  Bandwidth: ~24 GB/s effective

Performance Estimate:
  Token speed: ~35.0 tokens/second
  Speed ratio: 100.0% of full GPU

Layer Distribution:
  Layers on GPU:  40
  Layers on CPU:  0
  Offload Ratio:  100.0%

PCIe generation impact:

  • PCIe 3.0: ~12 GB/s effective (slower data transfer)

  • PCIe 4.0: ~24 GB/s effective (standard for modern GPUs)

  • PCIe 5.0: ~48 GB/s effective (best for CPU offload)

— Example 11: Multi-GPU Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Calculate tensor parallelism across multiple GPUs:

python main.py --params-b 405 --context 8192 --quantization int4 --multi-gpu --gpu-config "2x4090,1x3090"

Output:

MULTI-GPU CONFIGURATION
======================================================================
Model: Custom Model 405B (405B parameters)
Status: DOESN'T RUN
  Bottleneck: RTX 3090
  Communication overhead: 5.28 GB

Per-GPU Allocation:
  ✗ RTX 4090              25.32 GB / 24 GB
    Shard: 50.0%

Framework Configuration:
  tensor_parallel_size: 3
  mode: tensor_parallel
  vllm: --tensor-parallel-size 3

Heterogeneous configurations (mixed GPU models) are supported. Each GPU’s allocation is proportional to its VRAM capacity.

— Example 12: GGUF Auto-Detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Auto-detect quantization from GGUF filename:

python main.py --gguf-file "llama-2-7b.Q4_K_M.gguf" --context 4096

Output:

GGUF file detected: llama-2-7b.Q4_K_M.gguf
  Quantization: Q4_K_M
  Effective bits: 5.50 bits/param
  Using quantization: int4

Supported GGUF quantizations:

  • Q2_K, Q2_K_S, Q2_K_M, Q2_K_L (~3-4 bits effective)

  • Q3_K, Q3_K_S, Q3_K_M, Q3_K_L, Q3_K_XS (~4-5 bits)

  • Q4_K, Q4_K_S, Q4_K_M, Q4_0, Q4_1 (~5 bits)

  • Q5_K, Q5_K_S, Q5_K_M, Q5_0, Q5_1 (~6 bits)

  • Q6_K (~7 bits)

  • Q8_0 (8 bits)

  • F16, F32 (floating point)

— Example 13: Model Format Comparison ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compare VRAM requirements across different model formats:

python main.py --model 7 --context 8192 --format fp16
python main.py --model 7 --context 8192 --format gguf
python main.py --model 7 --context 8192 --format exl2

Format overhead is automatically applied to calculations.

— Advanced Programmatic Usage ~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the advanced calculators programmatically:

from calculator import LayerOffloadCalculator, CPUOffloadCalculator
from multi_gpu import MultiGPUCalculator, MultiGPUConfig, MultiGPUMode
from formats import detect_gguf_quantization
from models import get_model_by_size
from gpus import get_all_gpus

# Layer offload calculation
layer_calc = LayerOffloadCalculator(quantization=Quantization.INT4)
model = get_model_by_size(70)
gpu = get_all_gpus()[0]  # RTX 3060

result = layer_calc.calculate_optimal_offload(model, gpu, context_tokens=8192)
print(f"Layers on GPU: {result.layers_on_gpu}/{result.total_layers}")
print(f"Recommended: --gpu-layers {result.layers_on_gpu}")

# Multi-GPU calculation
multi_calc = MultiGPUCalculator(quantization=Quantization.INT4)
config = MultiGPUConfig(
    gpus=[gpu1, gpu2, gpu3],
    mode=MultiGPUMode.TENSOR_PARALLEL
)
result = multi_calc.calculate(model, config, context_tokens=8192)

# GGUF detection
gguf_info = detect_gguf_quantization("llama-2-7b.Q4_K_M.gguf")
print(f"Quantization: {gguf_info.quant_name}")
print(f"Effective bits: {gguf_info.bits_per_param}")