Examples
========

Example 1: List Available Models
---------------------------------

.. code-block:: bash

   python main.py --list-models

Output:

.. code-block:: text

   AVAILABLE MODELS
   ======================================================================
     [7B] LLaMA 2 / Mistral / Qwen 7B
          Architecture: decoder-only
          Default precision: fp16
          KV cache: 0.6 MB/token (FP16 baseline)

     [13B] LLaMA 2 13B
          Architecture: decoder-only
          Default precision: fp16
          KV cache: 0.9 MB/token (FP16 baseline)

     ...

---

Example 2: How Much VRAM Do I Need for a 70B Model?
----------------------------------------------------

.. code-block:: bash

   python main.py --model 70 --context 8192

Output:

.. code-block:: text

   VRAM BREAKDOWN: LLaMA 2 70B / LLaMA 3.1 70B
   ======================================================================
   Configuration:
     Context: 8,192 tokens
     Quantization: FP16

   Memory Breakdown:
     Model parameters: 13.67 GB
     Overhead (30%):      4.10 GB
     Model + overhead:    17.77 GB
     KV cache:            28.67 GB
     ----------------------------------------
     TOTAL VRAM:          46.44 GB

   Minimum GPU VRAM required: 46.4 GB
   Recommended (with margin): 51.1 GB

---

Example 3: Small Model with Fractional Size (0.6B)
--------------------------------------------------

Small models like Phi-3 Mini (3.8B) or Qwen2-0.5B use fractional sizes:

.. code-block:: bash

   python main.py --model 0.6 --context 8192
   python main.py -m 3.8 -c 4096 -q int4

These models are designed for edge devices and can run on GPUs with as little as 4-6 GB VRAM.

---

Example 4: Can I Run a 70B Model with INT4 on RTX 4090?
---------------------------------------------------------

.. code-block:: bash

   python main.py -m 70 -c 16384 -q int4

This shows that with INT4 quantization, a 70B model with 16k context
requires ~34 GB, so it fits on an RTX 4090 (24 GB) but would need
quantization or a larger GPU for the full context.

---

Example 5: Which GPU to Buy for a 34B Model?
---------------------------------------------

.. code-block:: bash

   python main.py --model 34 --context 8192

The output will show all GPUs that can run the model, ranked by
free VRAM percentage.

---

Example 6: Compare All Quantizations
---------------------------------------

.. code-block:: bash

   for q in fp32 fp16 int8 int4; do
       echo "=== $q ==="
       python main.py -m 7 -c 8192 -q $q | grep "TOTAL VRAM"
   done

This clearly shows how INT4 allows much larger models on the same GPU.

---

Example 7: Check Google Colab Compatibility
---------------------------------------------

.. code-block:: bash

   python main.py --model 13 --context 16384 -q int4 --gpu-type datacenter | grep -i colab

This quickly shows which Colab GPU tier can run your desired model.

---

Example 8: Export Results
-------------------------

.. code-block:: bash

   python main.py -c 8192 --export-json results.json

.. code-block:: bash

   python main.py -c 8192 --export-csv results.csv

The exported files contain all combinations for further analysis.

---

Programmatic Usage
------------------

You can also use the library directly in Python:

.. code-block:: python

   from calculator import VRAMCalculator, Quantization
   from models import get_all_models
   from gpus import get_all_gpus

   # Create calculator with INT4 quantization
   calc = VRAMCalculator(quantization=Quantization.INT4)

   # Calculate for context 8192
   model = get_all_models()[0]  # 7B
   gpu = get_all_gpus()[0]  # RTX 3060

   result = calc.evaluate_pair(model, gpu, context_tokens=8192)

   print(f"Status: {result.status.value}")
   print(f"VRAM required: {result.required_vram_gb:.1f} GB")
   print(f"VRAM available: {result.gpu_vram_gb} GB")

---
Advanced Examples (v0.2.0)
--------------------------

Example 9: Layer Offload Optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Calculate optimal GPU layer distribution for a model that doesn't fully fit in VRAM:

.. code-block:: bash

   python main.py --model 70 --context 8192 --optimize-config --quantization int4

Output:

.. code-block:: text

   OPTIMAL LAYER OFFLOAD CONFIGURATION
   ======================================================================
   Model: LLaMA 2 70B / LLaMA 3.1 70B (70B parameters)
   GPU:   RTX 3090 (24 GB VRAM)
   Total Layers: 80

   Layer Distribution:
     Layers on GPU:  0
     Layers on CPU:  80
     Offload Ratio:  0.0%

   Memory Usage:
     GPU VRAM used:  32.50 GB / 24 GB
     CPU RAM used:   45.00 GB

   Performance Impact:
     Estimated slowdown: 1000%

   Recommended Configuration:
     llama.cpp:  --gpu-layers 0
     AutoGPTQ:   --gpu-memory 32.5G

For GPUs with more VRAM, you'll see partial offload options.

---
Example 10: CPU Offload Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Analyze hybrid GPU+CPU inference with PCIe bandwidth considerations:

.. code-block:: bash

   python main.py --model 13 --context 8192 --cpu-offload --system-ram 64 --pcie-gen 4.0

Output:

.. code-block:: text

   CPU OFFLOAD ANALYSIS
   ======================================================================
   System Requirements:
     System RAM required: 8.50 GB
     System RAM available: 64.00 GB
     Status: ✓ Fits in system RAM

   PCIe Configuration:
     Generation: PCIe 4.0
     Bandwidth: ~24 GB/s effective

   Performance Estimate:
     Token speed: ~35.0 tokens/second
     Speed ratio: 100.0% of full GPU

   Layer Distribution:
     Layers on GPU:  40
     Layers on CPU:  0
     Offload Ratio:  100.0%

PCIe generation impact:

* **PCIe 3.0**: ~12 GB/s effective (slower data transfer)
* **PCIe 4.0**: ~24 GB/s effective (standard for modern GPUs)
* **PCIe 5.0**: ~48 GB/s effective (best for CPU offload)

---
Example 11: Multi-GPU Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Calculate tensor parallelism across multiple GPUs:

.. code-block:: bash

   python main.py --params-b 405 --context 8192 --quantization int4 --multi-gpu --gpu-config "2x4090,1x3090"

Output:

.. code-block:: text

   MULTI-GPU CONFIGURATION
   ======================================================================
   Model: Custom Model 405B (405B parameters)
   Status: DOESN'T RUN
     Bottleneck: RTX 3090
     Communication overhead: 5.28 GB

   Per-GPU Allocation:
     ✗ RTX 4090              25.32 GB / 24 GB
       Shard: 50.0%

   Framework Configuration:
     tensor_parallel_size: 3
     mode: tensor_parallel
     vllm: --tensor-parallel-size 3

Heterogeneous configurations (mixed GPU models) are supported.
Each GPU's allocation is proportional to its VRAM capacity.

---
Example 12: GGUF Auto-Detection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Auto-detect quantization from GGUF filename:

.. code-block:: bash

   python main.py --gguf-file "llama-2-7b.Q4_K_M.gguf" --context 4096

Output:

.. code-block:: text

   GGUF file detected: llama-2-7b.Q4_K_M.gguf
     Quantization: Q4_K_M
     Effective bits: 5.50 bits/param
     Using quantization: int4

Supported GGUF quantizations:

* Q2_K, Q2_K_S, Q2_K_M, Q2_K_L (~3-4 bits effective)
* Q3_K, Q3_K_S, Q3_K_M, Q3_K_L, Q3_K_XS (~4-5 bits)
* Q4_K, Q4_K_S, Q4_K_M, Q4_0, Q4_1 (~5 bits)
* Q5_K, Q5_K_S, Q5_K_M, Q5_0, Q5_1 (~6 bits)
* Q6_K (~7 bits)
* Q8_0 (8 bits)
* F16, F32 (floating point)

---
Example 13: Model Format Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compare VRAM requirements across different model formats:

.. code-block:: bash

   python main.py --model 7 --context 8192 --format fp16
   python main.py --model 7 --context 8192 --format gguf
   python main.py --model 7 --context 8192 --format exl2

Format overhead is automatically applied to calculations.

---
Advanced Programmatic Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~

Using the advanced calculators programmatically:

.. code-block:: python

   from calculator import LayerOffloadCalculator, CPUOffloadCalculator
   from multi_gpu import MultiGPUCalculator, MultiGPUConfig, MultiGPUMode
   from formats import detect_gguf_quantization
   from models import get_model_by_size
   from gpus import get_all_gpus

   # Layer offload calculation
   layer_calc = LayerOffloadCalculator(quantization=Quantization.INT4)
   model = get_model_by_size(70)
   gpu = get_all_gpus()[0]  # RTX 3060

   result = layer_calc.calculate_optimal_offload(model, gpu, context_tokens=8192)
   print(f"Layers on GPU: {result.layers_on_gpu}/{result.total_layers}")
   print(f"Recommended: --gpu-layers {result.layers_on_gpu}")

   # Multi-GPU calculation
   multi_calc = MultiGPUCalculator(quantization=Quantization.INT4)
   config = MultiGPUConfig(
       gpus=[gpu1, gpu2, gpu3],
       mode=MultiGPUMode.TENSOR_PARALLEL
   )
   result = multi_calc.calculate(model, config, context_tokens=8192)

   # GGUF detection
   gguf_info = detect_gguf_quantization("llama-2-7b.Q4_K_M.gguf")
   print(f"Quantization: {gguf_info.quant_name}")
   print(f"Effective bits: {gguf_info.bits_per_param}")