User Guide
==========

Basic Usage
------------

List Available Models
~~~~~~~~~~~~~~~~~~~~

To see all available models in the database:

.. code-block:: bash

   python main.py --list-models

This shows model name, size, architecture, and KV cache requirements for each model.

Check Specific Model
~~~~~~~~~~~~~~~~~~~~~

To check VRAM requirements for a specific model:

.. code-block:: bash

   python main.py --model 7 --context 8192
   python main.py -m 70 -c 16384 -q int4

The output includes:

* **VRAM Breakdown**: Parameters, overhead, and KV cache memory
* **Total VRAM Required**: Minimum and recommended GPU VRAM
* **GPU Compatibility**: List of GPUs that can run the model, with free VRAM percentage

All Combinations
~~~~~~~~~~~~~~~~

To see all model × GPU combinations:

.. code-block:: bash

   python main.py --context 4096

Or with a larger context:

.. code-block:: bash

   python main.py -c 8192


Command-Line Options
--------------------

.. code-block:: bash

   python main.py [OPTIONS]

Basic Options
~~~~~~~~~~~~~

.. code-block:: bash

   -h, --help            Show help message and exit
   -c, --context CONTEXT Context size in tokens (default: 4096)
   --list-models         List all available models
   -m SIZE, --model SIZE  Model size in billions (e.g., 0.6, 7, 13, 70)
   --gpu-type TYPE       GPU type: consumer, datacenter, all (default: all)
   --only-runs           Show only running combinations
   --group-gpu           Group results by GPU instead of model
   --summary TYPE        Summary: model, gpu, both, none (default: both)
   --export-csv FILE     Export results to CSV
   --export-json FILE    Export results to JSON
   -q, --quantization Q  Model precision (fp32, fp16, int8, int4)
   --mode MODE           Calculation mode (theoretical, conservative, production)

Advanced Options (v0.2.0)
~~~~~~~~~~~~~~~~~~~~~~~~

Layer Offload Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   --optimize-config      Show optimal layer offload configuration

Calculates how many transformer layers can fit in GPU VRAM, displaying:

* Layers on GPU vs CPU
* Recommended ``--gpu-layers`` parameter for llama.cpp
* Performance impact estimation
* Offload options for all available GPUs

CPU Offload Analysis
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   --cpu-offload          Enable CPU offload calculations
   --system-ram GB        System RAM available in GB (default: 32.0)
   --pcie-gen GEN         PCIe generation: 3.0, 4.0, or 5.0 (default: 4.0)

Calculates hybrid GPU+CPU inference configuration:

* System RAM requirements
* PCIe bandwidth impact on performance
* Estimated tokens/second
* Layer distribution between GPU and CPU

Multi-GPU Support
^^^^^^^^^^^^^^^^^

.. code-block:: bash

   --multi-gpu            Enable multi-GPU mode
   --gpu-config CONFIG    Multi-GPU configuration (e.g., "2x4090,1x3090")
   --multi-gpu-mode MODE  Parallelism mode: tensor or pipeline (default: tensor)

Configuration format:

* Homogeneous: ``3x4090`` (3 identical GPUs)
* Heterogeneous: ``2x4090,1x3090`` (mixed GPUs)
* Partial names: ``2x3090,1x4090``

Modes:

* **tensor** - Model weights split across GPUs (same layers, different shards)
* **pipeline** - Different layers on different GPUs

Model Format Support
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

   --gguf-file FILENAME  GGUF filename to auto-detect quantization
   --format FORMAT        Model format: fp16, gguf, exl2, gptq, awq (default: fp16)

GGUF auto-detection supports: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32

Format overhead multipliers:

* FP16: 1.0x (baseline)
* GGUF: 1.15x (+15% for metadata structure)
* EXL2: 1.05x (+5% optimized layout)
* GPTQ: 1.10x (+10% quantization metadata)
* AWQ: 1.08x (+8% activation-aware quantization)


Usage Examples
--------------

Consumer GPUs only:

.. code-block:: bash

   python main.py -c 8192 --gpu-type consumer

Show only viable combinations:

.. code-block:: bash

   python main.py -c 4096 --only-runs

Export results to JSON:

.. code-block:: bash

   python main.py -c 8192 --export-json results.json

Use INT4 quantization for larger models:

.. code-block:: bash

   python main.py -c 8192 -q int4 --only-runs

Check if a 70B model fits on RTX 4090:

.. code-block:: bash

   python main.py --model 70 --context 8192


Supported Precisions
--------------------

FP32 (Float32)
    4 bytes per parameter. Highest precision, highest VRAM usage.
    Rarely used for LLM inference due to high VRAM usage.

FP16 (Float16)
    2 bytes per parameter. Industry standard for inference.
    Excellent precision with half the VRAM usage of FP32.

INT8 (8-bit Integer)
    1 byte per parameter. Aggressive quantization.
    Small quality loss with significant VRAM savings.

INT4 (4-bit Integer)
    0.5 byte per parameter. Very aggressive quantization.
    Highest savings but more noticeable quality loss.


Interpreting Results
--------------------

``RUNS`` (green)
    The GPU has sufficient VRAM to run the model with the specified context.

``DOESN'T RUN`` (red)
    The GPU doesn't have enough VRAM.

Low safety margin warning
    ``⚠️ Low safety margin`` indicates the combination runs but with less
    than 10% VRAM free. This may cause OOM in real scenarios due to
    implementation variations.


VRAM Calculation Formula
-----------------------

The tool uses a conservative model with four components:

**1. Model Parameters**

Base memory for model weights. Since params_billion is in billions:

.. code-block:: python

   params_memory_gb = params_billion × bytes_per_param

Example: 70B model in FP16 (2 bytes/param):
70 × 2 = 140 GB

**2. Overhead**

.. code-block:: python

   overhead_gb = params_memory_gb × 0.30  # 30%

**3. KV Cache**

.. code-block:: python

   kv_cache_gb = (kv_cache_mb_per_token × context_tokens × multiplier) / 1024

**4. Total**

.. code-block:: python

   total_vram_gb = model_with_overhead_gb + kv_cache_gb


Technical Notes
---------------

The calculations assume PyTorch-style memory allocation (HF Transformers, vLLM).
Different backends may have varying memory behavior:

* **TensorRT-LLM**: Custom allocators, ~10-20% less memory
* **llama.cpp (GGUF)**: Memory-mapped files, ~20-30% less memory
* **EXL2**: Optimized allocation, minimal overhead

For detailed explanations of technical terms used in VRAM calculations,
see :doc:`glossary`.