API Reference

Este documento descreve a API do Local Inference Calculator.

This document describes the Local Inference Calculator API.

Módulos

calculator

VRAM usage calculator for LLM inference.

Implements the logic for estimating the memory required to run language models on specific GPUs.

Calculadora de uso de VRAM para inferência de LLMs. Implementa a lógica de estimativa de memória necessária para rodar modelos de linguagem em GPUs específicas.

class calculator.Quantization(value)[código-fonte]

Base: Enum

Supported precision/quantization types for inference.

Tipos de precisão/quantização suportados para inferência.

FP32 = 'fp32'
FP16 = 'fp16'
INT8 = 'int8'
INT4 = 'int4'
property bytes_per_param: float

Returns bytes per parameter for this precision.

Retorna bytes por parâmetro para esta precisão.

property kv_cache_multiplier: float

KV cache multiplier based on precision.

Multiplicador para KV cache baseado na precisão.

Note: KV cache usually remains in FP16 even with quantized weights, but some frameworks support quantized KV cache with INT8/INT4.

Nota: KV cache geralmente permanece em FP16 mesmo com pesos quantizados, mas com INT8/INT4 alguns frameworks suportam KV cache quantizado.

class calculator.Status(value)[código-fonte]

Base: Enum

Inference feasibility status.

Status de viabilidade da inferência.

RUNS = 'RUNS'
NOT_RUNS = "DOESN'T RUN"
class calculator.CalculationMode(value)[código-fonte]

Base: Enum

VRAM calculation mode.

Modo de cálculo de VRAM.

THEORETICAL = 'theoretical'
CONSERVATIVE = 'conservative'
PRODUCTION = 'production'
class calculator.InferenceResult(model_name: str, model_params_billion: int, gpu_name: str, gpu_vram_gb: int, required_vram_gb: float, status: Status, vram_free_percent: float, quantization: Quantization, warning: str | None = None)[código-fonte]

Base: object

Feasibility analysis result for a model × GPU pair.

Resultado de análise de viabilidade para um par modelo × GPU.

model_name

Name of the LLM model

Type:

str

model_params_billion

Model size in billions of parameters

Type:

int

gpu_name

Name of the GPU

Type:

str

gpu_vram_gb

GPU VRAM capacity in GB

Type:

int

required_vram_gb

VRAM required for inference

Type:

float

status

Feasibility status (RUNS or DOESN’T RUN)

Type:

calculator.Status

vram_free_percent

Percentage of VRAM remaining

Type:

float

quantization

Quantization type used

Type:

calculator.Quantization

warning

Optional warning message

Type:

str | None

model_name: str
model_params_billion: int
gpu_name: str
gpu_vram_gb: int
required_vram_gb: float
status: Status
vram_free_percent: float
quantization: Quantization
warning: str | None = None
to_dict() dict[código-fonte]

Converts to serializable dictionary.

Converte para dicionário serializável.

__init__(model_name: str, model_params_billion: int, gpu_name: str, gpu_vram_gb: int, required_vram_gb: float, status: Status, vram_free_percent: float, quantization: Quantization, warning: str | None = None) None
class calculator.CalculationBreakdown(params_memory_gb: float, overhead_gb: float, model_with_overhead_gb: float, kv_cache_gb: float, total_vram_gb: float)[código-fonte]

Base: object

Detailed VRAM calculation breakdown.

Breakdown detalhado do cálculo de VRAM.

params_memory_gb

Memory for model parameters in GB

Type:

float

overhead_gb

Memory overhead in GB

Type:

float

model_with_overhead_gb

Parameters + overhead in GB

Type:

float

kv_cache_gb

KV cache memory in GB

Type:

float

total_vram_gb

Total VRAM required in GB

Type:

float

params_memory_gb: float
overhead_gb: float
model_with_overhead_gb: float
kv_cache_gb: float
total_vram_gb: float
to_dict() dict[código-fonte]

Converts to serializable dictionary.

Converte para dicionário serializável.

__init__(params_memory_gb: float, overhead_gb: float, model_with_overhead_gb: float, kv_cache_gb: float, total_vram_gb: float) None
class calculator.VRAMCalculator(quantization: Quantization = Quantization.FP16, overhead_factor: float = 0.3, calculation_mode: CalculationMode = CalculationMode.CONSERVATIVE)[código-fonte]

Base: object

VRAM calculator for LLM inference.

Calculadora de VRAM para inferência de LLMs.

__init__(quantization: Quantization = Quantization.FP16, overhead_factor: float = 0.3, calculation_mode: CalculationMode = CalculationMode.CONSERVATIVE)[código-fonte]

Initialize the calculator.

Inicializa a calculadora.

Parâmetros:
  • quantization – Quantization type (default: FP16)

  • overhead_factor – Overhead factor (0.30 = 30%)

  • calculation_mode – Calculation mode for VRAM estimation

calculate_params_memory(params_billion: int) float[código-fonte]

Calculate base memory for model parameters.

Calcula memória base dos parâmetros do modelo.

Formula: params_memory_gb = params_billion * bytes_per_param Fórmula: params_memory_gb = params_billion × BYTES_PER_PARAM

Note: params_billion is in billions, and 1 billion bytes = 1 GB. So for FP16 (2 bytes/param): 70B model = 70 × 2 = 140 GB

Parâmetros:

params_billion – Model size in billions of parameters

Retorna:

Memory in GB

calculate_overhead(params_memory_gb: float) float[código-fonte]

Calculate memory overhead (runtime, activations, etc.).

Calcula overhead de memória (runtime, activations, etc.).

Formula: overhead = params_memory * overhead_factor Fórmula: overhead = params_memory × overhead_factor

Parâmetros:

params_memory_gb – Base parameter memory in GB

Retorna:

Overhead in GB

calculate_kv_cache(kv_cache_mb_per_token: float, context_tokens: int) float[código-fonte]

Calculate memory required for KV cache.

Calcula memória necessária para KV cache.

Formula: kv_cache_gb = (kv_cache_mb_per_token * context_tokens * multiplier * mode_buffer) / 1024 Fórmula: kv_cache_gb = (kv_cache_mb_per_token × context_tokens × multiplier × mode_buffer) / 1024

The base KV cache is defined for FP16. For other precisions, we apply a multiplier: FP32 uses 2x, INT8/INT4 may use less depending on the framework.

O KV cache base é definido para FP16. Para outras precisões, aplicamos um multiplicador: FP32 usa 2x, INT8/INT4 podem usar menos dependendo do framework.

The calculation mode adds a buffer for production scenarios: - THEORETICAL: No buffer (ideal minimum, batch=1, no padding) - CONSERVATIVE: 10% buffer (minimal overhead) - PRODUCTION: 25% buffer (batch>1, fragmentation, real-world serving)

Parâmetros:
  • kv_cache_mb_per_token – MB per token for the model (FP16 baseline)

  • context_tokens – Context size in tokens

Retorna:

KV cache in GB

calculate_total_vram(model: LLMModel, context_tokens: int) CalculationBreakdown[código-fonte]

Calculate total VRAM required for a model with given context.

Calcula VRAM total necessária para um modelo com contexto dado.

Parâmetros:
  • model – LLM model to evaluate

  • context_tokens – Context size in tokens

Retorna:

CalculationBreakdown with calculation details

evaluate_pair(model: LLMModel, gpu: GPU, context_tokens: int, quantization: Quantization | None = None) InferenceResult[código-fonte]

Evaluate if a model × GPU pair is viable for the given context.

Avalia se um par modelo × GPU é viável para o contexto dado.

Parâmetros:
  • model – LLM model to evaluate

  • gpu – GPU to evaluate

  • context_tokens – Context size in tokens

  • quantization – Override quantization (uses instance default if None)

Retorna:

InferenceResult with status and details

calculate_all_combinations(context_tokens: int, models: List[LLMModel] | None = None, gpus: List[GPU] | None = None) List[InferenceResult][código-fonte]

Calculate feasibility for all model × GPU combinations.

Calcula viabilidade para todas as combinações modelo × GPU.

Parâmetros:
  • context_tokens – Context size in tokens

  • models – List of models (uses all if None)

  • gpus – List of GPUs (uses all if None)

Retorna:

List of InferenceResult for all combinations

calculator.calculate_inference(context_tokens: int) dict[código-fonte]

Main calculation function (simplified interface).

Função principal de cálculo (interface simplificada).

Parâmetros:

context_tokens – Context size in tokens

Retorna:

Dictionary with structured results

class calculator.LayerOffloadResult(total_layers: int, layers_on_gpu: int, layers_on_cpu: int, gpu_vram_used: float, cpu_ram_used: float, offload_ratio: float, performance_impact: float, recommended_gpu_split: str, status: str)[código-fonte]

Base: object

Result for optimal layer offload calculation.

Resultado para cálculo de offload ótimo de camadas.

total_layers

Total number of layers in the model

Type:

int

layers_on_gpu

Number of layers that fit on GPU

Type:

int

layers_on_cpu

Number of layers that must remain on CPU

Type:

int

gpu_vram_used

VRAM used for GPU layers

Type:

float

cpu_ram_used

System RAM used for CPU layers

Type:

float

offload_ratio

Ratio of layers on GPU (0.0 to 1.0)

Type:

float

performance_impact

Estimated performance impact (0-100% slower)

Type:

float

recommended_gpu_split

Recommended –gpu-layers parameter for llama.cpp

Type:

str

total_layers: int
layers_on_gpu: int
layers_on_cpu: int
gpu_vram_used: float
cpu_ram_used: float
offload_ratio: float
performance_impact: float
recommended_gpu_split: str
status: str
__init__(total_layers: int, layers_on_gpu: int, layers_on_cpu: int, gpu_vram_used: float, cpu_ram_used: float, offload_ratio: float, performance_impact: float, recommended_gpu_split: str, status: str) None
class calculator.LayerOffloadCalculator(quantization: Quantization = Quantization.FP16, safety_margin_gb: float = 1.0, model_format: ModelFormat = ModelFormat.FP16)[código-fonte]

Base: object

Calculates optimal layer offload configuration for hybrid GPU+CPU inference.

Calcula configuração ótima de offload de camadas para inferência híbrida GPU+CPU.

This calculator helps determine how many transformer layers can fit in GPU VRAM for scenarios where the full model doesn’t fit, enabling partial offloading strategies used by llama.cpp, AutoGPTQ, and other frameworks.

Esta calculadora ajuda a determinar quantas camadas transformer cabem na VRAM para cenários onde o modelo completo não cabe, permitindo estratégias de offload parcial usadas por llama.cpp, AutoGPTQ e outros frameworks.

__init__(quantization: Quantization = Quantization.FP16, safety_margin_gb: float = 1.0, model_format: ModelFormat = ModelFormat.FP16)[código-fonte]

Initialize the layer offload calculator.

Inicializa a calculadora de offload de camadas.

Parâmetros:
  • quantization – Quantization type for layer size calculation

  • safety_margin_gb – Safety margin in GB to reserve

  • model_format – Model format for overhead calculation

estimate_layer_size_gb(model: LLMModel) float[código-fonte]

Estimate the VRAM size of a single transformer layer.

Estima o tamanho VRAM de uma única camada transformer.

Parâmetros:

model – LLM model to analyze

Retorna:

Estimated size of one layer in GB

Formula:

layer_size = (model_params / num_layers) * bytes_per_param * format_overhead

calculate_kv_cache_memory(model: LLMModel, context_tokens: int) float[código-fonte]

Calculate KV cache memory requirement.

Calcula requisito de memória de KV cache.

Parâmetros:
  • model – LLM model

  • context_tokens – Context size in tokens

Retorna:

KV cache memory in GB

calculate_optimal_offload(model: LLMModel, gpu: GPU, context_tokens: int) LayerOffloadResult[código-fonte]

Calculate optimal layer offload configuration.

Calcula configuração ótima de offload de camadas.

Parâmetros:
  • model – LLM model to analyze

  • gpu – GPU to use for offload

  • context_tokens – Context size in tokens

Retorna:

LayerOffloadResult with optimal configuration

class calculator.PCIeConfig(generation: str, bandwidth_gb_s: float, lanes: int = 16)[código-fonte]

Base: object

PCIe bandwidth configuration.

Configuração de largura de banda PCIe.

generation

PCIe generation (3.0, 4.0, 5.0)

Type:

str

bandwidth_gb_s

Theoretical bandwidth in GB/s (per lane x16)

Type:

float

lanes

Number of lanes (typically x16 for GPUs)

Type:

int

generation: str
bandwidth_gb_s: float
lanes: int = 16
property effective_bandwidth_gb_s: float

Effective bandwidth accounting for protocol overhead.

Largura de banda efetiva considerando overhead de protocolo.

__init__(generation: str, bandwidth_gb_s: float, lanes: int = 16) None
class calculator.CPUOffloadResult(system_ram_required: float, system_ram_available: float, fits_in_ram: bool, offload_config: LayerOffloadResult, pcie_generation: str, estimated_token_speed: float, speed_vs_full_gpu: float)[código-fonte]

Base: object

Result for CPU offload calculation.

Resultado para cálculo de offload de CPU.

system_ram_required

Total system RAM required in GB

Type:

float

system_ram_available

System RAM available in GB

Type:

float

fits_in_ram

Whether the model fits in system RAM

Type:

bool

offload_config

Layer offload configuration

Type:

calculator.LayerOffloadResult

pcie_generation

PCIe generation used

Type:

str

estimated_token_speed

Estimated tokens/second with offload

Type:

float

speed_vs_full_gpu

Speed ratio vs full GPU (0.0 to 1.0)

Type:

float

system_ram_required: float
system_ram_available: float
fits_in_ram: bool
offload_config: LayerOffloadResult
pcie_generation: str
estimated_token_speed: float
speed_vs_full_gpu: float
__init__(system_ram_required: float, system_ram_available: float, fits_in_ram: bool, offload_config: LayerOffloadResult, pcie_generation: str, estimated_token_speed: float, speed_vs_full_gpu: float) None
class calculator.CPUOffloadCalculator(quantization: Quantization = Quantization.FP16, system_ram_gb: float = 32.0, pcie_generation: str = '4.0', model_format: ModelFormat = ModelFormat.FP16)[código-fonte]

Base: object

Calculates hybrid GPU+CPU inference configurations.

Calcula configurações de inferência híbrida GPU+CPU.

This calculator helps determine: - How much system RAM is needed for CPU offload - Performance impact based on PCIe bandwidth - Optimal layer distribution between GPU and CPU

Esta calculadora ajuda a determinar: - Quanta RAM do sistema é necessária para offload de CPU - Impacto de performance baseado em largura de banda PCIe - Distribuição ótima de camadas entre GPU e CPU

__init__(quantization: Quantization = Quantization.FP16, system_ram_gb: float = 32.0, pcie_generation: str = '4.0', model_format: ModelFormat = ModelFormat.FP16)[código-fonte]

Initialize the CPU offload calculator.

Inicializa a calculadora de offload de CPU.

Parâmetros:
  • quantization – Quantization type

  • system_ram_gb – Available system RAM in GB

  • pcie_generation – PCIe generation (3.0, 4.0, 5.0)

  • model_format – Model format

calculate_total_model_memory(model: LLMModel) float[código-fonte]

Calculate total model memory requirement.

Calcula requisito total de memória do modelo.

Parâmetros:

model – LLM model

Retorna:

Total memory in GB

estimate_offload_performance(offload_result: LayerOffloadResult, model: LLMModel) tuple[float, float][código-fonte]

Estimate performance with offload configuration.

Estima performance com configuração de offload.

Parâmetros:
  • offload_result – Layer offload configuration result

  • model – LLM model

Retorna:

Tuple of (estimated_tokens_per_second, speed_ratio_vs_full_gpu)

calculate_offload(model: LLMModel, gpu: GPU, context_tokens: int) CPUOffloadResult[código-fonte]

Calculate CPU offload configuration.

Calcula configuração de offload de CPU.

Parâmetros:
  • model – LLM model

  • gpu – GPU to use

  • context_tokens – Context size in tokens

Retorna:

CPUOffloadResult with configuration details

gpus

Database of GPUs available in the market.

Includes consumer and datacenter GPUs with their VRAM capacities.

Base de dados de GPUs disponíveis no mercado. Inclui GPUs consumer e datacenter com suas capacidades de VRAM.

class gpus.GPUType(value)[código-fonte]

Base: Enum

GPU type: consumer or datacenter.

Tipo de GPU: consumer ou datacenter.

CONSUMER = 'consumer'
DATACENTER = 'datacenter'
class gpus.GPU(name: str, vram_gb: int, type: GPUType, memory_bandwidth_gb_s: int | None = None, architecture: str | None = None, pcie_gen: str = '4.0')[código-fonte]

Base: object

Represents a GPU with its VRAM capacity.

Representa uma GPU com sua capacidade de VRAM.

name

GPU model name

Type:

str

vram_gb

VRAM capacity in GB

Type:

int

type

GPU type (consumer or datacenter)

Type:

gpus.GPUType

memory_bandwidth_gb_s

Memory bandwidth in GB/s (optional, for future calculations)

Type:

int | None

architecture

GPU architecture name (optional, for information)

Type:

str | None

pcie_gen

Default PCIe generation (for CPU offload calculations)

Type:

str

name: str
vram_gb: int
type: GPUType
memory_bandwidth_gb_s: int | None = None
architecture: str | None = None
pcie_gen: str = '4.0'
property vram_label: str

Returns formatted VRAM label (e.g., ‘24GB’).

Retorna label formatado da VRAM (ex: ‘24GB’).

__init__(name: str, vram_gb: int, type: GPUType, memory_bandwidth_gb_s: int | None = None, architecture: str | None = None, pcie_gen: str = '4.0') None
gpus.get_gpu_by_name(name: str) GPU | None[código-fonte]

Returns a GPU by exact name.

Retorna uma GPU pelo nome exato.

Parâmetros:

name – GPU name to search for

Retorna:

GPU if found, None otherwise

gpus.get_consumer_gpus() List[GPU][código-fonte]

Returns all consumer GPUs.

Retorna todas as GPUs consumer.

Retorna:

List of consumer GPUs

gpus.get_datacenter_gpus() List[GPU][código-fonte]

Returns all datacenter GPUs.

Retorna todas as GPUs datacenter.

Retorna:

List of datacenter GPUs

gpus.get_all_gpus() List[GPU][código-fonte]

Returns all available GPUs.

Retorna todas as GPUs disponíveis.

Retorna:

List of all GPUs

gpus.get_gpus_by_vram_min(min_vram_gb: int) List[GPU][código-fonte]

Returns GPUs with at least the specified VRAM.

Retorna GPUs com pelo menos a VRAM especificada.

Parâmetros:

min_vram_gb – Minimum VRAM in GB

Retorna:

List of GPUs with VRAM >= min_vram_gb

models

Database of LLM models available in the market.

Each model contains metadata for VRAM usage calculation during inference. Includes Ollama library models for comprehensive coverage.

Base de dados de modelos LLM disponíveis no mercado. Cada modelo possui metadados para cálculo de uso de VRAM em inferência. Inclui modelos da biblioteca Ollama para cobertura abrangente.

class models.LLMModel(name: str, params_billion: int, architecture: str, precision_default: str, kv_cache_mb_per_token: float, format: ModelFormat = ModelFormat.FP16, context_length_max: int | None = None, num_layers: int | None = None)[código-fonte]

Base: object

Represents an LLM model with metadata for inference.

Represents an LLM model with metadata for inference.

name

Model name

Type:

str

params_billion

Number of parameters in billions

Type:

int

architecture

Model architecture (e.g., “decoder-only”)

Type:

str

precision_default

Default precision (e.g., “fp16”)

Type:

str

kv_cache_mb_per_token

KV cache in MB per token (conservative FP16 estimate)

Type:

float

format

Model format (defaults to FP16 for base models)

Type:

formats.ModelFormat

context_length_max

Maximum context length in tokens (None if unlimited)

Type:

int | None

num_layers

Number of transformer layers (for layer offload calculations)

Type:

int | None

name: str
params_billion: int
architecture: str
precision_default: str
kv_cache_mb_per_token: float
format: ModelFormat = 'fp16'
context_length_max: int | None = None
num_layers: int | None = None
property size_label: str

Returns simplified size label (e.g., ‘7B’, ‘13B’).

Retorna label simplificado do tamanho (ex: ‘7B’, ‘13B’).

property estimated_layers: int

Estimate number of layers based on model size if not specified.

Estima número de camadas baseado no tamanho do modelo se não especificado.

Uses typical layer counts for decoder-only models: - 7B models: ~32 layers - 13B models: ~40 layers - 30B+ models: ~60+ layers

__init__(name: str, params_billion: int, architecture: str, precision_default: str, kv_cache_mb_per_token: float, format: ModelFormat = ModelFormat.FP16, context_length_max: int | None = None, num_layers: int | None = None) None
models.get_model_by_size(size_billion: float) LLMModel | None[código-fonte]

Returns a model by its size in billions of parameters.

Retorna um modelo pelo tamanho em bilhões de parâmetros.

Parâmetros:

size_billion – Model size in billions of parameters

Retorna:

LLMModel if found, None otherwise

models.get_all_models() List[LLMModel][código-fonte]

Returns all available models.

Retorna todos os modelos disponíveis.

Retorna:

List of all LLMModel instances

models.get_models_by_size_range(min_billion: int, max_billion: int) List[LLMModel][código-fonte]

Returns models within a size range.

Retorna modelos dentro de uma faixa de tamanho.

Parâmetros:
  • min_billion – Minimum size in billions

  • max_billion – Maximum size in billions

Retorna:

List of LLMModel instances within the range

models.search_models_by_name(query: str) List[LLMModel][código-fonte]

Search models by name (case-insensitive partial match).

Busca modelos por nome (correspondência parcial case-insensitive).

Parâmetros:

query – Search query string

Retorna:

List of matching LLMModel instances

main

CLI for LLM local inference viability calculator.

Allows quickly discovering which models run on which GPU for a given context size.

CLI para calculadora de viabilidade de inferência local de LLMs. Permite descobrir rapidamente quais modelos rodam em qual GPU para um determinado tamanho de contexto.

class main.Colors[código-fonte]

Base: object

ANSI color codes for terminal output.

Códigos de cores ANSI para saída de terminal.

RESET = '\x1b[0m'
BOLD = '\x1b[1m'
DIM = '\x1b[2m'
BLACK = '\x1b[30m'
RED = '\x1b[31m'
GREEN = '\x1b[32m'
YELLOW = '\x1b[33m'
BLUE = '\x1b[34m'
MAGENTA = '\x1b[35m'
CYAN = '\x1b[36m'
WHITE = '\x1b[37m'
BG_RED = '\x1b[41m'
BG_GREEN = '\x1b[42m'
BG_YELLOW = '\x1b[43m'
BG_BLUE = '\x1b[44m'
static ok(text: str) str[código-fonte]

Green text for success/OK messages.

static warning(text: str) str[código-fonte]

Yellow text for warnings.

static warn(text: str) str[código-fonte]

Yellow text for warnings (alias).

static error(text: str) str[código-fonte]

Red text for errors.

static info(text: str) str[código-fonte]

Blue text for info.

static dim(text: str) str[código-fonte]

Dim text for less emphasis.

static bold(text: str) str[código-fonte]

Bold text.

main.print_table(results: List[InferenceResult], group_by_gpu: bool = False, show_only_runs: bool = False)[código-fonte]

Prints results in ASCII table format.

Imprime resultados em formato de tabela ASCII.

Parâmetros:
  • results – List of results / Lista de resultados

  • group_by_gpu – Group by GPU instead of model / Agrupa por GPU

  • show_only_runs – Show only running combinations / Apenas combinações que rodam

main.print_summary_by_model(results: List[InferenceResult])[código-fonte]

Prints summary grouped by model size.

Imprime resumo agrupado por tamanho de modelo.

Shows for each model which GPUs support it. Mostra para cada modelo quais GPUs suportam.

main.print_layer_offload_result(result: LayerOffloadResult, model: LLMModel, gpu: GPU)[código-fonte]

Prints layer offload configuration result.

Imprime resultado da configuração de offload de camadas.

Parâmetros:
  • result – Layer offload calculation result

  • model – LLM model

  • gpu – GPU being used

main.print_cpu_offload_result(result: CPUOffloadResult, model: LLMModel)[código-fonte]

Prints CPU offload configuration result.

Imprime resultado da configuração de offload de CPU.

Parâmetros:
  • result – CPU offload calculation result

  • model – LLM model

main.print_multi_gpu_result(result, model: LLMModel)[código-fonte]

Prints multi-GPU configuration result.

Imprime resultado da configuração multi-GPU.

Parâmetros:
  • result – MultiGPUResult from MultiGPUCalculator

  • model – LLM model

main.print_summary_by_gpu(results: List[InferenceResult])[código-fonte]

Prints summary grouped by GPU.

Imprime resumo agrupado por GPU.

Shows for each GPU which models it supports. Mostra para cada GPU quais modelos suporta.

main.export_csv(results: List[InferenceResult], filepath: str)[código-fonte]

Exports results to CSV.

Exporta resultados para CSV.

main.export_json(results: List[InferenceResult], filepath: str, context_tokens: int, quantization: Quantization)[código-fonte]

Exports results to JSON.

Exporta resultados para JSON.

main.list_models()[código-fonte]

Prints all available models.

Imprime todos os modelos disponíveis.

main.print_model_vram_breakdown(model: LLMModel, context_tokens: int, quantization: Quantization, calculation_mode: CalculationMode = CalculationMode.CONSERVATIVE)[código-fonte]

Prints detailed VRAM breakdown for a specific model.

Imprime breakdown detalhado de VRAM para um modelo específico.

Parâmetros:
  • model – LLM model to analyze

  • context_tokens – Context size in tokens

  • quantization – Quantization type

  • calculation_mode – Calculation mode

main.estimate_kv_cache(params_billion: int) float[código-fonte]

Estimate KV cache per token based on model size.

Estima KV cache por token baseado no tamanho do modelo.

Uses a conservative formula based on decoder-only architecture. Usa uma fórmula conservadora baseada em arquitetura decoder-only.

Parâmetros:

params_billion – Model size in billions of parameters

Retorna:

Estimated KV cache in MB per token (FP16)

main.parse_args()[código-fonte]

Parse CLI arguments.

Parse argumentos da CLI.

main.main()[código-fonte]

Main CLI function.

Função principal da CLI.