Local Inference Calculator
Welcome to the Local Inference Calculator documentation - a capacity planning tool for local Large Language Model (LLM) inference.
This tool allows you to quickly estimate which language models can run on specific GPUs, considering context size and model precision/quantization.
Contents:
- Installation
- User Guide
- Glossary / Glossário
- API Reference
- Examples
- Example 1: List Available Models
- Example 2: How Much VRAM Do I Need for a 70B Model?
- Example 3: Small Model with Fractional Size (0.6B)
- Example 4: Can I Run a 70B Model with INT4 on RTX 4090?
- Example 5: Which GPU to Buy for a 34B Model?
- Example 6: Compare All Quantizations
- Example 7: Check Google Colab Compatibility
- Example 8: Export Results
- Programmatic Usage
Overview
The Local Inference Calculator was developed to answer a simple question:
“With this GPU and this context size, which LLMs can I run?”
The tool considers:
Model parameters: Base memory required to store weights
Overhead: Additional memory for runtime, activations, etc.
KV Cache: Memory for attention cache during inference
Precision/Quantization: FP32, FP16, INT8, or INT4
Features
Support for models from 7B to 180B parameters
Database with 38 GPUs (consumer + datacenter)
Conservative calculations to ensure real-world viability
Export results to JSON and CSV
Command-line interface (CLI)
Other Languages
Português (Brazil): Run
make html LANG=pt_BRand opendocs/_build/pt_BR/html/index.html