Python module
quantization
APIs to quantize graph tensors.
This package includes a comprehensive set of tools for working with quantized models in MAX Graph. It defines supported quantization encodings, configuration parameters that control the quantization process, and block parameter specifications for different quantization formats.
The module supports various quantization formats including 4-bit, 5-bit, and 6-bit precision with different encoding schemes. It also provides support for GGUF-compatible formats for interoperability with other frameworks.
BlockParametersâ
class max.graph.quantization.BlockParameters(elements_per_block, block_size)
Parameters describing the structure of a quantization block.
Block-based quantization stores elements in fixed-size blocks. Each block contains a specific number of elements in a compressed format.
block_sizeâ
block_size: int
elements_per_blockâ
elements_per_block: int
QuantizationConfigâ
class max.graph.quantization.QuantizationConfig(quant_method, bits, group_size, desc_act=False, sym=False)
Configuration for specifying quantization parameters that affect inference.
These parameters control how tensor values are quantized, including the method, bit precision, grouping, and other characteristics that affect the trade-off between model size, inference speed, and accuracy.
bitsâ
bits: int
desc_actâ
desc_act: bool = False
group_sizeâ
group_size: int
quant_methodâ
quant_method: str
symâ
sym: bool = False
QuantizationEncodingâ
class max.graph.quantization.QuantizationEncoding(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Quantization encodings supported by MAX Graph.
Quantization reduces the precision of neural network weights to decrease memory usage and potentially improve inference speed. Each encoding represents a different compression method with specific trade-offs between model size, accuracy, and computational efficiency. These encodings are commonly used with pre-quantized model checkpoints (especially GGUF format) or can be applied during weight allocation.
The following example shows how to create a quantized weight using the Q4_K encoding:
from max.graph.quantization import QuantizationEncoding
from max.graph import Weight
encoding = QuantizationEncoding.Q4_K
quantized_weight = Weight(
name="linear.weight",
dtype=DType.uint8,
shape=[4096, 4096],
device=DeviceRef.GPU(0),
quantization_encoding=encoding
)MAX supports several quantization formats optimized for different use cases.
Q4_0â
Q4_0
Basic 4-bit quantization with 32 elements per block.
Q4_Kâ
Q4_K
4-bit K-quantization with 256 elements per block.
Q5_Kâ
Q5_K
5-bit K-quantization with 256 elements per block.
Q6_Kâ
Q6_K
6-bit K-quantization with 256 elements per block.
GPTQâ
GPTQ
Group-wise Post-Training Quantization for large language models.
block_parametersâ
property block_parameters: BlockParameters
Gets the block parameters for this quantization encoding.
-
Returns:
-
The parameters describing how elements are organized and encoded in blocks for this quantization encoding.
-
Return type:
block_sizeâ
property block_size: int
Number of bytes in encoded representation of block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block.
-
Returns:
-
Size in bytes of each encoded quantization block.
-
Return type:
elements_per_blockâ
property elements_per_block: int
Number of elements per block.
All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block.
-
Returns:
-
Number of original tensor elements in each quantized block.
-
Return type:
is_ggufâ
property is_gguf: bool
Checks if this quantization encoding is compatible with GGUF format.
GGUF is a format for storing large language models and compatible quantized weights.
-
Returns:
-
True if this encoding is compatible with GGUF, False otherwise.
-
Return type:
nameâ
property name: str
Gets the lowercase name of the quantization encoding.
-
Returns:
-
Lowercase string representation of the quantization encoding.
-
Return type:
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!