Quantization is the process of converting floating point weights and activations to lower bitwidth tensors by multiplying the floating point values by a scale factor and rounding the results to whole numbers. Dynamic quantization determines the scale factor for activations dynamically based on the data range observed at runtime. We support W8A8 (quantizing weights and activations into 8 bits) dynamic quantization by leveraging torch's X86InductorQuantizer
.
There are four steps to perform W8A8 dynamic quantization: export
, prepare
, convert
and compile
.
import torch
from neural_compressor.torch.export import export
from neural_compressor.torch.quantization import DynamicQuantConfig, prepare, convert
# Prepare the float model and example inputs for export model
model = UserFloatModel()
example_inputs = ...
# Export eager model into FX graph model
exported_model = export(model=model, example_inputs=example_inputs)
# Quantize the model
quant_config = DynamicQuantConfig()
prepared_model = prepare(exported_model, quant_config=quant_config)
q_model = convert(prepared_model)
# Compile the quantized model and replace the Q/DQ pattern with Q-operator
from torch._inductor import config
config.freezing = True
opt_model = torch.compile(q_model)
Note: The
set_local
ofDynamicQuantConfig
will be supported after the torch 2.4 release.
Example will be added later.