V2 Documentation¶
[TOC]
V2 Architecture¶
MRT V2 is superset of MRT that supports enriching features like zero
point quantization, channel-slice and groupwise quantization for
convolution. The core implementation attributes to the module
tfm_types.py, the architecture of which display like this:
Quantizer¶
Parameter (Uniform Symmetric Quantizer) |
Input (Uniform Symmetric Quantizer) |
Parameter (Uniform Affine Quantizer) |
Input (Uniform Affine Quantizer) |
|
|---|---|---|---|---|
S c a l e |
\(sc_ {w} = \frac{2^{ PREC-1}- 1}{\max{|Wr|}}\) |
\(sc_ {x} = \frac{2^{ PREC-1}- 1}{\max{|Xr|}}\) |
\(sc_ {w} = \frac{2^{ PREC}-1}{\max{W r} - \min{Wr}}\) |
\(sc_ {x} = \frac{2^{ PREC}-1}{\max{ Xr}-\min {Xr}}\) |
Z e r o P o i n t |
\(zp_ {wr} = \min Wr\) |
\(zp_ {xe} = \text{ round}\Big(\min Xr \cdot sc_{xe }\Big) \\ rzp_{x} = round \Big (\min Xr \Big)\) |
||
Q u a n t i z a t i o n |
\(Wq = \text{round} \Big(sc_{w} \cdot Wr \Big)\) |
\(frac, exp = \text{cvm_float} \bigg(\frac{sc_{x }}{sc_{xe}}\bigg ) \\ Xq = \text{ realize} (X_{ e}, frac, exp)\) |
\(W_{q} = \text{ round} \Big[ (sc_{w} ( W_{r} - zp_{wr} ) \Big]\) |
\(frac, exp = \text{cvm_float }\bigg(\frac{ sc_{x}}{sc_ {xe}}\bigg) \\ Xq = \text{realize }(Xe - zp_{x e}, frac, exp)\) |
R e q u a n t i z a t i o n |
\(Wr = \frac{Wq }{sc_{w}}\) |
\(Xr = \frac{Xq }{sc_{x}}\) |
\(Wr = \frac{Wq}{sc_ {w}} + zp_{wr}\) |
\(Xr = \frac{Xq}{sc_ {x}} + rzp_{x}\) |
The variable whose names ending up with ‘q’ stand for int quantized operators, ‘r’ stand for floating point operators and ‘e’ stand for int expanded operators.
Re-quantization into expansion is elaborated in operator expansion, see NN Operator Expansion, Broadcast Operator Expansion, Elemwise Operator Expansion, Transform Operator Expansion.
Granularity¶
MRT Support both layer-wise and channel-wise quantization. Channel wise quantization is implemented by graph-level channel split and channel merge.
Channel Split¶
To compromise between precision and calculation operations, MRT support
quantization with respect to channel features. Use slice to split
the channels in MRT rewrite process:
If \(X\) is of channel feature and \(W\) is of layer feature or vice versa, \(W\) (or \(X\)) will also be split to be compatible with \(X\) (or \(W\)).
Take Convolution (for simplicity, only point-wise convolution is
considered here, i.e. num_group=1) for instance (only uniform
symmetric quantization is considered for simplicity), layer-wise
Convolution can be rewritten as:
Channel Merge¶
Merge the channel symbol components to the equivalent symbol.
For operators like pad, relu, Pooling, merge as follows:
For operators like Convolution (num_group=1), merge as follows:
For operators like Convolution (num_group>1), the slice channel
process will be performed in each the output channel, and concat
along the output channel axis.
NN Operator Expansion¶
Convolution¶
Limitations
Only 2-D case is considered
num_groupis asserted to be 1biasis fused in MRT rewrite
Inputs
Input data \(X\), of shape \((N,C,H,W)\)
Kernel weight \(W\), of shape \((O,C,KH,KW)\)
Attributes
\(\text{padding} = (PH,PW)\)
\(\text{stride} = (SH,SW)\)
\(\text{dilation} = (DH,DW)\)
Real Formalization
Note, if num_groups is not 1, then convolution is generalized as
Groupwise Convolution.
Specifically, suppose kernel weight \(W\) is of shape \((O,IC,KH,KW)\) and input data \(X\) is of shape \((N,C,H,W)\).
For simplicity, here we will not inlcude the notation of groupwise convolution.
Given \(Xe\) and \(We\), MRT respectively quantize them into \(Xq\) and \(Wq\).
Expansion Formalization 1: Symmetric Quantized X and W
where the scale of \(Ye\) is \(sc_{x} \ sc_{w}\).
Expansion Formalization 2: Zero Point Quantized X and Symmetric Quantized W
where the scale of \(Ye1\) is \(sc_{x} \ sc_{w}\) and the scale of \(Ye2\) is \(sc_{w}\). By quantize_scale, MRT respectively quantize them into \(Yq1\) and \(Yq2\). Then get the final expansion.
Expansion Formalization 3: Symmetric Quantized X and Zero Point Quantized W
where the scale of \(Ye1\) is \(sc_{x} \ sc_{w}\) and the scale of \(Ye2\) is \(sc_{x}\). By quantize_scale, MRT respectively quantize them into \(Yq1\) and \(Yq2\). Then get the final expansion.
Expansion Formalization 4: Zero Point Quantized X and W (Deprecated)
Ye = Convoltion(Xq, Wq, **attrs) + wzp * Convoltion(Xq, W1, **attrs) + C2 + C3
infer_prec1 = get_bit_cnt(C*KH*KW) + xprec + wprec + 2
infer_prec2 = get_bit_cnt(abs(wzp)*C*KH*KW) + xprec + 1
infer_prec3 = get_bit_cnt(abs(xzp)*C*KH*KW) + wprec + 1
infer_prec4 = get_bit_cnt(abs(wzp)*abs(xzp)*C*KH*KW)
infer_prec = max(infer_prec1, infer_prec2, infer_prec3, infer_prec4) + 2
pad¶
Limitations
Only support
constantmode\(\text{constant_value} = 0\)
Only support pad of \(H\) dimension and \(W\) dimension
Inputs
Input data \(X\), of shape \((N,C,H,W)\)
Attributes
\(\text{pad_width} = (0,0,0,0,PH_1,PH_2,PW_1,PW_2)\)
Real Formalization
Expansion Scale
Expansion Formalization
Ye = pad(Xe, **attrs)
relu¶
Inputs
Input data \(X\), of shape \((M_0,M_1,...,M_{N-1})\)
Real Formalization
Expansion Scale
Expansion Formalization
Ye = relu(Xe)
Pooling¶
Limitations
Only 2-D case is considered
avgpooling will be rewritten intoConvolutionorbroadcast_mulOnly
maxpooling will be considered
Inputs
Input data \(X\), of shape \((N,C,H,W)\)
Attributes
\(\text{stride} = (SH,SW)\)
\(\text{kernel} = (KH, KW)\)
\(\text{padding} = (PH, PW)\)
Real Formalization
Padding beforehand
Xe = pad(Xe, mode="constant", pad_width=(0,0,0,0,PH,PH,PW,PW), constant_value=INT_MIN)
Expansion Scale
Expansion Formalization
Ye = Pooling(Xe, stride=stride, kernel=kernel)
FullyConnected¶
Limitations
The input only supports layer-wise quantization
biasis fused in MRT rewrite
Input
Input data \(X\), of shape \((N,K)\)
Weight \(W\), of shape \((M, K)\)
Real Formalization
Expansion Scale
Expansion Formalization 1: Symmetric Quantized X and W
Ye = FullyConnected(Xq, Wq)
infer_prec = get_bit_cnt(K) + xprec + wprec
Expansion Formalization 2: Zero Point Quantized X and Symmetric Quantized W
Ye = FullyConnected(Xq, Wq) + C
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 1
infer_prec2 = get_bit_cnt(abs(C))
infer_prec = max(infer_prec1, infer_prec2) + 1
Expansion Formalization 3: Symmetric Quantized X and Zero Point Quantized W
Ye = FullyConnected(Xq, Wq) + C * sum(Xq, axis=1, keep_dims=True)
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 1
infer_prec2 = get_bit_cnt(abs(C)*K) + xprec
infer_prec = max(infer_prec1, infer_prec2) + 1
Expansion Formalization 4: Zero Point Quantized X and W
Ye = FullyConnected(Xq, Wq) + C1 * sum(Xq, axis=1, keep_dims=True) + C2 + C3
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 2
infer_prec2 = get_bit_cnt(abs(C1)*K) + xprec + 1
infer_prec3 = get_bit_cnt(abs(max(C2)))
infer_prec4 = get_bit_cnt(abs(C3))
infer_prec = max(infer_prec1, infer_prec2, infer_prec3, infer_prec4) + 2
Broadcast Operator Expansion¶
broadcast_add¶
use Quantize Scale.
Elemwise Operator Expansion¶
elemwise_add¶
use Quantize Scale.
add_n¶
use Quantize Scale.
Transform Operator Expansion¶
concat¶
use Quantize Scale.
flatten¶
Inputs
Input data \(X\), of shape \((M_0,M_1,...,M_{N-1})\)
Real Formalization
Expansion Scale
Expansion Formalization
Ye = flatten(Xe)
Generalized Expansion Function¶
Quantize Scale¶
Limitations
All the inputs only support symmetric quantize
Expansion Scale
Expansion Formalization
Ye = quantize_scale(**Xqs, **attrs)
infer_prec = max(xprecs) if op_name == "Concat" else max(xprecs)+1
Op-level Configuration Turorial¶
Optimized Quantization¶
MRT GEN support op-level channel-slice and zero point specification to optimize the quantization process for a better int inference result. Two steps are needed in the optimization process.
Step 1. Find out potential layers to optimize
A simple method is to print out the names of all the layers like follows. See mrt.V2.Transformer.py
print out zero point quantization candidates
from mrt.sym_utils import is_params, sym_iter
sym, params = self.current_model.symbol, self.current_model.params
for s in topo_sort(sym):
name, op_name = s.attr('name'), s.attr('op_name')
if op_name in ["broadcast_add"]:
childs = sym_iter(s.get_children())
for c in childs:
cname = c.attr('name')
if is_params(c, params):
weight = params[cname]
maxv = weight.max().asscalar()
minv = weight.min().asscalar()
print(cname)
elif op_name in ["Convolution"]:
childs = sym_iter(s.get_children())
for c in childs:
cname = c.attr('name')
if is_params(c, params):
weight = params[cname]
maxv = weight.max().asscalar()
minv = weight.min().asscalar()
print(maxv, minv, cname)
exit()
Step 2. Set up Cfg_groups in configuration file
See Configuration Example. An examplary configuration is provide.
...
[CALIBRATION]
# [Optional] Calibration batch size, 16 by default.
Batch=16
# [Optional] Iterator numbers of calibration, 1 by default.
Calibrate_num=1
# [Optional] Granularity, Quantizer and Optimizor Configuration for specified nodes
Cfg_groups=
["mrt_sym_separate_bias_alexnet0_conv0_fwd_0"]: gn_info: {"gn_type"; "channel-wise". "ichannel"; 1. "step"; 1},
["alexnet0_conv0_weight"]: quant_type: UniformAffine,
["mrt_sym_separate_bias_alexnet0_dense1_bias_0"]: quant_type: UniformAffine
...
Layer-wise Restoration¶
Also, both MRT and MRT GEN support layer-wise restoration for
debugging purpose if MRT is developed for enhancement purposes. See
Configuration
Example.
The restore can be specified by symbol names in Restore_name:
[QUANTIZATION]
...
# [Optional] Debug usage
Restore_name=
...
Model Test¶
The comparison between the original float model, mrt quantized model and non-tuned mrt gen quantized model is listed as below.
model name |
Original Float Model |
MRT Quantized Model (Tuned) |
MRT Quantized Model |
MRT GEN Quanzied Model |
|---|---|---|---|---|
re snet50_v1 |
top1 =77.39%to p5=93.59% |
top1 =76.47%to p5=93.28% |
top1 =76.41%to p5=93.18% |
top1 =75.66%top2=92.79% |
re snet50_v2 |
top1 =77.15%to p5=93.44% |
top1 =70.76%to p5=89.56% |
top1 =69.89%to p5=88.84% |
top1 =69.83%top5=88.84% |
re snet18_v1 |
top1 =70.96%to p5=89.93% |
top1 =70.11%to p5=89.60% |
top1 =69.90%to p5=89.50% |
top1 =69.97%top5=88.84% |
resnet18 v1_b_0.89 |
top1 =67.21%to p5=87.45% |
top1 =63.75%to p5=85.67% |
top1 =69.97%top5=89.53% |
|
quickdraw |
top1 =81.90%to p5=98.26% |
top1 =81.83%to p5=98.24% |
top1 =81.89%top5=98.22% |
|
qd10_re snetv1_20 |
top1 =85.79%to p5=98.73% |
top1 =85.79%to p5=98.73% |
top1 =85.94%top5=98.73% |
|
de nsenet161 |
top1 =77.62%to p5=93.82% |
top1 =77.32%to p5=93.63% |
top1 =76.90%top5=93.49% |
|
alexnet |
top1 =55.91%to p5=78.75% |
top1 =51.69%to p5=77.99% |
top1=51.82%t op5=78.09%(channel sliced) |
|
cifar_re snet20_v1 |
top1 =92.88%to p5=99.78% |
top1 =92.82%to p5=99.75% |
top1 =92.52%top5=99.79% |
|
mob ilenet1_0 |
top1 =70.77%to p5=89.97% |
top1 =66.11%to p5=87.35% |
top1 =63.07%to p5=85.02% |
top1 =61.30%top5=83.88% |
mobile netv2_1.0 |
top1 =71.51%to p5=90.10% |
top1 =69.39%to p5=89.30% |
top1 =66.93%to p5=87.39% |
top1 =62.15%top5=84.00% |
shuf flenet_v1 |
top1 =63.48%to p5=85.12% |
top1 =60.45%to p5=82.95% |
top1 =60.40%to p5=82.91% |
top1 =60.13%top5=82.70% |
sque ezenet1.0 |
top1 =57.20%to p5=80.04% |
top1 =55.16%to p5=78.67% |
top1 =52.46%to p5=77.10% |
top1 =54.36%top5=78.19% |
tf_inc eption_v3 |
top1 =55.58%to p5=77.56% |
top1 =55.54%to p5=83.03% |
top1 =53.79%to p5=75.99% |
top1 =52.13%top5=73.96% |
vgg19 |
top1 =74.14%to p5=91.78% |
top1 =73.75%to p5=91.67% |
top1 =73.70%to p5=91.66% |
top1 =73.58%top5=91.60% |
mnist |
top1= 99.18%top 5=100.00% |
top1= 99.17%top 5=100.00% |
top1= 98.16%top5=100.00% |
|
trec |
97.84% |
97.63% |
97.20% |
97.38% |
y olo3_dark net53_voc |
81.37% |
82.08% |
80.85% |
|
yolo 3_mobilen et1.0_voc |
75.98% |
71.53% |
70.70% |
|
ssd_5 12_resnet 50_v1_voc |
80.27% |
80.01% |
79.57% |
|
ssd_51 2_mobilen et1.0_voc |
75.57% |
71.32% |
70.21% |
69.68% |
tf _mobilene t_v1_0.25 _224_lite |
top1 =34.68%to p5=59.32% |
to p1=3.39%t op5=9.93% |
top1=31.71 %top5=55.83%(slice channel) |
The MRT GEN module apply pad separate for Convolution and bias
separate for both Convolution and FullyConnected. From the
chart above we can that some model like quickdraw,
qd10_resnetv1_20, resnet18v1_b_0.89, trec, squeezenet1.0
and alexnet and have observable better accuracy than MRT module.
