V2 Documentation¶

[TOC]

V2 Architecture¶

MRT V2 is superset of MRT that supports enriching features like zero point quantization, channel-slice and groupwise quantization for convolution. The core implementation attributes to the module tfm_types.py, the architecture of which display like this:

Quantizer¶

	Parameter (Uniform Symmetric Quantizer)	Input (Uniform Symmetric Quantizer)	Parameter (Uniform Affine Quantizer)	Input (Uniform Affine Quantizer)
S c a l e	\(sc_ {w} = \frac{2^{ PREC-1}- 1}{\max{\|Wr\|}}\)	\(sc_ {x} = \frac{2^{ PREC-1}- 1}{\max{\|Xr\|}}\)	\(sc_ {w} = \frac{2^{ PREC}-1}{\max{W r} - \min{Wr}}\)	\(sc_ {x} = \frac{2^{ PREC}-1}{\max{ Xr}-\min {Xr}}\)
Z e r o P o i n t			\(zp_ {wr} = \min Wr\)	\(zp_ {xe} = \text{ round}\Big(\min Xr \cdot sc_{xe }\Big) \\ rzp_{x} = round \Big (\min Xr \Big)\)
Q u a n t i z a t i o n	\(Wq = \text{round} \Big(sc_{w} \cdot Wr \Big)\)	\(frac, exp = \text{cvm_float} \bigg(\frac{sc_{x }}{sc_{xe}}\bigg ) \\ Xq = \text{ realize} (X_{ e}, frac, exp)\)	\(W_{q} = \text{ round} \Big[ (sc_{w} ( W_{r} - zp_{wr} ) \Big]\)	\(frac, exp = \text{cvm_float }\bigg(\frac{ sc_{x}}{sc_ {xe}}\bigg) \\ Xq = \text{realize }(Xe - zp_{x e}, frac, exp)\)
R e q u a n t i z a t i o n	\(Wr = \frac{Wq }{sc_{w}}\)	\(Xr = \frac{Xq }{sc_{x}}\)	\(Wr = \frac{Wq}{sc_ {w}} + zp_{wr}\)	\(Xr = \frac{Xq}{sc_ {x}} + rzp_{x}\)

The variable whose names ending up with ‘q’ stand for int quantized operators, ‘r’ stand for floating point operators and ‘e’ stand for int expanded operators.

Re-quantization into expansion is elaborated in operator expansion, see NN Operator Expansion, Broadcast Operator Expansion, Elemwise Operator Expansion, Transform Operator Expansion.

Granularity¶

MRT Support both layer-wise and channel-wise quantization. Channel wise quantization is implemented by graph-level channel split and channel merge.

Channel Split¶

To compromise between precision and calculation operations, MRT support quantization with respect to channel features. Use slice to split the channels in MRT rewrite process:

\[\begin{split}\forall i \text{ in } [0, C, \text{step}), \\ Xi = \text{slice}\Big(X, \\ \text{begin=(None,)*ichannel+(i,)+(None,)*(ndims-ichannel-1)}, \\ \text{end=(None,)*ichannel+(i+step,)+(None,)*(ndims-ichannel-1)}, \\ \text{step=(-1,)*ichannel+(step,)+(-1,)*(ndims-ichannel-1)}\Big)\end{split}\]

If \(X\) is of channel feature and \(W\) is of layer feature or vice versa, \(W\) (or \(X\)) will also be split to be compatible with \(X\) (or \(W\)).

Take Convolution (for simplicity, only point-wise convolution is considered here, i.e. num_group=1) for instance (only uniform symmetric quantization is considered for simplicity), layer-wise Convolution can be rewritten as:

\[ \begin{align}\begin{aligned}\begin{split}Ye[n,o,p,q], \\\end{split}\\\begin{split}= \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,i,ki,kj], \\\end{split}\\\begin{split}= \sum_{i=0}^{C/step-1} \sum_{j=i*step}^{(i+1)*step-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, j, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,j,ki,kj], \\\end{split}\\\begin{split}= \sum_{i=0}^{C/step-1} Convolution(Xq[:,i*step:(i+1)*step,:,:], Wq[:,i*step:(i+1)*step,:,:]), \\\end{split}\\= \sum_{i=0}^{C/step-1} Yei[n,o,p,q]\end{aligned}\end{align} \]

Channel Merge¶

Merge the channel symbol components to the equivalent symbol.

For operators like pad, relu, Pooling, merge as follows:

\[X = \text{concat}\Big( \big[Xi, \forall i \in [0, C) \big], \text{dim=ich} \Big) \tag{concat}\]

For operators like Convolution (num_group=1), merge as follows:

\[X = \sum_{i=0}^{C-1} Xi \tag{add_n}\]

For operators like Convolution (num_group>1), the slice channel process will be performed in each the output channel, and concat along the output channel axis.

NN Operator Expansion¶

Convolution¶

Limitations

Only 2-D case is considered
num_group is asserted to be 1
bias is fused in MRT rewrite

Inputs

Input data \(X\), of shape \((N,C,H,W)\)
Kernel weight \(W\), of shape \((O,C,KH,KW)\)

Attributes

\(\text{padding} = (PH,PW)\)
\(\text{stride} = (SH,SW)\)
\(\text{dilation} = (DH,DW)\)

Real Formalization

\[\forall n \in [0, N), \quad \forall o \in [0, O), \quad \forall p \in \Bigg{[} 0, \bigg\lfloor \frac{H - DH \cdot (KH-1) - 1}{SH} \bigg\rfloor + 1 \Bigg{)}, \quad \forall q \in \Bigg{[} 0, \bigg\lfloor \frac{W - DW \cdot (KW-1) - 1}{SW} \bigg\rfloor + 1 \Bigg{)}\]

\[Yr[n,o,p,q] = \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xr[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wr[o,i,ki,kj]\]

Note, if num_groups is not 1, then convolution is generalized as Groupwise Convolution.

Specifically, suppose kernel weight \(W\) is of shape \((O,IC,KH,KW)\) and input data \(X\) is of shape \((N,C,H,W)\).

\[ \begin{align}\begin{aligned}\begin{split}Yr[n,o,p,q], \\\end{split}\\= \sum_{i=0}^{IC-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xr \Bigg[n, \bigg\lfloor\frac{o}{OPG}\bigg\rfloor IC + i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW\Bigg] \cdot Wr[o,i,ki,kj]\end{aligned}\end{align} \]

For simplicity, here we will not inlcude the notation of groupwise convolution.

Given \(Xe\) and \(We\), MRT respectively quantize them into \(Xq\) and \(Wq\).

Expansion Formalization 1: Symmetric Quantized X and W

\[Ye[n,o,p,q] = \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,i,ki,kj]\]

where the scale of \(Ye\) is \(sc_{x} \ sc_{w}\).

Expansion Formalization 2: Zero Point Quantized X and Symmetric Quantized W

\[Ye1[n,o,p,q] = \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,i,ki,kj]\]

\[Ye2[n,o,p,q] = zp_{xe} \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Wq[o,i,ki,kj]\]

where the scale of \(Ye1\) is \(sc_{x} \ sc_{w}\) and the scale of \(Ye2\) is \(sc_{w}\). By quantize_scale, MRT respectively quantize them into \(Yq1\) and \(Yq2\). Then get the final expansion.

\[Ye = Yq1 + Yq2\]

Expansion Formalization 3: Symmetric Quantized X and Zero Point Quantized W

\[Ye1[n,o,p,q] = \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,i,ki,kj]\]

\[Ye2[n,o,p,q] = zp_{we} \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW]\]

where the scale of \(Ye1\) is \(sc_{x} \ sc_{w}\) and the scale of \(Ye2\) is \(sc_{x}\). By quantize_scale, MRT respectively quantize them into \(Yq1\) and \(Yq2\). Then get the final expansion.

\[Ye = Yq1 + Yq2\]

Expansion Formalization 4: Zero Point Quantized X and W (Deprecated)

\[\begin{split}\begin{equation} \begin{split} Ye[n,o,p,q] = \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \cdot Wq[o,i,ki,kj] \\ + w_{zp} \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Xq[n, i, p \cdot SH + ki \cdot DH, q \cdot SW + kj \cdot DW] \\ + x_{zp} \sum_{i=0}^{C-1} \sum_{ki=0}^{KH-1} \sum_{kj=0}^{KW-1} Wq[o,i,ki,kj] + C \cdot KH \cdot KW \cdot x_{zp} \cdot w_{zp} \end{split} \end{equation}\end{split}\]

Ye = Convoltion(Xq, Wq, **attrs) + wzp * Convoltion(Xq, W1, **attrs) + C2 + C3
infer_prec1 = get_bit_cnt(C*KH*KW) + xprec + wprec + 2
infer_prec2 = get_bit_cnt(abs(wzp)*C*KH*KW) + xprec + 1
infer_prec3 = get_bit_cnt(abs(xzp)*C*KH*KW) + wprec + 1
infer_prec4 = get_bit_cnt(abs(wzp)*abs(xzp)*C*KH*KW)
infer_prec = max(infer_prec1, infer_prec2, infer_prec3, infer_prec4) + 2

pad¶

Limitations

Only support constant mode
\(\text{constant_value} = 0\)
Only support pad of \(H\) dimension and \(W\) dimension

Inputs

Input data \(X\), of shape \((N,C,H,W)\)

Attributes

\(\text{pad_width} = (0,0,0,0,PH_1,PH_2,PW_1,PW_2)\)

Real Formalization

\[\forall n \in [0, N), \quad \forall c \in [0, C), \quad \forall h \in [0, PH_1 + H + PH_2), \quad \forall w \in [0, PW_1 + W + PW_2)\]

\[\begin{split}\begin{equation} Yr[n,c,h,w] = \begin{cases} Xr[n, c, h-PH_1, w-PW_1] & PH_1 \leq h<H+PH_1 \wedge PW_1 \leq w < W+PW_1 \\ 0 & \text{otherwise} \end{cases} \end{equation}\end{split}\]

Expansion Scale

\[sc_{ye} = sc_{xe}\]

Expansion Formalization

\[\begin{split}\begin{equation} \begin{split} Ye[n,c,h,w] = \begin{cases} Xe[n, c, h-PH_1, w-PW_1] & PH_1 \leq h<H+PH_1 \wedge PW_1 \leq w < W+PW_1 \\ 0 & \text{otherwise} \end{cases} \end{split} \end{equation}\end{split}\]

Ye = pad(Xe, **attrs)

relu¶

Inputs

Input data \(X\), of shape \((M_0,M_1,...,M_{N-1})\)

Real Formalization

\[\forall i \in [0, N), \quad \forall m_i \in [0, M_i)\]

\[Yr[m_0,m_1,...,m_{N-1}] = \max{\big( 0, Xr[m_0,m_1,...,m_{N-1}] \big)}\]

Expansion Scale

\[sc_{ye} = sc_{xe}\]

Expansion Formalization

\[Ye[m_0,m_1,...,m_{N-1}] = \max{\big(0, Xe[m_0,m_1,...,m_{N-1}]\big)}\]

Ye = relu(Xe)

Pooling¶

Limitations

Only 2-D case is considered
avg pooling will be rewritten into Convolution or broadcast_mul
Only max pooling will be considered

Inputs

Input data \(X\), of shape \((N,C,H,W)\)

Attributes

\(\text{stride} = (SH,SW)\)
\(\text{kernel} = (KH, KW)\)
\(\text{padding} = (PH, PW)\)

Real Formalization

\[\forall n \in [0, N), \quad \forall c \in [0, C), \quad \forall p \in \Bigg{[} 0, \bigg\lfloor \frac{H - KH}{SH} \bigg\rfloor + 1 \Bigg{)}, \quad \forall q \in \Bigg{[} 0, \bigg\lfloor \frac{W - KW}{SW} \bigg\rfloor + 1 \Bigg{)}\]

\[\begin{split}Yr[n,c,p,q] = \max_{p \cdot SH \leq p' < p \cdot SH + KH \\ q \cdot SW \leq q' < q \cdot SW + KW} Xr[n,c,p',q']\end{split}\]

Padding beforehand

Xe = pad(Xe, mode="constant", pad_width=(0,0,0,0,PH,PH,PW,PW), constant_value=INT_MIN)

Expansion Scale

\[sc_{ye} = sc_{xe}\]

Expansion Formalization

\[\begin{split}Ye[n,c,p,q] = \max_{p' \in [p \cdot SH, p \cdot SH + KH) \\ q' \in [q \cdot SW, q \cdot SW + KW)} Xe[n,c,p',q']\end{split}\]

Ye = Pooling(Xe, stride=stride, kernel=kernel)

FullyConnected¶

Limitations

The input only supports layer-wise quantization
bias is fused in MRT rewrite

Input

Input data \(X\), of shape \((N,K)\)
Weight \(W\), of shape \((M, K)\)

Real Formalization

\[\forall m \in [0, M), \quad \forall n \in [0, N)\]

\[Yr[n,m] = \sum_{i=0}^{K-1} X[n,i] \cdot W[m,i]\]

Expansion Scale

\[sc_{ye} = sc_{x} \cdot sc_{w}\]

Expansion Formalization 1: Symmetric Quantized X and W

\[Ye[n,m] = \sum_{i=0}^{K-1} Xq[n,i] \cdot Wq[m,i]\]

Ye = FullyConnected(Xq, Wq)
infer_prec = get_bit_cnt(K) + xprec + wprec

Expansion Formalization 2: Zero Point Quantized X and Symmetric Quantized W

\[Ye[n,m] = \sum_{i=0}^{K-1} Xq[n,i] \cdot Wq[m,i] + zp_x \sum_{i=0}^{K-1} Wq[m,i]\]

Ye = FullyConnected(Xq, Wq) + C
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 1
infer_prec2 = get_bit_cnt(abs(C))
infer_prec = max(infer_prec1, infer_prec2) + 1

Expansion Formalization 3: Symmetric Quantized X and Zero Point Quantized W

\[Ye[n,m] = \sum_{i=0}^{K-1} Xq[n,i] \cdot Wq[m,i] + zp_{w} \sum_{i=0}^{K-1} Xq[n,i]\]

Ye = FullyConnected(Xq, Wq) + C * sum(Xq, axis=1, keep_dims=True)
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 1
infer_prec2 = get_bit_cnt(abs(C)*K) + xprec
infer_prec = max(infer_prec1, infer_prec2) + 1

Expansion Formalization 4: Zero Point Quantized X and W

\[Ye[n,m] = \sum_{i=0}^{K-1} Xq[n,i] \cdot Wq[m,i] + zp_{w} \sum_{i=0}^{K-1} Xq[n,i] + zp_{x} \sum_{i=0}^{K-1} Wq[m,i] + zp_{w} \cdot zp_{x} \cdot K\]

Ye = FullyConnected(Xq, Wq) + C1 * sum(Xq, axis=1, keep_dims=True) + C2 + C3
infer_prec1 = get_bit_cnt(K) + xprec + wprec + 2
infer_prec2 = get_bit_cnt(abs(C1)*K) + xprec + 1
infer_prec3 = get_bit_cnt(abs(max(C2)))
infer_prec4 = get_bit_cnt(abs(C3))
infer_prec = max(infer_prec1, infer_prec2, infer_prec3, infer_prec4) + 2

Broadcast Operator Expansion¶

broadcast_add¶

use Quantize Scale.

Elemwise Operator Expansion¶

elemwise_add¶

use Quantize Scale.

add_n¶

use Quantize Scale.

Transform Operator Expansion¶

concat¶

use Quantize Scale.

flatten¶

Inputs

Input data \(X\), of shape \((M_0,M_1,...,M_{N-1})\)

Real Formalization

\[\forall i \in [0, N), \quad \forall m_i \in [0, M_i)\]

\[Yr\Bigg[\sum_{j=0}^{N-2} m_j \prod_{i=j+1}^{N-1}M_i + m_{N-1}\Bigg] = Xr[m_0,m_1,...,m_{N-1}]\]

Expansion Scale

\[sc_{ye} = sc_{xe}\]

Expansion Formalization

\[Ye\Bigg[\sum_{j=0}^{N-2} m_j \prod_{i=j+1}^{N-1}M_i + m_{N-1}\Bigg] = Xe[m_0,m_1,...,m_{N-1}]\]

Ye = flatten(Xe)

Generalized Expansion Function¶

Quantize Scale¶

Limitations

All the inputs only support symmetric quantize

Expansion Scale

\[sc_{ye} = sc_{xi}\]

Expansion Formalization

Ye = quantize_scale(**Xqs, **attrs)
infer_prec = max(xprecs) if op_name == "Concat" else max(xprecs)+1

Op-level Configuration Turorial¶

Optimized Quantization¶

MRT GEN support op-level channel-slice and zero point specification to optimize the quantization process for a better int inference result. Two steps are needed in the optimization process.

Step 1. Find out potential layers to optimize

A simple method is to print out the names of all the layers like follows. See mrt.V2.Transformer.py

print out zero point quantization candidates
from mrt.sym_utils import is_params, sym_iter
sym, params = self.current_model.symbol, self.current_model.params
for s in topo_sort(sym):
    name, op_name = s.attr('name'), s.attr('op_name')
    if op_name in ["broadcast_add"]:
        childs = sym_iter(s.get_children())
        for c in childs:
            cname = c.attr('name')
            if is_params(c, params):
                weight = params[cname]
                maxv = weight.max().asscalar()
                minv = weight.min().asscalar()
                print(cname)
    elif op_name in ["Convolution"]:
        childs = sym_iter(s.get_children())
        for c in childs:
            cname = c.attr('name')
            if is_params(c, params):
                weight = params[cname]
                maxv = weight.max().asscalar()
                minv = weight.min().asscalar()
                print(maxv, minv, cname)
exit()

Step 2. Set up Cfg_groups in configuration file

See Configuration Example. An examplary configuration is provide.

...
[CALIBRATION]
# [Optional] Calibration batch size, 16 by default.
Batch=16
# [Optional] Iterator numbers of calibration, 1 by default.
Calibrate_num=1
# [Optional] Granularity, Quantizer and Optimizor Configuration for specified nodes
Cfg_groups=
  ["mrt_sym_separate_bias_alexnet0_conv0_fwd_0"]: gn_info: {"gn_type"; "channel-wise". "ichannel"; 1. "step"; 1},
  ["alexnet0_conv0_weight"]: quant_type: UniformAffine,
    ["mrt_sym_separate_bias_alexnet0_dense1_bias_0"]: quant_type: UniformAffine
...

Layer-wise Restoration¶

Also, both MRT and MRT GEN support layer-wise restoration for debugging purpose if MRT is developed for enhancement purposes. See Configuration Example. The restore can be specified by symbol names in Restore_name:

[QUANTIZATION]
...
# [Optional] Debug usage
Restore_name=
...

Model Test¶

The comparison between the original float model, mrt quantized model and non-tuned mrt gen quantized model is listed as below.

model name	Original Float Model	MRT Quantized Model (Tuned)	MRT Quantized Model	MRT GEN Quanzied Model
re snet50_v1	top1 =77.39%to p5=93.59%	top1 =76.47%to p5=93.28%	top1 =76.41%to p5=93.18%	top1 =75.66%top2=92.79%
re snet50_v2	top1 =77.15%to p5=93.44%	top1 =70.76%to p5=89.56%	top1 =69.89%to p5=88.84%	top1 =69.83%top5=88.84%
re snet18_v1	top1 =70.96%to p5=89.93%	top1 =70.11%to p5=89.60%	top1 =69.90%to p5=89.50%	top1 =69.97%top5=88.84%
resnet18 v1_b_0.89	top1 =67.21%to p5=87.45%		top1 =63.75%to p5=85.67%	top1 =69.97%top5=89.53%
quickdraw	top1 =81.90%to p5=98.26%		top1 =81.83%to p5=98.24%	top1 =81.89%top5=98.22%
qd10_re snetv1_20	top1 =85.79%to p5=98.73%		top1 =85.79%to p5=98.73%	top1 =85.94%top5=98.73%
de nsenet161	top1 =77.62%to p5=93.82%		top1 =77.32%to p5=93.63%	top1 =76.90%top5=93.49%
alexnet	top1 =55.91%to p5=78.75%		top1 =51.69%to p5=77.99%	top1=51.82%t op5=78.09%(channel sliced)
cifar_re snet20_v1	top1 =92.88%to p5=99.78%		top1 =92.82%to p5=99.75%	top1 =92.52%top5=99.79%
mob ilenet1_0	top1 =70.77%to p5=89.97%	top1 =66.11%to p5=87.35%	top1 =63.07%to p5=85.02%	top1 =61.30%top5=83.88%
mobile netv2_1.0	top1 =71.51%to p5=90.10%	top1 =69.39%to p5=89.30%	top1 =66.93%to p5=87.39%	top1 =62.15%top5=84.00%
shuf flenet_v1	top1 =63.48%to p5=85.12%	top1 =60.45%to p5=82.95%	top1 =60.40%to p5=82.91%	top1 =60.13%top5=82.70%
sque ezenet1.0	top1 =57.20%to p5=80.04%	top1 =55.16%to p5=78.67%	top1 =52.46%to p5=77.10%	top1 =54.36%top5=78.19%
tf_inc eption_v3	top1 =55.58%to p5=77.56%	top1 =55.54%to p5=83.03%	top1 =53.79%to p5=75.99%	top1 =52.13%top5=73.96%
vgg19	top1 =74.14%to p5=91.78%	top1 =73.75%to p5=91.67%	top1 =73.70%to p5=91.66%	top1 =73.58%top5=91.60%
mnist	top1= 99.18%top 5=100.00%		top1= 99.17%top 5=100.00%	top1= 98.16%top5=100.00%
trec	97.84%	97.63%	97.20%	97.38%
y olo3_dark net53_voc	81.37%		82.08%	80.85%
yolo 3_mobilen et1.0_voc	75.98%		71.53%	70.70%
ssd_5 12_resnet 50_v1_voc	80.27%		80.01%	79.57%
ssd_51 2_mobilen et1.0_voc	75.57%	71.32%	70.21%	69.68%
tf _mobilene t_v1_0.25 _224_lite	top1 =34.68%to p5=59.32%		to p1=3.39%t op5=9.93%	top1=31.71 %top5=55.83%(slice channel)

The MRT GEN module apply pad separate for Convolution and bias separate for both Convolution and FullyConnected. From the chart above we can that some model like quickdraw, qd10_resnetv1_20, resnet18v1_b_0.89, trec, squeezenet1.0 and alexnet and have observable better accuracy than MRT module.