paper:Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
code:pocketflow
概述
- 由于首尾两层对量化比较敏感,出于精度考虑,可不量化这两层,但是为了提高速度,建议量化。The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
- 当weights和activations同时量化到较低位数时,性能低于只量化weights。When both weights and activations are quantized, 8 bit does not lead to apparent drop of performance, and sometimes can even increase the classification accuracy, which is probably due to better generalization ability. Nevertheless, the performance will be more challenged when both weights and activations are quantized to lower bits, comparing to weight-only quantization.
- 量化粒度影响性能和存储。Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics (α
and β
) of each bucket need to be kept. - bucket可采用split或channel方式划分。split方式需要指定bucket大小,channel方式,bucket大小为cinksizeksize.
均匀量化
仿射变换
r = S(q − Z)
具体算法:
clamp(r; a, b) := min (max(x, a), b)
s(a,b,n):=\frac{b-a}{n-1}
q(r; a, b, n) :=round(\frac{clamp(r; a, b)-a}{s(a, b, n)})s(a, b, n) + a
a := min w, b := maxw.
n:=qmax-qmin
S为scale,Z为零点,可通过(minw,qmin)或(maxw,qmax)对来计算
Z=round(qmin-a/s(a,b,n))
pocketflow:
sc(x)=\frac{x-\beta }{\alpha},
\alpha =\omega_{max}-\omega_{min},
\beta =\omega_{min}
\hat{x}=\frac{1}{2^{k}-1}round((2^{k}-1)sc(x))
Q(x)=\alpha \hat{x}+\beta
pocketflow code
def __scale(self, w, mode):
"""linear scale function
Args:
* w: A Tensor (weights or activation output),
the shape is [bucket_size, bucekt_num] if use_buckets else the original size.
* mode: A string, 'weight' or 'activation'
Returns:
* A Tensor, the normalized weights
* A Tensor, alpha, scalar if activation mode else a vector [bucket_num].
* A Tensor, beta, scalar if activation mode else a vector [bucket_num].
"""
if mode == 'weight':
if self.use_buckets:
axis = 0
else:
axis = None
elif mode == 'activation':
axis = None
else:
raise ValueError("Unknown mode for scalling")
w_max = tf.stop_gradient(tf.reduce_max(w, axis=axis))
w_min = tf.stop_gradient(tf.reduce_min(w, axis=axis))
eps = tf.constant(value=1e-10, dtype=tf.float32)
alpha = w_max - w_min + eps
beta = w_min
w = (w - beta) / alpha
return w, alpha, beta
def __inv_scale(self, w, alpha, beta):
"""Inversed linear scale function
Args:
* w: A Tensor (weights or activation output)
* alpha: A float value, scale factor
* bete: A float value, scale bias
Returns:
* A Tensor, inversed scale value1
"""
return alpha * w + beta
def __uniform_quantize(self, x, mbits, mode, prefix=''):
"""Uniform quantization function
Args:
* x: A Tensor (weights or activation output)
* mbits: A scalar Tensor, tf.int64, spicifying number of bit for quantization
* mode: A string, 'weight' or 'activation', where to quantize
* prefix: A string, the prefix of scope name
Returns:
* A Tensor, uniform quantized value
"""
with tf.variable_scope(prefix + '/quantize'):
if self.use_buckets and mode == 'weight':
orig_shape = x.get_shape()
if self.bucket_type == 'split':
x, bucket_num, padded_num = self.__split_bucket(x)
elif self.bucket_type == 'channel':
x, bucket_num, padded_num = self.__channel_bucket(x)
x_normalized, alpha, beta = self.__scale(x, mode)
g = self.sess.graph
k = tf.cast(2 ** mbits - 1, tf.float32)
with g.gradient_override_map({'Round': 'Identity'}):
qw = tf.round(x_normalized * k) / k
qw = self.__inv_scale(qw, alpha, beta)
if self.use_buckets and mode == 'weight':
# Reshape w back to the original shape
qw = tf.reshape(qw, [-1])
if padded_num != 0:
qw = tf.reshape(qw[:-padded_num], orig_shape)
else:
qw = tf.reshape(qw, orig_shape)
# Update bucket storage if use buckets.
self.__updt_bucket_storage(bucket_num)
print("Quantized: " + tf.get_variable_scope().name)
return qw
BN层
y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta
pytorch实现:
the value used for the running_mean and running_var
computation. Can be set to None
for cumulative moving average
(i.e. simple average). Default: 0.1
cumulative moving average
{\textit {CMA}}_{n}={{x_{1}+\cdots +x_{n}} \over n}
inference时,BN层常和Conv层或fc层折叠
w_{fold}:= \frac{\gamma \omega}{\sqrt{EMA(\delta_{B}^{2})+\varepsilon }}
Here γ is the batch normalization’s scale parameter, EMA(σB2 ) is the moving average estimate of the variance of convolution results across the batch, and ε is just a small constant for numerical stability.

定点计算
重训练
训练时同时保留float weight和quant weight,梯度回传更新float weight->quant weight,其余步骤与正常训练无异。

网友评论