概述

由于首尾两层对量化比较敏感，出于精度考虑，可不量化这两层，但是为了提高速度，建议量化。The first and last layers of the network are connected directly with the input and output, and are arguably more sensitive to quantization. Keeping them un-quantized can slightly increase the performance, nevertheless, if you want to accelerate the inference speed, all layers are supposed to be quantized.
当weights和activations同时量化到较低位数时，性能低于只量化weights。When both weights and activations are quantized, 8 bit does not lead to apparent drop of performance, and sometimes can even increase the classification accuracy, which is probably due to better generalization ability. Nevertheless, the performance will be more challenged when both weights and activations are quantized to lower bits, comparing to weight-only quantization.
量化粒度影响性能和存储。Generally, smaller bucket size can lead to more fine grained quantization, while more storage are required since full precision statistics (α
and β
) of each bucket need to be kept.
bucket可采用split或channel方式划分。split方式需要指定bucket大小，channel方式，bucket大小为cinksizeksize.

均匀量化

仿射变换

r = S(q − Z)

具体算法：

clamp(r; a, b) := min (max(x, a), b)

s(a,b,n):=\frac{b-a}{n-1}

q(r; a, b, n) :=round(\frac{clamp(r; a, b)-a}{s(a, b, n)})s(a, b, n) + a

a := min w, b := maxw.

n:=qmax-qmin

S为scale，Z为零点，可通过(minw,qmin)或(maxw,qmax)对来计算

Z=round(qmin-a/s(a,b,n))

pocketflow:

sc(x)=\frac{x-\beta }{\alpha},
\alpha =\omega_{max}-\omega_{min},
\beta =\omega_{min}

\hat{x}=\frac{1}{2^{k}-1}round((2^{k}-1)sc(x))

Q(x)=\alpha \hat{x}+\beta

pocketflow code

 def __scale(self, w, mode):
    """linear scale function

    Args:
    * w: A Tensor (weights or activation output),
         the shape is [bucket_size, bucekt_num] if use_buckets else the original size.
    * mode: A string, 'weight' or 'activation'

    Returns:
    * A Tensor, the normalized weights
    * A Tensor, alpha, scalar if activation mode else a vector [bucket_num].
    * A Tensor, beta, scalar if activation mode else a vector [bucket_num].
    """
    if mode == 'weight':
      if self.use_buckets:
        axis = 0
      else:
        axis = None
    elif mode == 'activation':
      axis = None
    else:
      raise ValueError("Unknown mode for scalling")

    w_max = tf.stop_gradient(tf.reduce_max(w, axis=axis))
    w_min = tf.stop_gradient(tf.reduce_min(w, axis=axis))
    eps = tf.constant(value=1e-10, dtype=tf.float32)
    
    alpha = w_max - w_min + eps
    beta = w_min
    w = (w - beta) / alpha
    return w, alpha, beta

  def __inv_scale(self, w, alpha, beta):
    """Inversed linear scale function

    Args:
    * w: A Tensor (weights or activation output)
    * alpha: A float value, scale factor
    * bete: A float value, scale bias

    Returns:
    * A Tensor, inversed scale value1
    """

    return alpha * w + beta

  def __uniform_quantize(self, x, mbits, mode, prefix=''):
    """Uniform quantization function

    Args:
    * x: A Tensor (weights or activation output)
    * mbits: A scalar Tensor, tf.int64, spicifying number of bit for quantization
    * mode: A string, 'weight' or 'activation', where to quantize
    * prefix: A string, the prefix of scope name

    Returns:
    * A Tensor, uniform quantized value
    """
    with tf.variable_scope(prefix + '/quantize'):
      if self.use_buckets and mode == 'weight':
        orig_shape = x.get_shape()
        if self.bucket_type == 'split':
          x, bucket_num, padded_num = self.__split_bucket(x)
        elif self.bucket_type == 'channel':
          x, bucket_num, padded_num = self.__channel_bucket(x)
      x_normalized, alpha, beta = self.__scale(x, mode)
      g = self.sess.graph
      k = tf.cast(2 ** mbits - 1, tf.float32)
      with g.gradient_override_map({'Round': 'Identity'}):
        qw = tf.round(x_normalized * k) / k
      qw = self.__inv_scale(qw, alpha, beta)
      if self.use_buckets and mode == 'weight':
        # Reshape w back to the original shape
        qw = tf.reshape(qw, [-1])
        if padded_num != 0:
          qw = tf.reshape(qw[:-padded_num], orig_shape)
        else:
          qw = tf.reshape(qw, orig_shape)

       # Update bucket storage if use buckets.
        self.__updt_bucket_storage(bucket_num)
      print("Quantized: " + tf.get_variable_scope().name)
      return qw

BN层

 y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

pytorch实现：

the value used for the running_mean and running_var
computation. Can be set to None for cumulative moving average
(i.e. simple average). Default: 0.1

cumulative moving average

{\textit {CMA}}_{n}={{x_{1}+\cdots +x_{n}} \over n}

inference时，BN层常和Conv层或fc层折叠

w_{fold}:= \frac{\gamma \omega}{\sqrt{EMA(\delta_{B}^{2})+\varepsilon }}

Here γ is the batch normalization’s scale parameter, EMA(σB2 ) is the moving average estimate of the variance of convolution results across the batch, and ε is just a small constant for numerical stability.