Attention in CNN详解

注意力(attention)是一个十分常见的东西，这在CV领域尤为明显。eg：上课时候，学生的注意力都在老师身上，而对讲台和黑板不会关注,此时可以认为除了老师以外，都被学生自动认为是background了。而CV中的注意力机制的基本思想就是让系统学会注意力——能够忽略无关信息而更多的关注我们希望它关注的重点信息。本文主要讨论的是CV领域。关于NLP，读者可自行查阅。

注意力的发展和分类

发展

早期的注意力研究，是从大脑成像机制去分析的。搭建这种网络需要注意两方面：一方面是这种神经网络能够自主学习注意力机制，另一方面则是注意力机制能够反过来帮助我们去理解神经网络“看到的世界”。

近几年来，深度学习与视觉注意力机制结合的研究工作，大多数是集中于使用mask来形成注意力机制。mask的原理在于通过另一层新的权重，将图片中关键的特征标识出来，通过学习训练，让深度神经网络学到每一张新图片中需要关注的区域，也就形成了注意力。（可以注意到，本质是希望通过学习得到一组可以作用在原图上的权重分布）

分类

根据这种思想，注意力有两个大的分类：软注意力(soft attention) 和硬注意力 (hard attention)

软注意力是可微的，也就意味着可计算梯度，利用神经网络的训练方法获得。这也是本文重点。

硬注意力是一个随机的预测过程，更强调动态变化，同时其不可微，训练往往需要通过增强学习来完成。（不是研究的重点）。

软注意力（soft attention）

为了更具体的理解不同的注意力，本文从注意力域（attention domain）的角度来分析几种注意力的实现方法。其中主要是三种注意力域，通道域(channel domain)、空间域(spatial domain)和混合域(mixed domain)

Channel Attention

为了高效地计算channel attention，论文使用最大池化和平均池化对feature map在空间维度上进行压缩，得到两个不同的空间背景描述：F_{max}^c和F_{avg}^c。使用由MLP组成的共享网络对这两个不同的空间背景描述进行计算得到channel attention map:

计算过程如下：

Spatial Attention

注意力主要分布在空间中，又被称为空间注意力，表现在图像上就是对图像上不同位置的关注程度不同。反应在数学上就是指：针对大小为H×W×C的特征图，有效的一个空间注意力对应一个大小为H×W的矩阵，每个位置对原特征图对应位置的像素来说就是一个权重，计算时做pixel-wise multiply。

输入的是一个feature map

基于channel进行global max pooling和global average pooling;
将上述的结果基于channel做concat;
将concat后的结果经过一个卷积操作，channel降为1;
将结果经过sigmoid生成spatial attention feature，可以与输入的特征图做乘法，为feature增加空间注意力。

计算过程如下：

Mixed Attention

空间域忽略了通道域中的信息，将每个通道中的图片特征同等处理，这种做法会将空间域变换方法局限在原始图片特征提取阶段，应用在神经网络其他层的可解释性不强。

通道域的注意力是对一个通道内的信息直接全局平均池化，而忽略每一个通道内的局部信息，这种做法其实也是比较暴力的行为。

所以结合两种思路，就可以设计出混合域的注意力机制模型。

代码实现

import tensorflow as tf
import numpy as np

slim = tf.contrib.slim

def combined_static_and_dynamic_shape(tensor):
  """Returns a list containing static and dynamic values for the dimensions.

  Returns a list of static and dynamic values for shape dimensions. This is
  useful to preserve static shapes when available in reshape operation.

  Args:
    tensor: A tensor of any type.

  Returns:
    A list of size tensor.shape.ndims containing integers or a scalar tensor.
  """
  static_tensor_shape = tensor.shape.as_list()
  dynamic_tensor_shape = tf.shape(tensor)
  combined_shape = []
  for index, dim in enumerate(static_tensor_shape):
    if dim is not None:
      combined_shape.append(dim)
    else:
      combined_shape.append(dynamic_tensor_shape[index])
  return combined_shape

def convolutional_block_attention_module(feature_map, index, inner_units_ratio=0.5):
    """
    :param feature_map : input feature map
    :param index : the index of convolution block attention module
    :param inner_units_ratio: output units number of fully connected layer: inner_units_ratio*feature_map_channel
    :return:feature map with channel and spatial attention
    """
    with tf.variable_scope("cbam_%s" % (index)):
        feature_map_shape = combined_static_and_dynamic_shape(feature_map)
        # channel attention
        channel_avg_weights = tf.nn.avg_pool(
            value=feature_map,
            ksize=[1, feature_map_shape[1], feature_map_shape[2], 1],
            strides=[1, 1, 1, 1],
            padding='VALID'
        )
        channel_max_weights = tf.nn.max_pool(
            value=feature_map,
            ksize=[1, feature_map_shape[1], feature_map_shape[2], 1],
            strides=[1, 1, 1, 1],
            padding='VALID'
        )
        channel_avg_reshape = tf.reshape(channel_avg_weights,
                                         [feature_map_shape[0], 1, feature_map_shape[3]])
        channel_max_reshape = tf.reshape(channel_max_weights,
                                         [feature_map_shape[0], 1, feature_map_shape[3]])
        channel_w_reshape = tf.concat([channel_avg_reshape, channel_max_reshape], axis=1)

        fc_1 = tf.layers.dense(
            inputs=channel_w_reshape,
            units=feature_map_shape[3] * inner_units_ratio,
            name="fc_1",
            activation=tf.nn.relu
        )
        fc_2 = tf.layers.dense(
            inputs=fc_1,
            units=feature_map_shape[3],
            name="fc_2",
            activation=None
        )
        channel_attention = tf.reduce_sum(fc_2, axis=1, name="channel_attention_sum")
        channel_attention = tf.nn.sigmoid(channel_attention, name="channel_attention_sum_sigmoid")
        channel_attention = tf.reshape(channel_attention, shape=[feature_map_shape[0], 1, 1, feature_map_shape[3]])
        feature_map_with_channel_attention = tf.multiply(feature_map, channel_attention)
        # spatial attention
        channel_wise_avg_pooling = tf.reduce_mean(feature_map_with_channel_attention, axis=3)
        channel_wise_max_pooling = tf.reduce_max(feature_map_with_channel_attention, axis=3)

        channel_wise_avg_pooling = tf.reshape(channel_wise_avg_pooling,
                                              shape=[feature_map_shape[0], feature_map_shape[1], feature_map_shape[2],
                                                     1])
        channel_wise_max_pooling = tf.reshape(channel_wise_max_pooling,
                                              shape=[feature_map_shape[0], feature_map_shape[1], feature_map_shape[2],
                                                     1])

        channel_wise_pooling = tf.concat([channel_wise_avg_pooling, channel_wise_max_pooling], axis=3)
        spatial_attention = slim.conv2d(
            channel_wise_pooling,
            1,
            [7, 7],
            padding='SAME',
            activation_fn=tf.nn.sigmoid,
            scope="spatial_attention_conv"
        )
        feature_map_with_attention = tf.multiply(feature_map_with_channel_attention, spatial_attention)
        return feature_map_with_attention

#example
feature_map = tf.constant(np.random.rand(2,8,8,32), dtype=tf.float16)
feature_map_with_attention = convolutional_block_attention_module(feature_map, 1)

with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    result = sess.run(feature_map_with_attention)
    print(result.shape)

总结

其实上文提到的attention用到Resnet中效果会更好，但是整体结构简单，虽然对结果有提升，但是有限。
一个好的attention需要有效的训练手段，训练attention时的loss尤为重要。

本文到此结束

如果读者想继续学习，推荐进阶论文
《CBAM: Convolutional block attention module》