MyDLNote - Attention: RAM: Residual Attention Module for Single Image Super-Resolution
发布日期:2021-06-23 19:02:59 浏览次数:7 分类:技术文章

本文共 9496 字,大约阅读时间需要 31 分钟。

RAM: Residual Attention Module for Single Image Super-Resolution

我的博客尽可能提取文章内的主要传达的信息,既不是完全翻译,也不简单粗略。论文的motivation网络设计细节,将是我写这些博客关注的重点。

更正:之前以为这篇文章是在CVPR上发表的,后来发现并没有被正式会议发表。这篇文章中的观点,究竟是本身的观点不合主流,还是思想太简单了,不配在CVPR上发表呢?希望明白的大牛多指点指点,期待评论,谢谢!

Abstract

Attention mechanisms are a design trend of deep neural networks that stands out in various computer vision tasks. Recently, some works have attempted to apply attention mechanisms to single image super-resolution (SR) tasks. However, they apply the mechanisms to SR in the same or similar ways used for high-level computer vision problems without much consideration of the different nature between SR and other problems. In this paper, we propose a new attention method, which is composed of new channelwise and spatial attention mechanisms optimized for SR and a new fused attention to combine them. Based on this, we propose a new residual attention module (RAM) and a SR network using RAM (SRRAM). We provide in-depth experimental analysis of different attention mechanisms in SR. It is shown that the proposed method can construct both deep and lightweight SR networks showing improved performance in comparison to existing state-of-the-art methods.

目前基于Attntion机制的SR网络,并没有提出适合于SR的、与high-level任务不同的attention网络。

本文提出的残差注意力模型(RAM) 适用于SR,并且提出了新的注意力融合方法,即更好地将channel和spatial attention结合。

 

Introduction

Stacking an extensive amount of layers is a common practice to improve performance of deep networks [22].

While a huge size of SR networks tends to yield improved performance, this also has a limitation. Typically, most CNN-based methods internally treat all types of information equally, which may not effectively distinguish the detailed characteristics of the content (e.g., low and high frequency information). In other words, the networks have limited ability to selectively use informative features.

最开始用深度学习做SR的时候,基本靠的是堆叠更深的网络来实现好的performance,比如采用ResNet,DenseNet等。

但这种大规模的SR网络是有局限性的。通常,基于CNN的SR网络对待所有类型的信息是平等的,这样就不能有效地区分内容的详细特征(高低频信息)。也就是说,CNN网络很难从大量信息中选择最有效的那些信息。(不那么重要的信息像噪声似的存在,势必会对网络的性能产生负面影响。而attention机制就是将更有用的提炼出来,用处小的剔除。)

Attention mechanism allows the network to recalibrate the extracted feature maps, so that more adaptive and efficient training is possible. A few recent SR methods also employ attention mechanisms. 

It should be noted that the attention mechanisms applied to SR in these works are borrowed from other vision problems such as classification, and thus they may not be optimal for SR. Furthermore, how to combine the channel and spatial attention mechanisms effectively also remains unresolved.

最新的SR开始采用attention机制。但问题在于,这些工作采用的attention方法都是直接引用high-level中的方法,对于SR问题可能并不是最优的,并且如何有效地融合channel和spatial attention也是需要研究的。

We propose a new attention-based SR method that effectively integrates two new attention mechanisms, i.e., channel attention (CA) and spatial attention (SA). These mechanisms, which are optimized for SR, are attached to a ResNet-based structure, resulting in our proposed residual attention module (RAM) and consequently our proposed SR using RAM model (SRRAM). The proposed RAM exploits both inter- and intrachannel relationship by using the proposed CA and SA, respectively. 

本文提出的RAM是针对SR任务的,通过提出的CA和SA分别使网络更好地利用图片内部和通道内部的关系。

 

Related works

本节总结的特别清楚。

For mathematical formulation, we denote the input and output feature maps of an attention mechanism as X ∈ R^{H×W×C} and Xˆ ∈ R^{H×W×C} , where H, W, and C are the height, width, and number of channels of X. We also denote the sigmoid and ReLU functions as σ(·) and δ(·). For simplicity, bias terms are omitted.

本段给出符号和字母表示。

The attention mechanisms can be divided into two types depending on the dimension to which they are applied: channel attention (CA) and spatial attention (SA). CA and SA can be further divided into three processes:

squeeze: it is a process to extract one or more statistics S by the channel (CA) or spatial region (SA) from X. The statistics are extracted by using pooling methods, and 1 × 1 convolution can be used for SA.

excitation: using the extracted statistics, the excitation process captures the interrelationship between channels (CA) or spatial regions (SA) and generates an attention map M, having a size of 1 × 1 × C (CA) or H × W × 1 (SA). Two fully connected (FC) layers are used for CA in all methods, which has a bottleneck structure with a reduction ratio of r. For SA, one or two convolutions are used.

scaling: to recalibrate the input feature maps, the generated attention map is normalized through a sigmoid function between a range from 0 to 1, and then used for channel or spatial-wise multiplication with X. The same scaling process is applied to all methods.

一般地,CA和SA包括如下三个部分:

1. squeeze,即CA是将每个feature压缩成一个点(HxWxC->1x1xC),SA是将n个feature压缩成1个map(HxWxC->HxWx1)。

2. excitation,通过全连接层,获取channel间的关系,得到1 × 1 × C 的向量(CA);或者通过卷积层,获取空间区域关系得到H × W × 1的一张图(SA)。

3. scaling,同sigmoid将attention map转换成0-1之间的图,然后与输入进行点乘。

下面的内容主要对比RCABCBAMCSAR

RCAB:The mechanism aims to recalibrate filter responses by exploiting interchannel correlation. The average pooling is applied in the squeeze process.

CBAM:exploits both inter-channel and inter-spatial relationship of feature maps through its CA and SA modules. In the CA module, the difference from that of RCAB is that max pooling and average pooling are performed in the squeeze process, and the two kinds of statistics are used for the excitation process. For the SA module, the results of the average and max pooling are also applied in the squeeze process, resulting in generating two 2D statistics. They are concatenated and undergo the excitation process using one 7 × 7 convolution. To combine the two attention mechanisms, CBAM sequentially performs CA and then SA.

CSAR block:includes both CA and SA. The CA is equal to that of RCAB. For SA, in contrast to CBAM, the input feature map proceeds to the excitation process without going through the squeeze process. The excitation process employs two 1 × 1 convolutions, where the first one has C × γ filters and the second one has a single filter. Here, γ is the increase ratio. While CBAM combines the two attention mechanisms sequentially, the CSAR block combines two attention mechanisms in a parallel manner using concatenation and 1 × 1 convolution.

图已经把这几个网络解释的很清楚了,注意ReLU层的位置

Proposed methods

Network architecture

1. 第一个conv extracts initial feature maps

2. residual attention module (RAM) 

3. 最后一个RAM后跟一个conv,貌似是feature maps融合的作用 

4. global skip connection,为了保持low-level特征(保持细节)

5. upscaling采用sub-pixel convolution layers [2016CVPR, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network],如下右半部分:

6. 最后一个conv,for reconstruction

Residual attention module

Channel attention (CA)

average pooling method is used in high-level computer vision problems such as image classification and object detection without modification. However, since SR ultimately aims at restoring high-frequency components of images.

传统的SR在利用CA时,直接采用average pooling,这是不合理的。因为average pooling会破坏或者说忽略细节信息,适用于high-level应用。

To this end, we choose to use the variance pooling rather than the average for the pooling method.

所以,本文采用variance pooling

Spatial attention (SA)

Each channel in the feature maps has a different meaning depending on the role of the filter used. For example, some filters will extract the edge components in the horizontal direction, and some filters will extract the edge components in the vertical direction.

每个特征图代表着这个滤波器的作用。例如,有的滤波器用来提取横向边缘成分,有的提取垂直边缘。

Fom the viewpoint of SR, the importance of the channels varies by the spatial region. For example, in the case of edges or complex textures, more detailed information, i.e., those from complex filters, is more important. On the other hand, in the case of the region having almost no highfrequency components such as sky or homogeneous areas of comic images, relatively less detailed information is more important and needs to be attended. In this regard, the SA map for each channel needs to be different.

对于SR任务,feature的重要性取决于空间特性。例如,对于边缘或者文理区域,表示细节的复杂滤波器就比较重要。而对于平滑区域,如天空或漫画的同质区域,相对低频的信息更重要。所以,SA map对每个通道都应该给出不同的attention。

Terefore, unlike CBAM [30], which performs the squeeze process in its SA module, our proposed method does not squeeze information per channel to preserve channel-specific characteristics. In addition, for the excitation process, in contrast to other SA mechanisms [30, 9] generating a single 2D SA map, we obtain different SA maps for each channel MSA ∈ R H×W×C using depth-wise convolution [7],

因此,不像CBAM,采用平均或最大池化将所有feature maps压缩到一张图,本文用depth-wise卷积,对每个feature channel计算一个attention map。

Fused attention (FA)

The proposed CA and SA mechanisms exploit information from inter-channel and intrachannel relationship, respectively. Therefore, in order to exploit the benefits of both mechanisms simultaneously, we combine them by adding the CA and SA maps and then perform the scaling process is performed using the sigmoid function, whose result is used for recalibrating the feature map.

FA的过程就是,将CA和SA的attention map相加,之后通过Sigmoid拉伸到0-1,最后得到的n张图和原n张feature map点积。

(文章给出M^{SA},但并没有给出M^{CA}。我认为,M^{CA}是CA得到的1x1xC中每个点扩展到HxWxC的导师的图。等看完代码再到这里修改吧。)

 

MyNote

本文的motivation就是说,用attention机制做SR任务时,之前是算法都直接采用high-level应用中的模型,这对于low-level的SR问题是不合适的,因此提出了专门用于SR的CA和SA。

本文网络结构的特点:

1. CA采用variance pooling,目的是为了使聚合的通道向量反映的是边缘信息的特征;

2. SA采用depth-wise conv,并且对于每个输入的feature,都得到一张相同像素的attention map 

3. FA采用将CA+SA后再经过sigmoid

4. 上采样采用sub-pixel conv

这种设计应该很适合low-level的任务。后面自己的工作中可以参考。

 

转载地址:https://blog.csdn.net/u014546828/article/details/101415379 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:MyDLNote - Inpainting: Image Inpainting with Learnable Bidirectional Attention Maps
下一篇:MyDLNote - Enhancement: Fast Single Image Rain Removal via a Deep Decomposition-Composition Network

发表评论

最新留言

初次前来,多多关照!
[***.217.46.12]2024年04月06日 07时47分43秒