基于PyTorch的Seq2Seq翻译模型详细注释介绍（一）-白红宇的个人博客

发布日期：2021-05-14 17:11:51 浏览次数：12 分类：精选文章

本文共 6551 字，大约阅读时间需要 21 分钟。

Seq2Seq是目前主流的深度学习翻译模型，在自然语言翻译，甚至跨模态知识映射方面都有不错的效果。在软件工程方面，近年来也得到了广泛的应用，例如：

Jiang, Siyuan, Ameer Armaly, and Collin McMillan. "Automatically generating commit messages from diffs using neural machine translation." In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 135-146. IEEE Press, 2017.

Hu, Xing, Ge Li, Xin Xia, David Lo, and Zhi Jin. "Deep code comment generation." In Proceedings of the 26th Conference on Program Comprehension, pp. 200-210. ACM, 2018.

这里我结合PyTorch给出的Seq2Seq的示例代码来简单总结一下这个模型实现时的细节以及PyTorch对应的API。PyTorch在其官网上有Tutorial：，其对应的GitHub链接是：。这里就以这段代码为例来进行总结：

在上面那个官网的链接中给出了对应数据的下载链接：，另外，其实网上很多教程也都是翻译上面这个官方教程的，我也参考了一些，主要包括：

所以大家可以以这些教程为基础，我也只是在它们的基础上进行一些补充和解释，所以并不会像上面教程一样给出完整的解释，只是总结一些我觉得重要的内容。首先，初始化编码这些就不总结了，大家看看现有的教程就理解。从Encoder开始总结：

class EncoderRNN(nn.Module):    def __init__(self, input_size, hidden_size):        super(EncoderRNN, self).__init__()#对继承自父类的属性进行初始化。        self.hidden_size = hidden_size        self.embedding = nn.Embedding(input_size, hidden_size)#对输入做初始化Embedding。        self.gru = nn.GRU(hidden_size, hidden_size)#Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence.    def forward(self, input, hidden):        embedded = self.embedding(input).view(1, 1, -1)#view实际上是对现有tensor改造的方法。        output = embedded        output, hidden = self.gru(output, hidden)        return output, hidden    def initHidden(self):        return torch.zeros(1, 1, self.hidden_size, device=device)#初始化，生成(1,1,256)维的全零Tensor。

虽然只有短短几行，可还是有些需要讨论的内容：nn.Embedding是进行初始embedding，当然，这种embedding是完全随机的，并不通过训练或具有实际意义，我觉得网上有些文章连这一点都没搞清楚（例如这里的解释就是错误的：），具体可以参看这里的讨论：。其参数含义可以参考这个解释：nn.Embedding(2, 5)，这里的2表示有2个词，5表示维度为5，其实也就是一个2x5的矩阵，所以如果你有1000个词，每个词希望是100维，你就可以这样建立一个word embedding，nn.Embedding(1000, 100)。也可以运行下面我总结示例代码：

import torchimport torch.nn as nnword_to_ix={'hello':0, 'world':1}embeds=nn.Embedding(2,5)hello_idx=torch.LongTensor([word_to_ix['hello']])world_idx=torch.LongTensor([word_to_ix['world']])hello_embed=embeds(hello_idx)print(hello_embed)world_embed=embeds(world_idx)print(world_embed)

具体含义相信大家一看便知，可以试着跑一下（每次print的结果不相同，并且也没啥实际含义）。

另外就是.view(1, 1, -1)的含义，说实话我也没搞清楚过，其实在stackoverflow上已经有人讨论了这个问题：

大家看看就知，我这里也把上面别人给出的例子提供一下：

import torcha = torch.range(1, 16)print(a)a = a.view(4, 4)print(a)

Encoder就简单总结这些。下面直接进入到带注意力机制的解码器的总结（为了帮助理解，下面增加了一些注释，说明每一步Tensor的纬度，我个人觉得还是能够便于理解的）：

class AttnDecoderRNN(nn.Module):    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):#MAX_LENGTH在翻译任务中定义为10        super(AttnDecoderRNN, self).__init__()        self.hidden_size = hidden_size        self.output_size = output_size#这里的output_size是output_lang.n_words        self.dropout_p = dropout_p#dropout的比例。        self.max_length = max_length        self.embedding = nn.Embedding(self.output_size, self.hidden_size)        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)#按照维度要求，进行线性变换。        self.dropout = nn.Dropout(self.dropout_p)        self.gru = nn.GRU(self.hidden_size, self.hidden_size)        self.out = nn.Linear(self.hidden_size, self.output_size)    def forward(self, input, hidden, encoder_outputs):                print(input)        print('size of input: '+str(input.size()))        print('size of self.embedding(input): '+str(self.embedding(input).size()))                embedded = self.embedding(input).view(1, 1, -1)        print('size of embedded: '+str(embedded.size()))                embedded = self.dropout(embedded)        print('size of embedded[0]: '+str(embedded[0].size()))        print('size of torch.cat((embedded[0], hidden[0]), 1): '+str(torch.cat((embedded[0], hidden[0]), 1).size()))        print('size of self.attn(torch.cat((embedded[0], hidden[0]), 1)): '+str(self.attn(torch.cat((embedded[0], hidden[0]), 1)).size()))                #Size of embedded: [1,1,256]        #Size of embedded[0]: [1,256]        #Size of size of torch.cat((embedded[0], hidden[0]), 1): [1,512]                # 此处相当于学出来了attention的权重        # 需要注意的是torch的concatenate函数是torch.cat，是在已有的维度上拼接，按照代码中的写法，就是在第二个纬度上拼接。        # 而stack是建立一个新的维度，然后再在该纬度上进行拼接。        attn_weights = F.softmax(            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)#这里的F.softmax表示的是torch.nn.functional.softmax                #Size of attn_weights: [1,10]        #Size of attn_weights.unsqueeze(0): [1,1,10]        #Size of encoder_outputs: [10,256]        #Size of encoder_outputs.unsqueeze(0): [1,10,256]                #unsqueeze的解释是Returns a new tensor with a dimension of size one inserted at the specified position.        attn_applied = torch.bmm(attn_weights.unsqueeze(0),                                 encoder_outputs.unsqueeze(0))#bmm本质上来讲是个批量的矩阵乘操作。                #Size of attn_applied: [1,1,256]        output = torch.cat((embedded[0], attn_applied[0]), 1)        #Size of output here is: [1,512]        print('size of output (at this location): '+str(output.size()))        output = self.attn_combine(output).unsqueeze(0)        #Size of output here is: [1,1,256]        #print(output)        output = F.relu(output)#rectified linear unit function element-wise:        #print(output)        output, hidden = self.gru(output, hidden)        output = F.log_softmax(self.out(output[0]), dim=1)        print('')        print('------------')        return output, hidden, attn_weights    def initHidden(self):        return torch.zeros(1, 1, self.hidden_size, device=device)

首先是dropout，关于dropout可以首先参考一下PyTorch的官方解释：

简单来说，就是During training, randomly zeroes some of the elements of the input tensor with probability p using samples from a Bernoulli distribution，有朋友给出了很详细的讨论和解释：

其次应该注意一下nn.Linear的含义和作用，还是给出官网的解释：Applies a linear transformation to the incoming data，类似地，可以参考一下我下面给出的示例代码：

import torchimport torch.nn as nnm = nn.Linear(2, 3)input = torch.randn(2, 2)print(input)output = m(input)print(output)

接下来解释一下torch.bmm。按照PyTorch官网的解释，

torch.bmm起的作用是：Performs a batch matrix-matrix product of matrices stored in batch1 and batch2，这样的解释还是太抽象，其实通过一个例子就很好懂了，实际就是一个批量矩阵乘法：

import torchbatch1=torch.randn(2,3,4)print(batch1)batch2=torch.randn(2,4,5)print(batch2)res=torch.bmm(batch1,batch2)print(res)

具体的乘法规则是：If batch1 is a (b×n×m) tensor, batch2 is a (b×m×p) tensor, out will be a (b×n×p) tensor.

关于torch.cat，还是以PyTorch官网给出的例子做一个简单说明：

Concatenates the given sequence of seq tensors in the given dimension. 例子如下：

import torchx=torch.randn(2,3)print(x)print(torch.cat((x, x, x), 0))print(torch.cat((x, x, x), 1))

这里就先总结到这里，会在下一篇博客中继续总结。

上一篇：NetworkX学习笔记-4-NetworkX输出Gephi文件的方法

下一篇：使用Understand获取某个函数（方法）的静态度量指标

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章