用双向lstm+CRF做命名实体识别(附tensorflow代码)——NER系列（四）

这一篇文章，主要讲一下用深度学习（神经网络）的方法来做命名实体识别。现在最主流最有效的方法基本上就是lstm+CRF了。其中CRF部分，只是把转移矩阵加进来了而已，而其它特征的提取则是交由神经网络来完成。当然了，特征提取这一部分我们也可以使用CNN，或者加入一些attention机制。

接下来，我将参考国外的一篇博客《Sequence Tagging with Tensorflow》，结合tensorflow的代码，讲一下用双向lstm+CRF做命名实体识别。

1.命名实体识别简述

命名实体识别任务本质上就是序列标注任务。来一个例子：

John  lives in New   York  and works for the European Union
B-PER O     O  B-LOC I-LOC O   O     O   O   B-ORG    I-ORG

在CoNLL2003任务中，实体为LOC,PER,ORG和MISC，分别代表着地名，人名，机构名以及其他实体，其它词语会被标记为O。由于有一些实体（比如New York）由多个词组成，所以我们使用用一种简单的标签体系：

B-来标记实体的开始部分，I-来标记实体的其它部分。

我们最终只是想对句子里面的每一个词，分配一个标签。

2.模型

整个模型的主要组成部分就是RNN。我们将模型的讲解分为以下三个部分：

词向量表示

词的上下文信息表示

解码

2.1 词向量表示

对于每一个单词，我们用词向量 $w \in \mathbb{R}^n$ 来表示，用来捕获词本身的信息。这个词向量由两部分concat起来，一部分是用GloVe训练出来的词向量 $w_{glove} \in \mathbb{R}^{d_1}$ ，另一部分，是字符级别的向量 $w_{chars} \in \mathbb{R}^{d_2}$ 。

在以往，我们会手工提取并表示一些特征，比如用1，0来表示某个单词是否是大写开头，而在这个模型里面，我们不需要人工提取特征，只需要字符级别上面使用双向LSTM，就可以提取到一些拼写层面的特征了。当然了，CNN或者其他的RNN也可以干类似的事情。

对于每一个单词 $w = [c_1, \ldots, c_p]$ 里面的每一个字母（区分大小写），我们用 $c_i \in \mathbb{R}^{d_3}$ 这个向量来表示，对字母级别的embedding跑一个bi-LSTM，然后将最后的隐状态输出拼接起来（因为是双向，所以有两个最后隐状态，如上图），得到一个固定长度的表达 $w_{chars} \in \mathbb{R}^{d_2}$ ，直觉上，我们可以认为这个向量提取了字母级别的特征，比如大小写、拼写规律等等。然后，我们将这个向量 $w_{chars}$ 和Glove训练好的w_{glove}拼接起来，得到某个词最终的词向量表达： $w = [w_{glove}, w_{chars}] \in \mathbb{R}^n$ ，其中 $n = d_1 + d_2$ 。

看一下tensorflow对应的实现代码。

# shape = (batch size, max length of sentence in batch)
word_ids = tf.placeholder(tf.int32, shape=[None, None])

# shape = (batch size)
sequence_lengths = tf.placeholder(tf.int32, shape=[None])

好了，让我们用tensorflow的内置函数来读取word embeddings。假设这个embeddings是一个由GloVe训练出来的numpy数组，那么embeddings[i]表示第i个词的向量表示。

L = tf.Variable(embeddings, dtype=tf.float32, trainable=False)
# shape = (batch, sentence, word_vector_size)
pretrained_embeddings = tf.nn.embedding_lookup(L, word_ids)

在这里，应该使用tf.Variable并且参数设置trainable=False，而不是用tf.constant，否则可能会面临内存问题。

好，接下来，让我们来对字母建立向量。

# shape = (batch size, max length of sentence, max length of word)
char_ids = tf.placeholder(tf.int32, shape=[None, None, None])

# shape = (batch_size, max_length of sentence)
word_lengths = tf.placeholder(tf.int32, shape=[None, None])

为什么这里用这么多None呢？

其实这取决于我们。在我们的代码实现中，我们的padding是动态的，也就是和batch的最大长度对齐。因此，句子长度和单词长度取决于batch。

好了，继续。在这里，我们没有任何预训练的字母向量，所以我们调用tf.get_variable来初始化它们。我们也要reshape一下四维的tensor，以符合bidirectional_dynamic_rnn的所需要的输入。代码如下：

# 1. get character embeddings
K = tf.get_variable(name="char_embeddings", dtype=tf.float32,
    shape=[nchars, dim_char])
# shape = (batch, sentence, word, dim of char embeddings)
char_embeddings = tf.nn.embedding_lookup(K, char_ids)

# 2. put the time dimension on axis=1 for dynamic_rnn
s = tf.shape(char_embeddings) # store old shape
# shape = (batch x sentence, word, dim of char embeddings)
char_embeddings = tf.reshape(char_embeddings, shape=[-1, s[-2], s[-1]])
word_lengths = tf.reshape(self.word_lengths, shape=[-1])

# 3. bi lstm on chars
cell_fw = tf.contrib.rnn.LSTMCell(char_hidden_size, state_is_tuple=True)
cell_bw = tf.contrib.rnn.LSTMCell(char_hidden_size, state_is_tuple=True)

_, ((_, output_fw), (_, output_bw)) = tf.nn.bidirectional_dynamic_rnn(cell_fw,
    cell_bw, char_embeddings, sequence_length=word_lengths,
    dtype=tf.float32)
# shape = (batch x sentence, 2 x char_hidden_size)
output = tf.concat([output_fw, output_bw], axis=-1)

# shape = (batch, sentence, 2 x char_hidden_size)
char_rep = tf.reshape(output, shape=[-1, s[1], 2*char_hidden_size])

# shape = (batch, sentence, 2 x char_hidden_size + word_vector_size)
word_embeddings = tf.concat([pretrained_embeddings, char_rep], axis=-1)

注意sequence_length这个参数的用法，它让我们可以得到最后一个有效的state，对于无效的time steps，dynamic_rnn直接穿过这个state，返回零向量。

2.2 词的上下文信息表示

当有了词向量 $w$ 之后，就可以对一个句子里的每一个词跑LSTM或者双向LSTM了，然后得到另一个向量表示： $h \in \mathbb{R}^k$ ，如下图：

对应的tensorflow代码很直观，这次我们用每一个隐藏层的输出，而不是最后一个单元的输出。因此，我们输入一个句子，有m个单词： $w_1, \ldots, w_m \in \mathbb{R}^n$ ，得到m个输出： $h_1, \ldots, h_m \in \mathbb{R}^k$ 。现在的输出，是包含上下文信息的：

cell_fw = tf.contrib.rnn.LSTMCell(hidden_size)
cell_bw = tf.contrib.rnn.LSTMCell(hidden_size)

(output_fw, output_bw), _ = tf.nn.bidirectional_dynamic_rnn(cell_fw,
    cell_bw, word_embeddings, sequence_length=sequence_lengths,
    dtype=tf.float32)

context_rep = tf.concat([output_fw, output_bw], axis=-1)

2.3 解码

最后，我们要对每一个词分配一个tag。用一个全连接层就可以搞定。

假如，一共有9种tag，那么我们可以得到权重矩阵 $W \in \mathbb{R}^{9 \times k}$ 和偏置矩阵 $b \in \mathbb{R}^9$ ，最后计算某个词的得分向量 $s \in \mathbb{R}^9 = W \cdot h + b$ , $s[i]$ 可以解释为，某个词标记成第 $i$ 个tag的得分，tensorflow的实现是这样的：

W = tf.get_variable("W", shape=[2*self.config.hidden_size, self.config.ntags],
                dtype=tf.float32)

b = tf.get_variable("b", shape=[self.config.ntags], dtype=tf.float32,
                initializer=tf.zeros_initializer())

ntime_steps = tf.shape(context_rep)[1]
context_rep_flat = tf.reshape(context_rep, [-1, 2*hidden_size])
pred = tf.matmul(context_rep_flat, W) + b
scores = tf.reshape(pred, [-1, ntime_steps, ntags])

在这里，我们用zero_initializer来初始化偏置。

有了分数之后，我们有两种方案用来计算最后的tag：

softmax：将得分归一化为概率。
线性CRF：第一种方案softmax，只做了局部的考虑，也就是说，当前词的tag，是不受其它的tag的影响的。而事实上，当前词tag是受相邻词tag的影响的。定义一系列词 $w_1, \ldots, w_m$ ，一系列的得分向量 $s_1, \ldots, s_m$ ，还有一系列标签 $y_1, \ldots, y_m$ ，线性CRF的计算公式是这样的：

$\begin{aligned}C(y_1, \ldots, y_m) &= b[y_1] &+ \sum_{t=1}^{m} s_t [y_t] &+ \sum_{t=1}^{m-1} T[y_{t}, y_{t+1}] &+ e[y_m]\\&= \text{begin} &+ \text{scores} &+ \text{transitions} &+ \text{end}\end{aligned}$

在上面的式子里， $T$ 是转移矩阵，尺寸为 $\mathbb{R}^{9 \times 9}$ ，用来刻画相邻tag的依赖、转移关系； $e, b \in \mathbb{R}^9$ 是结束、开始tag的代价向量。下面是一个计算例子：

了解了CRF得分式子，接下来要做两件事：

找到得分最高的tag序列。
计算句子的tag概率分布。

“仔细想想，计算量是不是太大了？”

没错，计算量相当大。就上面的例子而言，有9种tag，一个句子有m个单词，一共有 $9^m$ 种可能，代价太大了。

幸运的是，由于式子有递归的特性，所以我们可以用动态规划的思想来解决这个问题。假设 $\tilde{s}_{t+1} (y^{t+1})$ 是时间步 $t+1, \ldots, m$ 的解（每个时间步都是有9种可能的），那么，继续往前推，时间步 $t, \ldots, m$ 的解，可以由下式表示：

$\begin{aligned}\tilde{s}_t(y_t) &= \operatorname{argmax}_{y_t, \ldots, y_m} C(y_t, \ldots, y_m)\\&= \operatorname{argmax}_{y_{t+1}} s_t [y_t] + T[y_{t}, y_{t+1}] + \tilde{s}_{t+1}(y^{t+1})\end{aligned}$

每一个递归步骤的复杂度为 $O(9 \times 9)$ ，由于我们进行了 $m$ 步，所以总的复杂度是 $O(9 \times 9 \times m)$ 。

最后，我们需要在CRF层应用softmax，将得分概率分布计算出来。我们得计算出所有的可能，如下式子：

$\begin{aligned}Z = \sum_{y_1, \ldots, y_m} e^{C(y_1, \ldots, y_m)}\end{aligned}$

上面提到的递归思想在这里也可以应用。先定义 $Z_t(y_t)$ ，表示从时间步 $t$ 开始、以 $y_t$ 为tag开始的序列，计算公式如下：

$\begin{aligned}Z_t(y_t) &= \sum_{y_{t+1}} e^{s_t[y_t] + T[y_{t}, y_{t+1}]} \sum_{y_{t+2}, \ldots, y_m} e^{C(y_{t+1}, \ldots, y_m)} \\&= \sum_{y_{t+1}} e^{s_t[y_t] + T[y_{t}, y_{t+1}]} \ Z_{t+1}(y_{t+1})\\\log Z_t(y_t) &= \log \sum_{y_{t+1}} e^{s_t [y_t] + T[y_{t}, y_{t+1}] + \log Z_{t+1}(y_{t+1})}\end{aligned}$

最后，序列概率计算式子如下：

$\begin{aligned}\mathbb{P}(y_1, \ldots, y_m) = \frac{e^{C(y_1, \ldots, y_m)}}{Z}\end{aligned}$

2.4 训练

最后，就是训练部分了。训练的损失函数采用的是cross-entropy（交叉熵），计算公式如下：

$\begin{aligned}- \log (\mathbb{P}(\tilde{y}))\end{aligned}$

其中， $\tilde{y}$ 为正确的标注序列，它的概率 $\mathbb{P}$ 计算公式如下：

CRF: $\mathbb{P}(\tilde{y}) = \frac{e^{C(\tilde{y})}}{Z}$
local softmax： $\mathbb{P}(\tilde{y}) = \prod p_t[\tilde{y}^t]$

“额..CRF层的损失很难计算吧..?”

没错，但是大神早就帮你做好了。在tensorflow里面，一行就能调用。下面的代码会帮我们计算CRF的loss，同时返回矩阵T，以助我们做预测：

# shape = (batch, sentence)
labels = tf.placeholder(tf.int32, shape=[None, None], name="labels")

log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
scores, labels, sequence_lengths)

loss = tf.reduce_mean(-log_likelihood)

local softmax的loss计算过程很经典，但我们需要用tf.sequence_mask将sequence转化为bool向量：

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=scores, labels=labels)
# shape = (batch, sentence, nclasses)
mask = tf.sequence_mask(sequence_lengths)
# apply mask
losses = tf.boolean_mask(losses, mask)

loss = tf.reduce_mean(losses)

最后，定义train op：

optimizer = tf.train.AdamOptimizer(self.lr)
train_op = optimizer.minimize(self.loss)

2.5 使用模型

最后的预测步骤很直观：

labels_pred = tf.cast(tf.argmax(self.logits, axis=-1), tf.int32)

至于CRF层，仍然用到上面提到过的动态规划思想。

# shape = (sentence, nclasses)
score = ...
viterbi_sequence, viterbi_score = tf.contrib.crf.viterbi_decode(
                                score, transition_params)

最终通过这份代码，F1值能跑到90％到91％之间。

3.后记

神经网络做NER，大部分套路都是这样：用基本的RNN、CNN模型做特征提取，最后加上一层CRF，再加点attention机制能稍微提升一下效果，基本上就到瓶颈了。

在2017年6月份，谷歌团队出品这篇论文《Attention Is All You Need》还是给我们带来不少震撼的，不用RNN,CNN，只用attention机制，就刷新了翻译任务的最好效果。所以，我们是不是可以想，把这种结构用到命名实体识别里面呢？

果然，已经有人开始做相关研究。《Deep Semantic Role Labeling with Self-Attention》这篇论文发表于2017年12月，实现了一个类似刚才说到的谷歌的模型，做的是SRL任务，也取得了不错的效果，同时他们也有放出实现代码：https://github.com/XMUNLP/Tagger

值得学习一下。

另外，用多模态来做实体识别也是一个方向，特别是对于一些类似微博的语料（有图片），这样做效果更佳。

代码和语料：
https://www.lookfor404.com/命名实体识别的语料和代码/