LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias 论文阅读

LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias 论文阅读
LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias 论文阅读

KDD 2023 原文地址

Introduction#

文本噪声，如笔误(Typos), 拼写错误(Misspelling)和缩写(abbreviations), 会影响基于 Transformer 的模型. 主要表现在两个方面:
1. Transformer 的架构中不使用字符信息.
2. 由噪声引起的词元分布偏移使得相同概念的词元更加难以关联.
先前解决噪声问题的工作主要依赖于数据增强策略, 主要通过在训练集中加入类似的 typos 和 misspelling 进行训练.
数据增强确实使得模型在损坏(噪声)样本上表现出出更高的鲁棒性.
虽然这种策略在一定程度上已被证明有效地缓解了词元分布偏移的问题, 但所有这些方法仍然受到在词元化(tokenization)中字符信息会丢失的限制.

Approach#

在自注意机制中加入词感知注意模块(Lexical-aware Attention module, LEA).
LEA 考虑了句子间的词的字符关系, 文中认为这是提高句子相似性任务的关键, 特别是在存在 typos 的情况下.

Self-attention#

定义 self-attention 的输入为 $X = {x_{1}, x_{2}, \dots, x_{n}}$ , 输出为 $Z = {z_{1}, z_{2}, \dots, z_{n}}$ , 输出中的每个 token 的表示计算如下:

$\begin{matrix} (1) & z_{i} = \sum_{j = i}^{n} a_{i j} (x_{j} \cdot W^{V}), z_{i} \in R^{d_{h}} . \end{matrix}$
其中的注意力权重 $a_{i j}$ 计算如下:

$\begin{matrix} (2) & a_{i j} = \frac{exp (e_{i j})}{\sum_{k = 1}^{n} exp (e_{i k})}, \end{matrix}$
其中

$\begin{matrix} (3) & e_{i j} = \frac{(x_{i} W^{Q}) (x_{j} W^{K})}{\sqrt{d_{h}}} . \end{matrix}$
Lexical attention bias#

对于语义文本相似性(textual similarity), 将两个句子拼接:

$\begin{matrix} (4) & X_{c} = X_{l} | X_{r} \end{matrix}$
主要做法是参考了相对位置嵌入(relative position embeddings)的做法, 对 self-attention 中的 $e_{i j}$ 进行如下修改:

$\begin{matrix} (5) & {\tilde{e}}_{i j} = e_{i j} + α l_{i j} W^{L}, \end{matrix}$
其中第二项就是词偏向(lexical bias). $W^{L} \in R^{d^{L} \times 1}$ 是可训练参数, $l \in R^{1 \times d^{L}}$ 是成对词汇注意嵌入(pairwise lexical attention embedding), $α$ 是一个固定的比例因子, 它在训练开始时根据两个项的大小自动计算一次.

为了计算成对词汇注意嵌入(pairwise lexical attention embedding), 先计算句子对之间单词的相似度, 而句子内单词的相似度设定为0:

$\begin{matrix} (6) & s_{i j} = {\begin{aligned} 0 & , if x_{i}, x_{j} \in X_{l} or x_{i}, x_{j} \in X_{r} \\ Sim (w (x_{i}), w (x_{j})) & , otherwise. \end{aligned} \end{matrix}$
其中 Sim 是一个度量, 用于表示两个单词之间的字符串相似度.

之后通过将将 $s_{i j}$ 带入 Transformer 中的正余弦函数, 得到表示词相似度的 embedding:

$\begin{matrix} (7) & \begin{aligned} l_{i j}^{(s_{i j}, 2 p)} & = & \sin (\frac{2 π \cdot d_{i j}}{β^{2 p / d_{h}}}), \\ l_{i j}^{(s_{i j}, 2 p + 1)} & = & \cos (\frac{2 π \cdot d_{i j}}{β^{2 p / d_{h}}}), \end{aligned} \end{matrix}$
最终的词相似度嵌入 $l_{i j}$ 是上了两个向量的拼接.

Implementation details#

论文中相似度度量选取的是 Jaccard 系数.
只在架构的后半层添加了 lexical attention bias.

Experiment#

Performance#

Impact of the lexical similarity choice#

分析了使用不同相似度度量在 Abt-Buy 这个数据集上, BERT-Medium 的表现.
相似度度量包括: Jaccard (Jac.), Smith-Waterman (Smith), Longest Common Subsequence (LCS), Levenshtein (Lev.) and Jaro–Winkler (Jaro)

Jaccard 相似度系数是顺序不可知的, 因此对字符交换更健壮.
Jaccard 在有错别字和没有错别字的单词对之间提供了更高的可分离性, 这在短文本中是有益的.
然而, 随着句子长度的增加, 被比较的单词具有相似字符但含义不同的概率增加, 这降低了交换不变性优势.

Jaccard 相似系数: 集合 A, B 的交集与并集的比值

LEA on different layers and sharing strategy#

文中认为, LEA 提供的字符级相似性可以被视为一种高级交互信息.
因此, 它为深层 Transformer 层补充了高层次的特性.
文中并没有验证这一假设.

Impact of the noise strength#

直观地说, 由于 LEA 利用的字符级相似性不是在训练过程中学习到的, 因此它们为模型提供的信息在某种程度上较少依赖于噪声的量.

图3(下)显示了随着 typos 数量的增加, LEA 的性能与普通数据增强模型之间的差距越来越大, 这表明 LEA 可以更好地泛化到不同的噪声强度.

Additional experiments#

Larger model#

1.BERT-Large

2.GPT-like models

Larger dataset#

BERT-M + DA 在 WDC-Comp.XL 性能超过了 LEA, 但是标准差较大.
相关阅读:
微信小程序案例3-1 比较数字
 Java Double doubleToRawLongBits()方法具有什么功能呢？
Python求平方根
 Vue3-ref函数、reactive函数的响应式
 【LeetCode:2216. 美化数组的最少删除数 | 贪心】
【PyTorch】Training Model
基于node.js+Vue学院会议纪要管理系统 element
【编程语言大比拼】java vs python vs js 如何编制对象数组的映射索引
 操作系统题目收录（一）
【漏洞复现】蓝凌EIS智慧协同平台 api.aspx接口处存在任意文件上传漏洞
原文地址：https://www.cnblogs.com/jjvv/p/17547762.html

LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias 论文阅读

Introduction#

Approach#

Self-attention#

Lexical attention bias#

Implementation details#

Experiment#

Performance#

Impact of the lexical similarity choice#

LEA on different layers and sharing strategy#

Impact of the noise strength#

Additional experiments#

Larger model#

Larger dataset#