<style>
.red {
color: red;
}
.blue{
color: blue;
}
.green{
color: green;
}
</style>
# [Unbalanced Optimal Transport for Unbalanced Word Alignment](https://arxiv.org/abs/2306.04116)
:::danger
**Github:** https://github.com/yukiar/OTAlign
:::
## 1. Introduction
1. Monolingual word alignment, which identifies semantically corresponding words in a sentence pair.
2. Its ability to declare redundant information in sentences is also useful for summarisation and sentence fusion.
3. In addition, the alignment information is valuable for interpretability of model predictions and for realising interactive document exploration as well.
:::info
"單語言詞對齊方法能夠在句子中標識出冗餘信息,這一能力對於自動文摘和句子融合任務也很有用處。"
具體步驟如下:
1. 單語言詞對齊任務的目標是找出一對平行句子中語義上對應的詞彙。
2. 在這個過程中,如果某些詞彙在目標句子中沒有對應的詞彙,那麼它們就被標記為"null alignment"(空對齊)。
3. 空對齊的詞彙,代表著在源句子中存在一些對於理解目標句子的語義是冗餘或多餘的信息。
4. 這種能夠自動識別句子中冗餘信息的能力,對於自動文摘和句子融合任務來說是很有用的。
5. 在自動文摘任務中,我們希望能夠概括並摘要出文章的核心內容,去掉多餘的修飾語和贅詞。
6. 通過查看每個句子與摘要之間的對齊關係,能夠發現哪些詞彙或片語是多餘的,可以將它們剔除以生成更簡潔的摘要。
7. 在句子融合任務中,我們需要將多個語意相近的句子融合成一個簡潔完整的句子。
8. 通過觀察這些句子的詞對齊情況,能夠發現哪些部分是冗餘的,哪些是核心主幹,從而指導如何適當刪減和融合。
9. 因此,單語言詞對齊能力能夠有效識別句子中的冗餘信息,為自動文摘和句子融合任務提供有用的語義線索和支持。
總之,單語言詞對齊除了其本身的應用價值,其能夠標識冗餘信息的特性,也使其在文摘和句子融合等相關任務中發揮了重要作用。
:::
### The challenges of monolingual word alignment:

**1. Null Alignment**
- where <span class='red'>words may not have corresponding counterparts</span>, which causes alignment asymmetricity .
- Null alignment is prevalent in semantically divergent sentences; indeed, the null alignment ratio reaches 63.8% in entailment sentence pairs used in our experiments.
:::info
這句話的意思是,在語義上差異很大的兩個句子之間,空對齊(null alignment)的情況會很普遍;的確,在我們實驗中使用的蘊涵句對(entailment sentence pairs)中,空對齊的比例高達63.8%。
其中:
1. Null alignment 指的是在字詞對齊矩陣中,源句詞語沒有對應的目標句詞語,即該詞語對齊到空詞語(null word)的情況。
2. Semantically divergent sentences 指語義上差異很大的兩個句子,例如蘊涵關係句對。
3. <span class='red'>Entailment sentence pairs</span> 指包括前提句和結論句的蘊涵類句對。這種句對語義上常常存在差距。
4. Null alignment ratio 計算方式為空對齊詞語數量占總詞語數量的比例。
文中統計了實驗所用的蘊涵句對資料集中此比例高達63.8%,說明這種句對存在大量的空對齊現象,體現了句對語義的不對稱性。
這正是Monolingual word alignment需要處理的重要情況。Optimal Transport方法正是通過軟化或允許空對齊的限制條件,來特別處理這類不平衡對齊的情形。
:::
**2. Alignment beyond one-to-one mapping needs to be addressed**
- These challenges constitute an unbalanced word alignment problem, where both word-by-word and null alignment should be fully identified.
:::info
這段話的意思是,除了null alignment以外,字詞對齊任務還需要處理「一對多」或「多對多」的對齊關係。這兩類對齊也常見於語義差異較大的句對中。為了完整識別字詞間的對齊關係,我們將字詞對齊視為一個「不平衡對齊」的問題。
具體來說:
1. Alignment beyond one-to-one mapping 指的是一對多或多對多的對齊。例如一個source sentence對應target sentence中多個詞,或多個source sentence word對應同一組target sentence word。
2. 這種複雜對齊也常出現在語義差異大的句子中,增加了字詞對齊的難點。
3. <span class='red'>為了完全正確識別出所有字詞對應關係,需要一種能夠同時兼顧空對齊和複雜對齊(一對多、多對多)的字詞對齊方法。文中將這種需求稱為 Unbalanced Word Alignment Problem。</span>
4. Optimal Transport可以通過調整限制條件,既允許null alignment的存在,也允許一對多、多對多等複雜對齊,從而自然解決這個不平衡對齊問題。
:::
### Family of Optimal Transport(OT)
- This study reveals that a family of optimal transport (OT) are suitable tools for unbalanced word alignment.
- Among the OT problems, <span class='red'>balanced OT (BOT)</span> should be the most prominent in natural language processing (NLP), which can handle many-to-many alignment.
- In contrast to BOT that is unable to deal with null alignment, <span class='red'>partial OT (POT) </span> and unbalanced OT (UOT) can handle the asymmetricity as desired in unbalanced word alignment, which has also attracted applications where null alignment is unignorable.
- This is the first study that connects these two paradigms of unbalanced word alignment and the family of OT problems that can naturally address null alignment as well as many-to-many alignment.
- We empirically :
1. Demonstrate that the OT-based methods are natural and sufficiently powerful approaches to unbalanced word alignment without tailor-made techniques.
2. Deliver a comprehensive picture that unveils the characteristics of BOT, POT, and UOT on unbalanced word alignment with different null alignment ratios.
## 2. Problem Definition
Suppose we have a source and target sentence pair s = {s1, s2, . . . , sn} and t = {t1, t2, . . . , tm} with their word embeddings {s1, s2, · · · , sn} and {t1, t2, · · · , tm}, respectively, where si, tj ∈ Rd. <span class='red'>The goal of monolingual word alignment is to identify an alignment P ∈ Rn×m + between semantically corresponding words, where Pi,j indicates a likelihood or binary indicator of aligning si and tj .</span>
:::info
好的,文章中關於單語言詞對齊目標的陳述,我用繁體中文詳細說明步驟如下:
1. 給定一對源句子s和目標句子t,分別為s = {s1, s2, ..., sn}和t = {t1, t2, ..., tm}。
2. 其中si和tj分別表示源句子和目標句子中的單詞,並已獲得其對應的詞向量表示{s1, s2, ..., sn}和{t1, t2, ..., tm}。
3. 單語言詞對齊的目標是在語義上對應的詞彙之間識別出一個對齊矩陣P ∈ Rn×m+。
4. 在對齊矩陣P中,Pi,j的值表示源句子中的si詞與目標句子中的tj詞對齊的可能性或二值指示(0或1)。
5. 也就是說,Pi,j的值越高(趨近1),代表si和tj這兩個詞彙在語義上越相對應、越應該對齊。
6. 反之,如果Pi,j的值較低(趨近0),則表示si和tj在語義上對應程度較低、不太應該對齊。
7. 因此,詞對齊的目標就是找到一個最佳的對齊矩陣P,使得P中元素Pi,j能夠正確反映出si和tj這兩個詞彙之間的語義對應關係。
總的來說,單語言詞對齊希望學習得到一個對齊矩陣P,其中每個元素Pi,j對應著一對源目標詞彙si和tj的對齊可能性或指示,從而捕獲不同詞彙之間語義上的對應關係。
:::
### Evaluation Metrics for Unbalanced Alignment

:::info
source sentence:s = {s1, s2, s3, s4}
target sentence:t = {t1, t2, t3}
人工標註的ground truth對齊關係:
s1對齊t1 (字-字對齊)
s2對齊t2 (字-字對齊)
s3空對齊 (空對齊)
s4空對齊 (空對齊)
模型預測的對齊結果:
s1對齊t1 (正確)
s2空對齊 (錯誤)
s3對齊t3 (錯誤)
s4空對齊 (正確)
則:
標註字-字對齊總對數 = 2
標註空對齊總數 = 2
預測字-字對齊數 = 1
預測空對齊數 = 2
正確字-字對齊數 = 1
正確空對齊數 = 1
precision = (正確字-字對齊數 + 正確空對齊數) / (預測字-字對齊數 + 預測空對齊數)
= (1 + 1) / (1 + 2) = 2/3 = 66.7%
recall = (正確字-字對齊數 + 正確空對齊數) / (標註字-字對齊數 + 標註空對齊數)
= (1 + 1) / (2 + 2) = 2/4 = 50%
所以這個簡單範例的對齊結果在我們定義的指標下精度為66.7%,回歸率為50%。
:::
## 3. Background: Optimal Transport
- OT seeks the most efficient way to move mass from one measure to another. Remarkably, the OT problem induces the OT mapping that indicates correspondences between the samples.
:::info
這句話解釋了最優運輸(Optimal Transport, OT)問題的目標和特點。我詳細解釋如下:
OT seeks the most efficient way to move mass from one measure to another.
最優運輸尋找從一種測度(measure)移動到另一測度的最有效方式。在這裡,「mass」和「measure」這兩個術語指的是資料樣本的分布情況。可以想像成有兩堆物品,OT要找到最省力的方式將一堆物品移動到另一堆去。
Remarkably, the OT problem induces the OT mapping that indicates correspondences between the samples.
值得注意的是,OT問題會產生一種OT映射(mapping),這個映射顯示了樣本之間的對應關係。換句話說,在解OT問題的同時,會得到一個樣本對應表,說明每個樣本如何與另一堆樣本對應。這對於後續分析樣本的關聯性很有幫助。
總結一下,最優運輸不只在尋找最省力的運送方式,還會產生一種對應關係映射,顯示出不同資料樣本之間的關聯程度,這是OT問題獨特且實用的特性。
:::
:::info
- 這個問題可以看作是將source sentence中的“word mass”最優化的運輸到目標句子中詞彙上的一個運輸過程。
- 最優運輸理論研究的正是這樣一個問題:給定兩個分佈(source sentence和target sentence中的word distribution)和cost function(詞與詞之間的語義距離),求出使總代價最小的最優運輸規劃。
- 因此,將source sentense中的word看成一個分布,target sentense中的word看作另一個分布,兩者之間的語義距離作為Cost function,那麼monolingual word alignmment問題就可以model為一個Optimal Transport Problem。
:::
- Formally, the inputs to the optimal transport problem are a **cost function** and **a pair of measures**.
1. A <span class='red'>cost means a dissimilarity between si and tj (source and target words)</span> computed by a distance metric c : Rd × Rd → R+, such as Euclidean and cosine distances.
2. The **cost matrix** C ∈ Rn×m + summarises the costs of any word pairs, that is, Ci,j = c(si, tj).
3. A **measure means a weight each word has**. The concept of measure corresponds to **fertility** introduced in IBM Model 3. <span class='red'>which defines how many target (source) words a source (target) word can align.</span>
:::success
- In summary, the mass of words in s and t is represented as arbitrary measures a ∈ Rn+ and b ∈ Rm+ , respectively. Finally, the OT problem identifies an alignment matrix P with which the sum of alignment costs is minimised under the cost matrix C:

:::
:::info
In summary, the mass of words in s and t is represented as arbitrary measures a ∈ R$^n_+$ and b ∈ R$^m_+$, respectively.
這句是在總結前面對「measure」一詞的描述。指出在source sentence *s*中的每個詞的「重要性」用一個非負實數向量a來表示,向量a的長度為n(source sentence word的個數)。同理,在target sentence *t*中每個詞的「重要性」用另一個非負實數向量b來表示,向量b的長度為m(目標句子詞的個數)。這裡的「重要性」(mass)可以視為word的權重。
Finally, the OT problem identifies an alignment matrix P with which the sum of alignment costs is minimised under the cost matrix C:
最後,OT問題要找到一個對齊矩陣P,使得根據成本矩陣C計算的對齊成本之和最小。這裡的對齊矩陣P描述了source sentence *s*中每個word與target sentence *t*中每個word的對應關係。成本矩陣C表示將source sentence *s*中的word對應到target sentence *t*中每個word的代價(通常用向量距離表示語義距離)。
所以OT問題的目標就是在給定source sentence和target sentence的word向量,以及每個word的「重要性」後,找到一種詞與詞之間的最佳對應方式,使得總對應代價最小。這種對應關係記錄在對齊矩陣P中。
:::
### 3.1 Balanced Optimal Transport
1. Assumes that a and b are probability distributions, i.e., a ∈ Σn and b ∈ Σm, respectively.
2. where Σn is the probability simplex: Σn := {p ∈ Rn + : P i pi = 1}.

:::info
好的,我將用繁體中文詳細解釋文章中4.1 Balanced Optimal Transport這一段的數學公式:
首先介紹一些數學符號:
R$^n_+$ 表示所有n維非負實數向量的集合
Σ$_n$ 表示n維輛(所有元素和為1)
1l$_n$ 表示一個n維向量,所有元素都是1
balanced OT (BOT)假設a和b都是概率分佈,即:
a ∈ Σ$_n$ 且 b ∈ Σ$_m$
在這種情況下,對齊矩陣P必須滿足以下條件:
U$_b$(a, b) := {P ∈ R$^{n×m}_+$ : P1l$_m$ = a, P$^⊤$ 1l$_n$ = b}
解釋一下:
P1l$_m$ = a
意味著對齊矩陣P的每一行向量元素相加起來等於a中對應的元素,即源句子中每個詞的總對齊量等於它的"重要性"
P$^⊤$ 1l$_n$ = b
意味著對齊矩陣P的每一列向量元素相加起來等於b中對應的元素,即目標句子中每個詞的總被對齊量等於它的"重要性"
這樣就能保證所有源詞和目標詞的"重要性"能被完全利用,對齊過程中不會有任何"剩餘"
在此約束條件下,BOT問題可以形式化為一個線性規劃問題:
min_{P ∈ U$_b$(a,b)} ⟨C, P⟩
其中 ⟨C, P⟩ := Σ$_{i,j}$ P$_{i,j}$ C$_{i,j}$
表示對齊成本矩陣C與對齊矩陣P的內積和,即所有對齊詞對的成本之和。
所以BOT的目標就是在滿足所有源詞和目標詞對齊總量等於各自"重要性"的約束條件下,找到一種對齊方式使得總成本最小。
:::
:::info
設source sentence *s*包含n個詞 {s$_1$, s$_2$, ..., s$_n$}
對應的"重要性"向量為 a = (a$_1$, a$_2$, ..., a$_n$), 其中a$_i$表示詞s$_i$的"重要性"
對齊矩陣P的維度為n x m, 其中n為#source sentence word, m為#target sentence word
P的第i行(column)向量P$_i$ = (P$_{i1}$, P$_{i2}$, ..., P$_{im}$)表示源句子中的詞s$_i$與target sentence中所有詞的對齊量
約束條件:
P1l$_m$ = a
可以展開為:
(P$_{11}$ + P$_{12}$ + ... + P$_{1m}$, P$_{21}$ + P$_{22}$ + ... + P$_{2m}$, ..., P$_{n1}$ + P$_{n2}$ + ... + P$_{nm}$) = (a$_1$, a$_2$, ..., a$_n$)
即對齊矩陣P的第i行元素之和等於a的第i個元素:
P$_{i1}$ + P$_{i2}$ + ... + P$_{im}$ = a$_i$
這就保證了source sentence中的每個word s$_i$,其與target sentence中所有word的對齊總量(P$_{i1}$ + P$_{i2}$ + ... + P$_{im}$)等於其本身的"重要性"a$_i$
換句話說,在對齊過程中,每個source sentence word的"重要性"都被完全利用,沒有剩餘或浪費。
這樣對source sentence中的每個word都成立,因此P1l$_m$ = a確保了source sentence中每個word的總對齊量等於它的"重要性"。
:::
3. Under this constraint set, the BOT problem is a linear programming (LP) problem and can be solved by standard LP solvers.
4. However, <span class='red'>the non-differentiable nature makes it challenging to integrate into neural models</span>.
### Regularised BOT
- The **entropy-regularised optimal transport**, initially aimed at improving the computational speed of BOT, is a differentiable alternative and thus can be directly integrated into neural models.

:::info
1. Function H is the negative entropy of alignment matrix P.
2. ε controls the strength of the regularisation, with sufficiently small ε, LεC well approximates the exact BOT.
3. The optimisation problem can be efficiently solved using the Sinkhorn algorithm.
:::
:::success
文章中3.1節講的是平衡最優運輸(Balanced Optimal Transport, BOT)。它的主要內容為:
1. BOT 假設source sentence的質量分佈a和target sentence的機率質量分佈b都是機率分佈,即a和b中的元素都非負且機率總和為1。
2. 在這種情況下,BOT 找尋的最佳對齊矩陣P必須滿足以下約束:
- P的每一列的元素和等於a對應的元素值,表示source sentence中該詞的質量都對應到了target sentence。
- P的每一行的元素和等於b對應的元素值,表示target sentence中該詞的質量都來自於source sentence對應的對齊。
3. 符合上述條件的P構成了一個線性規劃問題,可以通過標準的線性規劃求解器來解決BOT問題。
舉個簡單的例子:
源句s有兩個詞,質量分佈a=[0.4,0.6],目標句t有三個詞,質量分佈b=[0.3,0.5,0.2]。
則一種可行的BOT對齊P可以是:
P = [[0.3,0.1,0.0],
[0.0,0.4,0.2]]
滿足P的列和分別等於a,P的行和分別等於b。
這表示第一個source word對齊到了第一和第二個target word,第二個source word對齊到了第二和第三個target word。這樣source sentence和target sentence的質量都達到了平衡。
:::
### 3.2 Relaxation of BOT Constraint
1. Despite its success, <span class='red'>the hard constraint U$^b$ of the BOT that aligns all words often makes it sub-optimal for some alignment problems </span>where null alignment is more or less pervasive, such as unbalanced word alignment.
2. POT and UOT introduced subsequently relax this constraint to allow certain words to be left unaligned.
### Partial Optimal Transport
- POT relaxes BOT by **aligning only a fraction m of the fertility**

:::info
1. The fraction m is bound as m ≤ min(∥a∥1, ∥b∥1), where ∥ · ∥1 represents the L1 norm.
2. While POT can be solved by standard LP solvers, it can also be regularised as in the BOT and solved with the Sinkhorn algorithm.
:::
:::info
The fraction m is bound as m ≤ min(∥a∥1, ∥b∥1)
這句話指出分數m是受到下面不等式的約束:
m ≤ min(∥a∥1, ∥b∥1)
其中 min(x, y) 表示取x和y中的較小值。
∥ · ∥1 represents the L1 norm.
∥x∥1 表示向量x的L1範數(norm),定義為向量中所有元素絕對值之和:
∥x∥1 = |x1| + |x2| + ... + |xn| (假設x是n維向量)
所以上面的不等式可以具體展開為:
m ≤ min( |a1| + |a2| + ... + |an|, |b1| + |b2| + ... + |bm| )
這裡 a = (a1, a2, ..., an), b = (b1, b2, ..., bm) source sentence和target sentence中每個word的"重要性"向量。
∥a∥1 = |a1| + |a2| + ... + |an| 表示source sentence所有word"重要性"的總和
∥b∥1 = |b1| + |b2| + ... + |bm| 表示target sentence所有word"重要性"的總和
因此,上面的不等式約束可以理解為:數值m不能超過source sentence和target sentence所有word"重要性"總和的較小者。
這是因為在partial OT中,允許有一部分詞不被對齊,所以對齊總量m不能超過source sentence和target sentence可利用的"重要性"總量的最小值。
通過這個約束,可以確保對齊過程中不會出現某些詞的"重要性"被超額使用的情況。
:::
### Unbalanced Optimal Transport
- UOT relaxes BOT by introducing soft constraints that **penalise marginal deviation**.

:::info
1. where D is a divergence and τa and τb control how much mass deviations are penalised
2. unbalanced formulation seeks alignment matrices in the entire Rn×m.
3. In this study, we adopt the Kullback–Leibler divergence as D because of its simplicity in computation.
:::
:::info
*where D is a divergence*
這裡的D是一個divergence(分歧)函數。Divergence函數用來衡量兩個機率分佈之間的差異程度。常見的divergence函數有KL divergence、總變差距離(Total Variation distance)等。
*and τa and τb control how much mass deviations are penalised.*
τa和τb是控制"重要性"偏差懲罰強度的參數。具體來說:
- τa控制了對齊矩陣P的每一行向量元素之和與對應的source sentence word"重要性"a之間偏差的懲罰強度
- τb控制了對齊矩陣P的每一列向量元素之和與對應的target sentence word"重要性"b之間偏差的懲罰強度
Notice that the unbalanced formulation seeks alignment matrices in the entire R$^{n×m}$
值得注意的是,在unbalanced OT的形式化中,我們尋找的對齊矩陣P不再受到BOT或POT中那樣嚴格的約束,而是可以在整個R$^{n×m}$空間中尋找,也就是說對齊矩陣P中的元素可以任意取非負實數。
這允許對齊過程中出現不完全匹配的情況,即有些source sentence word或target sentence word的"重要性"可以部分浪費或利用。但同時,我們會通過divergence函數D對這種浪費或超額利用施加懲罰,懲罰強度由參數τa和τb控制。
這樣做的好處是使對齊更加靈活,可以在對齊質量和完全利用"重要性"之間權衡,特別適用於句子語義較為不同的情況,存在較多無法對齊的詞的場景。
:::
:::success
文章3.2節講的是關於放寬BOT框架中的約束條件,主要內容為:
- BOT框架強制所有的詞必須對齊,但在許多應用場景下(如字詞對齊)存在較多null alignment的情況,BOT的效果並不是最佳的。
- 因此文中提出使用部分最優運輸(POT)和不平衡最優運輸(UOT)等放寬BOT的約束條件,**允許某些詞左右對齊,以處理空對齊問題**。
- 這裡舉一個部分最優運輸(POT)和不平衡最優運輸(UOT)的例子:
1. 源句s:兩個詞s1、s2,質量分佈a = [0.4, 0.6]
2. 目標句t:三個詞t1、t2、t3,質量分佈b = [0.5, 0.3, 0.2]
3. POT 允許部分詞不對齊,一種可行對齊矩陣P為:
P = [[0.3, 0.1, 0],
[0.2, 0, 0]]
- 第一列元素和為0.3+0.1=0.4,<=對應的a1=0.4。
- 第二列元素和為0.2,<=對應的a2=0.6,允許部分質量不對齊
- P的行和與b匹配
4. UOT 允許質量不平衡,一種P為:
P = [[0.4, 0.1, 0],
[0.4, 0, 0.1]]
- 第一列和大於等於b1=0.5,質量不平衡
- 第二列和小於b2=0.3,質量不平衡
- 但UOT通過目標函數的軟約束進行整體優化
:::
## 4. Proposal: Unbalanced Word Alignment as Optimal Transport Problem
- In this study, we aim at formulating monolingual word alignment with high null frequencies in a natural way. For this purpose, we leverage OT with different constraints and reveal their features on this problem.
- We adopt the basic and generic **cost matrices** and **measures** instead of engineering them to avoid obfuscating the difference between BOT, POT, and UOT.
### 4.1 Cost Function and Measures
- We obtain contextualised word embeddings from a pre-trained language model as adopted in the previous word alignment methods.
- We concatenate source and target sentences and input to the language model to obtain the l-th layer hidden outputs.
- We then compute a word embedding by mean pooling the hidden outputs of its subwords.
- As a **cost function**, we use the cosine and Euclidean distances to obtain the cost matrix C, which are the most commonly used semantic dissimilarity metrics in NLP.
- we employ the **distortion** introduced in IBM Model 2, when computing a cost to discourage aligning words appearing in distant positions.
- Jalili Sabet et al. modelled the distortion between si and tj as κ (i/n − j/m)2, where κ scales the value to range in [0, κ]. The value becomes larger if the relative positions of i/n and j/m are largely different.
- <span class='red'>Each entry of the cost matrix is then modulated by the corresponding distortion value.</span> Note that the cost matrix is scaled in its computation process by using the min-max normalisation to make sure all entries lie in [0, 1].
:::info
distortion是指計算詞彙之間成本(cost)時考慮的一個因素,用以懲罰兩個詞在句子中的相對位置差距較大的詞對。
具體來說,文章中提到distortion被定義為 κ(i/n − j/m)2,,其中i/n和j/m分別表示source sentence中第i個詞的相對位置和target sentence中第j個詞的相對位置,n和m分別是source sentence和target sentence中的字詞數, κ(kappa)是用於調整數值範圍的參數。
這個公式的意思是,如果兩個詞在句子中的相對位置差得越遠,distortion的值就越大,也就是說這兩個詞之間的成本就會受到更大的懲罰。這樣做是為了避免將句子中位置差距較大的詞對齊起來。
舉例來說,如果source sentence中位置靠前的詞被對齊到target sentence中位置靠後的詞,或者反過來source sentence中位置靠後的詞被對齊到target sentence中位置靠前的詞,這樣的對齊方式就不太合理,應該受到較大的懲罰。引入distortion就是為了實現這一點。
綜上所述,**distortion是文章中所提出的一種位置差距懲罰機制**,用於在計算詞對成本時考慮其在句子中相對位置的影響。
:::
:::info
We employ the distortion introduced in IBM Model 2 when computing a cost to discourage aligning words appearing in distant positions.
在計算對齊成本時,我們引入了IBM Model 2中的扭曲(distortion)概念,目的是懲罰將位置較遠的詞對齊,從而鼓勵對齊位置較近的詞。
Jalili Sabet et al. modelled the distortion between s$_i$ and t$_j$ as κ (i/n - j/m)$^2$
Jalili Sabet等人將source sentence 中第i個詞s$_i$與target sentence中第j個word t$_j$之間的扭曲建模為 κ (i/n - j/m)^2
這個公式的含義是:
1) i/n 表示詞s$_i$在source sentence中的相對位置
2) j/m 表示詞t$_j$在target sentence中的相對位置
3) |i/n - j/m| 表示s_i和t_j的相對位置差
4) (i/n - j/m)^2 將相對位置差的絕對值變為平方,從而放大位置較遠詞對的懲罰
where κ scales the value to range in [0, κ]
這裡的κ是一個調節參數,它將上述(i/n - j/m)$^2$的值縮放到[0, κ]的範圍內。
具體來說:
1) 當κ=0時,不考慮扭曲,任何位置的詞對之間的扭曲懲罰都為0
2) 當κ增大時,扭曲懲罰的影響越大
3) 對於任何一對詞s$_i$和t$_j$,它們的扭曲懲罰值在[0, κ]範圍內
通過調節κ,我們可以控制扭曲項對最終對齊成本的影響程度。
最後,我們將這個扭曲項加入到詞對的對齊成本中,從而懲罰對齊位置較遠的詞對,鼓勵對齊鄰近詞。
:::
- **Fertility** determines the likelihood of words having alignment links.
- We use two standard measures adopted in previous studies that used OT in NLP tasks:
1. The uniform distribution
2. L2-norms of word embeddings
- On POT and UOT, <span class='red'>we directly use these measures without scaling</span>, for which Pi,j may have an arbitrary positive value.
- We scale P by the min-max normalisation so that we can handle alignment matrices of BOT, POT, and UOT by a unified manner.
:::success
文章4.1節講的是optimal transport在字詞對齊問題中使用的兩個關鍵元素:cost function和measures。
其中重點包括:
1. cost function: 測量source和target word之間不相似度(餘弦距離或歐氏距離),透過pre-train language model的詞向量計算。還考慮了基於詞序的扭曲成本。
2. measures: 即source word和target word的生成率,使用均勻分布或詞向量2-norm值表示,反映了詞的對齊可能性。這對應於IBM模型中的 fertility 概念。
3. 成本矩陣:成本函數運用於所有詞對而形成的矩陣。成本矩陣和質量分佈一起作為輸入傳遞給最優運輸模型。
4. 中心化:減去整個語料的詞向量均值,缓解預訓練語言模型中詞向量相似度高的問題。
通過這些元素的選擇和定義,使最優運輸模型儘可能通用地適用於字詞對齊問題。
:::
### 4.2 Heuristics for Sparse Alignment
- While the Sinkhorn algorithm is a powerful tool for solving OT problems, one drawback is that a resultant alignment matrix becomes dense, i.e., <span class='red'>each element has a non-zero weight</span>. It is not straightforward to interpret a dense solution as an alignment matrix and thus dense matrices are better avoided.
- As an empirical remedy, **simple heuristics** have been commonly used to <span class='red'>make alignment matrices sparse</span>:
1. Assuming top-k elements based on their mass
2. Elements whose mass are larger than a threshold are aligned.

3. We take the latter approach to avoid introducing an arbitrary assumption on fertility, i.e., the number of alignment links that a word can have.
4. Our experiments reveal that this simple ‘patch’ to obtain a sparse alignment can produce unbalanced alignment, rather than just sparse.
:::info
As an empirical remedy, simple heuristics have been commonly used to make alignment matrices sparse:
作為一種經驗性的補救措施,簡單的啟發式方法經常被用來使對齊矩陣變得稀疏(sparse):
assuming top-k elements based on their mass or elements whose mass are larger than a threshold are aligned.
其中一種方式是假設僅對質量(mass)排名前k的元素進行對齊,或者僅對質量大於某個閾值的元素進行對齊。
Our experiments reveal that this simple 'patch' to obtain a sparse alignment can produce unbalanced alignment, rather than just sparse.
我們的實驗顯示,這種為了獲得稀疏對齊矩陣而採取的簡單"修補",不僅可以產生稀疏的對齊,實際上還可以產生不平衡(unbalanced)的對齊。
1) 原始對齊矩陣P通過求解OT問題得到,可能是密集(dense)的,即大部分元素都是非零值。
2) 為了獲得稀疏的對齊,我們設置一個閾值λ。
3) 對於矩陣P中的每個元素P$_{ij}$:
a) 如果P$_{ij}$ > λ,則保留該元素的值不變
b) 如果P$_{ij}$ ≤ λ,則將該元素設置為0
4) 通過上述操作,我們獲得了一個新的稀疏矩陣 P$^{hat}$
5) 在P$^{hat}$中,非零元素對應著被對齊的詞對
6) 這種簡單的閾值過濾操作,不僅讓對齊矩陣變稀疏,還可能產生不平衡對齊的效果
7) 不平衡對齊指的是,某些source sentence word或target sentence word的總對齊量可能小於其原始的"重要性"
8) 這與BOT和POT的硬約束(要求完全利用詞的"重要性")不同,體現了較高的靈活性
9) 因此,這種簡單的閾值過濾可被視為一種獲得不平衡對齊的"修補"
總之,這一簡單技巧不僅讓對齊變稀疏,實際上還自動實現了句子間語義差異所需的不平衡對齊,解決了之前BOT和POT的限制。
:::
### 4.3 Application to Unsupervised Alignment
- We obtain contextualised word embeddings from a pre-trained **masked language model** without fine-tuning in the unsupervised setting.
- Such word embeddings are known to show relatively high cosine similarity between any random words, which blurs the actual similarity of semantically corresponding words.
- Alleviated this phenomenon with a simple technique of <span class='red'>centring the word embedding distribution</span>. We apply the corpus mean centring that subtracts the mean word vector of the entire corpus from each word embedding.
:::success
文章4.3 Application to Unsupervised Alignment部分主要是談到如何將所提出的以最優運輸為基礎的單語詞對齊方法應用到非監督式的詞對齊任務中。
具體來說,在非監督式詞對齊任務中,模型並沒有可供訓練的有標記資料。文章提到可以使用預訓練的masked language model來獲取上下文的詞向量,也就是將source sentence和target sentence連接起來一起輸入預訓練模型中,然後取出對應詞語的隱藏層表示作為詞向量。
但是,直接使用這種詞向量來計算詞間的相似度時,任意兩個詞之間的餘弦相似度都較高,難以反映出真正的語義對應關係。為了解決這一問題,文章提出使用mean centering的技巧,也就是從每個詞向量中減去整個語料庫中所有詞向量的均值向量,這樣可以緩解上述的 phenomenon。
綜上所述,5.3節主要說明了如何將OTAlign方法應用到非監督式詞對齊任務中,包括獲取上下文詞向量和mean centering的技巧,以期在沒有人工標記的訓練資料情況下,仍然可以學習出合理的詞對齊結果。
:::
### 4.4 Application to Supervised Alignment
- We adopt linear metric learning that learns a matrix W ∈ Rd×d defining a generalised distance between two embeddings:c(si,tj)=c(Wsi,Wtj).
- We train the entire model to learn parameters of W and the pre-trained language model by minimising the binary cross-entropy loss:

- where P and Y are the predicted and ground- truth alignment matrices, respectively. Specifically, Yi,j ∈ {0, 1} indicates the ground-truth alignment between si and tj; 1 means that alignment exists while 0 means no alignment.
:::success
文章4.4 Application to Supervised Alignment部分主要描述了如何將基於最優運輸的單語詞對齊方法應用到監督式學習的情況。
具體來說,在監督式學習下,算法可以利用有標記的訓練數據來學習詞間距離計算矩陣W。文章採用了線性度量學習的方法,使得詞間距離計算方式變為c(si, tj) = c(Wsi, Wtj)$,其中s和t分別是源句和目標句中的詞向量,c是某個基礎距離計算方法如歐氏距離。
然後將整個模型端到端學習,通過最小化binary cross-entropy loss來同時優化W和預訓練語言模型的參數。
可以看到,相比非監督方法,引入了詞距離計算矩陣W的學習,以及利用標記訓練數據通過損失函數進行參數更新。這樣可以增強模型根據訓練數據學習詞間關係的能力,從而提升監督下的詞對齊性能。
綜上所述,5.4節主要講述了如何將OTAlign方法應用到監督式學習情境下,通過引入距離計算矩陣的學習和損失函數的優化,以發揮有標記訓練數據的優勢,進一步提升詞對齊的效果。
:::
## 5. Experiment Settings
- We empirically investigate the characteristics of BOT, POT, and UOT on word alignment in both unsupervised and supervised settings.
- To alleviate performance variations due to neural model initialisation, all the experiments were conducted for 5 times with different random seeds, and the means of the scores are reported.
### Dataset
- As datasets that provide human-annotated word alignment, we used:
1. Microsoft Research Recognizing Textual Entailment (**MSR-RTE**)
2. **Edinburgh++**
3. Multi-Genre Monolingual Word Alignment (**MultiMWA**)

- **MultiMWA** consists of four subsets according to the sources of sentence pairs: **MTRef**, **Newsela**, **ArXiv**, and **Wiki**. Among them, **Newsela** and **ArXiv** are intended for a transfer-learning experiment, on which models trained using MTRef should be tested.
- MSR-RTE and Edinburgh++ do not have an official split for validation. Hence, we subsampled a validation set from the training split, which was excluded from there.
- As the tradition of word alignment, there are two types of alignment links:
1. Sure
2. Possible.
- The former indicates that an alignment was agreed upon by multiple annotators and thus preserves high confidence. Experiments were conducted in both ‘Sure and Possible’ and ‘Sure Only’ settings.
### Pre-trained Language Model
- OTAlign as well as the recent powerful methods (§2.1) use contextualised word embeddings obtained from pre-trained (masked) language models.
- As the basic and standard model, we used BERT-base-uncased for all the methods compared to directly observe the capabilities of different alignment mechanisms and to exclude performance differences owing to the underlying pre-trained models.
- For unsupervised word alignment,we used the 10th layer in BERT that performs strongly in unsupervised textual similarity estimation. For supervised alignment, we used the last hidden layer following the convention.
### OTAlign
- Our OT-based alignment methods require a cost function and fertility. As discussed in §5.1, we experiment with **cosine** and **Euclidean distances** as cost functions and **uniform distribution** and **L2-norm** of word embeddings as the fertility.
- Due to the space limitation, the main paper only discusses the results of a cost function and fertility that performed consistently strongly on the validation sets.
- We fixed the regularisation penalty ε to 0.1 throughout all experiments. The threshold λ to sparsify alignment matrices was searched in [0, 1] with 0.01 interval to maximise the validation F1.
:::info
As the tradition of word alignment, there are two types of alignment links:
根據詞對齊的傳統做法,有兩種類型的對齊連接:
sure and possible.
一種是「確定(sure)」對齊,另一種是「可能(possible)」對齊。
The former indicates that an alignment was agreed upon by multiple annotators and thus preserves high confidence.
前者「確定對齊」表示該對齊是經過多個標註者一致同意的,因此具有較高的可信度。
we used BERT-base-uncased for all the methods compared to directly observe the capabilities of different alignment mechanisms
為了直接觀察不同對齊機制的能力,在比較所有方法時,我們都使用了相同的BERT-base-uncased作為預訓練語言模型。
and to exclude performance differences owing to the underlying pre-trained models.
目的是排除由於使用不同預訓練模型而導致的性能差異,從而能夠公平比較。
解釋step by step:
1) 在詞對齊任務中,通常會有人工標註的金標準數據集。
2) 人工標註時,每個詞對或短語對之間的對齊關係會被劃分為「確定對齊」和「可能對齊」兩種類型。
3) 「確定對齊」表示該對齊得到了多個標註者的一致認可,可信度很高。
4) 「可能對齊」表示部分標註者認為存在對齊,但並不是所有標註者都達成共識,可信度較低。
5) 在本實驗中,作者比較了多種詞對齊方法的性能表現。
6) 為了排除使用不同預訓練語言模型導致的性能差異,作者選擇了相同的BERT-base-uncased模型。
7) 這樣做的目的是直接觀察和比較不同對齊機制本身的能力,而不受底層預訓練模型的影響。
8) 通過使用相同的預訓練模型,可以在公平的基礎上比較各種方法,真實反映了它們對齊機制上的優劣勢。
9) 這種做法有助於準確評估和分析各種對齊方法的本質特徵和局限性。
總之,使用金標準數據集中的「確定對齊」和「可能對齊」數據,並在相同的預訓練模型基礎上比較,可以最大程度排除外部因素的干擾,客觀評測不同對齊機制的實際性能。
:::
## 6. Unsupervised Word Alignment
- We reveal features of OT-based methods on monolingual word alignment under unsupervised learning to segregate the effects of supervision on the pre-trained language model.
:::info
"我們在無監督學習的設定下,揭示了基於最優運輸理論(OT)的單語言詞對齊方法的特徵,目的是剔除監督訊號對預訓練語言模型的影響。"
解釋如下:
1. 作者進行了無監督單語言詞對齊的實驗,即在沒有任何人工標註的詞對齊數據的情況下進行詞對齊。
2. 他們採用了基於最優運輸理論(OT)的詞對齊方法,包括平衡最優運輸(BOT)、部分最優運輸(POT)和不平衡最優運輸(UOT)等變體。
3. 通過這些實驗,作者揭示並分析了這些OT方法在無監督單語言詞對齊任務中的特徵表現。
4. 採用無監督設定的原因是為了剔除監督訊號對預訓練語言模型的影響。
5. 在有監督學習中,模型會在帶有詞對齊標註的數據上進行訓練,語言模型的參數也會相應更新。
6. 這種監督微調會影響語言模型捕獲詞彙語義的能力,從而影響詞對齊的性能。
7. 因此,作者在無監督設定下評估OT方法,可以剔除監督信號對語言模型的影響,揭示出更純粹的OT方法本身的能力。
8. 換句話說,通過無監督實驗,他們能夠獨立地觀察OT對齊方法處理預訓練語言模型提供的語義表示的優劣。
9. 這有助於更客觀地分析不同OT變體在處理高空對齊率等具有挑戰的場景時的行為特徵。
總之,採用無監督設定的目的,是為了剔除監督信號的影響,揭示出OT對齊方法在處理純粹語義表示時的真實能力和特徵,為有針對性的改進提供依據。
:::
- Note that we use the validation sets for hyper-parameter tuning, which may not be strictly ‘unsupervised.’ We still call this scenario ‘unsupervised’ for convenience, believing that the preparation of small-scale annotation datasets should be feasible in most practical cases.
### 6.1 Settings
- <span class='red'>As a conventional word alignment method</span>, we compared OTAlign to the **fast-align** that implemented IBM Model 2.
- For **fast-align**, the training, validation, and test sets were concatenated and used to gather enough statistics for alignment.
- <span class='red'>As the state-of-the-art unsupervised word alignment method</span>, we compared OTAlign to **SimAlign**.
- While **SimAlign** was initially proposed for crosslingual word alignment, it is directly applicable to the monolingual setting.
- **SimAlign** computes a similarity matrix in the same manner as ours, and aligns words using simple heuristics on the matrix.
- Specifically, we used the ‘**IterMax**’ that performed best across many language pairs.
- **IterMax** conducts the ‘ArgMax’ alignment iteratively (Jalili Sabet et al. (2020) set the iteration as two times), which aligns a word pair if their similarity is the highest among each other.
- **SimAlign** has two hyper-parameters:
1. one for the distortion (κ)
2. another for forcing null alignment if a word is not particularly similar to any target words. These values were tuned using the validation sets.
:::success
- For **OTAlign**, **POT** has a hyper-parameter m to represent a fraction of fertility to be aligned, which we parameterise as m = me min(∥a∥1, ∥b∥1).
- **UOT** has marginal deviation penalties of τa and τb.8 All of these hyper-parameters were searched in a range of [0, 1] with 0.02 interval to maximise the validation F1.
- For the distortion strength κ, we applied the same values tuned on SimAlign.
:::
### 6.2 Results: Primary Observations

- Shows the **F1 scores** measured on the test sets.
- We have the following observations:
1. The best OT problem depends on null alignment ratios.
- On datasets with higher null alignment ratios, i.e., Edinburgh++ and MTRef, regularised BOT, regularised POT, and UOT largely outperformed SimAlign.
- regularised POT performed the best on datasets with significantly high null alignment ratios.
- On these datasets, SimAlign also performed strongly thanks to the BERT representations.
2. Thresholding on the alignment matrix makes it unbalanced.
- The unregularised BOT showed the worst performance because its constraint forces to align all words and prohibits null alignment.
- We observe that the performance of unregularised BOT and POT was significantly boosted by regularisation and thresholding on resultant alignment matrices.
:::info
從實驗結果中,我們有以下觀察:
(i) The best OT problem depends on null alignment ratios.
1) OT問題的最佳選擇取決於null對齊(未對齊詞)的比例。
2) 當null對齊比例較低時,平衡最優運輸(Balanced OT)表現較好。
3) 當null對齊比例較高時,不平衡最優運輸(Unbalanced OT)和部分最優運輸(Partial OT)表現較好。
4) 這是因為BOT要求所有詞都必須對齊,不允許null對齊存在,所以在null對齊比例高的情況下會受到較大限制。
5) 而UOT和POT則放寬了這一硬性約束,可以適應null對齊比例較高的場景。
(ii) Thresholding on the alignment matrix makes it unbalanced.
1) 在對齊矩陣上設置閾值,可以使得對齊矩陣變得不平衡。
2) 原始的OT求解會得到一個密集的對齊矩陣,其中大部分元素都是非零值。
3) 作者採取了一種啟發式方法:對於大於某個閾值λ的元素,保留其值;對於小於λ的元素,將其設為0。
4) 這種簡單的閾值過濾操作,會使得矩陣變得稀疏,去除一些小的對齊量。
5) 更重要的是,它也引入了不平衡的效果。因為在處理後的稀疏矩陣中,某些詞的總對齊量可能小於其原始的"重要性"。
6) 這種不平衡效果,體現了對齊過程對高null對齊比例場景的適應性,避免語義差異較大的詞對被硬性對齊。
7) 因此,這一簡單的閾值過濾不僅讓對齊稀疏化,實際上也自動實現了所需的不平衡對齊。
總的來說,不同OT問題的適用場景取決於數據的null對齊比例;同時通過閾值過濾可以將對齊矩陣變得不平衡和稀疏,更好地適應高null對齊場景。這揭示了不同OT形式和後處理對最終表現的重要影響。
:::
- For further investigation, we binned the test samples across datasets (under the ‘Sure Only’ setting) according to their null alignment ratios and observed the performance of the methods.
:::info
For further investigation, we binned the test samples across datasets (under the 'Sure Only' setting) according to their null alignment ratios and observed the performance of the methods.
為了進一步探究,我們根據測試樣本中的null對齊(未對齊詞)比例,將跨數據集的測試樣本分成多個組(設定為'Sure Only'模式),然後觀察不同方法在各個組上的表現。
Step by step解釋:
1) 實驗中使用了多個數據集進行詞對齊評估。
2) 在每個數據集的測試集中,都包含有標註的"確定對齊"詞對。
3) 作者選擇只考慮"確定對齊"的子集,即'Sure Only'設定,排除了"可能對齊"類別。
4) 對於這些'Sure Only'測試樣本,作者統計了每個樣本中null對齊(沒有對應對齊的詞)的比例。
5) 根據null對齊比例的不同,將所有測試樣本分成了多個組(bins)。
6) 例如,可以將0-10%的null比例樣本歸為一組,10%-20%為另一組,依此類推。
7) 這樣做的目的是觀察在不同null對齊比例下,各種對齊方法的表現如何變化。
8) 對於每一個null對齊比例組,作者計算了不同對齊方法在該組上的F1分數等評估指標。
9) 通過比較各組的結果,可以清晰看出null對齊比例對不同方法的影響程度。
10) 這種按null對齊比例分組的分析方式,有助於深入理解各種對齊方法在處理不同語義差異程度的句子對時的優缺點。
總之,通過按null對齊比例分組,可以細致觀察各對齊方法在不同語義差異場景下的表現變化趨勢,從而全面評估和比較不同方法的能力。
:::

- Although the F1 score of BOT naturally degrades as null alignment rates increase, the F1 score of regularised BOT stays closer to that of UOT owing to thresholding on the regularised solutions.
- Regularised POT outperforms the unregularised POT.
- Therefore, we argue that <span class='red'>thresholding is vital to obtain not only a sparse but also unbalanced alignment matrix</span>.
- Interestingly, most methods show a v-shape curve, i.e., the F1 score decreases first and then increases again. We identified this trend is due to the characteristics of MSR-RTE. Most sentence pairs with a high (> 45%) null alignment ratio come from MSR-RTE that consists of ‘hypothesis’ and ‘text’ of entailment pairs that tend to have largely different lengths.
:::info
有趣的是,大多數方法顯示出一種"V"形曲線,即F1分數首先下降,然後再次上升。我們發現這種趨勢是由於MSR-RTE數據集的特性所導致。來自MSR-RTE的大多數句子對具有高(>45%)空對齊率,MSR-RTE由"假設"和"文本"組成,這些句子對往往長度差異很大。
具體來說,步驟如下:
1. 大多數方法在圖2(a)中顯示出"V"形曲線趨勢,即F1分數先下降,然後再上升。
2. 我們發現這種趨勢是由於MSR-RTE數據集的特點所致。
3. 在MSR-RTE中,大多數句子對的空對齊率很高(>45%)。
4. MSR-RTE由"假設"和"文本"組成,這些句子對的長度通常差異很大。
5. 在這種情況下,較短句子中的大多數詞彙將與較長句子中的詞彙對齊,而較長句子中剩餘的詞彙則沒有對應的對齊詞彙。
6. 這種長度差異很大的句子對特性在MSR-RTE中十分普遍,影響了對齊方法的性能表現。
因此,該V形曲線趨勢是由MSR-RTE數據集中具有高空對齊率且長度差異很大的句子對所導致的。
:::
- In these entailment pairs, most words in a (shorter) sentence would align with words in the (longer) pair, while the rest of the words in the pair have to be left unaligned. This characteristic is unique in MSR-RTE, which we conjecture affected the performance.
- We removed MSR-RTE and drew trends of the alignment F1 score. The alignment F1 approximately becomes inversely proportional to the null alignment rate as expected.
## 7. Supervised Word Alignment
### 7.1 Settings
- We compared our OT-based alignment methods to the state-of-the-art supervised methods proposed by **Lan et al. (2021)** and **Nagata et al. (2020)**.
- The method of **Lan et al. (2021)** uses a semi-Markov conditional random field combined with neural models to simultaneously conduct word and chunk alignment.
- Its high computational costs limit a chunk to be at most 3-gram.
- The method proposed by **Nagata et al. (2020)** models word alignment as span prediction in the same formulation as the SQuAD style question answering.
- <span class='red'>These previous methods explicitly incorporate chunks in the alignment process</span>.
- In contrast, our OT-based alignment is purely based on word-level distances represented as a cost matrix.
- We trained alignment methods of the regularised BOT, regularised POT, and UOT <span class='red'>in an end-to-end fashion using the Adafactor optimiser.
- The batch size was set to 64 for datasets except for Wiki, for which 32 was used due to longer sentence lengths.
- The training was stopped early, with 3 patience epochs and a minimum delta of 1.0 × 10−4 based on the validation F1 score.
- Before evaluation, we searched the learning rate from [5.0 × 10−5 , 2.5 × 10−4 ] with a 2.0 × 10−5 interval using the validation sets.
- Other hyper-parameters were set to the same as the unsupervised settings.
### 7.2 Results

- The **UOT-based** alignment exhibits competitive performance against these state-of-the-art methods on the datasets with higher null alignment ratios.
- Notably, despite the simple and generic alignment mechanism based on UOT, it performed on par with Lan et al. (2021) who specially designed the model for monolingual word alignment.
- The **UOT-based** alignment also shows consistently higher alignment F1 scores compared to the regularised BOT and POT alignment.
- These results confirm that <span class='red'>UOT well-captures the unbalanced word alignment problem</span>.
- OT-based alignment showed better transferability as well as Lan et al. (2021) than Nagata et al. (2020) as demonstrated on results of Newsela and Arxiv.
- We conjecture this is because our cost matrix has less inductive bias due to its simplicity. In contrast, Nagata et al. (2020) directly learn to predict alignment spans.
:::info
文章中提到,在Newsela和Arxiv數據集上,基於最優運輸(OT)的詞對齊方法展現出了比Nagata等人(2020)更好的遷移能力,與Lan等人(2021)的方法表現相當。具體步驟如下:
1. 作者評估了不同word alignment的方法在Newsela和Arxiv這兩個數據集上的表現。
2. Newsela和Arxiv數據集被用作遷移學習的測試集,模型是在MTRef數據集上訓練的。
3. 評估結果顯示,基於最優運輸(OT)的word alignment方法表現出了良好的遷移能力。
4. 具體而言,OT方法在這兩個測試集上的表現與Lan等人(2021)提出的現有最佳方法相當。
5. 然而,Nagata等人(2020)提出的方法在遷移到Newsela和Arxiv時,表現並不如OT方法和Lan等人的方法。
6. 作者推測OT方法良好的遷移能力可能是由於其成本矩陣(cost matrix)較為簡單,誘導偏差(inductive bias)較小所致。
7. 相比之下,Nagata等人的方法是直接學習預測詞對齊的span,可能會對特定領域數據有較強的適應性,遷移能力較差。
綜上所述,基於最優運輸理論的詞對齊方法在遷移到新的數據集時表現出了與目前最佳方法相當的遷移能力,優於另一種基於span預測的方法。
:::

- Demonstrating the robustness of OT-based methods against null alignment.
- OT-based alignment outperformed Nagata et al. (2020) at sentences whose null alignment ratios are higher than 15%.
- The performances of the regularised BOT and UOT reverse around 20% null alignment ratio.
- The Regularised BOT outperformed UOT on a lower side (0.7% higher F1 on average except for the lowest edge) and UOT outperformed BOT on a higher side (1.0% higher F1 on average).
- Same with the unsupervised word alignment, all methods show the v-shape trend. Figure 3 (b) shows the trend of the alignment F1 on the datasets except MSR-RTE. The performance of all methods becomes inversely proportional to the null alignment ratio as observed in the unsupervised case.

- While UOT and previous methods are competitive on the datasets with high null alignment frequencies, we found that UOT has an advantage in longer phrase alignment.
- The chunk size constraint in **Lan et al. (2021)**, 3-gram at maximum, hindered the many-to-many alignment.
- **Na-gata et al. (2020)** also failed to complete this alignment because their method conducts one-to-many alignment bidirectionally and merges results.
- **UOT** can align such a long phrase pair if the cost matrix is sufficiently reliable as demon- strated here.
## 8. Summary and Future Work
- In unsupervised alignment, the best OT problem depends on null alignment ratios.
- Second, simple thresholding on regularised BOT can produce unbalanced alignment.
- Third, in supervised alignment, simple and generic OT-based alignment shows competitive performance to the state-of-the-art models specially designed for word alignment.