# Translation Edit Rate
> Akash Chakrabarti
> Roll: 001610501037
## Objective
- Calculate the Translation Edit Rate between a given hypothesis and a reference
- Display the shifts, insertions, deletions and substitutions that need to be performed to convert the hypothesis into the reference
## Introduction
TER is defined as the minimum number of edits needed to change a hypothesis so that it exactly matches one of the references, normalized by the average length of the references.
Possible edits include the following:
- insertion of one word
- deletion of one word
- substitution of one word by another
- shift of a sequence of words of any length to another location within the hypothesis
Each of these edits have unit cost associated with them. The number of edits for TER is calculated in two phases.
- A greedy search is used to find the set of shifts, by repeatedly selecting the shift that most reduces the number of insertions, deletions and substitutions, until no more beneficial shifts remain.
- Then dynamic programming is used to optimally calculate the remaining edit distance using a minimum-edit-distance (where insertions, deletions and substitutions all have cost 1).
## Constraints for shifting
The space of all possible shifts being huge, reducing that space becomes necessary for efficient computation of the TER score. To achieve this reduction the following constraints have been put in place:
1. The shifted words must match the reference words in the destination position exactly.
2. The word sequence of the hypothesis in the original position and the corresponding reference words must not exactly match.
3. The word sequence of the reference that corresponds to the destination position must be misaligned before the shift.
For very big sentences several other constraints such as **limiting the maximum length of a sequence that can be shifted**, or **limiting the maximum distance over which a sequence can be shifted**, etc. can be imposed for more efficient computations.
## Outline of the algorithm
Let the hypothesis be denoted by `h`
Let the reference be denoted by `r`
Let the shifted hypothesis be denoted by `h'`
1. Find the shift that most reduces the Minimum-Edit-Distance(h', r)
2. If no such shift found goto `Step 6`
3. Perform the shift on `h` to get `h'`
4. Increment the count of edits needed
5. Goto `Step 1`
6. Edit-Count = Edit-Count + Minimum-Edit-Distance(h', r)
7. Analyze the alignment between the hypothesis and the reference and display results
8. Return Edit-Count
## Description of files in the project
| File Name | Description |
| --------- | ----------- |
| `TERCostFunction.cs` | This contains the functions the compute the costs associated with the various operations. In this case all of them return 1. |
| `TERShiftOperation.cs` | This encapsulates a shift operation. It contains information about the words being the shifted like the starting index, ending index and the final location in the shifted hypothesis. |
| `TERAlignment.cs` | This encapsulates an alignment between a hypothesis and the reference. It contains information like the shifts needed to arrive at that hypothesis. The alignment between the hyp and the ref is also contained in terms of the number of how many edits are needed and where in the hypothesis. |
| `MEDCalculator.cs` | This contains methods to calculate Minimum Edit Distance between two strings, calculating the alignment between them and display the edits needed to change the first string to the second |
| `TERCalculator.cs` | This contains all methods to calculate the TER between the hypothesis and the reference. This is the class where all the main logic is contained. |
| `Program.cs` | This the main file where the algorithm is tested. |
## Implementation Details
- This project has been implemented in C# as a console application.
- The main parts of the project have been provided below as follows:
### Cost function
This class contains the methods to compute the costs associated with the various edit operations.
```csharp
public class TERCostFunction
{
// The cost of substituting a
// hypothesis word for a reference word
public double SubstitutionCost()
{
return 1.0;
}
// The cost of inserting a
// hypothesis word
public double InsertionCost()
{
return 1.0;
}
// The cost of deleting a
// hypothesis word
public double DeletionCost()
{
return 1.0;
}
// The cost of shifting a contiguous
// sequence of one or more words
public double ShiftCost()
{
return 1.0;
}
}
```
### Main logic to calculate the TER
This is method that calculates the TER between the original hypothesis and reference. It returns a `TERAlignment` to let the main method analyze the results.
```csharp
public static TERAlignment CalculateTER(string[] hypothesisArr, string[] referenceArr, TERCostFunction costFunction)
{
Dictionary<string, SortedSet<int>> phraseMatches = BuildPhraseMatches(hypothesisArr, referenceArr);
TERAlignment currentAlignment = MEDCalculator.CalculateMinimumEditDistance(hypothesisArr, referenceArr, costFunction, ref MinEditDistMatrix);
string[] currentHypothesisArr = hypothesisArr;
currentAlignment.Hypothesis = hypothesisArr;
currentAlignment.Reference = referenceArr;
currentAlignment.ShiftedHypothesis = hypothesisArr;
var numberOfEdits = 0;
var allShifts = new List<TERShiftOperation>();
while (true)
{
(TERShiftOperation bestShift, TERAlignment resultingAlignment) =
PerformBestShift(currentHypothesisArr, hypothesisArr, referenceArr,
phraseMatches, currentAlignment, costFunction);
// break if no shift found
if (bestShift == null)
{
break;
}
// Add the cost of a shift to the number of edits
numberOfEdits += Convert.ToInt32(new TERCostFunction().ShiftCost());
// Add this shift to the list of shifts to be done
allShifts.Add(bestShift);
// Update the alignment after shift
currentAlignment = resultingAlignment;
// Update the hypothesis array after shift
currentHypothesisArr = currentAlignment.ShiftedHypothesis;
}
var finalAlignment = currentAlignment;
finalAlignment.AllShiftsPerformed = allShifts.ToArray();
finalAlignment.NumEdits += numberOfEdits;
return finalAlignment;
}
```
### Calculating the MED and alignment
```csharp
class MEDCalculator
{
public static TERAlignment CalculateMinimumEditDistance(string[] hypothesisArr, string[] referenceArr, TERCostFunction costFunction, ref int[][] MinEditDistMatrix)
{
// Allocate space for matrix of edit distances
MinEditDistMatrix = new int[referenceArr.Length + 1][];
for (var i = 0; i < MinEditDistMatrix.Length; i++)
{
MinEditDistMatrix[i] = new int[hypothesisArr.Length + 1];
}
MinEditDistMatrix[0][0] = 0;
// Need i insertions to convert from empty hyp to ref[0..i]
for (var i = 0; i < referenceArr.Length; i++)
{
MinEditDistMatrix[i + 1][0] = i + 1;
}
// Need j deletions to convert from hyp[0..j] to empty ref
for (var j = 0; j < hypothesisArr.Length; j++)
{
MinEditDistMatrix[0][j + 1] = j + 1;
}
for (var i = 0; i < referenceArr.Length; i++)
{
for (var j = 0; j < hypothesisArr.Length; j++)
{
if (referenceArr[i] == hypothesisArr[j])
{
MinEditDistMatrix[i + 1][j + 1] = MinEditDistMatrix[i][j];
}
else
{
MinEditDistMatrix[i + 1][j + 1] = 1 + Math.Min(Math.Min(MinEditDistMatrix[i][j + 1],
MinEditDistMatrix[i + 1][j]),
MinEditDistMatrix[i][j]);
}
}
}
//Console.WriteLine(MinEditDistMatrix[referenceArr.Length][hypothesisArr.Length]);
TERAlignment resultingAlignment = CalculateAlignment(hypothesisArr, referenceArr, ref MinEditDistMatrix);
return resultingAlignment;
}
public static TERAlignment CalculateAlignment(string[] hypothesisArr, string[] referenceArr, ref int[][] MinEditDistMatrix)
{
var alignment = new TERAlignment();
alignment.Alignment = new List<char>();
int i = MinEditDistMatrix.Length - 1;
int j = MinEditDistMatrix[i].Length - 1;
while (true)
{
if (i == 0 || j == 0)
{
break;
}
if (referenceArr[i - 1] == hypothesisArr[j - 1])
{
alignment.Alignment.Insert(0, ' ');
i = i - 1;
j = j - 1;
}
else if (MinEditDistMatrix[i][j] == MinEditDistMatrix[i - 1][j - 1] + 1)
{
alignment.Alignment.Insert(0, 'S');
i = i - 1;
j = j - 1;
}
else if (MinEditDistMatrix[i][j] == MinEditDistMatrix[i][j - 1] + 1)
{
alignment.Alignment.Insert(0, 'D');
j = j - 1;
}
else if (MinEditDistMatrix[i][j] == MinEditDistMatrix[i - 1][j] + 1)
{
alignment.Alignment.Insert(0, 'I');
i = i - 1;
}
}
alignment.Hypothesis = hypothesisArr;
alignment.Reference = referenceArr;
alignment.NumWords = referenceArr.Length;
alignment.NumEdits = MinEditDistMatrix[referenceArr.Length][hypothesisArr.Length];
return alignment;
}
}
```
### Performing the best shift that reduces MED
This method first finds all possible shifts that can be performed on the current hypothesis. It then goes through the shifts and checks which shift most reduces the MED between the shifted hypothesis and the reference. It performs that shift and returns that shift and the resulting alignment
```csharp
private static (TERShiftOperation bestShift,
TERAlignment resultingAlignment)
PerformBestShift(string[] currentHypothesisArr,
string[] hypothesisArr,
string[] referenceArr,
Dictionary<string, SortedSet<int>> phraseMatches,
TERAlignment currentAlignment,
TERCostFunction costFunction)
{
// Record which hypothesis and reference words are currently not aligned
bool[] hypErr = new bool[hypothesisArr.Length];
bool[] refErr = new bool[referenceArr.Length];
// Record which ref words are aligned to which hyp words
int[] refAlign = new int[referenceArr.Length];
PopulateAlignmentInformation(currentAlignment, hypErr, refErr, refAlign);
// Find all possible shifts
TERShiftOperation[][] allPossibleShifts =
FindAllPossibleShifts(currentHypothesisArr,
referenceArr,
phraseMatches,
currentAlignment,
hypErr,
refErr,
refAlign,
costFunction);
var isShiftGood = false;
var currentBestShiftCost = 0;
var currentBestShift = new TERShiftOperation();
var currentBestAlignment = currentAlignment;
var currentBestShiftedHypothesisArr = currentHypothesisArr;
for (var i = allPossibleShifts.Length - 1; i >= 0; i--)
{
for (var s = 0; s < allPossibleShifts[i].Length; s++)
{
var currentShift = allPossibleShifts[i][s];
var shiftedHypothesisArr = PerformCurrentShift(currentHypothesisArr, currentShift);
var shiftedHypothesisAlignment = MEDCalculator.CalculateMinimumEditDistance(shiftedHypothesisArr, referenceArr,
costFunction, ref MinEditDistMatrix);
shiftedHypothesisAlignment.Hypothesis = hypothesisArr;
shiftedHypothesisAlignment.Reference = referenceArr;
shiftedHypothesisAlignment.ShiftedHypothesis = shiftedHypothesisArr;
shiftedHypothesisAlignment.AllShiftsPerformed = new TERShiftOperation[] { currentShift };
var gain = (currentBestShiftCost + currentBestAlignment.NumEdits) -
(1 + shiftedHypothesisAlignment.NumEdits);
if (gain > 0 || (gain == 0 && currentBestShiftCost == 0))
{
isShiftGood = true;
currentBestShift = currentShift;
currentBestShiftCost = 1;
currentBestAlignment = shiftedHypothesisAlignment;
currentBestShiftedHypothesisArr = shiftedHypothesisArr;
}
}
}
TERShiftOperation bestShift = null;
TERAlignment resultingAlignment = null;
if (isShiftGood)
{
bestShift = currentBestShift;
resultingAlignment = currentBestAlignment;
var s = new StringBuilder();
for (var i = bestShift.Start; i <= bestShift.End; i++)
{
if (i == bestShift.Start)
s.Append(currentHypothesisArr[i]);
else
s.Append(" " + currentHypothesisArr[i]);
}
Console.WriteLine($"Shift : \"{s.ToString()}\" at Index Range [{bestShift.Start + 1}, {bestShift.End + 1}]" +
$" to Index Range [{bestShift.FinalLoc + 1}, {bestShift.FinalLoc + bestShift.End - bestShift.Start + 1}]" +
$" by {Math.Abs(bestShift.FinalLoc - bestShift.Start)} indexes");
Console.WriteLine($"Result : {String.Join(" ", currentBestShiftedHypothesisArr)}\n");
}
return (bestShift, resultingAlignment);
}
```
### Analyse the alignment information
This class ecapsulates an alignment between a shifted hypothesis and a reference. It also contains a method to analyse the data and print the results accordingly.
```csharp
public class TERAlignment
{
public string[] Reference { get; set; }
public string[] Hypothesis { get; set; }
public string[] ShiftedHypothesis { get; set; }
public TERShiftOperation[] AllShiftsPerformed { get; set; }
public int NumEdits { get; set; }
public int NumWords { get; set; }
public List<char> Alignment { get; set; }
public override string ToString()
{
var s = new StringBuilder();
Console.WriteLine($"Final shifted hypothesis: {String.Join(" ", ShiftedHypothesis)}\n");
if (Alignment != null)
{
MEDCalculator.ShowEdits(ShiftedHypothesis, Reference, ref TERCalculator.MinEditDistMatrix);
}
var numShifts = AllShiftsPerformed == null ? 0 : AllShiftsPerformed.Length;
var otherEdits = new Dictionary<char, int>();
foreach (var c in Alignment)
{
if (otherEdits.ContainsKey(c))
otherEdits[c]++;
else
otherEdits.Add(c, 1);
}
var v = 0;
var deletes = otherEdits.TryGetValue('D', out v) ? v : 0;
var inserts = otherEdits.TryGetValue('I', out v) ? v : 0;
var subs = otherEdits.TryGetValue('S', out v) ? v : 0;
Console.WriteLine();
Console.WriteLine($"Number of shifts performed: {numShifts}");
Console.WriteLine($"Number of insertions performed: {inserts}");
Console.WriteLine($"Number of deletions performed: {deletes}");
Console.WriteLine($"Number of substiturions performed: {subs}");
Console.WriteLine();
Console.WriteLine($"Total number of edits performed: {inserts + deletes + subs + numShifts}");
Console.WriteLine($"Number of words in reference: {Reference.Length}");
s.Append($"\nTER Score = {(double)NumEdits / NumWords} ({NumEdits}/{NumWords})");
return s.ToString();
}
}
```
## Sample Output
### When there is no shift that can reduce MED
```
Reference: the boy went to school in the afternoon yesterday
Hypothesis: the boy returned from school in the afternoon
Final shifted hypothesis: the boy returned from school in the afternoon
Insert "yesterday" at index 9 in Hyp
Substitute "from" at index 4 in Hyp by "to" in Ref
Substitute "returned" at index 3 in Hyp by "went" in Ref
Number of shifts performed: 0
Number of insertions performed: 1
Number of deletions performed: 0
Number of substiturions performed: 2
Total number of edits performed: 3
Number of words in reference: 9
TER Score = 0.333333333333333 (3/9)
```
### When there is a shift that can reduce the MED
```
Reference: saudi arabia denied this week information published in the american new york times
Hypothesis: this week the saudis denied information published in the new york times
Shift : "this week" at Index Range [1, 2] to Index Range [4, 5] by 3 indexes
Result : the saudis denied this week information published in the new york times
Final shifted hypothesis: the saudis denied this week information published in the new york times
Insert "american" at index 10 in Hyp
Substitute "saudis" at index 2 in Hyp by "arabia" in Ref
Substitute "the" at index 1 in Hyp by "saudi" in Ref
Number of shifts performed: 1
Number of insertions performed: 1
Number of deletions performed: 0
Number of substiturions performed: 2
Total number of edits performed: 4
Number of words in reference: 13
TER Score = 0.307692307692308 (4/13)
```
## Conclusion
We have implemented a program to calculate the Translation Edit Rate between a hypothesis and a reference. As can be seen from the results, the TER measure correlates more with human judgments of Machine Translation quality as compared to other measures like Word Error Rate and BLEU. Therefore it can be adopted as a an automatic evaluation metric of Machine Translation quality.