PID-848 - HackMD

Bicleaner is a parallel corpus cleaner that identifies and removes noisy sentence pairs from a parallel corpus. It is implemented using a supervised learning approach, where a machine learning model is trained on a labeled dataset of parallel sentence pairs. The model is trained to predict whether a given sentence pair is noisy or not. Once the model is trained, it can be used to classify new sentence pairs as noisy or not. To do this, the model extracts a set of features from each sentence pair, such as the length of the sentences, the number of shared words, and the number of unique words. The model then uses these features to predict whether the sentence pair is noisy or not. If the model predicts that a sentence pair is noisy, then the sentence pair is removed from the parallel corpus. This process is repeated until all of the sentence pairs in the parallel corpus have been classified. Bicleaner is implemented in Python and uses the following machine learning libraries: * scikit-learn * PyTorch * TensorFlow Bicleaner is also open source and can be found on GitHub [2]. The tool consists of several components: 1. **Hard rules**: Bicleaner uses a set of hard rules for pre-filtering noisy sentence pairs. These rules are implemented in the `bicleaner-hardrules` package [8]. Some of the rules are language-dependent and use language identification based on CLD2 for filtering. 2. **Language models**: Bicleaner uses n-gram language models for fluency scoring. It requires the KenLM Python bindings with support for 7-gram language models [2]. 3. **Classifiers**: Bicleaner uses classifiers to produce a probability score for each sentence pair, indicating whether they are mutual translations or not. The original Bicleaner implementation uses a classifier based on Extremely Randomized Trees. However, a more recent version called Bicleaner AI uses a neural classifier based on fine-tuned XLM-RoBERTa models [4][5]. To install Bicleaner, you can use the following commands: ``` pip install bicleaner pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip ``` For Bicleaner AI, you can install it using: ``` pip install bicleaner-ai pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip ``` After installation, you can use Bicleaner by running the `bicleaner-classify` command, or Bicleaner AI by running the `bicleaner-ai-classify` command [3][6]. For more detailed examples and instructions on how to train your own Bicleaner models, you can visit the Bicleaner GitHub repository [2] and the Bicleaner AI GitHub repository [5]. Here is a simplified overview of the Bicleaner implementation: 1. Load the labeled dataset of parallel sentence pairs. 2. Split the dataset into training and testing sets. 3. Train a machine learning model on the training set. 4. Evaluate the model on the testing set. 5. Use the trained model to classify new sentence pairs as noisy or not. 6. Remove noisy sentence pairs from the parallel corpus. Here are some of the benefits of using bicleaner to clean parallel corpora: 1. Improved machine translation quality: Bicleaner can help to improve the quality of machine translation by removing noisy sentence pairs from parallel corpora. 2. Reduced training time: Bicleaner can help to reduce the training time of machine translation models by removing noisy sentence pairs from parallel corpora. 3. Improved model robustness: Bicleaner can help to improve the robustness of machine translation models to noise in the input text.