LanguageTool GSoC proposal

# LanguageTool GSoC proposal --- ## About me * name: Oleg Serikov * email: srkvoa@gmail.com * github: [oserikov](https://github.com/oserikov/) I'm a 4th year undergraduate student at Moscow Aviation Institute in Moscow, Russia. My domain of interests includes AI and high-load systems, I'm experienced in backend development. My developer's [portfolio page](https://oserikov.github.io/). As a LanguageTool contributor I've fixed two bugs ([issue #373](https://github.com/languagetool-org/languagetool/issues/373), [issue #652](https://github.com/languagetool-org/languagetool/issues/652)). Currently I'm working on the ideas listed below. --- ## Project info During the GSoC I'm going to complete the following tasks: 1. Enhance the suggestions sorting algorithm using the machine learning way inspired by [after the deadline](https://open.afterthedeadline.com/about/technology-overview)'s (section "The Spelling Corrector") 2. Migrate the server-side of the LanguageTool to the modern lightweight framework 3. Migrate the LT from Maven to Gradle The listed tasks are developed and their advantages are explained below. --- ### 1. Suggestions sorting improvement The task was [discussed on the forum](https://forum.languagetool.org/t/spellchecker-improvement-discussion/2677/3). The benefit of having this task solved consists mostly in the improvement of the users' experience — having the needed replacement suggested on the 1st line of the suggestion box is handy as it usually is with google search suggestions for example. The task is composed of the following sub tasks: * implement the first version of the machine learning based suggestions sorting algorithm to use it on the next step * add the support of the machine learning based sorters to the LT * continuously improve the suggestions sorting algorithm until it shows the satisfying quality #### The starting version implementation During this part of task I will develop and train the neural network to sort the suggestions using the n-grams probability info and the edit distance as the features. That time there will be only one model for all the languages. #### Adding the ml-based algorithms support to the LT Here I will make the LT use the model developed on the previous step to sort the suggestions. #### Continuous enhancement of the corrections sorting algorithm Here are the ideas I'd like to try: * Use the keyboard distance between the letter keys as a feature * Use the separate models for the different languages (the model used in language will be trained on that language data) * Use the separate models for the grouped languages (the model used in language will be trained on the group of languages) * The deep-learning approach (RNN + training on the performant machines with the GPU) * The phonetic mask of the words taken in account (e.g. one could use Metaphone codes of the words when measuring the probability of the correction) * Non-ML based approach: assuming that most of the errors are at an edit distance of 1 or 2 from the correct spelling, we could just sort the corrections this way: firstly by the edit distance, then by probability * Mix the non-ML approach with the ML approach **The models' comparison**. Two ways to compare models quality on the test data were [discussed on the forum](https://forum.languagetool.org/t/spellchecker-improvement-discussion/2677/16). The first way is to compare models through comparing the number of times when the top 1 suggestion predicted by the model was selected by user. The second approach is to use the convex downward function to score the suggestion selected by the user according to the suggestion's position in the suggestions list. --- ### 2. Switching to lightweight framework The task was [discussed on the forum](https://forum.languagetool.org/t/use-a-modern-framework-for-embedded-http-server-discussion/2687). The benefit of having this task solved is the improvement of the performance and the simplification of support of the LT-server. Modern frameworks allow writing easy-to-read code, have lots of useful things out-of-box. While comparing the different frameworks (explained in the "Choice of the framework" section below) I had to implement most of the LT-server functionality using these frameworks, so I've got a sort of **prototype ready**. The way it is written allows to easily migrate to another framework because the logic is encapsulated into the framework independent part of code. The task is composed of the following sub tasks: * choose the framework to switch to * migrate the LT server-side to that framework * fine-tune the realization to improve the overall performance #### Choice of the framework There are at least three solutions to choose from: SparkJava, SpringBoot, Spring WebFlux (the reactive approach suggested on the [forum](https://forum.languagetool.org/t/use-a-modern-framework-for-embedded-http-server-discussion/2687/12)). These solutions' performance will be compared via the load tests, the most efficient framework will be chosen. I've compared the SparkJava and SpringBoot realizations of the LT server and the SpringBoot seems to be more efficient. The detailed report is posted on [the forum](https://forum.languagetool.org/t/use-a-modern-framework-for-embedded-http-server-discussion/2687/23). #### LT migration Most of the functionality is migrated on the previous step, here I'll migrate the rest of the functions. #### Fine-tuning the performance During this stage I'll finish the configuration of the developed solution. --- ### 3. Replacing Maven by Gradle The task was [discussed on the forum](https://forum.languagetool.org/t/use-a-modern-framework-for-embedded-http-server-discussion/2687/28). The benefit of having this task solved derives from the Gradle's advantages explained below. Solving this task is predominantly routine process, I will just rewrite the build environment logic module by module. #### Gradle advantages explained Gradle * offers less excessive syntax which is more human-readable * has good native support of unit-testing (so the surefire plugin is not needed anymore) * is [faster](https://gradle.org/gradle-vs-maven-performance/) than Maven — that’s a good thing especially when running tests Since the build logic of the LT is not super complex, the Gradle’s known flexibility is not the key feature now, but it could be useful in future anyway. --- ## The timeline #### Now – May 14 Choosing the framework to migrate to, exploring the popular approaches to sort the spellchecker suggestions and implementing the first ML-based sorting solution. #### May 14 – 18 Integration of the sorting solution with the LT. #### May 21 – 25 Testing. #### May 28 – June 1, June 4 – 8, June 11 – 15, June 18 – 22 Continuous enhancement of the corrections sorting solution. #### June 25 – 29 Ensure that everything works fine, fix the bugs if present. #### July 2 – 6, July 9 – 13 Migration to the modern framework. #### July 16 – 20 Testing (+ adding tests) the framework implementation of the server-side, tuning the performance. #### July 23 – 27 Migration from Maven to Gradle. #### July 30 – August 3 Migration from Maven to Gradle, updating the documentation, testing. #### August 6 – 14, final week Ensure that everything works fine, update the documentation if needed, submit final work.