Current Streams

# Current Streams **In order of importance..** - Helping Bot Dev team wherever they need help - Translation pipeline (bugfixes) - Number Parser Bot v1 (issues / bug fixes) - digits-parser (issues / bugfixes) - Jenkins Model training pipeline - Boilerplate repo additions (bugfixes, enhancements) - MLFlow Monitoring pipeline - Transliteration pipeline (fixes - includes current experiments) - Conversation Service - Core logic - Sentence splitter - Sentence splitter decider - Number Parser Bot v2 **SPOCs** [@Akash](@Akash) : Number parser bot [@Nihit](@Nihit) : IndiaFirst, Number parser bot [@Nitin](@Nitin) : Transliteration Experiment [@Srinath](@Srinath) : Everything else & for planning across the board (all streams) If someone's catching you for things not under you, you can choose to do the work yourself; or hop them over to the SPOC and they should be able to handle. # MLFlow **[@Srinath](@Srinath)** - [ ] Addition of API Gateway and integration with BitBucket - [ ] Final round of Testing - [ ] A small walk-through session with Bot Dev team & Bharath in case they're not able to understand themselves by watching that video - [ ] Documentation & PR # Number Parser Bot v2 [@Akash](@Akash), [@Nihit](@Nihit) & [@Srinath](@Srinath) - [ ] Make a list of all cases that pass currently, and all that don't - [ ] Add also a list of cases which we would like to support in v2 - [ ] Share the list to the group (maybe even bot-dev group and nishant & saransh & siddharth) -- get their inputs too (keep format simplified : `user says > bot understood > bot says > user says . . . `) - [ ] Try to accomodate (or think at least how v2 bot would easily extend to have) alphanumeric inputs - [ ] Come up with possible solutions to it (**everyone's task**) - [ ] Brainstorming discussion to discuss everyone's solution - [ ] If needed, have a second meeting - [ ] Document the design we've come up with # Conversation Service > Milestone : Aim for deployment by Wednesday for English. [@Akash](@Akash) - [ ] Flows - [ ] Test suite [@Nihit](@Nihit) - [ ] Test suites - [ ] Code Review - [ ] Integration & Deployment - [ ] Add modes to make it a pass-through service (if not done so far) [@Srinath](@Srinath) - [ ] Review what's been done so far & update the plan accordingly - [ ] Help out if need be in any parts of core logic - [ ] Sentence Splitter - [ ] Go through dataset lists, and select the datasets you'd use. Get the decision reviewed by *@Nitin* - [ ] Download and cleaning of datasets - [ ] Remember indian english can contain simple normal english sentences too. So, add some pure english sentences there as well - [ ] Understand the pre-processing needed for datasets - [ ] Pre-processing for datasets (make sure to use the same kind of pre-processing as used by DeepSegment folks) - [ ] Training an initial model for indian english - [ ] Training an initial model for hindi # Transliteration Experiments ## Common [@Bhavul](@Bhavul) - [ ] Generate testing data by writing a script to programmatically extract the text via Rasa-X APIs - [ ] Must contain sentences in one sheet tagged by ID - [ ] Must contain conversation ID -- so we can link back - [ ] Must contain a word list separately with their links to sentences in which it occurs [@Srinath](@Srinath) + [@Nitin](@Nitin) - [ ] Manually assign correct intents wherever they're wrong, and/or get it reviewed by Nishant [@Nitin](@Nitin) - [ ] Run nihit's script to get Redis hits / misses for words - [ ] Analyse redis misses in this testing data ## 1. TTS to generate varations [@Nitin](@Nitin) - [ ] Build devnagri data for the testing data so we can throw it as input to TTS for Redis Misses and some of those too which were hits in Redis (just to see if variations come from those or not) - [ ] Speak to Pranav on what else we can vary other than speed, and whether using any SSML tags would make sense to generate variations, and if any other options exist which are not exposed to other team yet - [ ] For every word we're gonna use this script for, generate 2-3 sentences in which it is used (google search, or search in our training data?), as well as the word itself - [ ] Build script to pass an input text to TTS with varying features (speed, etc) and pass through the above words (as well as their sentences) - [ ] Write a script to pass through the generated audio wav files through Google ASR - [ ] Add the ASR transcriptions to the testing data and compare if variations got generated - [ ] Calculate variation score (ask @srinath if you didn't understand the metric), and other metrics you feel are necessary - [ ] If variations get generated (for following tasks, you may need to take some KT / help from Srinath) : - [ ] Discuss with srinath on how to add variations into a new Redis instance (don't change the old one) - [ ] Launch a new Redis and add to it all words from previous redis and the new word variations we generated - [ ] Launch and deploy a new Rasa-X instance with latest meesho changes and which connects to this new Redis we've launched (and uses transliteration pipeline) -- take help of Srinath to understand how to get this done, but highly recommend you to try to do it once -- it gives a good idea on how to launch & deploy Rasa-X. - [ ] Make API calls with our testing data sentences (not words) to get their intents and compare them with - [ ] current production (translation pipeline) . . - [ ] transliteration pipeline pointing to old redis which doesn't have variations - [ ] Document results - [ ] If no variations were generated: - [ ] Move to one of the other experiment threads ## 2. Heuristical generation - [ ] Understand what's happening in these two repositories : - [ ] Pho-hinix (java) : https://github.com/Priiyam/Pho-hinix/blob/master/src/soundexNew.java - [ ] Masala Merge (python) : https://github.com/paulnov/masala-merge/blob/master/lev.py - [ ] See if any other resources also come up on this, or make a set of rules which we will code - [ ] Write a script to generate variations of words - [ ] Do the following tasks (take KT / help from srinath): - [ ] Discuss with srinath on how to add variations into a new Redis instance (don't change the old one) - [ ] Launch a new Redis and add to it all words from previous redis and the new word variations we generated - [ ] Launch and deploy a new Rasa-X instance with latest meesho changes and which connects to this new Redis we've launched (and uses transliteration pipeline) -- take help of Srinath to understand how to get this done, but highly recommend you to try to do it once -- it gives a good idea on how to launch & deploy Rasa-X. - [ ] Make API calls with our testing data sentences (not words) to get their intents and compare them with - [ ] current production (translation pipeline) . . - [ ] transliteration pipeline pointing to old redis which doesn't have variations - [ ] Document results - [ ] Move to next experiment ## 3. ASR Mixed Script output Experiment - [ ] Take some KT / help from Srinath and setup a clone of Rasa-X with transliteration pipeline with latest code - [ ] Ask tarun to set up a new Google ASR with primary language hindi and secondary as english - [ ] Send through the audio files of the conversation IDs that we have in our testing data, and get the new transcriptions in mixed script - [ ] Map these transcriptions in the testing data - [ ] Through APIs get intent classification results of these mixed script transcriptions with our new Rasa-X deployment with transliteration pipeline (connected to still old Redis only) - [ ] Move to next experiment ## 4. Phonetic Matching [@Bhavul](@Bhavul) - [ ] Understand the PhoneticMatching library from Microsoft -- various possible inputs and variables - [ ] Understand the inexactsearch library and its inputs and how it does the matching - [ ] Research on how we would implement nearest neighbour over Redis which has 1.8M words - [ ] Write code / script / DB function to add the functionality of using these libraries to find nearest word from Redis - [ ] Write it as a custom component? And deploy a new Rasa-X instance with latest code - [ ] Generate intent classification results on our testing dataset and compare with others