Labeling Tool Product Design

# Labeling Tool Product Design Labeling tool is used to manage models and data for machine learning classification. ## Terminology - **Model**: a model is a machine learning algorithm paired with a set of parameters. These parameters are usually learned/trained from a labeled dataset. Given input, the model can predict the corresponding output even the input may not exist in the training labeled dataset. - **Conflicted Example**: a conflicted example is an example whose labeled output is not same as the predicted output. It may be due to either the model error or the labeling error. A conflicted example can be ressolved when 1) its label is corrected and same as the prediction; or 2) the model is improved and the prediction is same as the label. ## Editor Account - An editor* has an account - An editor is able to rigester to create an account - An editor is able to login or logout the account \* Editor: an editor is a person (engineer or non-engineer) that can login the labeling tool system and do data and model management work; ## Editor User Profile - An editor can find the list of campaigns* she is working on - An editor can find her daily work progress: how many examples** she works on every day - An editor can find other editors' work progress (TBD) \* Campaign: a campaign is associated with a list of categories the editor wants to add labels. One campaign can have one or more battles. \** Example: A example at least contains an input (e.g., a sentence). For an unlabeled example, it has only an input. For a labeled example, it has an input and output (e.g., the category of the sentence). - **Unlabeled Dataset**: an unlabeled dataset is a set of examples. Each example only has input (e.g., sentence), but does not have output. ## Leaderboard - The leaderboard displays the overall performance of the system, including: - the number of categories that perform better than a list of thresholds - the performance of each category - the number of available examples* for each category - associated existing live campaigns** of each category - An editor is able to select a list of categories from the leaderboard to start a campaign \* available examples: once there is a model, the model runs prediction on a random sample of unlabeled dataset, and examples with prediction for a particular category is the available examples for this category; \** live campaigin: after one campaign starts, an editor and stop it. A live campaign is a campaign that has not been stopped yet. ## Campaign - creation - An editor can create a campaign by selecting a list of categories she wants to associate with this campaign. - access - An eidtor can find a campaign (live or stopped) by searching the campaign ID - An editor can click the link to find a live campaign on the leaderboard associated with categories. - display: campaign display page should include the following information - meta: campaign ID, owner, start/stop time - number of categories whose performance is better than thresholds - performance per category - battles* in this campaign - interaction: - an editor can stop a campaign - an editor can change the deactivate/activate categories in a campaign (Once a category is deactivated, this category is temporarily not associated with the campaign. An editor might want to do it once the performance of that category is good enough.) - an editor can start a battle in a campaign \* Battle: a battle is associated with a set of examples. When an editor starts a battle, the system prepares a list of unlabeled examples. The editor then starts to label these examples. After the labeling, all these unlabeled examples becomes labeled examples. ## Battle - creation - an editor can start a battle from a campaign page - the system find the examples predicted as active categories in the campaign and add them into the battle - the editor can specify the following parameter for the battle - the maximal number of examples to be labeled in the campaign - the maximal number of examples displayed on one page - the maximal number of suggested categories (options) provided for each example - an editor can start a battle from a list of failed examples*: the system finds similar examples to the failed examples and add these examples to the battle - access - an editor find a battle by directly search for its ID - an editor can click the battle link on the campaign display page to find a battle - display: the battle page contains the following information - battle meta: battle ID, editor, start/end time, number of examples, etc. - overall performance - performance per category - examples to be labeled (if available) - example label display - it can show one or more examples to be labeled on one page - it can show one or more questions (suggested categories) for each example - for each question, it provides a few options for the editor to label. For example, it can be "correct", "incorrect" and "not sure" - if all the questions (suggested categories) are not correct, the eidtor is able to add a new category. When the editor adds a new category, she should get autocomplete suggested categories that match the partial category name typed. - when the editor submits \* Failed Example: a failed example is an example that its prediction has been in consistency status with its label, and its label has been confirmed by an editor. A failed example can be recovered when the model is improved and the prediction is same as the label. [example UI in spreadsheet](https://docs.google.com/spreadsheets/d/1pFFYTnoQh4ktEKbLtsueIOuM_P_WC36y6JLhNKr-OpA/edit?usp=sharing) ## Data Cleaning Feature - Goal: Users are able to navigate/search conflicts between label and prediction on the training and validation data set; ### Data Cleaning: Main UI - show overall conflicts information on the training data - total number of training examples - total number of conflicts - category level stats in table including columns - name of the category - number of examples - number of conflicts - number of conflicts that are not resolved (not added into queries to be corrected yet) - percentage of all conflicts - precision/recall/f1 - all the categories can be sorted by one of the following - percentage of all all conflicts - f1 score - can select one category for the clean up ### Data Cleaning: Example List UI - show a list of editable examples (with pagination); - each example includes information - sentence - predicted categories - label categories - editor of the label - editor timestamp - you can delete this example if the input of the example is too ambiguous or confusing - you can add correct categories from the label or prediction column: if the correct categories are inconsistent with the prediction, this query is added to failed query list - show top n (e.g., 10) conflicted categories and number of not-resolved conflicts - once clicking on the conflicted categories, only the examples of this pair of conflicted categories are displayed - there is a tab where you can see all examples that are added to failed query list; ## Model Correction - Goal: users are able to - add new examples that to be correctd - correct the examples from the data cleaning ### Ackownoledged Query Navigation UI - show stats of failed queries - the total number of failed queries: fixed, not fixed - show per category stats of failed queries - category name - number/percentage of failed queries - clicking on the category name, you can get a list of failed queries associated with this category - you can search to get the failed queries by - text search - label category name - prediction category name - editor ### Faied Query Example List UI - when you click on category from categories with stats or do a failed query search, you can get a list of failed queries - there is a checkbox for each failed query, and you can select these queries to start a battle - the backend is going to run a round of related query acquisition - the battle will show up on the user's profile page