Label Studio review

# Label Studio review ## Initial questions - How to deploy the app? - Is the app communicating externally? - Is there any data collection from the software producer? - How to add a model for pre-annotation? Where does it takes place in the code? ## Modules There is 2 main modules : - label-studio : https://github.com/heartexlabs/label-studio - label-studio-ml-backend : https://github.com/heartexlabs/label-studio-ml-backend but also : - label-studio-converter : https://github.com/heartexlabs/label-studio-converter - it helps convert annotated data exports from label studio to the desired machine learning data format (eg. Yolo, Coco etc.) - it might be useful later in the project ### Label-studio module It is data annotation tool containing : - backend - Django and Flask - Server of the application - storing datas - managing users accounts - processing requests - frontend - ReactJs - User interface: - managing labeling - create projects - import data - etc. - database - PostgreSQL - store datas - low level CRUD operations - transparent ### Label-studio-ml-backend module This is a tool for deploying and training machine learning models inside label-studio. - You can import models for a lot of tasks : object detection, text. classification, image classification etc. - The model is wrapped in web server api. - A web service is an application that provides a standardized programming interface (API) to allow other applications to communicate with it via the network, here label-studio communicates with label-studio-ml. - Model wrapping is made with a python initialisation script containing : predict() and fit() functions. - Theses functions helps doing conversion from label-studio and model framework (eg. pytorch, tensorflow). ## How to deploy? ### Label-studio ```bash #clone the repository git clone https://github.com/heartexlabs/label-studio.git cd label-studio-develop #Install all package dependencies pip install -e . #Run database migrations python label_studio/manage.py migrate python label_studio/manage.py collectstatic #Start the server in development mode at http://localhost:8080 python label_studio/manage.py runserver ``` ### Label-studio-ml-backend ```bash git clone https://github.com/heartexlabs/label-studio-ml-backend cd label-studio-ml-backend pip install -U -e . pip install -r label_studio_ml/examples/requirements.txt # this command create a folder my_ml_backend using simple_text_classifier.py scrupt as entry point. # script contains functions to train, predict and evaluate the model # init argument creates necessary files to loaunch the model as a webservice label-studio-ml init my_ml_backend --script label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py ## launch the service on localhost:9090 by default label-studio-ml start my_ml_backend ``` ### Linking both servers Prerequisite is to have a project created. - go in "settings" in the project interface. - click on "Machine Learning" tab. - add a model. - put a description and a name. - in URL field, enter the ML-backend server adress - you can find it in the command line shell stack - or in the config file of the model - it is localhost:9090 by default ### Notes This deployment is for development only, this mean concretely that there is only 1 worker and 8 threads, and application could not take the workload of several users in the same time. This setup at the moment is just a quick fix for deploying the application for personal purpose. ## Wrapping a model There is a list of script examples available in label-studio-ml-backend and they cover tasks like OCR, text classification or image object detection. ## Active Learning Active learning is a machine learning approach that involves an algorithm selecting the most informative examples to learn from among a large pool of unlabeled data, then presenting these examples to a human expert for labeling to improve the algorithm's performance. ![](https://i.imgur.com/a0P3THp.png) ### Learning loop The learning loop allow to train the model automatically after a bunch of annotations in order to preannotate the datas more quickly and more accurately. ![](https://i.imgur.com/QxR5SG7.png) Sequence diagram +-------------+ +---------------------+ | Label Studio| | Label Studio ML | +-------------+ +---------------------+ | Create Labeling Project | | | | Create ML Backend | | || | | | Link ML Backend | | | |------------------------------->| | | | Train Model | | | |<-------------------------------| | | | Predict with Model | | | |------------------------------->| | | | View Results | | | |<-------------------------------| | | | Update Model | | | |------------------------------->| ## Community vs Entreprise versions See comparaison tab here : https://labelstud.io/guide/label_studio_compare.html The main difference is that the paid version offers technical support from Heartex company at the origin of this software. The paid version offers more features to manage teams of annotators (performance statistics, accounts etc.). This functionnalities are not useful for our purpose. ### Useful functionnalities we don't have - learning loop: - the model does not retrain with each new annotation - you have to retrain it and reload the backend with the new model - sorting data by score. - this is very useful to correct the most erroneous annotations and increase the performance of the model more quickly. ## Externals requests ### According to label-studio "*Label Studio collects anonymous usage statistics about the number of page visits and data types being used in labeling configurations that you set up. No sensitive information is included in the information we collect. The information we collect helps us improve the experience of labeling data in Label Studio and helps us plan future data types and labeling configurations to support.*" Link here : https://labelstud.io/guide/get_started.html#:~:text=Label%20Studio%20is%20an%20open,exploring%20multiple%20types%20of%20data. ### According to my tests #### 4 tests - Running the apps without internet connection. - Stack trace from command line. - Network tab in chrome explorer inspector. - Wireshark paquet tracing - This is too low level information, difficult to understand and filter. - I need to investigate further to get desired information. #### Conclusion No external requests regarding the **datas**. The apps run fine without access to internet. The web dev inspector and the stack trace shows that there are connections that are attempted for the javascript library. web dev inspector: ![](https://i.imgur.com/Gqp87GD.png) stack trace: ![](https://i.imgur.com/Gc0767D.png) Failure to load this external content does not lead to the interruption of the software or any disturbances. **but** I recommend working offline for now because it is possible to have unsolicited requests (after a certain time? when loading the app? when shutting down?...). Some requests may not be logged in stack trace nor web dev inspector. The explication above is not sufficient to understand application because it doesn't capture the details of the deployment, one should try the 2 following examples to get deeper in the code. ## Project example : annotating fish videos This example project doesn't involve ml-backend to preannotate data. It is a 'hack' to accelerate annotation of images in object detection task, using the very good video annotation tool. Indeed, why annotate each image 1 by 1 while there is sequentiality? We should rather use video annotation ? #### Video annotation tool - There is a good demo here : https://www.youtube.com/watch?v=Grp6UB_zB0Y&t=1872s - Principle is to annotate at different points of the video called 'key frames' . - The frames in the interval will be automatically annotated in a coherent way. - Label-studio implements a linear extrapolation of bboxes from one keyframe to another. - Videos formats accepted by label-studio: mpeg4/H.264 webp, webm*. #### Datas - https://alzayats.github.io/DeepFish/ - Dataset which consists of a series of images of fish in natural environments. - I made a quick jpg to mp4 conversion script below. It has to be reworked to adapt output name, because actually it is just 'out.mp4'. ```bash! #!/bin/bash # Set the path to the first-level directory FIRST_LEVEL="/Users/benjamin/Developments/label_studio/label-studio-ml-backend-master/fish_dataset/DeepFish/Classification" # Iterate through each second-level directory for dir in "${FIRST_LEVEL}"/*/; do # Check if the directory exists and is a directory if [ -d "$dir" ]; then # Do something with the second-level directory echo "Processing ${dir}" for dir2 in "${dir}"*/; do # Check if the directory exists and is a directory if [ -d "$dir2" ]; then # Do something with the second-level directory echo "Processing ${dir2}" ### convert all jpg files to mpeg-4/h.264 video format with 25 framerate with ffmpeg ffmpeg -framerate 25 -pattern_type glob -i ${dir2}'*.jpg' -c:v libx264 -pix_fmt yuv420p ${dir2}out.mp4 fi done fi done ``` - be careful to use a web browser compatible with the chosen video format: https://caniuse.com/?search=video%20format #### Launch label-studio ``` python label_studio/manage.py runserver #or label-studio start ### now reach http://localhost:8080 on your web browser ``` #### Sign in - Create an account. - The account is sotred locally in postgreSQL database. - Credentials, projects informations and datas are stored locally. #### Project creation - click on 'create' button - give project name - data import : import the recently made video - labeling setup : - open 'templates' - choose 'Videos' - choose 'Video Object Tracking' - add label : write the labels separated by line break - switch to 'code' mode and enter in the 'video' tag, where attribute 'framerate' should be the same as the one choosen in ffmpeg conversion (25 for example). - save #### Annotation - 'label all tasks' button: will pop-up randomly unannotated videos. - Otherwise choose manually videos to annotate. - Define 'keyframes', ie frames on which we will place the bboxes and the label. - From these keyframes label-studio will do a linear interpolation. - It results in bboxes which 'follow' the object from frame to frame. #### Results The export of the results is in the form of json which here is an example: ```jsonld [ { "id": 47, "annotations": [ { "id": 10, "completed_by": 2, "result": [ { "value": { "framesCount": 1142, "duration": 47.541667, "sequence": [ { "frame": 1, "enabled": true, "rotation": 0, "x": 16.587677725118482, "y": 9.557661927330173, "width": 3.554502369668245, "height": 8.636124275934703, "time": 0.041666666666666664 }, { "x": 16.113744075829384, "y": 10.400210637177457, "width": 3.554502369668245, "height": 8.636124275934703, "rotation": 0, "frame": 2, "enabled": true, "time": 0.08333333333333333 }, ... , "labels": [ "poisson1" ] }, "id": "0TUS4LF4hB", "from_name": "box", "to_name": "video", "type": "videorectangle", "origin": "manual" }, ... "file_upload": "e1917952-out.mp4", "drafts": [], "predictions": [], "data": { "video": "\/data\/upload\/9\/e1917952-out.mp4" }, "meta": {}, "created_at": "2023-05-02T14:55:46.864882Z", "updated_at": "2023-05-02T15:03:10.980673Z", "inner_id": 1, "total_annotations": 1, "cancelled_annotations": 0, "total_predictions": 0, "comment_count": 0, "unresolved_comment_count": 0, "last_comment_updated_at": null, "project": 9, "updated_by": 2, "comment_authors": [] } ] ``` - We can find the coordinates of the bboxes (x,y,h,w) as well as the label of each bbox but **only for keyframes**. - In order to label all the frames it would be interesting to write a python script with : - Input: the json - Output: the json augmented with bbox and label data for each frames. - Algorithm to label all frames : - take two neighbor keyframes: keyframe n and keyframe n+x - for each one, take the points x, y , w ,h - measure the difference between point k1 and k2. - divide by the number of intermediate frames: we obtain an alpha value. - increment each point by its alpha to reconstruct the intermediate frames. ## Example project: image object detection - Install MMDetection : https://mmdetection.readthedocs.io/en/v1.2.0/INSTALL.html - Do in CLI (adapt with your path): ``` label-studio-ml init coco-detector --from /Users/benjamin/Developments/label_studio/label-studio-ml-backend-master/label_studio_ml/examples/mmdetection-3/mmdetection.py ``` - Download model weights here : https://github.com/open-mmlab/mmdetection/tree/main/configs/faster_rcnn - Save it for example in mmdetection folder : /Users/benjamin/Developments/label_studio/mmdetection/mmdetection/checkpoint/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth - Config file is in mmdetection folder recently installed : mmdetection/mmdetection/configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py - Now you can launch the ml-backend server in CLI: ``` label-studio-ml start coco-detector --with \ config_file=/Users/benjamin/Developments/label_studio/mmdetection/mmdetection/configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py \ checkpoint_file=/Users/benjamin/Developments/label_studio/mmdetection/mmdetection/checkpoint/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \ device=cpu \ --port 8003 ``` - Download example datas : ``` cd path/to/coco-detector mkdir data && cd data wget https://download.openmmlab.com/mmyolo/data/cat_dataset.zip && unzip cat_dataset.zip ``` - Launch label-studio : ``` label-studio start ``` - Create a project - Choose template in Computer Vision > Object Detection with Bounding Boxes - In labelling configuration, paste all the COCO dataset labels (The rcnn model you downloaded was trained on it) : airplane apple backpack banana baseball_bat baseball_glove bear bed bench bicycle bird boat book bottle bowl broccoli bus cake car carrot cat cell_phone chair clock couch cow cup dining_table dog donut elephant fire_hydrant fork frisbee giraffe hair_drier handbag horse hot_dog keyboard kite knife laptop microwave motorcycle mouse orange oven parking_meter person pizza potted_plant refrigerator remote sandwich scissors sheep sink skateboard skis snowboard spoon sports_ball stop_sign suitcase surfboard teddy_bear tennis_racket tie toaster toilet toothbrush traffic_light train truck tv umbrella vase wine_glass zebra - import examples datas - connect label-studio with ml-backend - settings - add model - enter backend url - save - Now you can do the annotation loop : - click on 'Label All Tasks' : model will predict bbox and label on all images. - correct the annotations. - re-train the model - relaunch the ml-backend - etc. - until reaching good quality data labelling. ## Possible futures works - Implementing the model of Kilian. - External requests: - Checking the external requests more deeply with Wireshark. - Identify the code responsible. - Try to remove it without breaking the app. - Potentialy time consuming. - Implement a learning loop - Add score sorting functionnality. - Adapt the application for production. - Deploy on a server.