# TALLIP Dataset Papers Review v1 ### 1. [Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods](https://dl.acm.org/doi/10.1145/3582496) >There is a dearth of semantically annotated corpora for the Urdu language, which is widely spoken and used around the world. To ill this gap, this study presents a large benchmark corpus and methods for the semantic tagging task for the Urdu language. The proposed corpus contains 8,000 tokens in the following domains or genres: news, social media, Wikipedia, and historical text (each domain having 2K tokens). The corpus has been manually annotated with 21 major semantic ields and 232 sub-ields with the USAS (UCREL Semantic Analysis System) semantic taxonomy which provides a comprehensive set of semantic ields for coarse-grained annotation. Each word in our proposed corpus has been annotated with at least one and up to nine semantic ield tags to provide a detailed semantic analysis of the language data, which allowed us to treat the problem of semantic tagging as a supervised multi-target classiication task. So, this paper simply a token dataset paper and it seems they annotated the whole dataset. However, since we work on a classification dataset, this work has no meaning for us, I think. ### 2. [KenSwQuAD – A Question Answering Dataset for Swahili Low Resource Language](https://dl.acm.org/doi/10.1145/3578553) > The final set of shortlisted Swahili texts based on the selection criteria, were therefore 2,168 texts, of which 1,660 texts (76.6%) were provided to the QA annotators. The method of data allocation to annotators was by equal number of texts in each time duration (monthly, at the start of the month), then allowing individual annotators to access the next set of a fixed number of texts upon finalization of their targets. These subsequent sets were allocated weekly and replenished weekly upon confirmed completion. All the work was done in the 2 months total project duration. ### 3. [Detection and cross-domain evaluation of cyberbullying in Facebook activity contents for Turkish](https://dl.acm.org/doi/10.1145/3580393) ![](https://i.imgur.com/pOIxVP5.png) They annotated 60k (100%) Facebook content (post, comment etc.) as in the above figure