<style> .container { background-color: #696969; } </style> # Outline of a fake news detection mesh classifier: Data gathering and submodules --- ## Aim - make trust measureable - pipeline - baseline news database --- ## Browsing experience human based expert system™ for trusworthiness of article --> model decision process with machine learning models and heuristics --- ## Corpus Building - methodology of MNSZ3 classifiers are built from common crawl processed with News Please --- ## Submodules for trustworthiness ![image](https://hackmd.io/_uploads/SyRUIhzsT.png) --- ## corpus + crawling - baseline corpus from processed articles - daily crawl pipeline for trend analysis --- ## Integration ```plantuml @startuml User -> pipeline: Article url pipeline --> heuristics: look up cached article heuristics --> db: find trend correlation pipeline --> "request modul": download and render page pipeline --> emagyar: tokenization pipeline --> classifiers: run models pipeline --> User: trust score @enduml ``` --- ## Current work misinformation classifier toxic comment classifier next up: personal style classifer for authors (anomaly detection) --- ## Misinformation classifier huBERT based 7 million tokens f1=0.9875, loss=0.0662 tested against articles from news sites --- ## Toxic comment classsification huBERT based 655 comments like toxic-BERT 14000 tokens F1=0.8735 twitter sentiment corpus --- ## Personal style (work in progress) LoRa based learning diffusion model --- # Thank you for your attention ## Do you have any questions? --- citations Oravecz Csaba, Váradi Tamás, Sass Bálint: The Hungarian Gigaword Corpus. In: Proceedings of LREC 2014, 2014 https://commoncrawl.org/ @InProceedings{Hamborg2017, author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela}, title = {news-please: A Generic News Crawler and Extractor}, year = {2017}, booktitle = {Proceedings of the 15th International Symposium of Information Science}, location = {Berlin}, doi = {10.5281/zenodo.4120316}, pages = {218--223}, month = {March} } @InProceedings{ Nemeskey:2021a, author = {Nemeskey, Dávid Márk}, title = {Introducing \texttt{huBERT}}, booktitle = {{XVII}.\ Magyar Sz{\'a}m{\'i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia ({MSZNY}2021)}, year = 2021, pages = {TBA}, address = {Szeged}, } @misc{wu2023ardiffusion, title={AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation}, author={Tong Wu and Zhihao Fan and Xiao Liu and Yeyun Gong and Yelong Shen and Jian Jiao and Hai-Tao Zheng and Juntao Li and Zhongyu Wei and Jian Guo and Nan Duan and Weizhu Chen}, year={2023}, eprint={2305.09515}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc{hu2021lora, title={LoRA: Low-Rank Adaptation of Large Language Models}, author={Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen}, year={2021}, eprint={2106.09685}, archivePrefix={arXiv}, primaryClass={cs.CL} } --- ```markmap # ## Attention-seeking behaviour - Yellow pages detection - Exaggeration ## Secrecy - Not profile of medium - no citation - Anonymous author ## Falsity - lying - AI written - title and main text mismatch - suppression of information ## Incompetence - Factually incorrrect - small vocabulary - conspiracy theory vocabulary ## nlp - Irrelevant extra information - disconnected from trend - incoherent text - anomaly detection ``` ---
{"title":"Outline of a fake news detection mesh classifier: Data gathering and submodules","showTags":"false","description":"Attention seeking behaviour (Yello pages, Exaggeration, Many )","contributors":"[{\"id\":\"c290d3cb-d013-43e7-893a-1c38fe72ef30\",\"add\":4575,\"del\":946}]"}
    63 views