<style>
.container {
background-color: #696969;
}
</style>
# Outline of a fake news detection mesh classifier:
Data gathering and submodules
---
## Aim
- make trust measureable
- pipeline
- baseline news database
---
## Browsing experience
human based expert system™ for trusworthiness of article
--> model decision process with machine learning models and heuristics
---
## Corpus Building
- methodology of MNSZ3
classifiers are built from
common crawl
processed with News Please
---
## Submodules for trustworthiness

---
## corpus + crawling
- baseline corpus from processed articles
- daily crawl pipeline for trend analysis
---
## Integration
```plantuml
@startuml
User -> pipeline: Article url
pipeline --> heuristics: look up cached article
heuristics --> db: find trend correlation
pipeline --> "request modul": download and render page
pipeline --> emagyar: tokenization
pipeline --> classifiers: run models
pipeline --> User: trust score
@enduml
```
---
## Current work
misinformation classifier
toxic comment classifier
next up:
personal style classifer for authors
(anomaly detection)
---
## Misinformation classifier
huBERT based
7 million tokens
f1=0.9875, loss=0.0662
tested against articles from news sites
---
## Toxic comment classsification
huBERT based
655 comments like toxic-BERT
14000 tokens
F1=0.8735
twitter sentiment corpus
---
## Personal style (work in progress)
LoRa based learning
diffusion model
---
# Thank you for your attention
## Do you have any questions?
---
citations
Oravecz Csaba, Váradi Tamás, Sass Bálint: The Hungarian Gigaword Corpus. In: Proceedings of LREC 2014, 2014
https://commoncrawl.org/
@InProceedings{Hamborg2017,
author = {Hamborg, Felix and Meuschke, Norman and Breitinger, Corinna and Gipp, Bela},
title = {news-please: A Generic News Crawler and Extractor},
year = {2017},
booktitle = {Proceedings of the 15th International Symposium of Information Science},
location = {Berlin},
doi = {10.5281/zenodo.4120316},
pages = {218--223},
month = {March}
}
@InProceedings{ Nemeskey:2021a,
author = {Nemeskey, Dávid Márk},
title = {Introducing \texttt{huBERT}},
booktitle = {{XVII}.\ Magyar Sz{\'a}m{\'i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia ({MSZNY}2021)},
year = 2021,
pages = {TBA},
address = {Szeged},
}
@misc{wu2023ardiffusion,
title={AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation},
author={Tong Wu and Zhihao Fan and Xiao Liu and Yeyun Gong and Yelong Shen and Jian Jiao and Hai-Tao Zheng and Juntao Li and Zhongyu Wei and Jian Guo and Nan Duan and Weizhu Chen},
year={2023},
eprint={2305.09515},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{hu2021lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen},
year={2021},
eprint={2106.09685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
---
```markmap
#
## Attention-seeking behaviour
- Yellow pages detection
- Exaggeration
## Secrecy
- Not profile of medium
- no citation
- Anonymous author
## Falsity
- lying
- AI written
- title and main text mismatch
- suppression of information
## Incompetence
- Factually incorrrect
- small vocabulary
- conspiracy theory vocabulary
## nlp
- Irrelevant extra information
- disconnected from trend
- incoherent text
- anomaly detection
```
---
{"title":"Outline of a fake news detection mesh classifier: Data gathering and submodules","showTags":"false","description":"Attention seeking behaviour (Yello pages, Exaggeration, Many )","contributors":"[{\"id\":\"c290d3cb-d013-43e7-893a-1c38fe72ef30\",\"add\":4575,\"del\":946}]"}