Fake News - HackMD

# Fake News ###### tags:`NCTU Project` --- ### 2020/09/23 ## 會議記錄與分工 * 以影像(受訪人的比對)為主要判別方式 * 爬新聞影片, 以單則(約三分鐘)之新聞為主 * 將影片前處理(Sampling), 大約$1 frame/s$ * 建立受訪人臉Database, 以現有已train好的model處理 * 尋找臉部偵測模型 * 靜 * 研究是否可以用pytube sample frame, 而不要存整部影片 * 峰 * 決定要爬的欄位,分析youtube影片的頁面的html,並將資訊寫進pandas的dataframe中,最後可以存成一個csv檔 * 華 * 取得影片的url列表,例如怎麼爬一個頻道的列表。所以可能要決定爬哪一個頻道，或是怎麼爬特定主題的新聞，或是下關鍵字之後再用時間跟長度作限制來爬列表 * 柔 ### 爬蟲資料格式(csv) * ID(url) * Title * Author * Length * Hashtags * Upload_date * Description | ID | Title | Author | Length | Hashtags | Upload_date| Description | | -------- | -------- | -------- | -------- | -------- | ---------- | ----------- | | f6MrNLCyxW8 | 縣市競爭力首奪冠! 新竹市跨越"台北障礙" | TVBS NEWS | 79 | ['新竹', '台北', '林志堅', '新竹', '台北', '林志堅'] | 2020-09-28 | #新竹#台北#林志堅\n\n●訂閱【TVBSNEWS】最新資訊馬上接收👉https://tv... | ### YouTube影片路徑 ``` Data/2020-09-30/Title0.mp4 Data/2020-09-30/吃安眠藥後開車撞死人！孝女慘死輪下.mp4 Data/2020-10-01/嘉義「共匪餅」登香港TVB引爭議店家錯愕：是暱稱.mp4 ... ``` --- ### 2020/09/30 ## 會議記錄與分工 * csv部分增加Description * 柔 * FaceNet + Adaptive Thereshold, OpenCV + dlib 最新版本比較 * 靜, 華 * 直接存影片, 不Sampling(圖檔太大) * 兩個csv(爬蟲, face recognition) ## 進度報告 ### 柔: * 撰寫爬蟲，輸入一個Youtube playlist，擷取所有影片的資訊(id,title,author,length)，並下載影片。尚未處理Hashtag與description。 ### 靜: * 研究face recognition模型，目前找到的模型是2015 google facenet --- ### 2020/10/7 ## 會議記錄與分工 * 支線:影片相似度比對(Dynamic time warping) * 華 * 未來需要前端介面 * 建立database * 柔 * facenet跑起來 * 靜 * facenet有無find tune功能 ## 進度報告 ### 華: * facenet與dlib比較 * [Comparison of face detection](https://www.kaggle.com/timesler/comparison-of-face-detection-packages) * [Face detection algorithms comparison](http://datahacker.rs/017-face-detection-algorithms-comparison/) ### 柔: * 完成爬蟲，包含新增hashtags、description欄位、解決鎖ip問題、處理exception (欄位格式不同、連線異常)，調查主流新聞媒體YouTube頻道的播放清單。 ### 靜: * FaceNet: A Unified Embedding for Face Recognition and Clustering </br> https://arxiv.org/pdf/1503.03832.pdf * Data-specific Adaptive Threshold for Face Recognition and Authentication </br> https://arxiv.org/pdf/1810.11160.pdf * 瞭解這兩篇paper的network架構 --- ### 2020/10/14 ## 會議記錄與分工 * 提升爬蟲更新效率 * 以關鍵字搜尋相關新聞 * 柔 * 使用faceNet擷取影片frame中出現的人臉 * 決定以什麼樣的方式建立face與person的table * 靜 * 進行更影片相似度比對的實驗，以了解model的能力 * 華 ## 進度報告 ### 柔: * Database : MySQL * ER Model ![](https://i.imgur.com/i5JKE4v.png) * 轉移資料、改寫爬蟲 ### 華: * [ICCV_2019_ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning](https://openaccess.thecvf.com/content_ICCV_2019/papers/Kordopatis-Zilos_ViSiL_Fine-Grained_Spatio-Temporal_Video_Similarity_Learning_ICCV_2019_paper.pdf) * [github](https://github.com/MKLab-ITI/visil) * [demo](https://drive.google.com/drive/folders/1HGaCni74lcgESHs6U3S_Q45B2iQAFvF1?usp=sharing) * ![](https://i.imgur.com/cKrxz21.png) --- ### 2020/10/21 ## 會議記錄與分工 * ViSiL投影片 * 華 * 資料視覺化, 相似度矩陣 * 定時爬蟲 * 柔 * 找label face data的方法 * 靜 ## 進度報告 ### 柔: * 可使用關鍵字進行爬蟲 ### 華: * Gold Sample: 300 * 300 1hr * Ouputdata整理 * Same video similarity * Avg: 0.96 * Max: 1.0 * Min: 0.796 * Min video: 1288-159 * [Video](https://drive.google.com/drive/folders/1uWGevMymtFrDh6gRcRtlWLEgQWAMddko?usp=sharing) * threshold?除了自己以外找相似度前K名? ### 靜: * 找到所有人臉的embedding，用DBSCAN分群，不同的臉被分在同一群，相同的臉分到不同群 --- ### 2020/10/28 ## 會議記錄與分工 * 利用t-SNE降維的結果，製作互動介面，幫助label人臉的作業 * 柔 * 釐清finetune faceNet時，input的細節與整體流程 * 靜 * How ViSiL model transfer triplet loss (distance) to similarity output * 前處理介面(方便自動化使用以銜接已有之系統) ## 進度報告: ### 柔 * 完成YouTube爬蟲，透過windows排程，於每日23:58執行更新的batch程式 * 將人臉的embedding，透過t-SNE降維後視覺化 ![](https://i.imgur.com/JsMHaoI.png) ![](https://i.imgur.com/8f9h2nZ.jpg) ### 華: * Paper Study * ![](https://i.imgur.com/gGUQ5Py.png) * Spatial: frame to frame(Tensor Dot + Chamfer Similarity) * Temporal: video to video(Chamfer Similarity) * Triplet Loss --- ### 2020/11/03 ## 會議紀錄與分工 * Trace visil code * 華 * Labeling介面修改 * 新增欄位enbedding、label到face table * 以下拉式清單的方式讓使用者label data * 進行database的測試 * 柔 or 靜 ## 進度報告: ### 柔 * 撰寫Labeling介面 * 使用Flask + D3.js + mysql.connector * 測試階段，尚未結合database * 縮放(zoom)與選取(brush)整合與移除清單項目功尚未完成 ![](https://i.imgur.com/iWta8MX.png) ### 華: * use hard tanh activation function to evaluate output * video to video similarity to loss function * ![](https://i.imgur.com/LvpcPWd.png) * loss function * ![](https://i.imgur.com/Fa6JYbP.png) * ![](https://i.imgur.com/M8FhSjN.png) * evaluate * Near-Duplicate Video Retrieval * Fine-grained Incident Video Retrieval * Event Video Retrieval * Action Video Retrieval --- ### 2020/11/11 ## 會議紀錄與分工 * 增加刪除label功能 * Labeling介面整合 * 允許外部存取介面 * 柔 ## 進度報告: ### 柔 * 完成前處理程式 * 修改data檔案結構 ``` # data |__video | |__<video_upload_date> | |__<video_url>.mp4 | |__face | |__<video_upload_date> | |__<video_url> | |__<video_id>_<face_frame>_<face_no>.jpg | |__labeled_face |__<face_label> |__<video_id>_<face_frame>_<face_no>.jpg ``` * Face table欄位新增：face_embedding、face_label、face_no * 撰寫Labeling介面 * 縮放(zoom)與選取(brush)整合，但選取有錯位問題 * 可使用影片上傳時間過濾顯示的face * 可移除選取清單的項目 * 與Database串聯 * Labeled face image移動 ![](https://i.imgur.com/yjW66AX.png) ### 華: * 發現錯誤影片![](https://i.imgur.com/kiX3muP.jpg) --- ### 2020/11/18 ## 會議記錄與分工 ## 進度報告: ### 柔 * 可刪除label * 介面整合完成 * 可由[http://140.113.210.9:5000](http://140.113.210.9:5000)連至介面 ### 華: * Preprocessing -> visil -> Outputprocessing * Preprocessing: 完成錯誤處理 * Outputprocessing: 留下相似度正的並排序 * trace code * Code非training code為使用train好的model的API --- ### 2020/11/25 ## 會議紀錄與分工 * 語音相似度? * 先尋找僅有文字與聲音的影片有多少 * 影片相似度結果視覺化 * Visil CPU/GPU 確認 ## 進度報告: ### 靜: * 標記用google image抓下來的圖片 ### 柔: * 增加mouseover事件 ### 華: * Experiment * Queries:TempVideo (881) * Database:Gold Sample (299) * Cost: 3hr * Ouput: Some similarity videos * CPU/GPU ?? * The same videos have 0.75 similarity at least --- ### 2020/12/02 ## 會議紀錄與分工 * model fine-tune (fix layer or not) * plot vaildation acc curve * sample label data * try clustring * how to choose repersatitive feature ## 進度報告: ### 柔 * 增加顯示新聞縮圖功能 ![](https://i.imgur.com/ZRh4KTq.png) ### 華: * Demo網頁雛形[http://140.113.210.7:5000] * ![](https://i.imgur.com/TX5pUfL.png) ### 靜 * fine-tune model * (train 0.98 val 0.90) ![](https://i.imgur.com/6q7BLLJ.png) ![](https://i.imgur.com/in8FXsa.png) 紅色：陳時中藍色: 蔣萬安 (但主要還是看accuracy) ![](https://i.imgur.com/bYmpJEq.png) --- ### 2020/12/09 ## 會議記錄與分工 * visil site * User upload video to server * Server response results ## 進度報告: ### 柔 * 標label (2020/08/24-11/10) * 1,129部影片 * 101個政治人物 * 15,078張label ### 華: * 網頁影片顯示完成 ### 靜: - 檢查有標記錯誤的data - 爬更多的data - fine-tune (20 epochs) - WeightedRandomSampler - 根據每個class的數量給不同的weight，weight = 1/class_num - weight作為每個class的data被sample到的機率 - 更新所有的參數 - Train accuracy: 1.0000 - Validation accurazy: 0.9826 - 只更新最後兩層的參數 - Train accuracy: 0.9887 - Validation accurazy: 0.9650 - 只更新最後一層的參數 - Train accuracy: 0.9853 - Validation accurazy: 0.9607 - fine-tune - update all layers - acc ![](https://i.imgur.com/92cM9r2.png) - fps ![](https://i.imgur.com/oOl0GTt.png) - loss ![](https://i.imgur.com/23xAVVK.png) - update last layer - acc ![](https://i.imgur.com/MSedRpW.png) - fps ![](https://i.imgur.com/mxwC8XI.png) - loss ![](https://i.imgur.com/3ZdUwYg.png) --- ### 2020/12/16 ## 會議記錄與分工 * visil * 希望可以讓比對過程即時回傳 ## 進度報告: ### 柔 * 標label (2020/08/24-12/01) * 1,152部影片 * 137個政治人物 * 25,340張label ### 華: * visil * 使用者可上傳影片至server * server將送進visil比對 * 使用者可透過選單察看結果 * (尚未解決) * 上傳完後跳轉網頁造成server執行command不完整 * 執行進度顯示? --- ### 2020/12/23 ## 會議記錄與分工 ## 進度報告: ### 柔 * 標label (2020/08/24-12/21) * 1,183部影片 * 141個政治人物 * 26,649張label ### 華: * (這周去處理羽球計畫書) --- ### 2020/12/30 ## 期末考 --- ### 2021/01/06 ## 會議記錄與分工 * face * Use triplet loss finetune * Use head pose estimation to filter the face data * visil * 確認單個frame, video存了多少feature * 改變存檔格式(binary? npz?) * 確認比對相似度速度(1:300 / 150 min, 300:300 / 5 min) * 是否用全部feature ## 進度報告: ### 柔: * Read faceNet paper ### 華: * visil * 將golden sample feature先存下來 * 但讀檔還是慢(27G)(約6分鐘) * 計算相似度還是需要約5分鐘 * GPU問題尚未解決(已與玉米討論中) --- ### 2021/01/13 ## 會議記錄與分工 * face * 重新檢視計劃書 * visil * 串成能demo的模樣 * 想辦法再加速 ## 進度報告 ### 柔: * Head Pose Estimation * [FSA-Net](https://www.csie.ntu.edu.tw/~cyy/publications/papers/Yang2019FSA.pdf) * [shamangary/FSA-Net](https://github.com/shamangary/FSA-Net#for-lazy-people-just-like-me) * [omasaht/headpose-fsanet-pytorch](https://github.com/omasaht/headpose-fsanet-pytorch) * 使用omasaht的FSA-Net來測試 ![](https://i.imgur.com/4OI98hi.jpg) ![](https://i.imgur.com/yB6JXSG.jpg) ### 華: * visil * feature size: sec(per frame) * 9 * 3840 * ex: 180s video has 180 * 9 * 3840 * data * json to binary(per video per file) * 6 min to 5 s * cal simularity * 6 min to 6 s * extract query video feature * 6 s --- ### 2021/01/20 ## 會議記錄與分工 * 本周開會取消 ## 進度報告 ### 華: * visil * 流程串起 * Server空的大概30秒 * 有其他 process 大概 1min ### 2021/01/27 ## 會議記錄與分工 * face * triplet loss * visil * 錄製 demo 影片?0 ## 進度報告 ### 柔: * 重新閱讀假新聞計畫書，釐清系統功能 * 新聞蒐集 * 撰寫爬蟲 * 資料庫 * MySQL * 新聞資料表 * 新聞關鍵影格擷取 (避免儲存高度相似資料) * 人臉資料表 (過度資料表) * 人臉偵測 * MTCNN * Embedding擷取 * FaceNet * Pose estimation * FSA-Net * 人物資料表 * 代表性embedding * 公眾人物辨識 (系統自己學) * 言論資料表 * 語音轉文字 * 文章摘要 * 人物關係表 * 前處理 * 擷取新聞中出現的人物 (增加人物比對效率) * 使用者介面 * 人臉Labeling * 新聞人物串聯 * 人臉偵測 * 選取人物 * 人臉驗證、辨識 (與人物資料庫比對) * 輔助人臉辨識 (當系統辨識信心分數時) * 顯示人物所參與過之新聞 * 新聞內容辨識 (新聞內容是否屬於相同事件) * 演員警告 (非公眾人物卻同時參與不同新聞事件) * 半自動公眾人物辨識 * 公眾人物名言錄 * 以時間軸視覺化 * 參與新聞日期 * 人物言論 * 新聞來源 * 公眾人物關係網路 (幫助民眾了解公眾人物之間的關係) * 選擇公眾人物 * 選擇時間區間 * 顯示關係圖 * 同時出現在同一影格 * 同時出現在同則一新聞 * (定義可表現的關係) * 使用 finetuned model 擷取 face embedding ![](https://i.imgur.com/gz2ZEiw.jpg) * 新增 head pose feature 欄位至資料庫 * 增加 pose threshould 至 lebaling 介面 ![](https://i.imgur.com/D10TodV.png) ### 華: * visil * preload model and read data * 1:30秒影片大約15秒回傳結果 ![](https://i.imgur.com/lW5WEDS.png) ### 靜: * hesd pose estimation + finetune model * 實驗 * 訓練 * all layer * last linear layer and logit layer * threshold * 40, 30, 20 * 測試 * test set * test set + head pose estimation(threshold 20) * 每一組都獨立訓練五次 * validation accuracy最高的model會存下來 ![](https://i.imgur.com/zLzldet.png) ![](https://i.imgur.com/exXmkSg.png) ![](https://i.imgur.com/srpUgoR.png) ### 2021/02/02 ## 會議記錄與分工 ## 進度報告 ### 柔: * 使用 DBSCAN 分群 * 閱讀arcface paper * 嘗試使用arcface 產生embedding ### 2021/02/24 ## 會議記錄與分工 * Fake news * 調整embedding維度(256、128)，再進行clustering * 視覺化用triplet loss finetune後的結果 * Dimesion reduction * 對paper中提到的其他dataset做實驗 * 跑學長的[程式](https://github.com/jxcodetw/Parametric-DR?fbclid=IwAR3hdlwaduI6jo1vcFEmWsziIj5jWWf4Qx3F9k70faMSQMqLJuO3kJtYDa4)，先試MNIST的實驗 * Unity toolkit * 看能不能寫程式 (彈性程度) * 提出整理報告 ## 進度報告 ### 柔: * [Deep Learning Multidimensional Projections](https://arxiv.org/pdf/1902.07958v1.pdf) * 閱讀paper * 進行MNIST實驗 ![](https://i.imgur.com/rdH7Yy3.jpg) * 使用mean shift分群 ### 靜: * finetune model (triplet loss) ### 2021/03/03 ## 會議記錄與分工 * Dimesion reduction * 檢查TSNE的實驗參數 * IMDB不做、做CIFAR10就好 * 實作Neighborhood Hit * 用學長的code跑實驗2 * Fake news * 使用finetune後的model擷取embedding * 將clustering.py整進UI * Unity ML * 試更多例子(網球、足球) * 用成Demo影片 ## 進度報告 ### 柔: * Dimesion reduction * Facilitate the Parametric Dimension Reduction by Gradient Clipping * 閱讀paper * 進行batch size、network capacity實驗 * Deep Learning Multidimensional Projections * MNIST、Fashion MNIST實驗 * 處理Cats vs Dogs Dataset (feature extraction) * Fake news * clustering result UI ### 2021/03/10 ## 會議記錄與分工 * Dimesion reduction * 換 feature extraction model (keras to pythorch) * Fake news * 實作比對 UI * 實驗多種比對方法 * Unity ML * 嘗試使用 GPU 在 server training ## 進度報告 ### 柔: * Dimesion reduction * 實作Neighborhood Hit * 實驗 2 比較 * Fake news * 使用finetune後的model擷取embedding * 使用整進UI ### 2021/03/17 ## 會議記錄與分工 * Dimesion reduction * 測試[GitHub code](https://github.com/mespadoto/proj-quant-eval/tree/master/code/01_data_collection) * 進行實驗 * 比較方法AE、PCA、UMAP、TSNE、DLMP * Fake news * 開發新聞人臉比較UI * 流程構想: * 前處理 * 為每個新聞的臉分群 * 找出每個出現人物代表性的embedding (離平均最近的臉) * UI * 使用者選擇觀看影片 * 系統框出影片中的人臉 * 使用者選擇特定人臉 * 比對代表性embedding * 顯示相關新聞(依相似度排序) * Unity ML * 嘗試使用 GPU 在 server training ## 進度報告 ### 柔: * 閱讀 Towards A Quantitative Survey of Dimension Reduction Techniques * 下載實驗所需的 18 個 dataset * 測試[GitHub code](https://github.com/mespadoto/proj-quant-eval/tree/master/code/01_data_collection)(還在debug中)