AIOT/AI project 筆記

# AIOT/AI project 筆記 ## 原理藉由分析網路社群的討希望找出好看的小說這邊的討論資料是取自PTT CF版 >link: https://www.ptt.cc/bbs/CFantasy/index.html ## 架構預計架構 >爬蟲(全部資料)->NER->書名->評論->正負面分析&關鍵字 > 實際架構 >爬蟲(全部資料)-> ~~NER~~-> 書名->評論->正負面分析&關鍵字 >NER找不出書名，改用人工找也許是小說名字特性的緣故，NER常常會視為多個詞彙 ## 爬蟲使用python request 與 beautifulsoup 套件 ```python= url='https://www.ptt.cc/bbs/CFantasy/search?page=%d&q=%s'%(i,q) res=rs.get(url,verify=False) soup=BeautifulSoup(res.text,features="html.parser") ``` >程式碼選錄而後將文章存成json以便後續使用 ```python= for idx,url in enumerate(urls): with open('./post_data_byname/%s/%d.json'%(target,idx),'w') as f: f.write(json.dumps(get_a_post(url))) ``` >程式碼選錄 ## 分析 ### 正負面分析使用bert 後面接上分類器 0=負面 1=正面 >->[bert介紹](https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html) 這邊模型是感謝同學大大的提供 ```python= tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') config = BertConfig.from_pretrained('trained_model/config.json') model = BertForSequenceClassification.from_pretrained('trained_model/pytorch_model.bin',config=config) ``` >載入模型 ```python= . .#一些格式化輸入的部分 . input_ids = torch.LongTensor(input_id).unsqueeze(0) outputs = model(input_ids) . .#一些格式化輸出的部分 . ``` >程式碼選錄 ### 關鍵字使用 jieba.analyse ```python= import jieba.analyse ks = jieba.analyse.extract_tags(string, topK=4) ``` 原理為TF/IDF 也就是說，當文章中有特有的文字，就容易成為關鍵字 ## 網頁 ### location http://104.43.19.221:8088 Azure 主機價格考量，沒有買網域 ### 前端對伺服器發送請求 ```python= column=request.args.get('column') order=request.args.get('order') ``` >接受請求 ### 後端 Flask 跟據請求來跟資料庫query ```python= c.execute("SELECT * FROM novel ORDER BY "+column+" "+order) ``` ### 資料庫 mysql ```python= insert=r''' INSERT INTO novel (name, ccount, pcount, ncount, rate, keyword) VALUES ("%s","%s","%s","%s","%s",'%s'); ''' cursor.execute(insert%(bn,json_data['ccount'],json_data['pcount'],json_data['ncount'],json_data['rate'],kw)) ``` ## 結論