# 數位醫學論文資料整理
###### tags: `class`
### Use of Semantic Features to Classify Patient Smoking Status
* 使用MedLEE(Medical Language Extraction and Encoding)
MedLEE,Rule based,Text trained,MedLEE表現最好
* https://github.com/WGLab/EHR-Phenolyzer 此github為,將EHR轉成MedLEE的code
* 資料裡沒有 "smok|cig|tob|pack[^e]|nico" 視為unlnown
* Three types of classifiers were used: lexical supervised, semantic supervised and semantic symbolic (i.e., rule-based).
* #### Feature Selection
* MedLEE example
* 挑選MedLEE中的3個屬性problem,certainty, status
1. "Problem" corresponded to the smoking-related action or object
2. "certainty" contained negation information
3. "status" contained temporal information.
* 如果產生的MedLEE中沒有status這個tag,就改用data tag來判斷
* #### Classifier algorithms
* 使用BoosTexter這個classifier表現最好,與使用WEKA的classifier差不多
* BoosTexter github連結:https://github.com/benob/icsiboost
* 
---
### Identifying Patient Smoking Status from Medical Discharge
* 使用一些文字的features (e.g., “smok”, “tobac”, “cigar”, Social History, etc.)來去判斷抽菸的狀態
* 統整性的期刊論文,使用各個統計指標,來論述各做法的表現
---
### Using implicit information to identify smoking status in smoke-blind medical discharge summaries
* 上去jamia也找不到figure和table
---
### Medical i2b2 NLP Smoking Challenge: The A-Life System Architecture and Methodology
* 使用LifeCode (LifeCode is a natural language processing (NLP) and medical coding expert system that extracts information from free-text clinical records)
> LiefCode 來自這篇論文 LIFECODE A Deployed Application for Automated Medical Coding
* 
* **Document segmenter** :根據根據章節標題劃分、分類
* **Lexical Analyzer** :是將文本轉換成KB詞彙表字串的處理器,功能包括縮寫的展開以及詞法學上的減少(例如:dogs,dog都是指同一個)
* **phrase parser** :bottom up的句法分析,讓input變成phrase(詞組、短語),而這個解析是可以接受語法錯誤或不知道的單字,一個文本塊,是由兩個單字到一個句子所組成的
* **Concept Matcher** : uses vector analysis to assign meanings (KB concept labels) to each phrase
* #### Method
* CM-Extractor (來自 CM-Extractor: An Application for Automating Medical Quality Measures Abstraction in a Hospital Setting 這篇論文)
* LifeCode就是for CM-Extractor
* CM-Extractor分為兩步驟,第一步驟為:將文檔轉成LifeCode
* 第二步驟為:rule-based approach如下圖

---
### Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
* 使用weka進行分類

* dataset:分為兩個trainset
* 第一個為(NO-Freq):col-1表示吸煙狀態的段落,col2包含在該字段中鍵入的實際文本,col3描述相同的句子出現在數據集中的次數(句子頻率)
* 第二個(Freq):根據相同的研究數據生成了第二個trainset(“Freq”),並根據句子的頻率重複了句子
* 分為兩個數據集是想要探討句子的頻率是否為影響分類結果,但最後的研究結果顯示加入句子頻率和attributes selection,並不會改善model perfomance...
* Preprocessing:要將data存入excel,並將Excel轉成weka用的ARFF格式
* classifier SMO在smoking challenge表現最好,而在本篇論文,SMO也是表現最好的分類器
* token使用unigrams and bigrams表現最好
* weka 參考網址:http://blog.pulipuli.info/2017/04/weka-make-predictions-with-saved.html?fbclid=IwAR2ust4nqSx3aNHih1QH6pU5e_7IKTwbc-aKXbOCxhxeE_wX4hi_EzLSdyo