# 數位醫學論文資料整理 ###### tags: `class` ### Use of Semantic Features to Classify Patient Smoking Status * 使用MedLEE(Medical Language Extraction and Encoding) MedLEE,Rule based,Text trained,MedLEE表現最好 * https://github.com/WGLab/EHR-Phenolyzer 此github為,將EHR轉成MedLEE的code * 資料裡沒有 "smok|cig|tob|pack[^e]|nico" 視為unlnown * Three types of classifiers were used: lexical supervised, semantic supervised and semantic symbolic (i.e., rule-based). * #### Feature Selection * MedLEE example![](https://i.imgur.com/tdeESDj.png) * 挑選MedLEE中的3個屬性problem,certainty, status 1. "Problem" corresponded to the smoking-related action or object 2. "certainty" contained negation information 3. "status" contained temporal information. * 如果產生的MedLEE中沒有status這個tag,就改用data tag來判斷 * #### Classifier algorithms * 使用BoosTexter這個classifier表現最好,與使用WEKA的classifier差不多 * BoosTexter github連結:https://github.com/benob/icsiboost * ![](https://i.imgur.com/rRhMj6l.png) --- ### Identifying Patient Smoking Status from Medical Discharge * 使用一些文字的features (e.g., “smok”, “tobac”, “cigar”, Social History, etc.)來去判斷抽菸的狀態 * 統整性的期刊論文,使用各個統計指標,來論述各做法的表現 --- ### Using implicit information to identify smoking status in smoke-blind medical discharge summaries * 上去jamia也找不到figure和table --- ### Medical i2b2 NLP Smoking Challenge: The A-Life System Architecture and Methodology * 使用LifeCode (LifeCode is a natural language processing (NLP) and medical coding expert system that extracts information from free-text clinical records) > LiefCode 來自這篇論文 LIFECODE A Deployed Application for Automated Medical Coding * ![](https://i.imgur.com/5B4Otyg.png) * **Document segmenter** :根據根據章節標題劃分、分類 * **Lexical Analyzer** :是將文本轉換成KB詞彙表字串的處理器,功能包括縮寫的展開以及詞法學上的減少(例如:dogs,dog都是指同一個) * **phrase parser** :bottom up的句法分析,讓input變成phrase(詞組、短語),而這個解析是可以接受語法錯誤或不知道的單字,一個文本塊,是由兩個單字到一個句子所組成的 * **Concept Matcher** : uses vector analysis to assign meanings (KB concept labels) to each phrase * #### Method * CM-Extractor (來自 CM-Extractor: An Application for Automating Medical Quality Measures Abstraction in a Hospital Setting 這篇論文) * LifeCode就是for CM-Extractor * CM-Extractor分為兩步驟,第一步驟為:將文檔轉成LifeCode * 第二步驟為:rule-based approach如下圖 ![](https://i.imgur.com/82mtNJE.png) --- ### Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records * 使用weka進行分類 ![](https://i.imgur.com/zICtnC0.png) * dataset:分為兩個trainset * 第一個為(NO-Freq):col-1表示吸煙狀態的段落,col2包含在該字段中鍵入的實際文本,col3描述相同的句子出現在數據集中的次數(句子頻率) * 第二個(Freq):根據相同的研究數據生成了第二個trainset(“Freq”),並根據句子的頻率重複了句子 * 分為兩個數據集是想要探討句子的頻率是否為影響分類結果,但最後的研究結果顯示加入句子頻率和attributes selection,並不會改善model perfomance... * Preprocessing:要將data存入excel,並將Excel轉成weka用的ARFF格式 * classifier SMO在smoking challenge表現最好,而在本篇論文,SMO也是表現最好的分類器 * token使用unigrams and bigrams表現最好 * weka 參考網址:http://blog.pulipuli.info/2017/04/weka-make-predictions-with-saved.html?fbclid=IwAR2ust4nqSx3aNHih1QH6pU5e_7IKTwbc-aKXbOCxhxeE_wX4hi_EzLSdyo