--- title: 'Tâm - Setup MIMIC - 03/01/20' tags: BKHCMUT, Research, Cao Hoang Tru, Vo Thi Ngoc Chau --- Setup MIMIC - 03/01/20 === ## Table of Contents [TOC] ## Downloading MIMIC data to local PostGre - Data files Link https://physionet.org/content/mimiciii/1.4/ - Download via command line & web browser -> too slow - Download via Amazon s3: Access permission error (https://registry.opendata.aws/mimiciii/) - Install https://docs.aws.amazon.com/cli/latest/userguide/install-macos.html - configure: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration ``` AKIATUNWTO2R77CZSHNM p2RnhKyc9zJndRa8Z+nmVOoxqedLXkSIjCrRxgbc ``` - Run: ``` /Users/macmac/bin/aws s3 sync s3://mimic-iii-physionet ~/../../../Volumes/2Drive/MIMICiii-1.4/AWS ``` error: `fatal error: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied` - Using Amazon Athena (we don't use) https://aws.amazon.com/blogs/big-data/perform-biomedical-informatics-without-a-database-using-mimic-iii-data-and-amazon-athena/ - **Access via cloud** (https://mimic.physionet.org/gettingstarted/cloud/) granting access to Google Cloud account, then BigQuery can be used directly on the data on the cloud [demo here](https://mimic.physionet.org/tutorials/intro-to-mimic-iii-bq/). go to bigquery [here](https://console.cloud.google.com/bigquery?authuser=2&folder=&project=mimic-iii-hcmut&j=bq:US:bquxjob_22bfaa47_16f6bf3307e&page=queryresults) - Download via [Google Cloud Storage Browser:](https://console.cloud.google.com/storage/browser/mimiciii-1.4.physionet.org?authuser=2) ~30 mins for large files (~4GB), download concurrently ~60 mins - Đối với file downloads của Tâm, csv files download trực tiếp bị corrupted. Phải dùng .zip file và gunzip thì mới có file csv dùng được. ## Migrate PostGreSQL to SSD for more storage - **finding current data directory:** `psql -U postgres` - to move data to new directory in SSD ```BASH= sudo rsync -av /Library/PostgreSQL/12/data ~/../../../Volumes/APFS/postgres-data ``` - Edit postgres.conf file in `/var/postgres/`. Change `data_directory` to new value - Then stop the old server and start the new data server: ```BASH= # Stop current data server sudo -u postgres PG_CTL -D /Library/PostgreSQL/12/data stop # Start new server sudo -u postgres PG_CTL -D ~/../../../Volumes/APFS/postgres-data/data start # See the status of old and new servers sudo -u postgres PG_CTL -D /Library/PostgreSQL/12/data status sudo -u postgres PG_CTL -D ~/../../../Volumes/APFS/postgres-data/data status ``` **We need to stop old server and start new server everytime machine restarts in order to use new data directory** ## Create MIMIC PostGres database - Follow https://mimic.physionet.org/tutorials/install-mimic-locally-ubuntu/ - Create user mimicuser `CREATE USER mimicuser WITH PASSWORD 'jw8s0F4' CREATEDB SUPERUSER;` - `\dt *.*` allows to see the relations whereas just `\dt` will not show any relations because we are using mimic schema which is not public. - Because there was an cpu error when loading the data into tables, my computer restarts and some tables are still empty. I created a custom script based on setup scripts to load remaining data into the tables. - **Problems with load_data because of not UTF-8 encoding:** >github issue link >https://github.com/MIT-LCP/mimic-code/issues/662 > >**I fixed the problem by running load data in Mac Safe Mode and prevent crashing** ```bash= #Try to convert encoding using terminal iconv -f utf-8 -t ascii INPUTEVENTS_CV_1.csv > INPUTEVENTS_CV.csv iconv -f ascii -t utf-8 D_ICD_DIAGNOSES_2.csv > D_ICD_DIAGNOSES.csv iconv -f ascii -t utf-8 D_ICD_PROCEDURES_2.csv > D_ICD_PROCEDURES.csv iconv -f ascii -t utf-8 INPUTEVENTS_CV_2.csv > INPUTEVENTS_CV.csv iconv -f ascii -t utf-8 INPUTEVENTS_MV_2.csv > INPUTEVENTS_MV.csv iconv -f ascii -t utf-8 LABEVENTS_2.csv > LABEVENTS.csv iconv -f ascii -t utf-8 MICROBIOLOGYEVENTS_2.csv > MICROBIOLOGYEVENTS.csv iconv -f ascii -t utf-8 NOTEEVENTS_2.csv > NOTEEVENTS.csv iconv -f ascii -t utf-8 OUTPUTEVENTS_2.csv > OUTPUTEVENTS.csv iconv -f ascii -t utf-8 PATIENTS_2.csv > PATIENTS.csv iconv -f ascii -t utf-8 PRESCRIPTIONS_2.csv > PRESCRIPTIONS.csv iconv -f ascii -t utf-8 PROCEDUREEVENTS_MV_2.csv > PROCEDUREEVENTS_MV.csv iconv -f ascii -t utf-8 PROCEDURES_ICD_2.csv > PROCEDURES_ICD.csv iconv -f ascii -t utf-8 SERVICES_2.csv > SERVICES.csv iconv -f ascii -t utf-8 TRANSFERS_2.csv > TRANSFERS.csv mv D_ICD_DIAGNOSES.csv D_ICD_DIAGNOSES_3.csv mv D_ICD_PROCEDURES.csv D_ICD_PROCEDURES_3.csv mv INPUTEVENTS_CV.csv INPUTEVENTS_CV_3.csv mv INPUTEVENTS_MV.csv INPUTEVENTS_MV_3.csv mv LABEVENTS.csv LABEVENTS_3.csv mv MICROBIOLOGYEVENTS.csv MICROBIOLOGYEVENTS_3.csv mv NOTEEVENTS.csv NOTEEVENTS_3.csv mv OUTPUTEVENTS.csv OUTPUTEVENTS_3.csv mv PATIENTS.csv PATIENTS_3.csv mv PRESCRIPTIONS.csv PRESCRIPTIONS_3.csv mv PROCEDUREEVENTS_MV.csv PROCEDUREEVENTS_MV_3.csv mv PROCEDURES_ICD.csv PROCEDURES_ICD_3.csv mv SERVICES.csv SERVICES_3.csv mv TRANSFERS.csv TRANSFERS_3.csv mv INPUTEVENTS_CV.csv INPUTEVENTS_CV_3.txt #Check file encoding file D_ICD_DIAGNOSES.csv file -I D_ICD_DIAGNOSES.csv mv INPUTEVENTS_CV_3.csv INPUTEVENTS_CV.csv mv INPUTEVENTS_MV_3.csv INPUTEVENTS_MV.csv mv LABEVENTS_3.csv LABEVENTS.csv mv MICROBIOLOGYEVENTS_3.csv MICROBIOLOGYEVENTS.csv mv NOTEEVENTS_3.csv NOTEEVENTS.csv mv OUTPUTEVENTS_3.csv OUTPUTEVENTS.csv mv PATIENTS_3.csv PATIENTS.csv mv PRESCRIPTIONS_3.csv PRESCRIPTIONS.csv mv PROCEDUREEVENTS_MV_3.csv PROCEDUREEVENTS_MV.csv mv PROCEDURES_ICD_3.csv PROCEDURES_ICD.csv mv SERVICES_3.csv SERVICES.csv mv TRANSFERS_3.csv TRANSFERS.csv ``` > Server character set https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7 > https://www.postgresql.org/docs/9.3/multibyte.html > https://stackoverflow.com/questions/4867272/invalid-byte-sequence-for-encoding-utf8 > copy with custom encoding > https://stackoverflow.com/questions/4867272/invalid-byte-sequence-for-encoding-utf8 > http://pgdocptbr.sourceforge.net/pg82/multibyte.html > >Converting iconv >https://www.tecmint.com/convert-files-to-utf-8-encoding-in-linux/ > fix_latin --strict-utf8 > https://www.endpoint.com/blog/2017/07/21/postgres-migrating-sqlascii-to-utf-8 > > us-ascii to utf-16 to utf-8 > iso-8859-1 > https://stackoverflow.com/questions/11303405/force-encode-from-us-ascii-to-utf-8-iconv > `iconv -f us-ascii -t utf-16 TRANSFERS_2.csv > TRANSFERS_4.csv ` `iconv -f utf-16be -t utf-8 TRANSFERS_4.csv > TRANSFER.csv ` - Seeing non-ASCII character `pcregrep --color='auto' -n "[\x80-\xFF]" TRANSFERS.csv ` - **Deleting non-ASCII character** > https://unix.stackexchange.com/questions/475548/removing-all-non-ascii-characters-from-a-workflow-file > `LC_ALL=C tr -dc '\0-\177' <TRANSFERS_5.csv >TRANSFERS.csv ` - **Converting to UTF-8 using Java** https://dzone.com/articles/the-programmers-way-to-convert-excel-to-csv https://www.mkyong.com/java/how-to-write-utf-8-encoded-data-into-a-file-java/ - **Convertin using C++** https://www.example-code.com/vcpp/charset_convert_file_from_utf8_to_ansi.asp https://rupinderjeetkaur.wordpress.com/2014/06/20/run-a-cc-program-on-terminal-using-gcc-compiler/ ```bash= sudo g++ -o converter chilkatConverter.cpp ``` Errors: `chilkatConverter.cpp:1:10: error: 'CkCharset.h' file not found with <angled> include; use "quotes" instead` `ld: symbol(s) not found for architecture x86_64` https://medium.com/@m.muizzsuddin_25037/error-ld-symbol-not-found-for-architecture-x86-64-a5e5b648ffc ================================================== ```bash= ./converter ``` ================================================== unlocking global: `sudo gcc –o unlock unlock.c` error: ``` clang: error: no such file or directory: '–o' clang: error: no such file or directory: 'unlock' ``` **I'm creating a new db named 'mimiciii' with schema 'mimic' and load data again** `psql 'dbname=mimiciii user=mimicuser options=--search_path=mimic' -f postgres_load_data.sql -v mimic_data_dir='mimic-iii-clinical-database-1.4'` but my computer crashed and restarted again. Then I ran into the same error of UTF-8 encoding when trying to fill the remaining tables **I fixed the problem by running load data in Mac Safe Mode and prevent crashing** ## Tìm hiểu MIMIC text - **Tổng quát MIMIC database:** - MIMIC-III v1.4 - Gồm 40 tables và tổng cộng 728,556,685 rows và 534 columns - Thông tin chi tiết về schema tại [đây](https://mit-lcp.github.io/mimic-schema-spy/index.html) - Thông tin thật của hơn 40 ngàn bệnh nhân tại Beth-Isreal Deaconess Medical Center critical care unit - Thông tin bao gồm demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). - **MIMIC text containing caregivers' notes:** - [mimic.mimiciii.noteevents](https://mit-lcp.github.io/mimic-schema-spy/tables/noteevents.html): column text contains note as text string - Sampling command for mimiciii.noteevents text column - all columns with header:`\COPY (SELECT * FROM mimiciii.noteevents LIMIT 1 OFFSET 107) TO '~/../../Volumes/APFS/MIMIC/sample_noteeventstext_107.csv' csv header;` - only text column with header : `\COPY (SELECT text FROM mimiciii.noteevents LIMIT 5 OFFSET 107) TO '~/../../Volumes/APFS/MIMIC/sample_noteeventstext_107.csv' csv header;` - exporting all column text ` \COPY (SELECT text FROM mimiciii.noteevents) TO '~/../../Volumes/APFS/MIMIC/all_noteeventstext.csv' csv header;` - Table mimic.mimiciii.chartevents - - **Label của MIMIC text:** ## References :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `Sentiment Analysis` `MIMIC clinical text` `BKHCMUT`