# MDS Capstone OraQ: Dental Records # 2022-04-19 Initial Capstone Project Meeting With Partner https://github.ubc.ca/mds-2021-22/DSCI_591_capstone-proj_students/blob/master/proposals/Using_NLP_to_untangle_the_complex_web_of_dental_conditions.md ## Attendance - Junghoo Kim, OraQ AI, Calgary, AB - data scientist - mds 2021 - dentistry in calgary - Edgar - ml scientist - Daniel Chen - Arlin - Valli - Gloria - Doris ## Intros (10 min) ## From the initial proposal Using NLP to untangle the complex web of dental conditions We seek MDS students to help with the task of (1) identifying key terms associated with the conditions, (2) gaining insight about patients’ risk profiles for each of the conditions based on frequencies of key terms appearing in exam notes, (3) discovering any trends between the risk profiles for different conditions, and, if time permits, (4) identifying potential outliers that do not appear to follow these trends. Students will have access to de-identified text notes from dental exams. There are currently text notes for 1400 patients, and more notes can be acquired and de-identified on an ongoing basis. Students will be given access to appropriate computing resources on Google Cloud Platform. Depending on the scope of the project as determined by the students, one or more of the following would be submitted as data product: (1) method of estimating patient’s risk for conditions based on exam notes; (2) visualization of noticeable trends between risks for different conditions, and (3) method of detecting anomalies from discovered trends ## Project Overview - motivation: OraQ is diagnostic software with new patients. - multiple sources of info (intake, recorded findings) - patient medial, images, dental notes - would like to make use of all these 3 parts if possible - images would likely have a lot of information - but notes would as well (primary source of data) - Every dentist may have different ways of describing the same things - There are no labeled datsets available - given unlabeled data, use ML or heuristic approach, to extract information form notes - data in dropbox right now (for access) - closer to May -> Google Cloud Platform - data - 1400 patients - scraped data for training (for nlp / embedded models) - Edgar worked on the dataset (most familiar with its content) - General distributions/types of features for risk prediction - Happy to talk about approaches he has followed - But also do not want to impose/direct potential solutions - Feel free to follow any leads - MDS for second pair of eyes - UA student has done topic modeling project + publication already (similar to what has already been suggested in proposal) - unsupervised learning technique - This paper can be used as a starting point - Exploratory approach ## Questions 1. How is this specific project building on the current methodologies and solutions used at ORAQ AI? 2. Among the four tasks you mentioned in the proposal, the first one is to “identify key terms associated with the conditions (sleep disorder, joint/muscular disorder etc)”. Does this mean we will need to conduct some initial research prior to building an ML model? Or has these terms been identified previously? 3. Will there be any training session provided in the beginning of the capstone to familiarize with the work that’s being done or on any new applications we will be using? - We have a few hackathons at the start of capstone, will there be any training sessions provided on things? - A: they can provide access to Google Cloud Platform, make an account for gcloud platform - A: resources will be provided 4. What are some biases, ethical dilemmas that we need to be aware of in the dataset? - A: there most likely will be biases. specific city. exam notes can share similar terms. - A: re: ethical/fairness: I (Junghoo) do not foresee any, but data+results may not generalize because of the smaller subset of data - A: things will have to be vetted before the project is commercially available 5. What processes are in place to handle appeals or mistakes? Q: does initial part of capstone be identifying terms? or have terms already been identified? A: joint disorder is the specific problem (sleep apena). articles in batches of 100 have been downloaded. Q: journal articles A: exam notes themselves, other than the 1400, not really possible to run NLP analysis, research papers are used to figure out the important terms 1. What does the dataset look like? Is text note labeled? 2. Can we know how and where the data was collected? Shall we worry about the underlying biases? 3. Will there be training sessions to familiarize us with the domain knowledge needed for this project? - Q: google cloud? - A: data is de-identified, data being on dropbox + local should be okay. - ideally limit that as much as possible - no safeguards with people downloading data - can use local setup to explore + train data - delete the data when you're done - google is there for compute resources - dropbox is there for data access sooner - Articles are not on google cloud platform yet (you can download via search terms) - Q: OraQ GH org? - A: Yes. there is an org where we can put code in for repo - Dan: this is probably the best way moving forward, if possible, since the end of capstone will be very abrupt - Q: text part of data in text form? or images of Rx of document - A: there are images and photos, we will not stop you from accessing them, but there might be easier means to get the data we want - More preference for using the text data - potential for data leakage - research articles are (machine generated) PDF so easier to convert to text - Q: in terms of what you have done already, where do you think we will be adding value - A: Edgar: interesting dataset and many open ended questions and unexplored areas - there can be many different ways you can provide value - may be premature to tell you what those areas would be right now - Best to spend some time to explore data first - E.g., if you find some embedding to transform clinical notes into, might give you feature for prediction - they have explored in the past, but did not take it all the way - Q: who will be at meetings? - A: Junghoo most all of the weekly meetings, Edgar (some weeks?) ## Action items ### MDS - Read the paper from UA student (how data and model were used) - https://doi.org/10.3389/fdmed.2022.833191 - students provide dropbox emails - 1400 patients text notes during intake ### OraQ - Dropbox link for data - Students provide email addresses - Github Repo for OraQ