Biosurveillance Meeting Minutes

[![hackmd-github-sync-badge](https://hackmd.io/1XuD8eoQSs6iofAgLSRndw/badge)](https://hackmd.io/1XuD8eoQSs6iofAgLSRndw) Biosurveillance Meeting Minutes, 8/3/2023 === :::info - **Date:** June 12, 2023, 11:00am ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Nathan - **Host:** Shannon ::: 🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis Project Status --- **No way to get the AAAI paper in at this point** Albert and Nathan met a few times to work on BERT embeddings - will get GPU processing running within the week - will then re-train with one month time intervals - **all about a 2-week time frame** Can confirm all the Scott County data with geo-tags are located in Indiana - unclear what all the non-geo-tagged data is doing Drugs dataset is basically finished Politics dataset is tougher to handle - will simply run it and go with whatever results we get :newspaper: Paper Submission --- Need a new venue Biosurveillance Meeting Minutes, 6/12/2023 === :::info - **Date:** June 12, 2023, 11:00am ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Nathan - **Host:** Shannon ::: 🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis Project Status --- Drugs dataset is complete running through the model - way outperformed the Scott County dataset (interesting! why?) - each tweet has *something* related to drugs/alcohol Scotty County dataset - problematic from a content perspective -- many tweets (the majority) seem generic - should all have geo-tags though, and dates within the relevant time frame - re-training at one-month intervals, see if that improves model performance Politics dataset - tougher to preprocess Temporal topic modeling Two week intervals for Scott County Bunch of results Still need to run the other two datasets How to validate the models? - some "modern-day" datasets? - hold-out sets? :newspaper: Paper Submission --- AAAI deadline is **August 8**! This seems like a perfect option - get the AAAI Latex template - start filling out headers - fill out references - assign sections Biosurveillance Meeting Minutes, 4/25/2023 === :::info - **Date:** April 25, 2023, 11:00am ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Nathan - **Host:** Shannon ::: 🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis :books: Catch-up from last meeting --- - Zika data seems problematic - PR "Clean code" is large and unwieldy - Let's merge as quickly as possible - Then open up new and smaller PRs for remaining issues - Still unsure as to what BERT is doing with the sentence pairs - If creating a word embedding within tweets, why not just discard the concept of a tweet entirely and have all the text within a bucket as one single (large) document? - **Need to figure out what BERT is doing with the sentence pairs / 1s and 0s** Project status --- - Get preprocessing going for the new dataset - HIV dataset currently at 2-month buckets; want to re-run to be 2-week buckets (to better match with new data) but may be a runtime bottleneck - GitHub repo - Other housekeeping tasks - README update https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/41 - Branch naming https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/47 - Want to split up and create new and smaller PRs - Merge [#50](https://github.com/quinngroup/Twitter-Embedding-Analysis/pull/50) as soon as possible - Open up new PRs for continuing work - Add bash scripts for each step :dart: Goals for this week --- - Figure out what BERT is doing - Calculate distance metric across datasets - Repo overhaul - **Need a report from Albert** Biosurveillance Meeting Minutes, 4/19/2023 === :::info - **Date:** April 19, 2023, 11:15am ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Albert - Nathan - **Host:** Shannon ::: 🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis :books: Catch-up from last meeting --- - Action items - Status of `daniel` ✅ - Other twitter datasets: - zika - March 2, 2016 - December 30, 2016 (some gaps) - ~2.2GB - `daniel:/data/zika` - other, focused on drugs & alcohol mentions - March 27, 2014 - May 1, 2014 - ~22GB - `daniel:/data/drugsalcohol` - Developing a temporal distance metric to determine how "far" a search term is from its top similarity hits in the searched dataset - A distance metric has been made. When tested on the initial top terms, the distances don't appear to converge. - Re-ran on our preselected HIV related terms, calculating distances between all terms from the list - Further experimentation needed - Also ran on comparing each term to just the target "hiv" - Converged at a slightly higher rate than the average - Issue - due to the alignment part of the code, our words will converge. - Adjust for this average convergence in our calculations, look for more sensitive ways to calculate the shifts Project status --- - GitHub repo - Multiple stale issues - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/45 - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/43 - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/39 - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/18 - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/10 - Multiple stale branches - https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/46 - Other housekeeping tasks - README update https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/41 - Branch naming https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/47 - Albert and Nathan split up the work - Albert - Worked on getting BERT running - Distance metric - Alignment will intrinsically converge over time - As a result, it's difficult to sort out what's useful and what's not - Try out different datasets? i.e., run an HIV-related query term against the zika or drugs&alcohol dataset; hopefully the top results should be much "further" away than the top results from the HIV dataset - BERT - Have to do tweet-level embeddings - For the first time bucket, take the average position for any tweet tagged with HIV-related terms (done manually) - That's our **baseline** -- (Get standard deviation) - For any subsequent bucket, take distances of tweets containing HIV-related terms compared to that baseline point - What is the label (0 or 1) doing for the sentence pairings? :dart: Goals for this week --- - Figure out what BERT is doing - Calculate distance metric across datasets - Repo overhaul Biosurveillance Meeting Minutes, 3/24/2023 === :::info - **Date:** March 24, 2023, 1:00pm ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Albert - Nathan - Ivan - **Host:** Shannon ::: :books: Catch-up from last meeting --- Ivan - Back at UGA, graduating in the Fall - Going to Capital One this summer Nathan - Graduating this spring - Working on this and one other research project Albert - Graduating this spring - Applying to graduate school (so possibly coming back to UGA) Project status --- One of the biggest issues the reviewers had was to use BERT - complained that our methodology was from 2017 (not compelling on its own) - at least when it comes to creating embeddings (after training), BERT only works on small corpuses - could also use doc2vec to do the same Idea for moving forward - **Develop a temporal metric to "measure" how far the search term is from the top related terms** - Discretize the model (if not pre-trained) over time - For each time point, measure the distance (lots of wiggle room here) between the search term and the top (X?) related terms according to the embedding stretegy of the model - Show this distance over time (for the whole duration of the dataset) and see if there is a "convergence" where the distance shrinks, suggesting there are related terms appearing much more frequently in the data - Once this has been done for our SciPy model, look into other models - BERT - doc2vec - Even if we don't have variety of _data_, we can show variations on the approach work consistently :dart: Goals for this week --- - [x] **Shannon**: look into other twitter datasets I have laying around - [x] See what time frames they cover - [x] Any historical events we could target to see how our model generalizes - [x] **Shannon**: see what the status of `daniel` is from Piotr :notebook: Notes ---  --- Biosurveillance Meeting Minutes === :::info - **Date:** April 22, 2022, 3:00pm ET - **Agenda** 1. Catch-up from last meeting 2. SciPy progress 3. Other updates - **Participants:** - Shannon - Albert - Nathan - **Host:** Shannon ::: :books: Catch-up from last meeting --- - mamba instead of conda - Attendance of July 13 SciPy poster session - Start an Overleaf document for the main 8-page paper - Set up 2hrs/wk sync meetings for code reviews, debugging, paper writing, etc - Code profiling - Ivan started with Palanteer - Dask? - glove, word2vec - Metrics for evaluating model - Debugging Albert's autorun - Doing SVD wrong :P a very expensive identity function ```python import scipy.sparse.linalg as sla dim_limit = 10 # e.g., top 10 dimensions U, s, Vt = sla.svd(X, full_matrices = False) U_hat = U[:, :dim_limit] s_hat = s[:dim_limit] Vt_hat = Vt[:dim_limit] ``` Move results from preprocessing somewhere else - impossible to share, currently Uploading usable stuff to GitHub [Sparse matrix SVD](https://docs.scipy.org/doc/scipy/reference/sparse.html) :snake: SciPy progress --- Need to decide whether to use Overleaf or just reStructuredText :dart: Goals for this week --- - [x] debug Albert's autorun script - [ ] start Overleaf or rst document - [ ] refactor json data structure so everything fits in a single json file? - [ ] do some research into how to align vector subspaces, e.g. word2vec with PPMI :notebook: Notes ---  --- :::info - **Date:** April 8, 2022, 3:00pm ET - **Agenda** 1. Catch-up from last meeting 2. SciPy acceptance 3. Other updates - **Participants:** - Shannon - Albert - Ivan - Nathan - **Host:** Shannon ::: :books: Catch-up from last meeting --- - Checked out glove? - Still working on word2vec being useful - Both (word2vec and glove) have the same issues regarding dimensionality - Dimensionality matching issues - :heavy_check_mark: - Now the roadblock is making the right words appear in the embedding - Code profiling and performance testing - [Details](https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/28) - 6 embedded for loops, [yikes](https://github.com/quinngroup/Twitter-Embedding-Analysis/blob/master/model/populate_cooccurrence.py#L11) - Ticket on dask repo for getting examples working - conda resolving very, very slowly - Apparently just running the Jupyter notebook directly instead of setting up the conda environment works better - Dynamic word embedding metrics - Quantitative evaluation requires the full dataset - Qualitative evaluation has been implemented ![](https://i.imgur.com/W2cEmWJ.png) This is kinda fun! - The model is definitely doing SOMETHING :+1: :snake: SciPy acceptance --- **Abstract accepted!** Now what? Need to indicate whether we will be able to present the poster by **April 15**. - [Response form](https://forms.gle/5BBiuZB21vSU9JQG7) - If time is an issue, bare minimum is we need someone to attend the poster session portion of the conference, which is **Wednesday, July 13 at 5:30pm CT**. - Our submission number is **209**. Would be good to start writing. Things we can already talk about: - natural language processing, word embeddings specifically - strengths of dynamic word embeddings - what biosurveillance is and why it's important - how dynamic word embeddings could potentially act as an early-warning system (our hypothesis) - the data we're working with - the technology stack we're using ## :dart: Goals for this week - Determine in-person availability for July 13 (poster session) - Someone start an Overleaf document for the main 8-page paper - Set up 2hrs/wk sync meetings for code reviews, debugging, paper writing, etc - Find the dependencies for Palanteer and send them to me (Dr. Quinn) ## Notes  [Mamba](https://mamba.readthedocs.io/en/latest/) instead of conda - uninstall conda - install mamba - do everything starting commands with `mamba` instead of `conda`, e.g., `mamba install python=3.9` --- ###### tags: `Templates` `Meeting` :::info - **Date:** March 25, 2022, 3:00pm ET - **Agenda** 1. Catch-up from last meeting 2. Other updates - **Participants:** - Shannon - Nathan - Ivan - **Host:** Shannon ::: :books: Catch-up from last meeting --- - Progressing well - slowest part of the pipeline is still the PPMI matrix construction - Dimensionality matching is still a problem - two vectors aren't aligning - may need to increase the number of words included in our model - word2vec working well, but glove has not been looked at yet - qualitatively, word vectors seem about right - [updated the README](https://docs.google.com/document/d/1XWV6OWbYVt7--ID-Fw2Jg1q6ToOGIBm0cHJW02ay0ak/edit) :snake: SciPy submission --- - Ivan wants to move the document to Overleaf but have not done that yet ## :dart: Goals for this week - start looking into dask - [ ] submit a ticket on the dask repo regarding the examples not working - start identifying performance bottlenecks in the code - code profiling: time or timeit (Jupyter magic), [cProfile](https://docs.python.org/3/library/profile.html), [Palanteer](https://github.com/dfeneyrou/palanteer) - also start identifying section of old code that need to be updated - performance testing - use `%timeit` Jupyter magics with incremental increases in data size to determine roughly how fast the runtime increases with the data (linear? logarithmic? quadratic?) - move the updated README to the repo - implement the metrics in the dynamic word embeddings paper to measure the quality of the learned embeddings ## Notes  --- :::info - **Date:** Feb 11, 2022, 3:00pm ET - **Agenda** 1. Catch-up from last meeting 2. State of alignment script 3. Other updates - **Participants:** - Shannon - Ivan - Nathan - Albert - **Host:** Shannon ::: :books: Catch-up from last meeting --- - Didn't need to contact Zane; figured it out! - No data on GitHub yet - We have a primitive prototype model! - Dynamic temporal word embedding model - Weekly team meetings? - `scipy.linalg.eigh` seems to be working (at least with small dataset) :snake: SciPy submission --- - draft: https://docs.google.com/document/d/1J3e3QMuUNPs3W6aoENlXfXe6oYlGpI_yNk_2Dt7H0wM/edit :dart: Goals for next meeting --- - Submit abstract to SciPy (due Feb 18) - Send Shannon the next complete draft for editing - Establish baseline using existing word embedding methods - [word2vec](https://en.wikipedia.org/wiki/Word2vec) - [glove](https://nlp.stanford.edu/projects/glove/) ## Goals for this week - All: edit SciPy document and submit - Ivan and Albert: generate baseline embeddings using word2vec and glove - Operate on each time bucket independently :closed_book: Tasks -- ==Importance== (1 *most* - 5 *least*) / Task / **Estimate** (# of hours) - [ ] ==1== Finish editing abstract and submit - [ ] ==1== Schedule an independent team meeting time - [ ] ==2== Put testing dataset on the repo - [ ] ==3== Update the README in the repo to more accurately reflect the goals of the project ## Notes  Many, many words that only occur 5x or fewer - The "long tail" of word count distributions - Check if word occurs at least once in all clusters? Or once in at least two clusters? We can include some mention of Trump twitter dataset as a second, unrelated validation dataset to make sure we still see the same phenomena. --- ###### tags: `Templates` `Meeting` :::info - **Date:** Feb 3, 2022, 9:45am ET - **Agenda** 1. Catch-up from fall 2021 2. Plans for spring 2022 - **Participants:** - Shannon - Ivan - Nathan - **Host:** Shannon ::: :books: Fall 2021 Review --- - Backlog - Personnel - Codespaces :dart: Spring 2022 Goals --- - Weekly meetings of the team - Submission to SciPy https://www.scipy2022.scipy.org/participate - Abstract due: **Feb** ~~11~~ **18**: https://www.scipy2022.scipy.org/talk-poster-presentations - Final paper: ~April/May - Have prototype of temporal word embeddings for Scott County, IN ## Goals for this week - Put [some] data on GitHub - Repo: https://github.com/quinngroup/Twitter-Embedding-Analysis - Write a program to downsample data to put on GitHub - ~~Contact Zane about code he wrote to train model from PPMI data~~ - Start testing out SVD/eigh - Estimate full vocabulary set - 722591 Words - Be careful about the data structure(s) you choose for this - Update project README in the repo :closed_book: Tasks -- ==Importance== (1 *most* - 5 *least*) / Task / **Estimate** (# of hours) - [ ] ==1== Testing dataset on the repo - [ ] ==1== - word2vec/GloVe on each timebucket individually to form baseline against temporal word embeddings - [ ] ==1== split each Tweet into a list of words to run the w2v model on - [ ] ==5== Update the README in the repo to more accurately reflect the goals of the project ## Notes  Turning PPMI into the actual model - reviewing the steps - code in the repo doesn't quite match up with steps in the paper When does SVD happen? - After the PPMI matrix is computed for a given time column - Decompose using https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eigh.html#scipy.linalg.eigh - eigenvectors `v` are the U(t) - eigenvalues `w` in the subsequent analysis but can still be useful as debugging information Dask will most likely be needed at some point in the analysis pipeline, most likely in the SVD/eigh step. - "in-core": analysis can happen entirely in-memory on one machine - "out-of-core": analysis happens either in parallel or chunks or distributed across nodes Paper: https://arxiv.org/pdf/1703.00607.pdf Temporal compass: https://arxiv.org/pdf/1906.02376.pdf Ivan/Albert: - autorun scripts for preprocessing AND model to get up to SVD step - will need to update for loop (0-17) and uncomment chunks of code in scripts