[](https://hackmd.io/1XuD8eoQSs6iofAgLSRndw)
Biosurveillance Meeting Minutes, 8/3/2023
===
:::info
- **Date:** June 12, 2023, 11:00am ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Nathan
- **Host:** Shannon
:::
🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis
Project Status
---
**No way to get the AAAI paper in at this point**
Albert and Nathan met a few times to work on BERT embeddings
- will get GPU processing running within the week
- will then re-train with one month time intervals
- **all about a 2-week time frame**
Can confirm all the Scott County data with geo-tags are located in Indiana
- unclear what all the non-geo-tagged data is doing
Drugs dataset is basically finished
Politics dataset is tougher to handle
- will simply run it and go with whatever results we get
:newspaper: Paper Submission
---
Need a new venue
Biosurveillance Meeting Minutes, 6/12/2023
===
:::info
- **Date:** June 12, 2023, 11:00am ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Nathan
- **Host:** Shannon
:::
🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis
Project Status
---
Drugs dataset is complete running through the model
- way outperformed the Scott County dataset (interesting! why?)
- each tweet has *something* related to drugs/alcohol
Scotty County dataset
- problematic from a content perspective -- many tweets (the majority) seem generic
- should all have geo-tags though, and dates within the relevant time frame
- re-training at one-month intervals, see if that improves model performance
Politics dataset
- tougher to preprocess
Temporal topic modeling
Two week intervals for Scott County
Bunch of results
Still need to run the other two datasets
How to validate the models?
- some "modern-day" datasets?
- hold-out sets?
:newspaper: Paper Submission
---
AAAI deadline is **August 8**! This seems like a perfect option
- get the AAAI Latex template
- start filling out headers
- fill out references
- assign sections
Biosurveillance Meeting Minutes, 4/25/2023
===
:::info
- **Date:** April 25, 2023, 11:00am ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Nathan
- **Host:** Shannon
:::
🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis
:books: Catch-up from last meeting
---
- Zika data seems problematic
- PR "Clean code" is large and unwieldy
- Let's merge as quickly as possible
- Then open up new and smaller PRs for remaining issues
- Still unsure as to what BERT is doing with the sentence pairs
- If creating a word embedding within tweets, why not just discard the concept of a tweet entirely and have all the text within a bucket as one single (large) document?
- **Need to figure out what BERT is doing with the sentence pairs / 1s and 0s**
Project status
---
- Get preprocessing going for the new dataset
- HIV dataset currently at 2-month buckets; want to re-run to be 2-week buckets (to better match with new data) but may be a runtime bottleneck
- GitHub repo
- Other housekeeping tasks
- README update https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/41
- Branch naming https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/47
- Want to split up and create new and smaller PRs
- Merge [#50](https://github.com/quinngroup/Twitter-Embedding-Analysis/pull/50) as soon as possible
- Open up new PRs for continuing work
- Add bash scripts for each step
:dart: Goals for this week
---
- Figure out what BERT is doing
- Calculate distance metric across datasets
- Repo overhaul
- **Need a report from Albert**
Biosurveillance Meeting Minutes, 4/19/2023
===
:::info
- **Date:** April 19, 2023, 11:15am ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Albert
- Nathan
- **Host:** Shannon
:::
🔗 https://github.com/quinngroup/Twitter-Embedding-Analysis
:books: Catch-up from last meeting
---
- Action items
- Status of `daniel` ✅
- Other twitter datasets:
- zika
- March 2, 2016 - December 30, 2016 (some gaps)
- ~2.2GB
- `daniel:/data/zika`
- other, focused on drugs & alcohol mentions
- March 27, 2014 - May 1, 2014
- ~22GB
- `daniel:/data/drugsalcohol`
- Developing a temporal distance metric to determine how "far" a search term is from its top similarity hits in the searched dataset
- A distance metric has been made. When tested on the initial top terms, the distances don't appear to converge.
- Re-ran on our preselected HIV related terms, calculating distances between all terms from the list
- Further experimentation needed
- Also ran on comparing each term to just the target "hiv"
- Converged at a slightly higher rate than the average
- Issue - due to the alignment part of the code, our words will converge.
- Adjust for this average convergence in our calculations, look for more sensitive ways to calculate the shifts
Project status
---
- GitHub repo
- Multiple stale issues
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/45
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/43
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/39
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/18
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/10
- Multiple stale branches
- https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/46
- Other housekeeping tasks
- README update https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/41
- Branch naming https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/47
- Albert and Nathan split up the work
- Albert
- Worked on getting BERT running
- Distance metric
- Alignment will intrinsically converge over time
- As a result, it's difficult to sort out what's useful and what's not
- Try out different datasets? i.e., run an HIV-related query term against the zika or drugs&alcohol dataset; hopefully the top results should be much "further" away than the top results from the HIV dataset
- BERT
- Have to do tweet-level embeddings
- For the first time bucket, take the average position for any tweet tagged with HIV-related terms (done manually)
- That's our **baseline** -- (Get standard deviation)
- For any subsequent bucket, take distances of tweets containing HIV-related terms compared to that baseline point
- What is the label (0 or 1) doing for the sentence pairings?
:dart: Goals for this week
---
- Figure out what BERT is doing
- Calculate distance metric across datasets
- Repo overhaul
Biosurveillance Meeting Minutes, 3/24/2023
===
:::info
- **Date:** March 24, 2023, 1:00pm ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Albert
- Nathan
- Ivan
- **Host:** Shannon
:::
:books: Catch-up from last meeting
---
Ivan
- Back at UGA, graduating in the Fall
- Going to Capital One this summer
Nathan
- Graduating this spring
- Working on this and one other research project
Albert
- Graduating this spring
- Applying to graduate school (so possibly coming back to UGA)
Project status
---
One of the biggest issues the reviewers had was to use BERT
- complained that our methodology was from 2017 (not compelling on its own)
- at least when it comes to creating embeddings (after training), BERT only works on small corpuses
- could also use doc2vec to do the same
Idea for moving forward
- **Develop a temporal metric to "measure" how far the search term is from the top related terms**
- Discretize the model (if not pre-trained) over time
- For each time point, measure the distance (lots of wiggle room here) between the search term and the top (X?) related terms according to the embedding stretegy of the model
- Show this distance over time (for the whole duration of the dataset) and see if there is a "convergence" where the distance shrinks, suggesting there are related terms appearing much more frequently in the data
- Once this has been done for our SciPy model, look into other models
- BERT
- doc2vec
- Even if we don't have variety of _data_, we can show variations on the approach work consistently
:dart: Goals for this week
---
- [x] **Shannon**: look into other twitter datasets I have laying around
- [x] See what time frames they cover
- [x] Any historical events we could target to see how our model generalizes
- [x] **Shannon**: see what the status of `daniel` is from Piotr
:notebook: Notes
---
<!-- Other important details discussed during the meeting can be entered here. -->
---
Biosurveillance Meeting Minutes
===
:::info
- **Date:** April 22, 2022, 3:00pm ET
- **Agenda**
1. Catch-up from last meeting
2. SciPy progress
3. Other updates
- **Participants:**
- Shannon
- Albert
- Nathan
- **Host:** Shannon
:::
:books: Catch-up from last meeting
---
- mamba instead of conda
- Attendance of July 13 SciPy poster session
- Start an Overleaf document for the main 8-page paper
- Set up 2hrs/wk sync meetings for code reviews, debugging, paper writing, etc
- Code profiling
- Ivan started with Palanteer
- Dask?
- glove, word2vec
- Metrics for evaluating model
- Debugging Albert's autorun
- Doing SVD wrong :P a very expensive identity function
```python
import scipy.sparse.linalg as sla
dim_limit = 10 # e.g., top 10 dimensions
U, s, Vt = sla.svd(X, full_matrices = False)
U_hat = U[:, :dim_limit]
s_hat = s[:dim_limit]
Vt_hat = Vt[:dim_limit]
```
Move results from preprocessing somewhere else
- impossible to share, currently
Uploading usable stuff to GitHub
[Sparse matrix SVD](https://docs.scipy.org/doc/scipy/reference/sparse.html)
:snake: SciPy progress
---
Need to decide whether to use Overleaf or just reStructuredText
:dart: Goals for this week
---
- [x] debug Albert's autorun script
- [ ] start Overleaf or rst document
- [ ] refactor json data structure so everything fits in a single json file?
- [ ] do some research into how to align vector subspaces, e.g. word2vec with PPMI
:notebook: Notes
---
<!-- Other important details discussed during the meeting can be entered here. -->
---
:::info
- **Date:** April 8, 2022, 3:00pm ET
- **Agenda**
1. Catch-up from last meeting
2. SciPy acceptance
3. Other updates
- **Participants:**
- Shannon
- Albert
- Ivan
- Nathan
- **Host:** Shannon
:::
:books: Catch-up from last meeting
---
- Checked out glove?
- Still working on word2vec being useful
- Both (word2vec and glove) have the same issues regarding dimensionality
- Dimensionality matching issues
- :heavy_check_mark:
- Now the roadblock is making the right words appear in the embedding
- Code profiling and performance testing
- [Details](https://github.com/quinngroup/Twitter-Embedding-Analysis/issues/28)
- 6 embedded for loops, [yikes](https://github.com/quinngroup/Twitter-Embedding-Analysis/blob/master/model/populate_cooccurrence.py#L11)
- Ticket on dask repo for getting examples working
- conda resolving very, very slowly
- Apparently just running the Jupyter notebook directly instead of setting up the conda environment works better
- Dynamic word embedding metrics
- Quantitative evaluation requires the full dataset
- Qualitative evaluation has been implemented

This is kinda fun!
- The model is definitely doing SOMETHING :+1:
:snake: SciPy acceptance
---
**Abstract accepted!** Now what?
Need to indicate whether we will be able to present the poster by **April 15**.
- [Response form](https://forms.gle/5BBiuZB21vSU9JQG7)
- If time is an issue, bare minimum is we need someone to attend the poster session portion of the conference, which is **Wednesday, July 13 at 5:30pm CT**.
- Our submission number is **209**.
Would be good to start writing. Things we can already talk about:
- natural language processing, word embeddings specifically
- strengths of dynamic word embeddings
- what biosurveillance is and why it's important
- how dynamic word embeddings could potentially act as an early-warning system (our hypothesis)
- the data we're working with
- the technology stack we're using
## :dart: Goals for this week
- Determine in-person availability for July 13 (poster session)
- Someone start an Overleaf document for the main 8-page paper
- Set up 2hrs/wk sync meetings for code reviews, debugging, paper writing, etc
- Find the dependencies for Palanteer and send them to me (Dr. Quinn)
## Notes
<!-- Other important details discussed during the meeting can be entered here. -->
[Mamba](https://mamba.readthedocs.io/en/latest/) instead of conda
- uninstall conda
- install mamba
- do everything starting commands with `mamba` instead of `conda`, e.g., `mamba install python=3.9`
---
###### tags: `Templates` `Meeting`
:::info
- **Date:** March 25, 2022, 3:00pm ET
- **Agenda**
1. Catch-up from last meeting
2. Other updates
- **Participants:**
- Shannon
- Nathan
- Ivan
- **Host:** Shannon
:::
:books: Catch-up from last meeting
---
- Progressing well
- slowest part of the pipeline is still the PPMI matrix construction
- Dimensionality matching is still a problem
- two vectors aren't aligning
- may need to increase the number of words included in our model
- word2vec working well, but glove has not been looked at yet
- qualitatively, word vectors seem about right
- [updated the README](https://docs.google.com/document/d/1XWV6OWbYVt7--ID-Fw2Jg1q6ToOGIBm0cHJW02ay0ak/edit)
:snake: SciPy submission
---
- Ivan wants to move the document to Overleaf but have not done that yet
## :dart: Goals for this week
- start looking into dask
- [ ] submit a ticket on the dask repo regarding the examples not working
- start identifying performance bottlenecks in the code
- code profiling: time or timeit (Jupyter magic), [cProfile](https://docs.python.org/3/library/profile.html), [Palanteer](https://github.com/dfeneyrou/palanteer)
- also start identifying section of old code that need to be updated
- performance testing
- use `%timeit` Jupyter magics with incremental increases in data size to determine roughly how fast the runtime increases with the data (linear? logarithmic? quadratic?)
- move the updated README to the repo
- implement the metrics in the dynamic word embeddings paper to measure the quality of the learned embeddings
## Notes
<!-- Other important details discussed during the meeting can be entered here. -->
---
:::info
- **Date:** Feb 11, 2022, 3:00pm ET
- **Agenda**
1. Catch-up from last meeting
2. State of alignment script
3. Other updates
- **Participants:**
- Shannon
- Ivan
- Nathan
- Albert
- **Host:** Shannon
:::
:books: Catch-up from last meeting
---
- Didn't need to contact Zane; figured it out!
- No data on GitHub yet
- We have a primitive prototype model!
- Dynamic temporal word embedding model
- Weekly team meetings?
- `scipy.linalg.eigh` seems to be working (at least with small dataset)
:snake: SciPy submission
---
- draft: https://docs.google.com/document/d/1J3e3QMuUNPs3W6aoENlXfXe6oYlGpI_yNk_2Dt7H0wM/edit
:dart: Goals for next meeting
---
- Submit abstract to SciPy (due Feb 18)
- Send Shannon the next complete draft for editing
- Establish baseline using existing word embedding methods
- [word2vec](https://en.wikipedia.org/wiki/Word2vec)
- [glove](https://nlp.stanford.edu/projects/glove/)
## Goals for this week
- All: edit SciPy document and submit
- Ivan and Albert: generate baseline embeddings using word2vec and glove
- Operate on each time bucket independently
:closed_book: Tasks
--
==Importance== (1 *most* - 5 *least*) / Task / **Estimate** (# of hours)
- [ ] ==1== Finish editing abstract and submit
- [ ] ==1== Schedule an independent team meeting time
- [ ] ==2== Put testing dataset on the repo
- [ ] ==3== Update the README in the repo to more accurately reflect the goals of the project
## Notes
<!-- Other important details discussed during the meeting can be entered here. -->
Many, many words that only occur 5x or fewer
- The "long tail" of word count distributions
- Check if word occurs at least once in all clusters? Or once in at least two clusters?
We can include some mention of Trump twitter dataset as a second, unrelated validation dataset to make sure we still see the same phenomena.
---
###### tags: `Templates` `Meeting`
:::info
- **Date:** Feb 3, 2022, 9:45am ET
- **Agenda**
1. Catch-up from fall 2021
2. Plans for spring 2022
- **Participants:**
- Shannon
- Ivan
- Nathan
- **Host:** Shannon
:::
:books: Fall 2021 Review
---
- Backlog
- Personnel
- Codespaces
:dart: Spring 2022 Goals
---
- Weekly meetings of the team
- Submission to SciPy https://www.scipy2022.scipy.org/participate
- Abstract due: **Feb** ~~11~~ **18**: https://www.scipy2022.scipy.org/talk-poster-presentations
- Final paper: ~April/May
- Have prototype of temporal word embeddings for Scott County, IN
## Goals for this week
- Put [some] data on GitHub
- Repo: https://github.com/quinngroup/Twitter-Embedding-Analysis
- Write a program to downsample data to put on GitHub
- ~~Contact Zane about code he wrote to train model from PPMI data~~
- Start testing out SVD/eigh
- Estimate full vocabulary set - 722591 Words
- Be careful about the data structure(s) you choose for this
- Update project README in the repo
:closed_book: Tasks
--
==Importance== (1 *most* - 5 *least*) / Task / **Estimate** (# of hours)
- [ ] ==1== Testing dataset on the repo
- [ ] ==1== - word2vec/GloVe on each timebucket individually to form baseline against temporal word embeddings
- [ ] ==1== split each Tweet into a list of words to run the w2v model on
- [ ] ==5== Update the README in the repo to more accurately reflect the goals of the project
## Notes
<!-- Other important details discussed during the meeting can be entered here. -->
Turning PPMI into the actual model
- reviewing the steps
- code in the repo doesn't quite match up with steps in the paper
When does SVD happen?
- After the PPMI matrix is computed for a given time column
- Decompose using https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eigh.html#scipy.linalg.eigh
- eigenvectors `v` are the U(t)
- eigenvalues `w` in the subsequent analysis but can still be useful as debugging information
Dask will most likely be needed at some point in the analysis pipeline, most likely in the SVD/eigh step.
- "in-core": analysis can happen entirely in-memory on one machine
- "out-of-core": analysis happens either in parallel or chunks or distributed across nodes
Paper: https://arxiv.org/pdf/1703.00607.pdf
Temporal compass: https://arxiv.org/pdf/1906.02376.pdf
Ivan/Albert:
- autorun scripts for preprocessing AND model to get up to SVD step
- will need to update for loop (0-17) and uncomment chunks of code in scripts