# Measuring faith in technology and science in historical newspapers
###### tags: `1. Main`
[TOC]
## This document
Here's a working document for textmining on the Deep Transitions / Industrial Modernity doc. I expect everyone here knows the project basics.
Using markdown here, as I find managing google documents more difficult: maybe for this task, hackmd may work very well. Editable doc link here https://hackmd.io/@deeptransitions/SJWJla5Dv/edit. Markdown offers bare-bones editing and essentials: It supports headers, lists, images (upload via img), comments, easy online editing. I made this document fully public unlisted, if we use this, then I'll share with only us later.
> [name=Peeter Tinits] [color=#35ff57] Example comment: It is also possible to comment here.
## Docs
### Techno/Science optimism
- [Data: texts on tech /science](/iWvUrfEIQZG6hD0KnA4__g)
- [Custom sentiment lexicons, wordsets 1,2,3](https://hackmd.io/lhi_lghjTom2OXBp-QIPzA)
- [Ngram search (Ruben)](https://hackmd.io/@zoJes2XdQUe1xc1pGikQ-w/Hy5oA3KtD)
- [Simple 3-gram searches (Peeter)](https://hackmd.io/4uveBz_6T62iIlqSm8AKZQ)
- [Future-directed verbs](/ijJw1UBITty-SO_ZSPqeBw)
- [Just the word future](/7_EkvG-zQY-7sDBIOlg8hA)
### Technocracy
- [Data: texts on government]()
- [Technocracy experiment: theory-driven cowords](https://hackmd.io/CtMcY3vlRfG2Tg9iwPSp2w)
## Status
### Now
#### What we discussed:
1) Going for adjectives seems a good approach, clustering and/or annotating
2) Possibly focus on the extremes - or another way to get higher precision
3) Simple thematic co-words works surprisingly well for technocracy, can try on other corpora
#### Here's what's next:
1) Build clusters of adjectives in context, hoping to get relevant clusters. (Ruben)
2) Will look for extreme (+ high precision) adjectives from the frequency list. (will share with all)
3) Will rerun models lowercased and most punctuation removed. (Peeter)
4) Will run models with keywords with decade-tags. (Peeter)
5) Will extend technocracy results to AUS-1995+, NYT, India.
6) Can use events for validation
---
### Meeting 1: brainstorm on method, discuss and formulate pipeline.
1. First, technooptimism / faith in science.
- PRIORITY 1:
- **Brainstorm** research strategies
- Necessary steps for each, sequence to create test workflows
- Tasks for the workflows available to check and validate
- GRADUALLY:
- Discuss distributing tasks
- Figure out deadlines to stay on track
2. If time, also thoughts on technocracy.
### Meeting 2: Status, next steps.
## Aim
We are trying to measure the level of faith in science / technology, very approximately, on historical newspapers. Technically simple solution is nice.
[Examples and scale on measuring techno-science-optimism](/pqn3ToQ_SRKYv89LCTEWdA)
#### Indicative words
- Optimism: marvel, remarkable, accomplishments, material progress (era-specific), tremendous
- Pessimism: endanger, error
- Realist: destruction of jobs, creation of jobs, jobs, cheapening, salary, environmental, skills, experience, human problems, trade
And use word embeddings (across all texts, not search results to expand this vocabulary), then look for the presence of this vocabulary in search results.
## Strategy
General strategy alternatives to pursue. Mainly, the optimism/realism could be found in the words cooccurring with science/technology/progress. Simple sentiment detection may not work best here, though a custom list of meaningful cowords could do the job.
There are some different strategy ideas here, but none really solid or settled. Gathering different strategy ideas will be the first task at he brainstorm.
### S1 Limited cowords (progress)
Naive belief in science and technology: we discussed three possibilities here:
1. Make a list of synonyms for material/industrial/technical/technological & scientific progress/advancement/improvement/development. Measure the share of these expressions from the overall corpus (more progress = more continuity in industrial modernity).
2. Make a list of expressions containing one word + synonyms for progress/advancement/improvement/development. We discussed the possibility of focusing on the adjectives but it has its drawbacks (e.g. "human progress" would go missing). Cluster the expressions in topics and measure the relative share of each (more "material etc. progress" = more continuity in industrial modernity.
3. Extract the snippets for all items specified in point a. Measure the share of critical discussion over time (more critical discussion = more attempts to balance excessive techno-enthusiasm = possible rupture in industrial modernity).
### S2 Word embeddings (progress)
1. Construct semantic “dimensions”: Gather two groups of words -> avg their vectors -> extract one avg. vector from another -> make into a scale.
E.g. hard vs soft words - technol*, industr* vs human*, relig*
E.g. pos vs neg words - good, wonderful vs awful, disaster.

2. Tracking “natural” semantic change with PCA
The idea is to visualize multi-dimensional space of semantic change in 2D with PCA.
1. We get vectors (100 dimensions) for each of the “progress” words;
2. We get some closest neighbors for each of the “progress” words;
3. We exclude other “progress” words from that list + all the “development” stuff that was tagged before;
4. We also use only unique words (progress words share a lot of neighbors)
5. combine the vectors together
6. run the PCA
7. Visualize top2 PCs

### S3 Coword sentiments (science/technology)
1. Measure sentiment around the words science / technology / progress. However newspapers do not seem to be easy for it, first attempts with lexicon-based approaches did not seem to give very appropriate results.
## Useful materials
Useful data, plots, example papers
### Progress syns + contexts
Example search results "progress" - [explore here](https://peetertinits.github.io/DT_temp/concs.html)!
Set has only searches that had both progress+syns (507,188 matches) and at least one of material|industrial|technical|technological|industry|technolog (12,077 matches), scientific|science|research (3725 matches), or human|spiritual|mankind (2205) within +/-30 chars from the match.

Data
Progress + tech/sci/human data over time (n = 14,476). General dataset is pretty unbalanced over years here.

Example search results "technology"
N/A
Example search results "science"
N/A
### Topic model results for science, technology, environment
Will add.
### Topic model results for progress?
Should try?
## Pipeline
Practical pipeline (to be formulated).
Word embeddings
1. POS-tag the texts? (script available, done for AUS) [name=Peeter]
2. Timestamping the word we're interested in - progress, technology, science [name=Peeter, Artjom]
3. Train word embeddings on a random sample of the whole corpus, maybe can simply update the trained embeddings from other workflow. Can use diachronic word embeddings instead? [name=Peeter?, Artjom?]
4. Give paragraphs for annotation, [name=Peeter]
5. Annotate [name=Laur, Anna-Kati, Aro]
5. Get seed words from the annotations [name=Peeter?, Artjom?]
6. CHECK: query the seed words and ask for annotations [name=Peeter?, Artjom?]
7. Annotations again [name=Laur, Anna-Kati, Aro]
8. Use seed words to build the vectors [name=???]
9. See position of progress, science, technology over time... [name=anyone]
Look into OCR correction,
Bigram-trigram clusters
1. IN PARALLEL: Train word embeddings on the corpus, diachronic embeddings.
2. IN PARALLEL: Find bigrams-trigrasm that include science/technology, [name=Ruben, Peeter]
5. CHECK: do the clusters make sense
3. CHECK: phrase + context [name=Laur, Anna-Kati, Aro]
4. cluster them by distance in the embeddings of single words... [name=Ruben]
5. CHECK: do the clusters make sense [name=Peeter, Ruben, Laur, Anna-Kati, Aro]
6. Evaluate relevance for optimist vs realist for each cluster [name=Peeter, Ruben, Laur, Anna-Kati, Aro]
7. see the timelines [name=anyone]
-
First search:
technology, science, progress bigrams, trigrams (if there's determiners in the middle...)
Prototype both approaches:
- we do have prototype on word embeddings,
- need prototype on bigrams-trigrams...
Will do
+ add timeline
+ tasks with names
---
OCR correction, 2 workflows using word embeddings:
- https://github.com/mikahama/natas, https://www.aclweb.org/anthology/R19-1051.pdf
- https://dl.acm.org/doi/pdf/10.1145/3078081.3078107
---
TODO
- Will send warning note to team-members annotators on what's coming. [name=Peeter]
- Will make doodle to meet in a few weeks. [name=Peeter]
PLANS
- [name=Peeter] Will download the newspapers to TDMStudio servers.
-
nimed ja tööajad
NOT THE ACTUAL SCHEDULE BUT EXPERIMENTING WITH FORMAT (just placeholder dates and tasks)
---
```mermaid
gantt
title NOT THE ACTUAL SCHEDULE BUT EXPERIMENTING WITH FORMAT
excludes Sunday, Saturday
section Diagram Syntax
Completed task :crit, done, des1, 2020-11-06,2020-11-08
Active task :active, des2, 2020-11-09, 3d
Future task : des3, after des2, 5d
Future task2 : crit, des4, after des3, 5d
section Download
Download source files to TDM active :a1, 2020-11-02, 14d
POS-tag the sections :after a1 , 10d
section Critical discourse
Train word2vec :2020-11-02 , 12d
Annotate paragraphs : 5d
section Bigrams-trigrams
Extract n-grams :done, 2020-10-30 , 3d
Find interesting grams : 2020-11-02 , 10d
Build word embeddings : 2020-11-09 , 23d
section OCR-correction
Build workflow :2020-11-02 , 10d
Use workflow on materials : 12d
```
---
Old notes
1. Extract relevant articles
2. POS-tag (spacy?)
- Need to decide if we need OCR-preprocessing.
- Maybe get overview of POS-tag success
3. Get keyword match snippets.
4. Manually annotate some of the snippets.
- Idealist vs Realist
5. Coword analysis towards classifying a larger set of examples.
## Validation
What do we need to verify to make sure it works?