Developing Artificial Intelligence Policy: Contrasting Transatlantic Perspectives and Sources

# Developing Artificial Intelligence Policy: Contrasting Transatlantic Perspectives and Sources # ## Review ## In this review, we only assess the methods, specifically the use of neural embeddings and topic modelling (STM). From this perspective, the paper provides nothing original or rigorous, but only out-of-the-box applications of NLP libraries in _R_. It should be noted however that this is quite common in the social sciences and humanities, and may not be a problem, if the knowledge derived from the techniques is useful. ## Neural Embeddings ## Sentence embedding were extracted for the two corpora using a roBERTa using sentence transformers library. Followed by a classification of the sentences into categories using a trained neural classifier. It is assumed here that the documents are segmented in sentences, but how this is done is unspecified. Similarly the roBERTa models come in different sizes as well as different version and while the size can the extrapolated from the embedding size the exact model used should be clearly stated as this can make a meaningful difference. The sentence segmentation results in a new corpus, however the number of sentences in each class is not noted leaving the reader with little information about this new corpus such as the number of sentences in each class. This information is partly supplied by the naïve baseline model. Formula (1) used a 2 layer neural networks which only consist of one activation function in the last layers, thus it can be written as a one layer neural network: $W_0 \cdot W_1 \cdot \vec{x} = W \vec{x}$. I assume this is a mistake and that there is in fact an activation function in between the two layers? While the author resport the performance metrics in a boxplot (figure 4), I would additionally include the mean accuracy as well as its standard deviation as exact values can't be extracted from the boxplot. While figure 4 is helpful it should be supported by reported scores. Given that the authors have performed a ten-fold bootstrapping, the boxplot as I understand it should only agregate 10 data points. I thus suggest adding these to the plot as well to increase transparency as the boxplot might hide meaningful information. The authors split few documents into many sentences pr. document. While this is reasonable for a corpus of many documents it is unclear whether the model is in fact predicting which document the sentence belongs to rather than the whether it distinguish between the US and EU retorics. Thus I am afraid that the near perfect accuracy reflects that the sentences stem from relatively few documents. Training a classifier to predict the origin document from the sentence would reveal whether this is the case. The usage of the word probing might be misleading when predicting the document category form a sentence embedding, I suggest rephrasing this section 4.2.1 to exclude it. Probing in the context of NLP, while it is often used inconsistently, typically refers to exploring the underlying representation of the representation learner (i.e. the language model). In this case the sentence embedding is used to differentiate between categories and I don't believe the usage of the term "probing" is justified within this context. Hyperparameters for instance the learning rate for the the training of the neural classifier is wholly missing, thus does not allow for replication. I suggest supplying this information in the appendix or supplying a code base which can be used for replication. ## Topic Modeling ## The section reflect an out-of-the-box applications of STM, which does not reflect original or rigorous application the the technique. The primary issue is the rationale for using 'topic modeling' for a very small data set? The authors themselves cite Blei's 2012 paper stating that that the techniques are useful for 'large and otherwise unstructured collection of documents.' If you 'just' want to automate discrimination a few texts, why not use class-based TF-IDF scoring or word co-occurrences? It is correct that the number of tokens can be used to justify using the model, but why, when simpler and more well-suited techniques exists. How do the authors deal with the between nation difference in document length? They either want to justify why it has no effect or normalize. This brings us to the issue of very long documents for an extremely small data set. Models of very long papers will become generic - a BoW model will take into consideration all co-occurrences within the document (word pairs distributed over the first and last page are as relevant as word pairs within the same paragraph). Normally I would suggest that the ideal document size is a couple of paragraphs (to realistically encode co-occurrences as we perceive them). More importantly, I would suggest for a small dataset with long documents that you segment the texts into smaller document units that normalize length. The pipeline is mentioned, but not justified (with the exception of pruning). From a linguistic perspective, I would (almost) always suggest more advanced techniques for language normalization (POS to extract nouns, lemmatization), if you want to capture lexical semantics. This is not a trivial issue, because topic modeling is known to be highly sensitive to preprocessing (Schofield et al 2017). A rigorous applications would have tested effects of normalization to justify the chosen pipeline. On a more technical side, I do not see the advantage of the procedure for parameter optimization. In particular, I worry about coherence (the authors do not report what coherence measure they use). While there are several applications of topic models, this particular application favors interpretation of topic content, and coherence in contrast to log-likelihood will maximize human interpretability (Röder et al, 2015; Chang et all). Now coherence drops dramatically from $K = 2:4$, the reason for this is topics pr. document. if you have few documents you only have a couple of samples for representing the topics and learning stable distribution across those topics becomes hard, smaller $k$ values will therefore result in a better 'more coherent' topics. My suggestion here would be the same as previously, segmentation to increase the number of document and get more stable document topic representations. A related issue is how the $\alpha$ and $\beta$ parameters optimized (i.e., topic and word density) for the models? --- - Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009) Reading Tea Leaves: How Humans Interpret Topic Models. 9. - Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. - Schofield, A., Magnusson, M., Thompson, L., & Mimno, D. (2017). Pre-Processing for Latent Dirichlet Allocation.