Architectures - HackMD

# Architectures ## Overall thoughts - attention - LSTM, GRU - CNN - Recurent NN ## Non-classical NN-based studies ### CNN: - 40, 2, 19, 56, 4, 67, 15, 20 ### RNN: - 90 (not included in SLR now) #### LSTM/ biLSTM - 56, 20, 31 #### GRU: - 2, 56, 4 ### Other: - 6 (Hierarchical), 8 (Attention, paths), 61 (Bert)  | Year | Study | Algorithm | NN type | | ---- | ----- |:--------- | ---------------- | | 2021 | 40 | NN | CNN | | 2021 | 6 | NN | Hierarchical | | 2020 | 2 | NN | GRU, CNN | | 2020 | 55 | RF | - | | 2020 | 32 | NN | Classical | | 2020 | 19 | NN | CNN | | 2020 | 8 | NN | Attention, Paths | | 2020 | 90 | NN | RNN | | 2019 | 17 | RF | - | | 2019 | 39 | NN | Classical | | 2019 | 39 | SVM | - | | 2019 | 56 | NN | CNN | | 2019 | 56 | NN | LSTM, RNN | | 2019 | 56 | NN | GRU | | 2019 | 4 | NN | CNN | | 2019 | 4 | NN | GRU | | 2019 | 18 | NN | Classical | | 2019 | 75 | Other | - | | 2019 | 66 | SVM | - | | 2019 | 66 | RF | - | | 2019 | 64 | RF | - | | 2019 | 62 | NB | - | | 2019 | 62 | RF | - | | 2019 | 61 | NN | BERT | | 2019 | 67 | NN | CNN | | 2019 | 68 | RF | - | | 2019 | 15 | NN | CNN | | 2018 | 20 | NN | LSTM | | 2018 | 20 | NN | CNN | | 2018 | 38 | NN | Classical | | 2018 | 69 | RF | - | | 2017 | 37 | SVM | - | | 2017 | 3 | NN | Classical | | 2017 | 73 | SVM | - | | 2017 | 73 | NB | - | | 2017 | 73 | KNN | - | | 2017 | 31 | NN | LSTM | | 2015 | 58 | Other | - | | 2015 | 13 | SVM | - | | 2015 | 7 | RF | - | | 2014 | 25 | Other | - | | 2014 | 28 | Other | - | | 2014 | 36 | Other | - | | 2014 | 36 | SVM | - | | 2014 | 78 | Other | - | | 2013 | 29 | NN | Classical | | 2012 | 57 | SVM | - | | 2009 | 26 | Other | - | | 2009 | 41 | Other | - | | 2009 | 81 | Other | - | | 2008 | 51 | Other | - | | 2007 | 27 | Other | - | | 2007 | 84 | Other | - | | 2007 | 85 | Other | - | | 2006 | 52 | Other | - | | 2006 | 30 | Other | - | | 2006 | 86 | Other | - | | 2005 | 53 | Other | - | | 2004 | 5 | CDA | - | | 2000 | 87 | NN | Classical | ## 2021 ### 40 [^40] ![](https://i.imgur.com/QFGrFHF.png) ![](https://i.imgur.com/1j3aPXi.png) ### 6 [^6] > We use two encoders, one for the token level and one for the function level... ![](https://i.imgur.com/8NSWARo.png) #### Token level encoder > The input for the entire network is a three-dimensional matrix: package, functions, and tokens. > ... > The first layer is an embedding-layer mapping tokens to a low dimensional vector. The token-embeddings are learned as part of the authorship training, similarly to standard embeddings. Unlike the large lexicons in NLP tasks, source code vocabulary usually contains only a few dozen keywords and a few hundred API calls, as is shown in Section 6.3; thus, we can encode each token $t_{ij}$ with low dimensional vectors using much less data than commonly required in NLP tasks. > Once the token-embeddings are learned, we individually average the vectors of tokens within each function without losing meaningful ones. This averaging technique is common in the representation of textual units (sentences, paragraphs, documents) in NLP tasks . > For a sequence of token-embedding $t_{ij}$ in a function $f_i$, we apply a composition function $g$, averaging the token-embeddings that are different than zero (we apply dropout regularization between layers). $K_i$ represents the total number of tokens $t_{ij}$, while $K_{i}'$ represents the number of tokens that are different than zero in the ith function. > $$g(t_{ij} \in f_i)= \sum_{j=1,t_{ij}\ne0}^{K_i} \frac{t_{ij}}{ K_{i}'}$$ The output of $g$ is an embedding vector $f_i$ representing the corresponding function. Translation: 1. embedding layer 2. averaging #### Function level encoder Translation: 1. attention layer 2. softmax 3. aggregation 4. softmax > The main component in the function encoder is the attention layer. Attention mechanisms were proven useful in handling (memorizing) long sentences in machine translation [36], as well as in many other NLP tasks. Here, we simplify the additive attention layers and define the scoring annotation without the output, since it does not serve as an alignment for the input and output, but only as an importance mechanism over the functions [13]. Since not all functions contribute equally to the encoding of a package, we apply the attention mechanism using a vector of importance weights. We define the attention mechanism in the following way: first, we feed the function $f_i$ vector through a dense fully-connected layer to get $u_i$ — a hidden representation vector of $f_i$, $u_i =tanh(w_ff_i +b_f)$ We then compute $a_i$ — the normalized weight, by multiplying $u_i$ and the context vector importance $u_f$. The context vector $u_f$ is randomly initialized and is learned during the training process. A softmax function is then used to get $a_i$ — the normalized importance weight. > $$a_i = \frac{exp(u^T_i u_f )}{\sum_i exp(u^T_i u_f )}$$ The normalized weight $a_i$ is summed and multiplied by the function $f_i$, yielding a package vector that encodes the information of all functions. We denote this package vector as $v_p$. $$v_p = \sum_i a_if_i $$ Finally, we feed the package encoded vector $v_p$ to a fully-connected layer and activate softmax function to get the distribution $p$ for the classes. $$p = softmax(W_pv_p + b_p)$$ ## 2020 ### 2 [^2] ![](https://i.imgur.com/EkRHOIB.png) ### 55 [^55] (keystroke) Random forest ### 32 [^32] ![](https://i.imgur.com/NXbMGoB.png) > The deep learning algorithm is optimized with fine-tune configuration in the context of drop out layer, activation and loss function, optimizer method, and learning error rate. The deep learning model is designed with dense layers with 150, 100, and 50 neurons, respectively. The drop out layer is also configured with each dense layer to solve the overfitting problem... ### 19 [^19] (baseline arch) ![](https://i.imgur.com/6mNIEtr.png) ### 8 [^8] (Kovalenko) ![](https://i.imgur.com/lSoguv1.png) - plus Random forest classifier (parallel algorithm with competitive results) > Figure 2 shows the architecture of the network. The network takes a bag of path-contexts as an input. The number of path-contexts, even with restrictions on path length and width in place, might be tremendous. To speed up the computations, at each training iteration we only take up to 500 random path-contexts for each sample. Further, we transform path-contexts into numerical form that can be passed to the network. We embed a path and both tokens into $R^d$ and concatenate these vectors to form a 3d-dimensional context vector. Embeddings for tokens and paths are matrices of size $R^d × T$ and $R^d × P$ , respectively. At first, the matrices are random, their values adjust during the network training process. The size of the embeddings vector might be set separately for paths and tokens, but, as these numbers are of roughly the same order of magnitude, it is easier to tune one hyperparameter instead of two, so we set both of them to $d$. Then, a fully-connected layer with $tanh$ activation function transforms raw path-context vectors of size 3d into context vectors of size $d$. This step is not obligatory, but it speeds up convergence of the model [32]. After that, a piece of code is represented by a set of d-dimensional vectors corresponding to path-contexts. At the next step, we aggregate vectors of individual path-contexts into a representation of a code snippet through an attention mechanism [36]. We use a simple version of attention, represented by a single trainable vector $a_{att}$ . For the path-context vectors $ctx_k$ , the attention weights $w_k$ are computed as follows: $$att_k = ctx_k · a_{att}$$ $$ w_k =\frac{e^{att_k}}{e^{ \sum^{|ctx|}_{k=1}w_k\cdot ctx_k}}$$ Weights for the context vectors are a softmax of attention values. Then, the representation of the code snippet is: $$ r = \sum^{|ctx|}_{k=1} w_k · ctx_k$$ Finally, a fully-connected layer with softmax activation outputs author predictions. The number of the PbNN’s parameters is $O((T + P )d)$. Since the value of $(T + P )$ is usually large, tens of thousands to millions, the number of required samples for the model to train is also significant. ### multi-x - ==TODO==: read in depth ![](https://i.imgur.com/XiqLV7i.png) ## 2019 ### Code Authorship Attribution: Methods and Challenges ==TODO== ### 17 [^17] Git blame who. Random forest ### 39 [^39] manual features (simple NN, SVM) ![](https://i.imgur.com/68f51zL.png) ### 56 [^56] > В рамках данного исследования было решено смоделировать следующие CRNN: > - CNN-SimpleRNN; > - SeparableCNN-SimpleRNN; > - CNN-GRU/LSTM; > - CNN-BiGRU/BiLSTM. ### 4 [^4] ![](https://i.imgur.com/2M2eLP5.png) ### 18 [^18] > We have configured seven layers to train the features with 100, 80, 80, 60, 60, 40 neurons, respectively. The 7th layer is configured for output variable, i.e., programmers. First is the input layer, then five are hidden layers, and last is the output layer. The Relu activation function is used in input and hidden layers. The softmax function is used for the target variable. The dropout layer is used to fine-tune the deep learning algorithm to remove the overfitting problem. There are 750 parameters trained on layer 1, 15100 parameters on layer 2, 5050, on layer 3 and 5049 on layer 4. Total 25,949 parameters are trained for the designed experiment. For better accuracy, the deep learning algorithm is optimized with fine-tune configuration in the context of drop out layer, activation and loss function, optimizer method, and learning error rate. The softmax activation function is also called soft- argmax or normalized exponential function, which is used in the output layer to handle multi-class problems [42]. It takes a vector of K real numbers and, transforms into the normal- ized probability distribution of K probabilities. The output is proportional to the input of K numbers. Some input may not be in proper distribution of numbers. Softmax is applied to convert K numbers in a range of [0,1]. It is often used in multi- class neural networks to convert the non-normalized output to a probability distribution over predicted classes... ### 75 [^75] - not DNN, but still distance-based approach ### 66 [^66] - no arch (SVM + RF + ...) ### 60 [^60] - no arch ### 65 [^65] - no arch ### 64 [^64] - seems to be RF ![](https://i.imgur.com/eNxu0vQ.png) ### 62 [^62] - RF + NB ### 61 [^61] - BERT! ![](https://i.imgur.com/sZfvujh.png) ### 67 [^67] ![](https://i.imgur.com/E7SQta1.png) ### 68 [^68] - RF ### 15 [^15] - understood Two approaches, each of which is implemented in two archs: - Stacked / Concatenated: Vertical or Horizontal location of the convolutions - with/without embedding layer ![](https://i.imgur.com/bL3pWhB.png) ![](https://i.imgur.com/q79JlTB.png) ![](https://i.imgur.com/vzLtrXL.png) ![](https://i.imgur.com/TLU6vZu.png) ## 2018 ### 20 - LSTM 0. Embedding layer 1. 1D convoluton (3 tokens, 32 channels) 2. Max-Pooling (size 2) 3. Dropout (30%) 4. LSTM (bidirectional) - 64 units 5. Dropout (30%) 6. Dense ### 38 - classical ![](https://i.imgur.com/viotF7F.png) ### 69 - RF ## 2017 ### 37 - SVM ### 3 - Classical NN with PSO (practical swarm optimization, whatever it is) ### 73 - SVM, KNN, Naive bayes ### 31 - LSTM (+biLSTM) traverse the tree with DFS algorithm 1. Embedding layer 2. Subtree layer = LSTM/biLSTM - ==LOOK AT== 3. Softmax ## 2015 ### 58 - ~clustering~ algorithms ### 13 - LDA, SVM... ### 7 - RF ## 2014 ### 25 - SCAP, quering ### 28 - SCAP, Bayes ### 36 - Okapi, SVM (Burrows) ### 78 - not NN ## 2013 ### 29 - NN, classical (?) - combinations of the Restricted Boltzman Machines (RBMs) ## 2012 ### 57 - SVM and similarity measures ## 2009 ### 26 - MRR, MAP scores (similarity measures) ### 41 - Burrows, Zetair search eingine (~similarity measures~) ### 81 - MRR, MAP (similarity measures), Okapi ... ## 2008 ### 51 - SCAP, quering ## 2007 ### 27 - Burrows, similarity scores ### 84 - distance measures and ranking ### 85 - VFI (voting feature intervals) classifier ## 2006 ### 52 - SCAP... ### 30 - SCAP... ### 86 - SCAP... ## 2005 ### 53 - SCAP... ## 2004 ### 5 - CDA (Canonical Discriminant Analysis), SDA (Stepwise DA) ## 2000 ### 87 NN(!26-9-7), classical + MDA, CBR, etc ## Want to implement: - 31 (LSTM) - 20 (LSTM) - 15 (CNN) - 4 (CNN) - 8 - trees - 19 (CNN) - 6 - classical (???) - 40 - CNN ## References: [^15]: Mohammed Abuhamad et al. “Code authorship identifi-cation using convolutional neural networks”. In:FutureGeneration Computer Systems95 (2019), pp. 104–115 [^40]: Pranali Bora et al. “ICodeNet–A Hierarchical NeuralNetwork Approach for Source Code Author Identifica-tion”. In:arXiv preprint arXiv:2102.00230(2021). [^6]: Roni Mateless, Oren Tsur, and Robert Moskovitch.“Pkg2Vec: Hierarchical package embedding for codeauthorship attribution”. In:Future Generation Com-puter Systems116 (2021), pp. 49–60. [^2]: Anna Kurtukova, Aleksandr Romanov, and AlexanderShelupanov. “Source Code Authorship IdentificationUsing Deep Neural Networks”. In:Symmetry12.12(2020), p. 2044. [^55]: Jeongmin Byun, Jungkook Park, and Alice Oh. “Detect-ing Contract Cheaters in Online Programming Classeswith Keystroke Dynamics”. In:Proceedings of theSeventh ACM Conference on Learning@ Scale. 2020,pp. 273–276. [^32]: Farhan Ullah, Sohail Jabbar, and Fadi Al-Turjman.“Programmers’ de-anonymization using a hybrid ap-proach of abstract syntax tree and deep learning”.In:Technological Forecasting and Social Change159(2020), p. 120186. [^19]: Sarim Zafar et al. “Language and Obfuscation ObliviousSource Code Authorship Attribution”. In:IEEE Access8 (2020), pp. 197581–197596. [^8]: Egor Bogomolov et al. “Authorship Attribution ofSource Code: A Language-Agnostic Approach and Ap-plicability in Software Engineering”. In:arXiv preprintarXiv:2001.11593(2020) [^17]: Edwin Dauber et al. “Git blame who?: Stylistic au-thorship attribution of small, incomplete source codefragments”. In:Proceedings on Privacy EnhancingTechnologies2019.3 (2019), pp. 389–408 [^39]: Parvez Mahbub, Naz Zarreen Oishie, and SM RafizulHaque. “Authorship identification of source code seg-ments written by multiple authors using stacking ensem-ble method”. In:2019 22nd International Conference onComputer and Information Technology (ICCIT). IEEE.2019, pp. 1–6. [^56]: А.В. Куртукова, А.С. Романов. "Моделирование архитектуры нейронной сети в задаче идентификации автора исходного кода". Доклады ТУСУР, 2019, том 22, No 3 [^4]: Anna Vladimirovna Kurtukova and Aleksandr Sergee-vich Romanov. “Identification author of source code bymachine learning methods”. In:Trudy SPIIRAN18.3(2019), pp. 742–766. [^18]: Farhan Ullah et al. “Source code authorship attributionusing hybrid approach of program dependence graphand deep learning model”. In:IEEE Access7 (2019),pp. 141987–141999. [^75]: Sergey Gorshkov et al. “Source Code Authorship Identifi-cation Using Tokenization and Boosting Algorithms”. In:International Conference on Modern Information Tech-nology and IT Education. Springer. 2018, pp. 295–308. [^66]: Chanchal Suman et al. “Source Code Authorship Attribu-tion using Stacked classifier.” In:FIRE (Working Notes).2020, pp. 732–737. [^60]: Jos ́e Antonio Garcı’ea-Dı’eaz and Rafael Valencia-Garcı’ea. “UMUTeam at AI-SOCO’2020: Source CodeAuthorship Identification based on Character N-Gramsand Author’s Traits.” In:FIRE (Working Notes). 2020,pp. 717–726. [^65]: Yves Bestgen. “Boosting a kNN Classifier by improv-ing Feature Extraction for Authorship Identification ofSource Code.” In:FIRE (Working Notes). 2020, pp. 705–712. [^64]: Yunpeng Yang et al. “N-gram-based Authorship Identifi-cation of Source Code.” In:FIRE (Working Notes). 2020,pp. 694–698. [^62]: Nitin Nikamanth Appiah Balaji and B Bharathi. “SS-NCSENLP@ Authorship Identification of SOurce COde(AI-SOCO) 2020.” In:FIRE (Working Notes). 2020,pp. 746–750. [^61]: Mutaz Bni Younes and Nour Al-Khdour. “Team Alexa atAuthorship Identification of SOurce Code (AI-SOCO).”In:FIRE (Working Notes). 2020, pp. 699–704. [^67]: Asrita Venkata Mandalam and Abhishek. “Embedding-based Authorship Identification of Source Code.” In:FIRE (Working Notes). 2020, pp. 727–731. [^68]: Daniel Watson. “Source Code Stylometry and AuthorshipAttribution for Open Source”. MA thesis. University ofWaterloo, 2019.