A new approach of analyzing AES data with deep learning

## A new approach of analyzing AES data with Deep Learning SMILES 2020 ---  PHI 680 Scanning Auger Microprobe   ![](https://i.imgur.com/4Hychob.jpg =780x) Note: is a common analytical technique used specifically in the study of surfaces and, more generally, in the area of materials science. Underlying the spectroscopic technique is the Auger effect, as it has come to be called, which is based on the analysis of energetic electrons emitted from an excited atom after a series of internal relaxation events. ---  ![](https://i.imgur.com/P0xS68K.png) Note: he Auger effect is an electronic process at the heart of AES resulting from the inter- and intrastate transitions of electrons in an excited atom. When an atom is probed by an external mechanism, such as a photon or a beam of electrons with energies in the range of several eV to 50 keV, a core state electron can be removed leaving behind a hole. As this is an unstable state, the core hole can be filled by an outer shell electron, whereby the electron moving to the lower energy level loses an amount of energy equal to the difference in orbital energies. The transition energy can be coupled to a second outer shell electron, which will be emitted from the atom if the transferred energy is greater than the orbital binding energy.[ ---  AES Spectrum On the X-axis and Y-axis: Electron energy and c/s, respectively ![](https://i.imgur.com/CLjLP11.png) Note: applicable to xray photoelectron spectroscopy Secondary ion mass spectrometry and the majority of surface analsys techniques ---  # Data Collection ---- * MULTIPAK Database * MISIS Nano technology lab ---  Peak annotation Tools ![](https://i.imgur.com/7qDEWP9.png) ---  #### Dataset is published at abdalazizrashid.github.io/open-aes ![](https://i.imgur.com/P7NFCOv.jpg) ---  ## Data Augmentation ---- $$ \begin{aligned} i_1, ..., i_n \sim \mathcal{U} \{1, ..., K\}\\ w_k \sim Dir(\{k | k=1\})\\ S_w = \sum_{i=0}^k{w_kS_{ik}} \end{aligned} $$ Note: Data Augmentation The main problem we faced during this research is the lack of data, for pure elements, the specs where available, but for the compounds, the labels were not very accurate. This because MULTIPAK does not have support for batch processing of the samples, manually labelling the data by opening each file would be timeconsuming and infeasible. We made a script that does batch processing by detecting peaks using peak detection algorithm, available by Scipy. After that, we compared the peaks with peaks of elements by measure the Euclidean distance between the peak location on the signal and the peak energy location of the elements. The closest element is chosen, this method provides some labels, but it is not super accurate mainly because many elements have very close energy peak which made it hard to determine the actual element and the lack of human experience to make the correct judgement. Where k is the set of elements we want to mix, K is the total number of elements, wk is the weight or the concentration of each element and Sw is nonphysical compound. All this boiled down to the fact we need to augment the data, we used a Dirichlet ---  # Data Representation ---  Using RP (Recurance Patterns) $$ \mathcal{R(i, j)} = \begin{cases} 1 & \text{if $\vert\vert\vec{x}(i) - \vec{x}(j)\vert\vert \leq \epsilon$} \\ 0 & \text{otherwise,} \end{cases} $$ ![Recurence patterns](https://i.imgur.com/ea2nZRR.png) Note: > In descriptive statistics and chaos theory, a recurrence plot (RP) is a plot showing, for each moment I in time, the times at which a phase space trajectory visits roughly the same area in the phase space as at time j. > > In dynamical system theory, a phase space is a space in which all possible states of a system are represented, with each possible state corresponding to one unique point in the phase space. For mechanical systems, the phase space usually consists of all possible values of position and momentum variables. The concept of phase space was developed in the late 19th century by Ludwig Boltzmann, Henri Poincaré, and Josiah Willard Gibbs > ---  Using Gramian Angular Field (GAF) $$ \mathcal{P(r, \theta)} = \begin{cases} \theta_i & = \arccos(x_i), -1 \leq \tilde{x} \leq 1, \tilde{x} \in \tilde{X} \\ r_i & = \frac{i}{N}, t_i \in \mathbb{N} \end{cases} $$ $$ \mathcal{G} = \begin{bmatrix} \cos(\theta_1 + \theta_1) & \cdots & \cos(\theta_1 + \theta_n) \\ \cos(\theta_2 + \theta_1) & \cdots & \cos(\theta_2 + \theta_n) \\ \vdots & \ddots & \vdots \\ \cos(\theta_n + \theta_1) & \cdots & \cos(\theta_n + \theta_n) \end{bmatrix} $$ Note: > Why not the inner product of the polar encoded values? > The inner product in the 2D polar space has several limitations because the norm of each vector has been adjusted for the time dependency. More precisely: > An inner product between two distinct observations will be biased in favor of the most recent one (because the norm increases with time). > When computing the inner product of observation with itself, the resulting norm is biased as well. > Therefore, if an inner product like operation was to exist, it should solely depend on the angle. > > Advantages > The diagonal is made of the original value of the scaled time series (we will approximately reconstruct the time series from the high level features learned by the deep neural network.) > Temporal correlations are accounted for with the relative correlation by superposition of directions with respect to time interval k. ---  ###### Newly constructed operation corresponds to a penalized version of the conventional inner product $$ \cos(\theta_1 + \theta_2) = \langle x, y \rangle - \sqrt{1-x^2}.\sqrt{1-y^2} $$ Note: > The penalty shifts the mean output towards -1. > The closer x and y are to 0, the larger is the penalty. The main consequence is that points which were closer to the Gaussian noise … (with the dot) are (with the penalized dot) > For x = y: it is casted to -1 > The outputs are easily distinguishable from Gaussian Noise.  ---  ![](https://i.imgur.com/L1x0InR.gif) ![](https://i.imgur.com/tRosMvI.gif) ---  Continuous wavelet transformation (CWT) $$ X_w(a,b) = \frac{1}{\lvert a \rvert^{1/2}} \int_{-\infty}^{\infty} x(t) \bar\psi\left(\frac{t-b}{a}\right)dt $$ ##### $$ a \in \mathbb{R^+_*} $$ ##### $$ b \in \mathbb{R}$$ Note: is the negative normalized second derivative of a Gaussian function, i.e., up to scale and normalization, the second Hermite function. It is a special case of the family of continuous wavelets (wavelets used in a continuous wavelet transform) known as Hermitian wavelets. The Ricker wavelet is frequently employed to model seismic data, and as a broad spectrum source term in computational electrodynamics. It is usually only referred to as the Mexican hat wavelet, when used as a 2D image processing kernel. It is also known as the Marr wavelet for David Marr. #### Mexican hat wavelet, the Ricker wavelet In mathematics and numerical analysis, the Ricker wavelet Morlet wavelet the Morlet wavelet (or Gabor wavelet)[1] is a wavelet composed of a complex exponential (carrier) multiplied by a Gaussian window (envelope). This wavelet is closely related to human perception, both hearing[2] and vision.[3] `----` https://en.wikipedia.org/wiki/Continuous_wavelet_transform the continuous wavelet transform (CWT) is a formal (i.e., non-numerical) tool that provides an overcomplete representation of a signal by letting the translation and scale parameter of the wavelets vary continuously. The continuous wavelet transform of a function x(t)at a scale (a>0), a and b are translational value is expressed by the following integral where \psi (t) is a continuous function in both the time domain and the frequency domain called the mother wavelet and the overline represents operation of complex conjugate. The main purpose of the mother wavelet is to provide a source function to generate the daughter wavelets which are simply the translated and scaled versions of the mother wavelet. To recover the original signal x(t), the first inverse continuous wavelet transform can be exploited. ---  ![](https://i.imgur.com/dCXh7NA.gif) Note: Morlet wavelet the Morlet wavelet (or Gabor wavelet)[1] is a wavelet composed of a complex exponential (carrier) multiplied by a Gaussian window (envelope). This wavelet is closely related to human perception, both hearing[2] and vision.[3] ---  ![](https://i.imgur.com/Yn7NKCQ.jpg) ---  Spline Convolution $$ g_l(u)= \sum_{\mathbf{p}\in p}{w_{\mathbf{p}, l} \cdot B_p(u)} $$ Where $B_{\mathbf{p}}$ being the product of the basis functions in $\mathbf{p}$ Note: Spline-based Convolutional Neural Networks (SplineCNNs), a variant of deep neural networks for irregular structured and geometric input, e.g., graphs or meshes. Our main contribution is a novel convolution operator based on B-splines, that makes the computation time independent from the kernel size due to the local support property of the B-spline basis functions. As a result, we obtain a generalization of the traditional CNN convolution operator by using continuous kernel functions parametrized by a fixed number of trainable weights. In contrast to related approaches that filter in the spectral domain, the proposed method aggregates features purely in the spatial domain. In addition, SplineCNN allows entire end-to-end training of deep architectures, using only the geometric structure as input, instead of handcrafted feature descriptors. For validation, we apply our method on tasks from the fields of image graph classification, shape correspondence and graph node classification, and show that it outperforms or pars state-of-the-art approaches while being significantly faster and having favorable properties like domainindependence. Our source code is available on GitHub1 . ---  $$ B_\mathbf{p}(u)=\prod_{i=2}^{d}{N_{i,p_i}(u_i)} $$ $$ (f \star g)(i) = \dfrac{1}{| \mathcal{N}(i)|} \sum_{l=1}^M{ \sum_{j\in \mathcal{N}(i)} {f_l(j)\cdot g_l( \mathbf{u} (i, j))}} $$ Note: An interpretation for this kernel is to see the trainable parameters $w_{\mathbf{p}, l}$ as control values for the height of a $d + 1-dimensional$ B-spline surface, from which a weight is sampled for each neighboring point $j$, depending on $\mathbf{u}(i, j)$. Given the kernel function $g=(g_1, \cdots, g_M)$ and input node features $f$, we define out spatial convolution operator for a node as: ---  # Architecture ---  ![SpecNet Architecture](https://i.imgur.com/iarfz4o.jpg) SpecNet Architecture ---  ### Activation ---  #### Gaussian Error Linear Unit (GELU) $$ \begin{aligned} GELU(x):= xP(X \leq x)=x\Phi(x)\\ \dfrac{1}{2} \left( 1+erf \left( \dfrac{x}{\sqrt{2}}\right)\right) \end{aligned} $$ Note: the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU nonlinearity is the expected transformation of a stochastic regularizer which randomly applies the identity or zero map to a neuron's input. The GELU nonlinearity weights inputs by their magnitude, rather than gates inputs by their sign as in ReLUs. ---  ![](https://i.imgur.com/lcDddBX.png) ---  $$ \mathbb K\left(P||Q\right) = \sum_{i}\log_2 (p_i/q_i)p_i $$ Note: Since this task can be considered as a multi-label classification problem, there is many loss functions that are suitable for multi-label classification like Cross entropy loss, weighted cross-entropy loss, Kullback-Leibler divergence. The best-suited loss function is for this task is the Kullback-Leibler divergence also called relative entropy is a measure of the difference between two probability distributions. The interpretation of Kullback-Leibler divergence is not a metric proper, because it is not symmetric. Considering the $P$ distribution to be the target distribution (the true element distribution in the sample) , and the $Q$ is the approximated distribution inferred by the neural network (probability distribution of elements in the sample as predicted by the neural network) ---  # Training $$ \eta_t = \eta_{min}^i + \dfrac{1}{2} (\eta_{max}^i - \eta_{min}^i) (1+\cos(\dfrac{T_{cur}}{T_i})) $$ Note: Where $\eta_{min}^i$ and $\eta_{max}^i$ are ranges for the learning rate, in the study $\eta_{min}^i = 1e-5$ and $\eta_{max}^i = 1e-1$, and $T_{cur}=0$ tracks how many epochs have been performed since the last restart. $T_{cur}$ is updated at each batch iteration $t$, it can take discredited values. To increase the accuracy $T_i=1$ has been set to a small value and increased by a factor $T_{mult}=2$, also $\eta_{max}^i$ and $\eta_{min}^i$ has been decreased at every new restart. ---  ![](https://i.imgur.com/lRUJ45Z.png) ---  # Results ---  ![](https://i.imgur.com/fzfZFU9.png) ---  ![](https://i.imgur.com/uNNbro8.png) ---  The accuracy for the trained neural network is around 88\% for elements 75\% for augmented compounds and 63\% for real compounds. ---  Thank you! for collaboration abdalaziz.rashid@outlook.com ---