1.1. Data handling and modeling

--- tags: Mathematical modeling and optimization title: 1.1. Data handling and modeling --- ## 1.1. Data handling and modeling Numbers are the most objective form presenting phenomenon and regulation in the nature and engineering environment. Data, as collective of numbers, reveals more physical and system insights. In various scientific disciplines, data often presents itself in a discrete form, manifesting as distinct points or values. Understanding and modeling this discrete data is pivotal for drawing meaningful insights and making accurate predictions. ### 1.1.1. interpolation of data Interpolation is a fundamental concept in mathematics and data analysis, serving as a powerful tool for estimating values between known data points. It involves constructing new data points within the range of existing data. To bridge the gap between these discrete data points and achieve a continuous representation, interpolation techniques play a crucial role. Interpolation involves estimating values between known data points, enabling the creation of a smooth and continuous model. This is particularly valuable when we seek to describe the behavior of a system using mathematical equations. The discovery of [Boyle's Law](https://en.wikipedia.org/wiki/Boyle%27s_law) is a typical example. The behavior between pressure and volume of gas becomes a law after the data is interpreted. <img src="https://live.staticflickr.com/65535/54287815580_d4f21c336a_o.png" width="80%"> The most commonly used techniques include **linear interpolation**, **polynomial interpolation** and **spline interpolation**. #### Linear interpolation Linear interpolation is a method for estimating values between two known data points by assuming a linear relationship between them. Let's denote the two known data points as $(x_1, y_1)$ and $(x_2, y_2)$, where $x_1$ and $x_2$ are the independent variable values, and $y_2$ and $y_2$ are the corresponding dependent variable values. The linear interpolation formula can be expressed as follows: $$ \displaystyle y=y_1+\frac{(x−x_1)(y_2−y_1)}{(x_2−x_1)}$$ $x$: The independent variable value at which we want to interpolate the dependent variable. $x_1$, $y_1$: The coordinates of the first known data point. $x_2$, $y_2$: The coordinates of the second known data point. $y$: The interpolated value of the dependent variable at the specified $x$. #### Polynomial interpolation Polynomial interpolation involves fitting a polynomial function to a set of known data points. Let's consider a scenario where we have $n+1$ data points $(x_0,y_0)$,$(x_1,y_1)$,...,$(x_n,y_n)$, where $x_i$ and $y_i$ are the coordinates of the ith data point. The goal of polynomial interpolation is to find a polynomial of degree at most n, denoted as $P_n(x)$, that passes through all these data points. The general form of $P_n(x)$ is given by: $$\displaystyle P_n(x)=a_0+a_1 x+a_2x_2+...+a_nx_n$$ Here, $a_0$,$a_1$,...,$a_n$ are the coefficients of the polynomial, and we call the degree of the polynomial is $n$. #### Spline interpolation Spline interpolation involves fitting piecewise-defined polynomials (splines) to a set of data points, creating a smooth and continuous curve. One of the most commonly used types of splines is the cubic spline, which uses cubic polynomials for each interval between data points. Let's consider a set of n+1 data points $(x_0,y_0)$,$(x_1,y_1)$,...,$(x_n,y_n)$, where $x_i$ and $y_i$ are the coordinates of the i_th data point. The __**cubic spline**__ is a piecewise-defined function $S(x)$ that consists of cubic polynomials on each interval $[x_i,x_{i+1}]$. The general form of the cubic spline on the ith interval is given by: $$\displaystyle S_i(x)=a_i+b_i(x−x_i)+c_i(x−x_i)^2+d _i(x−x_i)^3$$ Here, $a_i$, $b_i$, $c_i$, and $_i$ are coefficients that need to be determined for each interval. The goal is to find these coefficients in a way that ensures a smooth and continuous curve. An illustration of these 3 interpolation methods is shown below. It is seen, since _linear_ and _spline_ wrap up the data every two/three points, the interpolated values will not exceed the scope; whereas the value calculated based on polynomial parameters can be rapidly escalated. <img src="https://live.staticflickr.com/65535/54287443971_4815a0ece8_z.jpg" width="80%"> #### Resource in python [__Scipy.interpolate__](https://docs.scipy.org/doc/scipy/tutorial/interpolate.html) provides the corresponding functions, which can be directly utilized while doing data analyzing. ### 1.1.2. Statistical representation and Gaussian distribution When dealing with measurement data, consideration of deviation is necessary. Data deviation may due to measuring uncertainties, noise, or sometimes the outlier. Understanding the variability within a dataset and especially, distinguishing the unphysical from the physical data, is the only way towards meaningful analysis. One key measure that illuminates this variability is the standard deviation, which shows the statistical measure that quantifies the amount of variation or dispersion in a dataset. For a dataset with $n$ data points $(x_1,x_2,...,x_n)$ and a mean $\bar{x}$, the standard deviation ($\sigma$) is calculated as follows: $$\large \displaystyle \sigma=\sqrt{\sum_{i=1}^n\frac{(x_i−\bar{x})^2} {n}}$$ Here, $x_i$ is the individual data points and $\bar{x}$ is the mean of the dataset. Based on the idea, realistic data and its trend evolution may have the behavior as shown below. In the figure, three data evolutions are plotted in the same diagram. Concerning the measurement deviation, the trend could be modeled in the dash line. <img src="https://live.staticflickr.com/65535/54292607421_b7299b64cc_z.jpg" width="80%"> The deviation itself also reveals certain information. Incorporated with __Gaussian distribution__, the deviation can serve as a quality criterium to evaluate the data feasibility, e.g. $\pm\ 3\ \sigma$ principle. #### Gaussian distribution *Gaussian distribution*, also known as the normal distribution, is a fundamental concept and mostly applied model in statistics and probability theory. It is also used to model random variables whose values cluster around a mean value, as shown in the figure. <img src="https://miro.medium.com/max/1200/1*IdGgdrY_n_9_YfkaCh-dag.png" width=80%> The probability density function (PDF) of a Gaussian distribution with mean ($\bar{x}$) and standard deviation ($\sigma$) is given by: $$\displaystyle f(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}(\frac{x-\bar{x}}{\sigma})^2\right)$$ Here: $x$ represents the variable for which we are calculating the probability density. $\bar{x}$ is the mean of the distribution. $σ$ is the standard deviation. In the data modeling point of view, Gaussian distribution have widespread applications in various practical assignments, including: - Evaluation of event probability. - Distinguishing the similarity of data sets, e.g. [T-test](https://en.wikipedia.org/wiki/Student%27s_t-test) ### 1.1.3. Spectrum analysis and Fourier transformation Signal processing is a crucial field that enables us to analyze and manipulate signals, extracting valuable information from diverse data sources. One powerful tool in this domain is the Fourier Transform, which allows us to represent signals in the frequency domain, providing insights into their frequency components. An efficient algorithm for implementing Fourier Transform is the Fast Fourier Transform (FFT), significantly enhancing computational speed. The FFT allows us to decompose a signal into a superposition of different waves, each characterized by a specific wavelength and amplitude. Imagine a musical chord played on a guitar or piano. This composite sound is not a single, uniform wave but rather a combination of individual waves, each corresponding to a particular musical note. The FFT, in essence, disentangles this amalgamation, unveiling the distinct frequencies that contribute to the overall sound. Mathematically, the FFT takes a time/space domain signal and transforms it into the frequency/wave length domain, expressing the signal as a sum of sinusoidal waves. Each component of the FFT spectrum corresponds to a specific frequency/wave length, revealing the wavelength/frequency and amplitude of the underlying waves. A example shown when analyzing the patterns of turbulent flows, where the chaotic structure reveals to the $-5/3$ law in the Fourier space. <img src="https://www.researchgate.net/publication/348426509/figure/fig2/AS:979353382952961@1610507450688/Velocity-magnitude-of-the-decaying-homogeneous-isotropic-turbulence-at-M-a-t-01-Re-l.ppm" width="60%"> <img src="https://d3i71xaburhd42.cloudfront.net/bad09807a652834618fdbe39e77d67ac92d152f3/8-Figure1-1.png" width="39%"> The Fourier Transform is a mathematical tool that decomposes a signal into its constituent frequencies. For a continuous-time signal x(t), the Fourier Transform $X(f)$ is defined as: $$\displaystyle X(f)=∫^{\infty}_{-\infty}x(t)⋅\exp\left(−j\ 2\pi\ f\ t\right) dt$$ This transformation yields a representation of the signal in the frequency domain, providing information about its frequency components and their respective magnitudes and phases. #### Discrete Fourier Transform (DFT): In practical applications, signals are often discrete. The Discrete Fourier Transform (DFT) is used to analyze such discrete signals. For a discrete signal $x[n]$, the DFT $X[\kappa]$ is defined as: $$\displaystyle X[\kappa]=\sum_{n=0}^{N−1}x[n]⋅\exp\left(−j\frac{2\pi}{N} \ \kappa n \right)$$ Here, $N$ is the number of samples in the signal. Wenn dealing with noisy data, handling in Fourier space can contribute to filter the noise. By diminishing the high frequency noise, data obtain only the large scale evolution, which can be revealed by backwards Fourier transformation. In such way, central information become neat and manageable. #### Resources in python *fft* and *ifft* are common appellation in the relevant applications. [__Scipy.fft__](https://docs.scipy.org/doc/scipy/tutorial/fft.html) provides the corresponding functions till N-dimensional level, which can be directly utilized while doing data analyzing. ### <ins>Scenario -- bringing data to function A sensor captures the velocity of one particle into the data set with time information. The evolution presents in a very scattered manner. <img src="https://live.staticflickr.com/65535/54292607426_3858c1cfb7_z.jpg"> Considering the measurement noise is involved. How could you - estimate the standard deviation of the measuring data? - apply the main movement information into trajectory? [__Sample solution__](https://cocalc.com/share/public_paths/b19cb186328fb5fa8e195cdf317ad27b993d9cc0) --- [__BACK TO CONTENT__](https://hackmd.io/@SamuelChang/H1LvI_eYn)