# From High Dimensional Data to Insights (clean)
*BlockScience, June 2020*
## Problem Definition and Approach Overview
In this case, we have an extremely high dimensional space to evaluate complex relationships between a set of controlled and uncontrolled parameters with an interest in the effects of variations in the controlled parameters while accounting for the uncertainty in the uncontrolled parameters.
Let the set of controlled parameters be denoted $\mathcal{P}_c$ and the set of uncontrolled or "environmental" parameters be denoted $\mathcal{P}_e$. Furthermore, the set of metrics or KPIs of interest is $\mathcal{K}$.
A naive analysis would attempt to characterize the sensitivity of each parameter $p \in \mathcal{P} = \mathcal{P}_c \bigcup \mathcal{P}_e$ on each of the KPIs $k\in \mathcal{K}$ which is already $\left(|\mathcal{P}_c| + |\mathcal{P}_e| \right) \times |\mathcal{K}|$ dimensional statistical analysis. However, our work has indicated that the interaction effects significantly influence the scores. This causes a further combinatorial explosion in the potential sensitivities to account for.
Recognizing this, we must move to more advanced computational statistical approaches. We can frame the analysis in the format of an ML learning method and focus on regression techniques capable of reading out sensitivities (including interaction effects). To accomplish this, we can structure our simulation results around a data set containing features and labels.
Our feature vector $x = [ p\in\mathcal{P}_c | p\in\mathcal{P}_e ]$ and the associated label is $y = [k\in \mathcal{k}]$. Since we are using Monte Carlo experiments, there will be many records of the form $r = [x | y]$ with the same $x$ and potentially different values of $y$. In practice, this enhances the reliability of the technique, so the experimental approach needs to determine the range values of $x$ we wish to explore, ensure there are a multiplicity of experiments for each $x$, and compute the values of $y$ from the resulting simulation experiments. Upon constructing the data set $\mathcal{R}$ containing all of the records $r$, we will train a machine learning model to predict $y$ given $x$. The art of training such a model heavily depends on the data scientist's capacity to uncover the subspaces in the feature vector space that contain the most predictive power. Our effort will leverage these dimensionality reduction techniques and the ML model's input sensitivity read-out functions to identify the combinations of parameters to which the KPIs are most sensitive. Furthermore, it is to discover the subsets of the controlled parameter choices that drive the best outcomes while accounting for the worst cases of the environmental parameters.
The two pieces of information we need
1. what is the pre-image of the points $y$ that we liked in the feature space $x$, with a paricular focus on $x_c$?
2. what are the levels of sensitivity of the outputs $y$ to particular subsets (or subspaces) of the feature space $x$?
### Additional Thoughts
Another technique that should help with selection of the final parameters is to intentionally exclude the features $x_e$ and restrict ourselves to predicting $y$ only using our choices of $x_c$. With this approach its incredibly important to have the range of scenarios for different $x_e$ still in the data set, and in particular for the frequency of those records to be representive our best guess of the probabilities of those particular environment conditions denoted by $x_e$.
Another important consideration is that the RMSE or other loss functions used to assess predictive power measures for this model will serve a dual purpose. To a limited extent they fulfill there standard function in validating the quality of the ML model, and lower loss will imply the model is stronger. However, due to the complexity it is expected that these modes will still have relatively high error rates. We are more interested in examining where that variance remains versus where it has been reduced, rather than strictly getting a model with an extremely low error.
## Methodology
### Setup
#### Controlled Parameters
enumerate the set $\mathcal{P}_c$
* P1 (e.g. token issuance initial value)
* P2. (e.g. token issuance inflection point)
* P3 (e.g. lock-up fraction)
* Others that may be
#### Environmental Parameters
Enumerate the set $\mathcal{F}_e$
* Token Price (parameteres of the generator functions)
* Associated dimentionality: $F$
* Probability distribution
* Mean
* Standard deviation
* Number of aggregators
* Token Supply
* Associated dimensions: $S \times T$
* Demand for Token-associated Utility
* Associated dimensions: $S \times T$
* Personas
* Associated dimensions: $S \times R$
* Operating cost
* Type
* Common Farmer
* Staking Farmer
* Verified Farmer
#### Performance Metrics
Enumerate the set $\mathcal{K}$ (must be caste as either binary outputs or scalar outputs computable from a simulation trajectory)
Decided on XX-MMMM:
* Is the lowest profitable farming output after year 5 greater than or equal to a given number?
* `(min(farming_output[5:]) >= MIN_THRESHOLD)`
* Tokens stay >= 10% more than 80% of the time
* * cost of 33% Staked >= 100M USD more than 80% of the time
* Network profitable output in staked tokens
* Dimensionality: $S \times T$
* Profitable Utility per persona
* Dimentionality: $S \times R \times T$
* Cost of production per unit of token
* Dimensionality: $S \times T$
* Staked to Supply Ratio
* Dimentionality: $S \times T$
* 33% Staking Fraction
* Dimentionality: $S \times T$
### Data structure
The simulation data structure is described generally as a 5-tensor, such as $D = (\nu, S, R, T, t)$, where each dimension is described as follows:
* $\nu$: An measurement associated with an given variable. For the current simulation, we have dim$(\nu) \approx 30$
* S: The parameter space of the simulations. This includes all sweeped parameters ($P$) as well as the pricing scenarios ($F$) such that $S = P \times F$. Currently, $dim(F) = 3, dim(P) \approx 10000 \leftrightarrow dim(S) \approx 30000$
* R: The space of the existing personas on the simulations. Currently, $dim(R) = 4$
* T: The time dimension associated with the measurements. Currently, $dim(T) \approx 180$
* t: The 'imaginary' time dimension, which is an time-series for each point in time. Has an concrete interpretation of being the future and past expectancies around a given time. Currently, $dim(t) \approx 200$
Multiplying all the dimensions gives us an tensor with about 130 trillion size. In practice, this tensor is sparse, and we can decompose it on four dense tensors, each one with an characteristic dimension. They are:
* ST tensor, associated with network measurements
* Associated variables: staked tokens, demand for utility, etc
* SRT tensor, associated with measurements on each persona
* Associated variables: staked tokens associated with each persona
* STt tensor, associated with time-series on each point on time
* Associated variables:
* SRTt tensor, associated with time-series for each persona and on each point on time.
* Associated variables: rewards for unlocking at a given time, for each persona
#### Linking the data with the formalism
Each sweeped parameter is associated with a subspace of the $S$ dimension. Specifically, $S = \mathcal{P_c} \times \mathcal{P_e} \times F$.
Additionaly, the $\mathcal{K}$ is defined as being a subset over the measurement space generated by $\nu$. This means that $\mathcal{K} \subset (\nu \times S \times R \times T \times t)$. This generates the possibility that the KPIs have different dimentionalities, and as such, they must be either aggregated or expanded when doing comparisons in order to have measurable outputs.
One example of aggregation would be to sum the profitable utility of the personas, which has SRT dimensions, into a total profitable utility of the network, which has ST dimensions. Essentialy, we apply an transformation rule such that $R \rightarrow 1$
As for the expansion, it works by increasing the variable dimensions of the feature space by the associated dimensions. It is sort of analogous to generating new columns for each element in the expanded dimension. If you have four personas with a profitable utility, then by expanding you'll have four columns, each one associated with a persona profitable utility.