---
tags: scikit-learn, proposal
---
# scikit-learn Sensitivity Analysis Proposal
## Proposal
Add a Sensitivity Analysis (SA) function.
The function would compute _Sobol'_ indices [1,2]. Consider a function `f` with parameters `x1`, `x2` and `x3`. Hence `y=f(x1,x2,x3)`. We are interested to know which parameter has the most impact, in terms of variance, on the value `y`.
I believe scikit-learn, and the wider scientific community, would greatly benefit to have such tool. I believe scikit-learn has something related with `feature_importances_` in some regressors.
As an expert in the field, I would propose to develop the initial functionalities and provide long term support.
## Background
_Sobol’_ indices are the cornerstone of SA and UQ and its application is not restricted to any field. It’s a non intrusive method which makes the only assumption that the variables are independent (this constraint can be alleviated).
Being able to compute sensitivity indices allows to reduce the dimensionality of a problem, better understand the importance of each factors and also see how parameters are interacting with each other. As such, it’s an important engineering tool. If you have 2 or more variables and only have the budget to improve your knowledge on one of them, this can help to make an informed choice. There are a lot of successful usage of SA in the literature and in real world applications. The EU (through the JRC), is now requiring to conduce uncertainty analysis when evaluating a system. They are recommending the use of _Sobol’_ indices.
## Example
Let’s take an example with the _Ishigami_ function:
`y = sin(x1) + 7*sin(x2)^2 + 0.1*x3^4*sin(x1)`
It’s not obvious to tell which variable would impact more `y`.
The Sobol’ indices are bounded from 0 to 1, with 1 meaning more important. Here they would be:
| Variable | First order Sobol’ | Total Sobol’ |
| -------- | ------------------ | ------------ |
| x1 | 0.31 | 0.56 |
| x2 | 0.41 | 0.44 |
| x3 | 0.00 | 0.24 |
The difference between the first and total indices indicate an interaction between variables. The total indices allow to rank the variable by importance. `x1` is the most important. Looking at the first orders, `x3` by itself does not have an impact on the variance of the output. It’s its combination with another variable which makes it have a total impact of 0.24. `x1` in this case also have a difference in first and total indices while `x2` is the same. We can say that `x1` and `x2` have a second order interaction.
Computing the indices requires a large sample size, to alleviate this constraint, a common approach is to construct a surrogate model with Gaussian Process or Polynomial Chaos (to name the most used strategies). This allows to conduct a SA with a low computational budget (we see lots of engineering applications with expensive HPC codes taking advantage of this strategy).
## Implementation
Here is an example implementation I did a while back:
https://gist.github.com/tupui/09f065d6afc923d4c2f5d6d430e11696
Note that this implementation lack a few things such as higher order indices, other methods, input validation, doc, tests, other wrapping, etc. This is just to show how it works.
## Alternatives
Practitioners might be more familiarized with gradient based technics. Keywords include: gradient, adjoint. These methods are _local_ sensitivity analysis methods. And as such they are only capturing the sensitivity of the model in specific areas of the parameter space.
_Sobol’_ indices are variance based indices. There are other indices using higher moments, namely: _moment independant_ based sensitivity analysis. These methods are very attractive and provide lot of information while being simple to compute/analyse. They have been less studied but there is an increasing interest in the community. I believe this would make a good second step.
## References
Upon request, I can provide more information. Following are some fundational references–first 2 cited a few thousand times.
* [1] Sobol',I.M. (2001), _Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates_, MATH COMPUT SIMULAT,55(1–3),271-280, [doi:10.1016/S0378-4754(00)00270-6](http://dx.doi.org/10.1016/S0378-4754(00)00270-6)
* [2] Saltelli, A. et al., (2008), _Global Sensitivity Analysis. The Primer_, John Wiley & Sons, [doi:10.1002/9780470725184](http://dx.doi.org/10.1002/9780470725184)
* [3] Saltelli, A. et al., (2020), _The Future of Sensitivity Analysis: An essential discipline for systems modeling and policy support_, Environmental Modelling & Software, [doi:10.1016/j.envsoft.2020.104954](http://dx.doi.org/10.1016/j.envsoft.2020.104954)
For some examples of industrial applications (Part. 3), a visual explanation of some methods (Chap. 2.2) or more bibliography, you can also see my thesis:
https://github.com/tupui/PHD-Thesis/blob/master/phd_thesis_pamphile_roy.pdf