scikit-learn Sensitivity Analysis Proposal

--- tags: scikit-learn, proposal --- # scikit-learn Sensitivity Analysis Proposal ## Proposal Add a Sensitivity Analysis (SA) function. The function would compute _Sobol'_ indices [1,2]. Consider a function `f` with parameters `x1`, `x2` and `x3`. Hence `y=f(x1,x2,x3)`. We are interested to know which parameter has the most impact, in terms of variance, on the value `y`. I believe scikit-learn, and the wider scientific community, would greatly benefit to have such tool. I believe scikit-learn has something related with `feature_importances_` in some regressors. As an expert in the field, I would propose to develop the initial functionalities and provide long term support. ## Background _Sobol’_ indices are the cornerstone of SA and UQ and its application is not restricted to any field. It’s a non intrusive method which makes the only assumption that the variables are independent (this constraint can be alleviated). Being able to compute sensitivity indices allows to reduce the dimensionality of a problem, better understand the importance of each factors and also see how parameters are interacting with each other. As such, it’s an important engineering tool. If you have 2 or more variables and only have the budget to improve your knowledge on one of them, this can help to make an informed choice. There are a lot of successful usage of SA in the literature and in real world applications. The EU (through the JRC), is now requiring to conduce uncertainty analysis when evaluating a system. They are recommending the use of _Sobol’_ indices. ## Example Let’s take an example with the _Ishigami_ function: `y = sin(x1) + 7*sin(x2)^2 + 0.1*x3^4*sin(x1)` It’s not obvious to tell which variable would impact more `y`. The Sobol’ indices are bounded from 0 to 1, with 1 meaning more important. Here they would be: | Variable | First order Sobol’ | Total Sobol’ | | -------- | ------------------ | ------------ | | x1 | 0.31 | 0.56 | | x2 | 0.41 | 0.44 | | x3 | 0.00 | 0.24 | The difference between the first and total indices indicate an interaction between variables. The total indices allow to rank the variable by importance. `x1` is the most important. Looking at the first orders, `x3` by itself does not have an impact on the variance of the output. It’s its combination with another variable which makes it have a total impact of 0.24. `x1` in this case also have a difference in first and total indices while `x2` is the same. We can say that `x1` and `x2` have a second order interaction. Computing the indices requires a large sample size, to alleviate this constraint, a common approach is to construct a surrogate model with Gaussian Process or Polynomial Chaos (to name the most used strategies). This allows to conduct a SA with a low computational budget (we see lots of engineering applications with expensive HPC codes taking advantage of this strategy). ## Implementation Here is an example implementation I did a while back: https://gist.github.com/tupui/09f065d6afc923d4c2f5d6d430e11696 Note that this implementation lack a few things such as higher order indices, other methods, input validation, doc, tests, other wrapping, etc. This is just to show how it works. ## Alternatives Practitioners might be more familiarized with gradient based technics. Keywords include: gradient, adjoint. These methods are _local_ sensitivity analysis methods. And as such they are only capturing the sensitivity of the model in specific areas of the parameter space. _Sobol’_ indices are variance based indices. There are other indices using higher moments, namely: _moment independant_ based sensitivity analysis. These methods are very attractive and provide lot of information while being simple to compute/analyse. They have been less studied but there is an increasing interest in the community. I believe this would make a good second step. ## References Upon request, I can provide more information. Following are some fundational references–first 2 cited a few thousand times. * [1] Sobol',I.M. (2001), _Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates_, MATH COMPUT SIMULAT,55(1–3),271-280, [doi:10.1016/S0378-4754(00)00270-6](http://dx.doi.org/10.1016/S0378-4754(00)00270-6) * [2] Saltelli, A. et al., (2008), _Global Sensitivity Analysis. The Primer_, John Wiley & Sons, [doi:10.1002/9780470725184](http://dx.doi.org/10.1002/9780470725184) * [3] Saltelli, A. et al., (2020), _The Future of Sensitivity Analysis: An essential discipline for systems modeling and policy support_, Environmental Modelling & Software, [doi:10.1016/j.envsoft.2020.104954](http://dx.doi.org/10.1016/j.envsoft.2020.104954) For some examples of industrial applications (Part. 3), a visual explanation of some methods (Chap. 2.2) or more bibliography, you can also see my thesis: https://github.com/tupui/PHD-Thesis/blob/master/phd_thesis_pamphile_roy.pdf