Distribution Infrastructure Notes

## Some review of the branch `new_distribution[2]` ### First pass - You can update parameters with the update_parameters method. (e.g. dist.update_parameters(a=[1.0, 2.0]). - As a user, if I can do `dist.a`, that could be surprising to not be able to do `dist.a = ...`. - plot: really cool to include that feature! I am not sure though about the list. I think I prefer composing by adding another layer to an axis. i.e. we would not allow passing a list and this simplifies the API. - sample/qmc_sample: I know we have the distinction in UNU.RAN. But from a user perspective is this really needed? I would use a setting for that. And consider as well making QMC the default ;) This is a more "tricky" point OC. - Should we make a separate folder for all that? - enums: subclass from enum and str to avoid values to be 1, 2, etc. - oo = np.inf. I am not sure about that for variables (I see you use some unicode in strings which I am ok with). If we wanted to go that route, maybe another unicode from this list would fit? https://www.asmeurer.com/python-unicode-variable-names/start-characters.html Questions: a lot of these sounds a bit too early to discuss no? Are you sure about the whole infra and API before diving into optims and tweaks? - 1. You want an immutable object? Typically `__slots__ = []` can be used for that - _Domain should be an ABC. Things like endpoints and inclusive should be in the spec. Why str is not defined there? I don't think I understand why the base class does not implement anything. - get_numerical_endpoints: sounds like we could have properties instead - draw: why is this method needed? This seems a bit out of place to me. Also from the name I intuitively thought about a plotting method. OC I prefer that over RVS, alternatively what about sample or random? Seems like you use sample on the distribution side already. - parameter_values={} careful about that, it's mutable. Use None or something else instead - _validate_rng/order methods look like these don't really need to be methods - decorators: I am a bit afraid to see that many. I suppose you already had to debug a few things. How was the experience? Was that an issue or it was still "easy" to navigate and get to the issues? - seeing all the methods, did you explore the idea to have a smaller API and have external functions instead of embedding everything into the class? e.g. if you would want the kurtosis, you have `dist.kurthosis(...)`. An alternative could be `kurthosis(dist, ...)`. - log vs exp: argument to only provide log interface. - Another variant, only provide a single function and have a parameter/[global] config --- ### Second pass * Extra distribution base classes ``` # implement symmetric distribution # implement composite distribution # implement wrapped distribution # implement folded distribution # implement double distribution ``` What if we only implemented one (like Composite) and left out intentionally the rest for users to do? To me it's a bit like this endless list of distributions we support. If the mechanism to create something is easy, then we can let users do the magic they want on their side. OC, I am assuming these are not that fundamental as I've never really been expose to such distributions. * The tips section. That's a good idea to communicate somewhere these. In general I think some of these could be useful in other places. (I still hope that one day we have a validation system that we can toggle off if we know what we are doing. Since I am doing a lot of backend right now, I would be tempted to call it "debug=True") * Domain: * I am really wondering if it's helping us to have 3 levels. In the end if we add a `DiscontinuousRealDomain`, will that hierarchy be helpful? If yes, then the middle layer (`_SimpleDomain`) should probably require some specific methods to be implemented to be clear on the API. * rng and default parameters: why pass them in the `draw` method and not at the init? * `cached_property`: a FYI, you can invalidate cache by using `del obj.the_property` (see https://stackoverflow.com/questions/23489548/how-do-i-invalidate-cached-property-in-django) * `_dispatch` mechanism: ok to have the decorator. I am not sure I understand the need to have a normal function and then call the dispatched one. * `fit` and `plot`: maybe these should be separate functions taking a class in. That would really restrict the responsibility of the class to be the distribution itself. Otherwise I cannot think of another way right now to limit the number of methods besides creating classes with specific responsibilities like `PDF`. I suppose that's ok. * `sample`: why not having a single `rng` and using an `isinstance` check? Otherwise I am still not a fan of passing the `rng` at all but that ship has sailed for me :wink: * What is your take on implementing things like `__add__` on the distributions? Just throwing this there. It would be close to an independent Copula (yes I still like these). #### Questions * Question 1: as discussed, a solution could be to allow accessing values `dist.a` but only allow overwritting using `dist.set_params(a=...)`. Scikit-Learn is doing that (not the attribute part though) so that could be a good idea as (some) people are used to that, e.g. https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html Speaking of sklearn, Thomas J. Fan could be good to consult for the API discussion. * Question 2: Why not a bool? (Same for `iv_policy`) Do you envision other states? I would keep it as simple as possible to start with. * Question 3: `iv_policy`. I think we have to check by default or that will surprise people. But back to my `debug: bool` if we were to frame the option like that, then we could actually think about a default were no check is done. We could have a global try/except with a message saying: "An error occured, use debug=True for more info". If it's a validation error, we would see it. If not, we let the error bubble up. * Question 4: `tol`: did you think about having a tolerance class (`Tolerance(*, tol, abs=None, rel=None)`)? In general, I find playing with tolerances cumbersome. I think having an object where you can define one tol and the others automatically adjust would be very handy. That would help unify the way tolerances work in general: for maintainers we don't need to think about that anymore, same for users. In terms of API, we could allow a single value (used to initialize the class, like `tol = Tolerance(tol)`) or a `Tolerance` object to be passed to `tol`, then you retrieve what you need (`tol.abs` or `tol.rel`). * Question 5: I am starting to feel ok about subclassing arrays. You make some good arguments. As long as we document this really well, why not. * Question 6: `kwargs`. I would like us to find a way to not have `args, kwargs` in the public API. Without using any magic, I think we should just implement a proper `__init__` for the distributions. i.e. do something similar from what is in `reference_distribution.py`. I liked that API from a user perspective (and the code was really simple to follow too). I think this is really important otherwise it's hard to know what are the possible parameters, the parametrization. The IDE cannot help, it's hard for the inline doc and if you look at the variable explorer of the IDE, you don't see the parameters of the object: right now it only shows `cache_policy`, `iv_policy` and `tol`, which are details to me. I also struggled to follow the implementation due to the `__getattr__` magic. ``` sp.stats.LogUniform() ValueError: The `LogUniform` distribution family requires parameters, but none were provided. sp.stats.LogUniform(a=1) ValueError: The provided parameters `{a}` do not match a supported parameterization of the `LogUniform` distribution family. ``` We could find a way to show the parametrization options in the errors (in any case a good idea), but that would not solve the lack of auto-completion, etc. For the working case, the `str/repr` could show the values of the parameters. Right now it's only showing the parametrization. Without going to the extent of my Jupyter notebook print idea, we could possibly add a prettier output with the domain information, etc. * Question 7: I am ok with `method`. I don't think it's much of an issue that we also call functions of object methods. We have such a parameter in a lot of places.