# SciPy Distribution Infrastructure Effort to get `scipy.stats._distribution_infrastructure` into SciPy. Working PR: https://github.com/mdhaber/scipy/pull/110 ## Meeting 1: Hello! Presentation of the project (2024-01-31 20 CET) ### Agenda * Cheers together, it’s been a while since we had a coffee talk :) * Go around the new proposed API * Raise concerns, draft solutions, distribute workload. * Matt shared a notebook with a demo: ~~https://www.dropbox.com/scl/fi/c0r9jbe78sq94owal9267/distribution_infrastructure.ipynb?rlkey=hyqrgkpcb1qfog84e4wiguerv&dl=0~~https://nbviewer.org/github/mdhaber/scipy/blob/rv_infrastructure/scipy/stats/tests/distribution_infrastructure.ipynb * Update parameter: we are not certain this is needed. * Array operations with syntaxic sugar works * Tests are strong: based on hypothesis * How do we parametrize: many methods vs one method with arguments, or even separate functions not part of the class * size vs shape: all in * limit scope of distributions to not be the limiting factor in the community. People interested in exotic dist should be able to do it easily on their side so we don't have to support it * ## Tasks - [ ] Review sample [`lmoment` ](https://github.com/scipy/scipy/pull/19475) function if we eventually want to support method of L-moments for fitting distributions ## Meeting 2: parameters, moments and tests (2024-02-14 20 CET) * [name=Matt] presents how to create a distribution * Setting ranges of parameters with special objects as class attributes. * These are also used to automatically generate the tests with hypothesis. * Be clear about what we support in terms of precision. E.g. show in the code like NumPy up to which precision we support and provide guarantee. * Minimal overwritting needed for basic distributions (e.g. just PDF) * Does not yet handle constraints between parameters: see if needed in the first PR as depends on the initial set of distributions we will support. Still the infrastructure should allow this if we don't do it ourselves. * Transformations: basic infra will be there for the first PR, but more left for future improvements * Speed. TL;DR it's fast. Vectorization is used and we will add Array API support * Quad: we have other nice tools now that Matt added (reviewed mostly by Albert and a bit Pamphile). E.g root finder, bracketer, etc. * Moment * Now we have central, standard and raw moments * Difficult to propose a single function as people usually want different type of moment depending on the order. Raw for 1, standard for 2, etc. Open question as to do something else. * Visualization * Convenient functions to plot things like PDF * Presentation of the thought process on how the whole logic works in terms of defining which function needs to be implemented and how inversion or complement work. * Flow chart must end up in the doc! * Tests: extensive testing using hypothesis. At some point needs to be deeply reviewed. Should cover every cases from NaN to out of bounds, etc. ## Meeting 3 (2024-02-29 20 CET) * Discussion around `fit` * [name=Evgeni] Have a `_fit`, do the minimal in `fit` * Failure mode: be explicit about the non-guarantee of the results. Some reasonable failure modes by propagating messages from underlying optimizers. We can intercept the status and make suggestion or better messages. Give more transparency and forces more users to be mindful about what they are doing. * Documentation: number of files concern * Maybe not that important to start with as we will start either way with a limited number of distributions * API surface: a lot of methods * [i]ccdf maybe we don't need these for instance. Maybe make a poll * log functions, maybe use args like in R. But for things like inverse, complimentary etc. The order matters and arguments might change making things more complicated. * Let's not do magic with generating automatically methods like `cdfc` and allow people to put letters as they want. * Moments: raw, central, std seems an ok tradeoff vs offering a single function. * RFC content: just mention `fit` and transformations, no fancy transform, plot (to say it's coming), moment (maybe), no loc/scale. ## Meeting 4 (2024-03-13 20 CET) https://meet.google.com/tce-uegh-gnp ### Agenda * Everyone please have a look at the code and play with the new infra (rv_infrastructure on [name=Matt]’s fork). * [name=Tirth] and [name=Matt] wanted to show something else I think? * [name=Pamphile] present result of poll * RFC discussion * [name=Christoph] sampling / rvs: can we use tools from stats.sampling? ### Poll results * Moment: Consensus seems to be on having a single method. Matt will keep the 3 methods behind the hood and give a go at making a higher level API. * Methods: keep as is with all the log, complement, inverse, etc. * Update params: discussion is around input validation and speed, the other argument is around being Pythonic. We will leave the update feature out at first. * Globally disabling the input validation would be nice to have in general. But this would generate some discussions. For now we will keep that as a param on the class instantiation. * Methods which will be there: moments, sample. Fitting to data is wanted but not coded yet, it will depends if Matt gets some time. Plot is not entirely finished, it will be used to show things but will be a quick follow up. Plot is very handy for checking your data and distribution which would maybe even help reduce the number of bug reports. * Single Normal distribution distribution * Only a limited set of distribution will be integrated. People like the idea of having a separate repo for distributions. * Duplication in the doc is not a concern * Support for iv_policy: keep the current behaviour * Support for method (strategy to compute distribution): leave alone ### Consistency for axis, nan_policy, keepdims, masked array Matt shared [a document](https://colab.research.google.com/drive/16kolTGaJ4pXrELjEtp36dgcAjIRSKTSj?usp=sharing) that he and Tirth prepared. * Masked arrays are an issue and not necessarily faster. Warren had written a doc explaining why ma should be avoided. So we have been progressively been moving away from them. * Proposal to deprecate critical values as typically tabulated and not relevant anymore with accurate pvalues. Sometimes we make a curve fit which is not accurate, etc. We can instead do a Monte Carlo if we don't have a good approximation. ## Meeting 5 (2024-04-03 20 CET) https://meet.google.com/tce-uegh-gnp ### Agenda * Infrastructure Updates * Single [`Normal`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.Normal.html#scipy.stats.Normal) distribution * Single [`moment`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.moment.html#scipy.stats.ContinuousDistribution.moment) method * Enhanced [`plot`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.plot.html#scipy.stats.ContinuousDistribution.plot) method * Added [documentation](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.html#scipy.stats.ContinuousDistribution) * How to document attributes? * Should `ContinuousDistribution` be public? * [Consistent axis / nan_policy support in `scipy.stats`](https://colab.research.google.com/drive/16kolTGaJ4pXrELjEtp36dgcAjIRSKTSj?usp=sharing) * Last time we agreed that `critical_values` and `significance_level` would be removed from `anderson`/`anderson_ksamp` * [gh-20137](https://github.com/scipy/scipy/pull/20137) and [gh-20284](https://github.com/scipy/scipy/pull/20284) gives `axis` support to `pearsonr` with the usual broadcasting conventions, ignoring the (unfortunate) precedent set by `spearmanr` * Proposed changes to `gstd` [(gh-20285)](https://github.com/scipy/scipy/pull/20285) * Array API Masked Arrays? [(gh-20363)](https://github.com/scipy/scipy/pull/20363) ### Notes * Infrastructure Updates * Single [`Normal`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.Normal.html#scipy.stats.Normal) distribution * Make sure `StandardNormal` object behaves like a `Normal` object, with `mu` and `sigma` * Single [`moment`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.moment.html#scipy.stats.ContinuousDistribution.moment) method ✔ * Enhanced [`plot`](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.plot.html#scipy.stats.ContinuousDistribution.plot) method ✔ * Added [documentation](https://output.circle-artifacts.com/output/job/e77b029e-b2b8-49dd-83c4-de99f3ffb0f6/artifacts/0/html/reference/generated/scipy.stats.ContinuousDistribution.html#scipy.stats.ContinuousDistribution) * Should `ContinuousDistribution` be public? Consensus: not initially. * Tirth and Albert will be looking into a Sphinx extension to prevent duplication of doc and also the issue of examples. * Pamphile will look at the way to document attributes * Matt will ask everyone to review sections of the documentation * Matt will work on `fit` and `ShiftedScaledDistribution` ## Meeting 6 (2024-04-18 20 CET) ### Agenda * Going over the suggestions on the PR * Discuss next steps with the RFC ### Notes * Going through comments on PR * Tirth found a solution to document distributions without duplicating rst files. He is going to make a PR on Matt's branch. * Matt is going to finish with all suggestions and then prepare a draft to post on the forum :rocket: