SciPy Distribution Infrastructure

Effort to get scipy.stats._distribution_infrastructure into SciPy.

Working PR: https://github.com/mdhaber/scipy/pull/110

Meeting 1: Hello! Presentation of the project (2024-01-31 20 CET)

Agenda

Cheers together, it’s been a while since we had a coffee talk :)
Go around the new proposed API
Raise concerns, draft solutions, distribute workload.
Matt shared a notebook with a demo:
~~https://www.dropbox.com/scl/fi/c0r9jbe78sq94owal9267/distribution_infrastructure.ipynb?rlkey=hyqrgkpcb1qfog84e4wiguerv&dl=0~~https://nbviewer.org/github/mdhaber/scipy/blob/rv_infrastructure/scipy/stats/tests/distribution_infrastructure.ipynb
Update parameter: we are not certain this is needed.
Array operations with syntaxic sugar works
Tests are strong: based on hypothesis
How do we parametrize: many methods vs one method with arguments, or even separate functions not part of the class
size vs shape: all in
limit scope of distributions to not be the limiting factor in the community. People interested in exotic dist should be able to do it easily on their side so we don't have to support it

Tasks

Review sample lmoment
function if we eventually want to support method of L-moments for fitting distributions

Meeting 2: parameters, moments and tests (2024-02-14 20 CET)

Matt presents how to create a distribution
- Setting ranges of parameters with special objects as class attributes.
  - These are also used to automatically generate the tests with hypothesis.
  - Be clear about what we support in terms of precision. E.g. show in the code like NumPy up to which precision we support and provide guarantee.
- Minimal overwritting needed for basic distributions (e.g. just PDF)
- Does not yet handle constraints between parameters: see if needed in the first PR as depends on the initial set of distributions we will support. Still the infrastructure should allow this if we don't do it ourselves.
- Transformations: basic infra will be there for the first PR, but more left for future improvements
- Speed. TL;DR it's fast. Vectorization is used and we will add Array API support
- Quad: we have other nice tools now that Matt added (reviewed mostly by Albert and a bit Pamphile). E.g root finder, bracketer, etc.
Moment
- Now we have central, standard and raw moments
  - Difficult to propose a single function as people usually want different type of moment depending on the order. Raw for 1, standard for 2, etc. Open question as to do something else.
Visualization
- Convenient functions to plot things like PDF
Presentation of the thought process on how the whole logic works in terms of defining which function needs to be implemented and how inversion or complement work.
- Flow chart must end up in the doc!
Tests: extensive testing using hypothesis. At some point needs to be deeply reviewed. Should cover every cases from NaN to out of bounds, etc.

Meeting 3 (2024-02-29 20 CET)

Discussion around fit
- Evgeni Have a _fit, do the minimal in fit
- Failure mode: be explicit about the non-guarantee of the results. Some reasonable failure modes by propagating messages from underlying optimizers. We can intercept the status and make suggestion or better messages. Give more transparency and forces more users to be mindful about what they are doing.
Documentation: number of files concern
- Maybe not that important to start with as we will start either way with a limited number of distributions
API surface: a lot of methods
- [i]ccdf maybe we don't need these for instance. Maybe make a poll
- log functions, maybe use args like in R. But for things like inverse, complimentary etc. The order matters and arguments might change making things more complicated.
  - Let's not do magic with generating automatically methods like cdfc and allow people to put letters as they want.
- Moments: raw, central, std seems an ok tradeoff vs offering a single function.
RFC content: just mention fit and transformations, no fancy transform, plot (to say it's coming), moment (maybe), no loc/scale.

Meeting 4 (2024-03-13 20 CET)

https://meet.google.com/tce-uegh-gnp

Agenda

Everyone please have a look at the code and play with the new infra (rv_infrastructure on Matt’s fork).
Tirth and Matt wanted to show something else I think?
Pamphile present result of poll
RFC discussion
Christoph sampling / rvs: can we use tools from stats.sampling?

Poll results

Moment: Consensus seems to be on having a single method. Matt will keep the 3 methods behind the hood and give a go at making a higher level API.
Methods: keep as is with all the log, complement, inverse, etc.
Update params: discussion is around input validation and speed, the other argument is around being Pythonic. We will leave the update feature out at first.
- Globally disabling the input validation would be nice to have in general. But this would generate some discussions. For now we will keep that as a param on the class instantiation.
Methods which will be there: moments, sample. Fitting to data is wanted but not coded yet, it will depends if Matt gets some time. Plot is not entirely finished, it will be used to show things but will be a quick follow up. Plot is very handy for checking your data and distribution which would maybe even help reduce the number of bug reports.
Single Normal distribution distribution
Only a limited set of distribution will be integrated. People like the idea of having a separate repo for distributions.
Duplication in the doc is not a concern
Support for iv_policy: keep the current behaviour
Support for method (strategy to compute distribution): leave alone

Consistency for axis, nan_policy, keepdims, masked array

Matt shared a document that he and Tirth prepared.

Masked arrays are an issue and not necessarily faster. Warren had written a doc explaining why ma should be avoided. So we have been progressively been moving away from them.
Proposal to deprecate critical values as typically tabulated and not relevant anymore with accurate pvalues. Sometimes we make a curve fit which is not accurate, etc. We can instead do a Monte Carlo if we don't have a good approximation.

Meeting 5 (2024-04-03 20 CET)

https://meet.google.com/tce-uegh-gnp

Agenda

Infrastructure Updates
- Single Normal distribution
- Single moment method
- Enhanced plot method
- Added documentation
  - How to document attributes?
  - Should ContinuousDistribution be public?
Consistent axis / nan_policy support in scipy.stats
- Last time we agreed that critical_values and significance_level would be removed from anderson/anderson_ksamp
- gh-20137 and gh-20284 gives axis support to pearsonr with the usual broadcasting conventions, ignoring the (unfortunate) precedent set by spearmanr
- Proposed changes to gstd (gh-20285)
Array API Masked Arrays? (gh-20363)

Notes

Infrastructure Updates
- Single Normal distribution
  - Make sure StandardNormal object behaves like a Normal object, with mu and sigma
- Single moment method ✔
- Enhanced plot method ✔
- Added documentation
  - Should ContinuousDistribution be public? Consensus: not initially.
  - Tirth and Albert will be looking into a Sphinx extension to prevent duplication of doc and also the issue of examples.
  - Pamphile will look at the way to document attributes
  - Matt will ask everyone to review sections of the documentation
- Matt will work on fit and ShiftedScaledDistribution

Meeting 6 (2024-04-18 20 CET)

Agenda

Going over the suggestions on the PR
Discuss next steps with the RFC

Notes

Going through comments on PR
Tirth found a solution to document distributions without duplicating rst files. He is going to make a PR on Matt's branch.
Matt is going to finish with all suggestions and then prepare a draft to post on the forum
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

SciPy Distribution Infrastructure

Meeting 1: Hello! Presentation of the project (2024-01-31 20 CET)

Agenda

Tasks

Meeting 2: parameters, moments and tests (2024-02-14 20 CET)

Meeting 3 (2024-02-29 20 CET)

Meeting 4 (2024-03-13 20 CET)

Agenda

Poll results

Consistency for axis, nan_policy, keepdims, masked array

Meeting 5 (2024-04-03 20 CET)

Agenda

Notes

Meeting 6 (2024-04-18 20 CET)

Agenda

Notes

Read more

SciPy Community Meeting

RCF_distributions

SimDec Poster SciPy2024

Full Proposal CZI EOSS6 - SALib