Try   HackMD

SciPy Distribution Infrastructure

Effort to get scipy.stats._distribution_infrastructure into SciPy.

Working PR: https://github.com/mdhaber/scipy/pull/110

Meeting 1: Hello! Presentation of the project (2024-01-31 20 CET)

Agenda

Tasks

  • Review sample lmoment
    function if we eventually want to support method of L-moments for fitting distributions

Meeting 2: parameters, moments and tests (2024-02-14 20 CET)

  • Matt presents how to create a distribution

    • Setting ranges of parameters with special objects as class attributes.
      • These are also used to automatically generate the tests with hypothesis.
      • Be clear about what we support in terms of precision. E.g. show in the code like NumPy up to which precision we support and provide guarantee.
    • Minimal overwritting needed for basic distributions (e.g. just PDF)
    • Does not yet handle constraints between parameters: see if needed in the first PR as depends on the initial set of distributions we will support. Still the infrastructure should allow this if we don't do it ourselves.
    • Transformations: basic infra will be there for the first PR, but more left for future improvements
    • Speed. TL;DR it's fast. Vectorization is used and we will add Array API support
    • Quad: we have other nice tools now that Matt added (reviewed mostly by Albert and a bit Pamphile). E.g root finder, bracketer, etc.
  • Moment

    • Now we have central, standard and raw moments
      • Difficult to propose a single function as people usually want different type of moment depending on the order. Raw for 1, standard for 2, etc. Open question as to do something else.
  • Visualization

    • Convenient functions to plot things like PDF
  • Presentation of the thought process on how the whole logic works in terms of defining which function needs to be implemented and how inversion or complement work.

    • Flow chart must end up in the doc!
  • Tests: extensive testing using hypothesis. At some point needs to be deeply reviewed. Should cover every cases from NaN to out of bounds, etc.

Meeting 3 (2024-02-29 20 CET)

  • Discussion around fit

    • Evgeni Have a _fit, do the minimal in fit
    • Failure mode: be explicit about the non-guarantee of the results. Some reasonable failure modes by propagating messages from underlying optimizers. We can intercept the status and make suggestion or better messages. Give more transparency and forces more users to be mindful about what they are doing.
  • Documentation: number of files concern

    • Maybe not that important to start with as we will start either way with a limited number of distributions
  • API surface: a lot of methods

    • [i]ccdf maybe we don't need these for instance. Maybe make a poll
    • log functions, maybe use args like in R. But for things like inverse, complimentary etc. The order matters and arguments might change making things more complicated.
      • Let's not do magic with generating automatically methods like cdfc and allow people to put letters as they want.
    • Moments: raw, central, std seems an ok tradeoff vs offering a single function.
  • RFC content: just mention fit and transformations, no fancy transform, plot (to say it's coming), moment (maybe), no loc/scale.

Meeting 4 (2024-03-13 20 CET)

https://meet.google.com/tce-uegh-gnp

Agenda

  • Everyone please have a look at the code and play with the new infra (rv_infrastructure on Matt’s fork).
  • Tirth and Matt wanted to show something else I think?
  • Pamphile present result of poll
  • RFC discussion
  • Christoph sampling / rvs: can we use tools from stats.sampling?

Poll results

  • Moment: Consensus seems to be on having a single method. Matt will keep the 3 methods behind the hood and give a go at making a higher level API.
  • Methods: keep as is with all the log, complement, inverse, etc.
  • Update params: discussion is around input validation and speed, the other argument is around being Pythonic. We will leave the update feature out at first.
    • Globally disabling the input validation would be nice to have in general. But this would generate some discussions. For now we will keep that as a param on the class instantiation.
  • Methods which will be there: moments, sample. Fitting to data is wanted but not coded yet, it will depends if Matt gets some time. Plot is not entirely finished, it will be used to show things but will be a quick follow up. Plot is very handy for checking your data and distribution which would maybe even help reduce the number of bug reports.
  • Single Normal distribution distribution
  • Only a limited set of distribution will be integrated. People like the idea of having a separate repo for distributions.
  • Duplication in the doc is not a concern
  • Support for iv_policy: keep the current behaviour
  • Support for method (strategy to compute distribution): leave alone

Consistency for axis, nan_policy, keepdims, masked array

Matt shared a document that he and Tirth prepared.

  • Masked arrays are an issue and not necessarily faster. Warren had written a doc explaining why ma should be avoided. So we have been progressively been moving away from them.
  • Proposal to deprecate critical values as typically tabulated and not relevant anymore with accurate pvalues. Sometimes we make a curve fit which is not accurate, etc. We can instead do a Monte Carlo if we don't have a good approximation.

Meeting 5 (2024-04-03 20 CET)

https://meet.google.com/tce-uegh-gnp

Agenda

Notes

  • Infrastructure Updates
    • Single Normal distribution
      • Make sure StandardNormal object behaves like a Normal object, with mu and sigma
    • Single moment method ✔
    • Enhanced plot method ✔
    • Added documentation
      • Should ContinuousDistribution be public? Consensus: not initially.
      • Tirth and Albert will be looking into a Sphinx extension to prevent duplication of doc and also the issue of examples.
      • Pamphile will look at the way to document attributes
      • Matt will ask everyone to review sections of the documentation
    • Matt will work on fit and ShiftedScaledDistribution

Meeting 6 (2024-04-18 20 CET)

Agenda

  • Going over the suggestions on the PR
  • Discuss next steps with the RFC

Notes

  • Going through comments on PR
  • Tirth found a solution to document distributions without duplicating rst files. He is going to make a PR on Matt's branch.
  • Matt is going to finish with all suggestions and then prepare a draft to post on the forum
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →