NumFOCUS Summit 2023 NumPy Ideas

# NumFOCUS Summit 2023 NumPy Ideas ## Summit details - Date: 11/09/2023 -> 12/09/2023 - Location: Novotel Amsterdam Schiphol Airport - [maps](https://goo.gl/maps/wmu2zFxZj3z2ZvXz7) - Representatives: Inessa Pawson ([InessaPawson](https://github.com/InessaPawson)), Ganesh kathiresan ([ganesh-k13](https://github.com/ganesh-k13)) ## Exploring bots ### Coverlay Reports Lack of codecov reporting has already been discussed [here](https://github.com/numpy/numpy/issues/11369#issue-333114350), but was addressed via this [PR](https://github.com/numpy/numpy/pull/11567). The goal of this proposal is to ensure that we are aware of which areas of the code has low coverage and ensure it is not due to any miss in review. This will also provide a very good avenue for new contributors to add test cases to low coverage areas. We can host regular sprints to increase the code coverage by showing the community on how to add test cases. Maintainers decided to configure two settings: - Minimum coverage is 1%, this would ensure that no PR is blocked due to low coverage - The bot that displays the report is suppressed. This would mean, one has to manually check the report. We can however build on top of this and generate a weekly report that informs the community of the increase or decrease in code coverage. We can integrate this in two ways: - Create a slack channel dedicated to codecov. This will generate reports weekly and upload the data to the channel. - Update a tracking GitHub issue that everyone can have access and see where coverage is lacking on a weekly basis. ![](https://hackmd.io/_uploads/rJxcVg11T.png) _Source: https://app.codecov.io/gh/numpy/numpy?search=&trend=12%20months_ ### Stale Pull Request actions: https://github.com/actions/stale NumPy today has a lot of open PRs and the old ones might not be relevant today. We have about 200 PRs open today. Going over them regularly and closing them will take a lot of maintainer time. Automating the closing of stale issues in a graceful way save a lot of time. Example from scikit-image: https://github.com/scikit-image/scikit-image/blob/main/.github/workflows/dormant_issues_prs.yml Additionally a few other things that were tweaked: - Some projects did not use the `stale` tag. A more friendly `dormant` tag for example was used. - The action runs only on a few PRs per day. This helps not just with rate limits but also to not close a bunch of PRs that a genuine issues slips through the cracks - Some PRs might actually take a long time and must not be closed. This is ensured through two ways: - Ignore PRs with certain labels - Have a grace time between when PR is considered for close (via comment) and actually closing it ![](https://hackmd.io/_uploads/rJ8I8xJya.png) _source: https://github.com/scikit-image/scikit-image/pull/6776_ ## Developer Experience ### Pytest-Bisect integration for Spin Upon failure of a test case, we can auto find the bad commit using `git-bisect` and re-running the test case on each commit. #### Change proposal Add a new `Spin` command that takes: - Good Commit (optional) - Bad Commit - Testcase `Spin` will then bisect, build each commit and run the single test case. This was we can eventually find the bad commit. ### Autogenerate issue meta data on test failures or crashes Issue reporting is quite easy today with templates provided by GitHub. We can however improve this in two ways: - Not everyone knows they can create issues in GitHub in case of crashes/failures - Autogenerate useful information that can be added to the issue for better debugging. Few projects display useful information or even generate a crash bundle that can be uploaded to the issue to help debug better. #### Generate info on UT failures In case of an UT failure, we can collect information from `np.show_config` and `np.show_runtime` along with traceback of the failure and populate a text file that can be copy pasted to create a new issue in GitHub. #### Collect crash information In case of segmentation faults or other crashes, we create a "crash bundle" of sorts that can be uploaded to help narrow down the issue. ## Telemetry ### Documentation Telemetry Various methods were discussed on collecting telemetry data for the packages. This data will provide useful information on what functions are used most by our users and will help focus development for a larger impact Some intrusive ways included including an opt-in side package that will collect data on the imported packages and apply static analysis to identify the functions called. This would require a lot work to adhere to data privacy laws and is a larger effort. I suggested an alternative of using the documentation website's access logs data to get datapoints of interest within a package. This data should be readily available and should not break any privacy laws due to the highly anonymous nature of the data. #### Why do we care about this data? We can start small by just seeing which page gets hits the most and provide a list of most popular functions. This will give the developers a good idea on what functions are searched the most. This, by no means gives data of how popular a function is. For example, `numpy.min` which is probably used a lot is intuitive enough that no one actually visits the docs. `numpy.kron` however might have more hits in the docs, but this is probably due to non-intuitive nature of the function. #### How to act on this data? I heard one of the packages used this click data to release blog posts on those functions to divert the traffic to a more meaningful documentation with examples. This provides a great data for projects to generate topics for newcomers or for GSoD projects. ### Google search analytics for missing data Documentation telemetry is helpful for existing functions and improving upon them. But we still need a way to find out what kind of features are required by users. GitHub issues are good ways for sure, but not everyone has a Github account and search in issues sections. Most users tend to use StackOverflow for example via Google search Proposal is to collect analytics data to help maintainers take more data driven decisions on what features to focus on. #### High level overview Taking two of the most liked issues in NumPy today: - [first nonzero element (Trac #1673)](https://github.com/numpy/numpy/issues/2269) - [ENH: minmax function](https://github.com/numpy/numpy/issues/9836) ![](https://hackmd.io/_uploads/SJJc5l1ka.png) We can see that google data provides more insights and a different view to the problem. MinMax is probably searched more by users, but is not `liked` as much in GitHub. This gives us a slightly more holistic view on what users are requesting. This data is of course highly subject to noise and we cannot conclude anything solid just by the search data. But we can use this as a tool for future developments to take data driven decisions on what the community needs. ## Final thoughts The summit was fantastic! We could collaborate with a large number of projects and understand their pain points. I made a recommendation to the board for having a central technical advisor, like a CTO to ensure we are not duplicating efforts across projects. Common problems such as Cirrus and build systems were trending topics that can be solved with collective effort to save developer time. NumFOCUS projects are on an upward trend as more and more projects are added. Collaborations such as this summit gives developers a forum to think of new ideas and more data on solving older problems. I am grateful for this opportunity and thankful to the steering council of NumPy to provide me with this opportunity. I hope the ideas proposed above gets picked up one day and helps the community.