Time log and description

# Time log and description [TOC] ## Project Description This [GitHub issue](https://github.com/astropy/astropy-project/issues/413) contains all the details ## Statement of Work Performance optimisation for core pieces of functionality within Astropy: 1. Engage with the community to identify which functionalities would benefit the greatest from optimisation (either faster to run or lower memory usage) 2. Estimate time required to optimise the most impactful functionalities 3. Setup a variety of scenarios to capture initial (pre-optimisation) benchmarks 4. Profile the python/C code line-by-line to identify potential hotpots 5. Optimise relevant lines of code that are hotspots, including developing new formulations, algorithms, mathematical transformations 6. Run the benchmark to capture improvement in runtime/memory usage Repeat steps 2-5 for as many independent pieces (identified in step 1) of functionality as time allows ## Invoices - Invoice for Aug 2024: 45 mins + 2 hrs + 1.5 hrs + 3 hrs + 1.5 hrs + 1.5 hrs + 45 mins = 11 hrs @USD 150/hr = USD 1650. - Invoice for Sep 2024: 3 hrs + 3 hrs + 3 hrs + 1 hr + 1.5 hrs + 0.75 hrs + 2.25 hrs + 2.5 hrs + 0.5 hrs [optimisation] + 0.5 hrs [code-review] = 18 hrs @USD 150/hr = USD 2700 ## Time log (overhead) ### June-July: Setting up contract with NumFocus: 45 mins - Initial contract review (11/6/2024) and provided feedback on the contract text and requested example of statement of work. Problematic clause in the contract that implies that I could miss pay for more than a months worth of work - which is unacceptable - JWarner from NF clarified that the wording is standard legalese and should not apply since NF makes sure all projects are fully funded before onboarding. However, my perspective is that if the projects are fully funded beforehand, then that clause does not need to be there - There is an additional clause later on that states that NF does not have any liability for loss of business etc *above* the invoiced amount - which likely can be interpreted as the full amount - Added Statement of Work to the contract - Created and populated time-log for tracking the project ## Time log (optimisation) - 20th Aug : 2.0 hrs - 21st Aug : 1.5 hrs - 23rd Aug : 3.0 hrs - 28th Aug : 1.5 hrs - 29th Aug : 1.5 hrs - 30th Aug : 45 mins ### Potential list of issues - [match_coordinates_sky/3d performance degradation and memory exhaustion (w/ fix)](https://github.com/astropy/astropy/issues/14002) - [Slow initialization of astropy.wcs.WCS() from a FITS header](https://github.com/astropy/astropy/issues/4520) - [Optimizing Cosmology.comoving_distance](https://github.com/astropy/astropy/issues/5281) - [Algorithm Improvements for Fast Lomb Scargle](https://github.com/astropy/astropy/issues/5941) - [Astropy.Tables performance issue on array arithmetic](https://github.com/astropy/astropy/issues/6253) - [Performance issues when writing VO tables in binary mode](https://github.com/astropy/astropy/issues/6519) - Possibly solved but the issue needs to be verified and closed. [Performance issue with WCS units](https://github.com/astropy/astropy/issues/8414) - [Large memory footprint in astropy.io.votable.parse_single_table](https://github.com/astropy/astropy/issues/8946) - [Performance: quantity comparison to 0 is incredibly expensive](https://github.com/astropy/astropy/issues/10396) - [Knuth’s rule fails with simple and small array (eats up system's memory)](https://github.com/astropy/astropy/issues/11357) - TBD ### 20th Aug : 2 hrs #### Existing issues - Familiarising with open performance issues (i.e., all open issues tagged with the performance label) and identifying potential issues of interest - Looks like SkyCoords transforms or 3d-matching could be an impactful improvement - Another option is the Lomb-Scargle periodogram or the BLS code - Looking at the C code for BLS - The input parameter `N` could be changed to int64 - If astropy code can be compiled with `-fopenmp-simd` (or `-fopenmp`), then constructs like `#pragma simd for reduction(+:mean_iva)` will vectorize code and improve performance. Such a pragma could/should be protected with `#if _OPENMP >= 201307` (OpenMP 4.0, see this https://stackoverflow.com/a/54304772). Such pragma code is already within that function within a `omp parallel` marked region, with a check for `#ifdef _OPENMP` (the macro is defined by gcc/clang/icc? when compiling with `-fopenmp/-openmp`) #### Compiling and dynamic dispatch for native C code - One option to speedup code is by compiling native C code for a range of architectures and then dynamic dispatch at runtime (what numpy is currently doing). Two ways of doing this that I can see: - use the `target_clones` attribute provided by recent gcc for all cpu-intensive functions. gcc will create a driver function and then offload to the relevant (highest available instructions set) function at runtime automatically. See a simple example [here](https://godbolt.org/z/c14Y3b8Wb) for a vectorized sum of an array. - use the full-blown numpy machinery for baseline and dispatch setup. Might not even be feasible without also copying/adapting numpy's build system #### Questions - How are the compiler flags decided? I see [this closed issue](https://github.com/astropy/astropy/issues/6762) that talks about the C flag `-std=c99` and points out that `wcslib` uses `gnu99` by default. Could an option be to set `-std=gnu11` by default, and remove if not supported by `os + compiler` combo? `c99` allows the `for` loop variables to be defined *within* the loop - which is a good coding pattern. - Similarly, `-march=native/skylake-avx512/icelake-` or similar (this specific option could also depend on the compiler) might be useful. But would need to be adapted for different OS's + hardware - for example, `clang` on M1/M2 OSX does not support `-march=native` (needs to be `-march=apple-m1` or `-mtune=apple-m1`). - How about fp16 support? Also, all maths seems to be done with doubles which doubles the runtime and memory requirement and might not be essential in all use-cases (but is less prone to catastrophic precision loss due to numerical round-offs etc). Such compiler flag needs to have a "test and add/remove" type of function. The `archspec` package might be handy to at least figure out what the platforms could be capable of. For wheel-building, i.e., cross-platform builds, the situation might (should?) be easier since the compiler and OS is known/controlled. - How about OpenMP/pthreads or even the sub-interpreters available with python3.13 ### 21st Aug : 1.5 hrs - Made a list of open issues within the hackmd document - Identified the _KnuthF bin-setting function within histogram as a potential for solving both the memory issue and a performance issue. - the input data is duplicated and then sorted. While the sorting might improve the numerical stability, I suspect that it is not essential for most cases. The one reason why that sorting is required is to calculate the range within the `bins` function - however that could be done by setting a `dmin, dmax` which are O(N) operations rather than the O(NlogN) for the sort (these two vars perhaps could be added within a `__slots__` associated with the local _KnuthF class) - removing the data duplication and the sort should improve performance and reduce the memory footprint ### 23rd Aug : 3 hrs - Setting up astropy + reading through contribution guideline - https://www.astropy.org/contribute.html has 404 links for ` developer install instructions` (pointing to https://docs.astropy.org/en/latest/development/workflow/get_devel_version.html) - Opened https://github.com/astropy/astropy/issues/16871 - Created the directory structure `<myworkdir>` for my files, and then the `astropy` subdir with the `astropy.git` repo. Now when I put my ipynb within the `myworkdir`, I immediately run into issues with importing the astropy package :'(. <details> <summary>Traceback details</summary> ```python AttributeError Traceback (most recent call last) Cell In[7], line 6 3 sys.path = sys.path[1:] 5 import numpy as np ----> 6 from astropy import stats 7 import matplotlib.pyplot as plt 8 import timeit File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/stats/__init__.py:15 2 """ 3 This subpackage contains statistical tools provided for or used by Astropy. 4 (...) 11 12 """ 14 from . import bayesian_blocks as _bb ---> 15 from . import ( 16 biweight, 17 circstats, 18 funcs, 19 info_theory, 20 jackknife, 21 sigma_clipping, 22 spatial, 23 ) 24 from . import histogram as _hist 25 from .bayesian_blocks import * File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/stats/circstats.py:19 15 from typing import TYPE_CHECKING 17 import numpy as np ---> 19 from astropy.units import Quantity 21 if TYPE_CHECKING: 22 from numpy.typing import NDArray File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/units/__init__.py:13 1 # Licensed under a 3-clause BSD style license - see LICENSE.rst 3 """ 4 This subpackage contains classes and functions for defining and converting 5 between different physical units, as well as the units themselves. (...) 10 code under a BSD license. 11 """ ---> 13 from . import ( 14 astrophys, 15 cgs, 16 core, 17 decorators, 18 errors, 19 misc, 20 photometric, 21 physical, 22 quantity, 23 si, 24 structured, 25 ) 26 from .astrophys import * 27 from .cgs import * File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/units/astrophys.py:10 3 """ 4 This package defines the astrophysics-specific units. They are also 5 available in (and should be used through) the `astropy.units` namespace. 6 """ 8 import numpy as np ---> 10 from astropy.constants import si as _si 12 from . import si 13 from .core import UnitBase, def_unit, set_enabled_units File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/constants/__init__.py:27 23 from astropy import units 25 del units ---> 27 from . import cgs, si 28 from . import utils as _utils 29 from .config import codata, iaudata File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/constants/cgs.py:7 1 # Licensed under a 3-clause BSD style license - see LICENSE.rst 2 """ 3 Astronomical and physics constants in cgs units. See :mod:`astropy.constants` 4 for a complete listing of constants defined in Astropy. 5 """ ----> 7 from .config import codata, iaudata 8 from .constant import Constant 10 for _nm, _c in (*sorted(vars(codata).items()), *sorted(vars(iaudata).items())): File ~/sorsery-consulting/2024/astropy-cycle4-perf-opt/astropy/astropy/constants/config.py:12 8 import importlib 10 import astropy ---> 12 phys_version = astropy.physical_constants.get() 13 astro_version = astropy.astronomical_constants.get() 15 codata = importlib.import_module(".constants." + phys_version, "astropy") AttributeError: module 'astropy' has no attribute 'physical_constants' ``` </details> Which I think is happening is because I am not in the astropy base directory - when I am in the astropy base directory the import works fine. So either I have to change my workflow and move my testing scripts into the base directory or I have to modify the sys.path to insert the astropy base directory at the front (which I tried but did not work - so the first option it is). - Created a jupyter notebook for benchmarking and line-by-line profiling the code. Turns out that most/all the time is spent in `np.histogram`. Replacing that with the equivalent call to `fast_histogram` results in a 10x speedup. - Here's the callgraph and the results: - With `fast_histogram` ![knuth_fast_histogram](https://hackmd.io/_uploads/BJxEO-UoA.png) - Default with `np.histogram` ![knuth_astropy_histogram](https://hackmd.io/_uploads/B1xVObUiR.png) ### 28th Aug : 1.5 hrs We all want the highest performance we can possibly get from the runtime hardware, but as package distributors we also want to make sure that the package can run on all reasonable hardware. For compiled code, if we compile for the most recent hardware, then that code will certainly run fast on recent hardware, but is almost guaranteed to crash with an `illegal instruction (SIGILL)` on any/all older hardware. One "easy" way to get "portable performance" for C code is to get the compiler to compile for the hardware the code is running on. For example, Corrfunc does this by *only* distributing the source and always compiling on the install computer -- the assumption is that the install computer has the same hardware as the running computer - which may not be the case always, e.g., on supercomputers where the login nodes and compute nodes might be different. Astropy, on the other hand, distributes pre-build wheels for a combination of CPython and OS. There are no special architecture level compiler flags that are passed - since the underlying hardware could really be anything. In practice, this means that the compiled target hardware is SSE2 (for x86_64 systems) - hardware that was released back in 2001. Modern hardware contains many more improvements, and lot more vectorizated instructions. So how do we take advantage of modern hardware while preserving the pre-built wheel release strategy? Enter [**Function Multi-Versioning**](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute). This is a feature provided by recent gcc and clang where the compiler automatically generates specialised functions targeting a given list of architectures, and then at runtime, the correct function that corresponds to the runtime hardware is chosen. All this is done **automatically** but requires adding a special "attribute" to the function that needs to be multi-versioned. Requires >= GLIBC 2.23, according to [the docs](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute). For example, adding the following attribute and compiling with a recent gcc (probably >= gcc8, but more recent is better. I checked with gcc14. One can check by continually dropping down the gcc version, I didn't bother since there should be complete control over the gcc version used during build time) ```c __attribute__((target_clones("default,sse4.2,avx,avx2,avx512f,arch=skylake-avx512"))) double sum_square(const double *restrict arr, const int64_t N) { double sum = 0.0; //This can also be vectorized if using -Ofast //or using #pragma omp simd reduction(+:sum) and compiling with -fopenmp-simd for(int64_t i=0;i<N;i++) sum += arr[i]*arr[i]; return sum; } ``` will generate the following: - specialised `sum_square` functions for "default==sse2, sse4.2, avx, avx2, avx512f,skylake-cpus" - a resolver function that checks the runtime CPU and calls the correct specialised function, i.e., no `SIGILL` at runtime. ### 29th Aug: 1.5 hr - Refining and understanding the generated assembly for the [godbolt example](https://godbolt.org/z/bE44Ej5PM) - Writing down the hackmd logs and godbolt explanation for folks that might have directly clicked to that link - Opened [issue 16902](https://github.com/astropy/astropy/issues/16902) ### 30th Aug: 45 mins - Looking at `astropy.stats` and `astropy.convolution` for candidates. `astropy/stats/src/compute_bounds.c` looks the most promising for the repeated computation of sums and std-devs. - convolution does have an interesting use of force_inline to get the compiler to optimise out some branches. Looks a bit clunky but should work in practice - Related: wirth_median is modifying (semi-sorting) the input data array - this may be unexpected behaviour. The only call is from a astropy-owned data buffer; however this is not obvious that the function would modify the input data buffer. Needs docstrings around both the `wirth_select` function, and the single caller from `compute_bounds.c` ### 11th Sep: 3 hrs - Slack message on #releasing - Invoice for work (up to August) on OpenCollective - Verified that `wirth_select` was not, in fact, sorting the entire array. - Created a new median-finding code, using the `quickselect` algorithm with a random pivot selection that has better worst-case performance than the default quickselect. Used claude.ai (including tests) and confirmed with a sorted array copy that the code produces correct results - Filled out hackmd log and wrote the next to do list to resume the next time - Added some justification on slack about why vectorisation might be a reasonable trade-off and clarified about vectorisation slowdowns not really being an issue on newer cpus - read through the cibuildwheel jobs on GH actions - seems that the emulated jobs take a *LOT* longer - added comment on GH about using old manylinux and gcc requirements (gcc10 on many2014 image should be fine) ### 12th Sep: 3 hrs - Respond to Maarten on GH (30 mins) - Reading the [NPY SIMD implementation docs](https://numpy.org/doc/2.1/reference/simd/index.html) (30 mins) - Finding and reading through multiple issues related to SIMD + cpu dispatch (2 hrs) - relevant issues/PRS/docs: - the [SIMD implementation commment](https://github.com/numpy/numpy/pull/13516#issuecomment-558859638) - [NEP 37](https://numpy.org/neps/nep-0037-array-module.html) and [NEP 38](https://numpy.org/neps/nep-0038-SIMD-optimizations.html) - [ENH: C++ SIMD wrapper](https://github.com/numpy/numpy/issues/24084) - [ENH, SIMD: Add C++ wrapper for universal intrinsics](https://github.com/numpy/numpy/pull/21057) - [Using highway for SIMD](https://github.com/numpy/numpy/compare/main...Mousius:numpy:hwy-spike?expand=1) - [DOC: add NEP 54 on SIMD - moving to C++ and adopting Highway (or not)](https://github.com/numpy/numpy/pull/24138/files) - PR with [implementation for tanh](https://github.com/numpy/numpy/pull/20363/files) - Here's the example code to implement a function that squares an array with universal intrinsics (within `simd.hpp`) - https://github.com/numpy/numpy/pull/21057/files#diff-cee58cafc4ff85b8fd3d174e5fc719dcffeee5bd72d820701087b920d6f302ecR38-R87 However, that PR is not merged. - Sent email to numpy developers list ## 13th Sep: 1 hr - Responded to [GH issue on numpy](https://github.com/numpy/numpy/pull/21057), [read NEP 54](https://numpy.org/neps/nep-0054-simd-cpp-highway.html), skim highway docs (45 mins) - Added issue on [astropy-project](https://github.com/astropy/astropy-project/issues/442) about improving the documentation for contributors on funded projects (15 mins) ## 16th Sep: 1.5 hrs - Responded to GH comments about SIMD + highway (30 mins) - Tried to create a benchmarked C example for wirth_median on my laptop but ran into errors with both gcc and clang saying `ifunc not supported` (1 hr). - Updating entire toolchain (macports) resulted in errors with installing gcc14 and gcc15 :( - no potential issues on stackoverflow or gcc bugtracker. ## 18 Sep: 0.75 hrs - Added time log and notes (15 mins) - Tried to get the target_clones to work on OSX (30 mins) ## 25th Sep: 2.25 hrs - Reading through forums, llvm compiler source to find out what ISAs are supported and whether target_clones works (45 mins) - Looks like not. `target(ISA)` works though. On an OSX machine with Apple Silicon (i.e., >= Apple M1), the target arch would be `target(arch=armv8.5-a)` since that's what Apple M1 supports. The regular Intel CPUs would be different though (not sure what) - ie., need to create between two build wheels on OSX - apple Silicon and Intel cpus - There's also [this](https://github.com/OpenMathLib/OpenBLAS/blob/092986582f8fb27dde00918d1f618776fe58c11a/cpuid_arm64.c#L280-L283) to detech ARM on OSX - Code does not compile at all on my laptop ("ifunc not supported on target") but does compile on godbolt (but does not have specialised arch functions like on x86_64) - Wrote up the benchmarking in C for `wirth_select` and benchmarked on OSX (Apple M2) - 1.5 hrs - with random data, the timings seem quite variable and some boost with the `target(arch)` attribute (<~30%), but nothing conclusive. - Realised that for even data, when the `kth_smallest` function is being called twice, the second call could limit the number of elements since the first call has already re-arranged the array elements such that all elements to the left of `n/2` are already smaller than the `n/2`th element. That is, instead of searching all `n` elements, we can limit the search to `n/2` elements. - Added tests to make sure all functions returned the same results as the astropy `wirth_select` code. ## 26th Sep: 2.5 hrs - Moved over to OzSTAR to try out the AVX512 Intel Skylake, and on the new NT with AMD EPYC CPUS (AVX2). - Added a header macro defs so that the code compiles correctly on both OSX and linux. The header has these three patterns: 1. If function multiversioning (FMV) is not requested (set as default to prevent accidental FMV provisioning), then macros are left as blank 2. If FMV is requested, then two macros, viz., `FMV_ARCH_ATTR`, `FMV_FLATTEN_ATTR` are defined as: 2a. OSX: `FMV_ARCH_ATTR := __attribute__((target("arch=armv8.5-a")))` 2b. Linux: `FMV_ARCH_ATTR := __attribute__((target_clones("default,arch=skylake-avx512,arch=x86-64-v2,arch=x86-64-v3,arch=x86-64-v4,arch=sapphirerapids")))` 2c. On all: `FMV_FLATTEN_ATTR __attribute__((flatten))` The first macro declares the set of target_clone architectures to build on for `x86_64`, and for the Apple-M1 cpus (which are supported by all later cpus) on Apple-SOC. The second macro, `FMV_FLATTEN_ATTR`, tells the compiler to inline *all* called functions within this architectured function - which may or may not be what is wanted. - Added a benchmarking function that keeps running the target function upto a max. number of iterations, or until the scatter in the runtime is less than 1% of the avg. runtime for that function. - Ran the benchmarks on both Skylake and AMD CPUs but with compiling the code only on the skylake cpus. The code, compiled on skylake, ran fine on the AMD EPYC cores (did not try the other way round) - The runtimes were again variable, depending on the number of points but - Fixed the claude-ai generated quick_select code ## 27th Sep: 30 mins - Added logs #### Existing issues ## Time-log (code review) - 11th Sep (wcslib-8.3_pthreads patch from Mark): 0.5 hrs ### Code-review - [x] Review the latest wcslib code from Mark (11th Sep) - [ ] TBD ## Next To do (added on 11th Sep) - [x] Follow-up on Astropy Slack (#releasing channel) about any comments - [x] Follow-up on GitHub (https://github.com/astropy/astropy/issues/16902) - [x] Ask the numpy developers list - [ ] Create a benchmark with `fast_sigma_clipping` using both with and without the `target_clones` directives on `compute_bounds` and `wirth_median` (4 combos of target_clones) - [ ] Try on x86_64 (does not work on my OSX laptop with Apple Silicon) - [ ] Add a explanatory note on top of `wirth_select` to say that the entire array is *NOT* being sorted. Also, quickselect might be faster (O(N) complexity on avg but O(N^2) worst-case). - [ ] Change the function signature to make them int64 rather than int (i.e., allow handling of ~2 billion+ elements) - [ ] Test whether the `quickselect` code is faster than the `wirth_select` present in astropy. This will require a benchmark suite with different types of data (random, already sorted, sorted runs, reverse sorted) and with different number of elements (few, 100s, 10000s, millions, billions) - [ ] Open a GitHub issue for docs on existing code snippets to i) benchmark patches for both time and memory, ii) generate fake data with certain specs - random, repeats, sorted