fMRIprep LTS - HackMD

# fMRIprep LTS ## Oct 13th, 2022 Attendance: YC, TG, PB * Major update by YC. [Link to google slides](https://docs.google.com/presentation/d/10e35C_tx_q9EyOSe_kaAerTkNvfsqUrWzA80OgpHuGw/edit#slide=id.g1f88252dc4_0_162) * Ran a bunch of MCA replications of structural processing, on 7 T1-weighted images. * Derived a test at each voxel -> p value. Also tried applying different FWE and FDR correction for multiple testing. * Main experiment is "global null", leave-one-MCA-replication-out experiment. * Varying the alpha threshold, the proportion of (false) positive detections follows the nominal value closely (YAY) * With FWE correction, most replicates lead to no detection at all, as expected. PB: need to compute the empirical FWE to compare with the nominal value. * FDR is more variable. PB: try to run same analyses with a bit of smoothing (fwhm = 3?). * Couple of feasibility experiments have been planned. * One is to compare the IEEE (baseline) to MCA distribution -> does not detect variations. * Another: compare the outcome of one image to another image's distribution. Should lead to lots of detections. WIP * Another: use a corrupted template for registration. Should also lead to lots of detections. WIP. * Aim is a paper relatively soon. Looks like all the components are in place. The logic is completely generic and could be applied to any image processing pipeline, or really any software pipeline! * Once the test is established, PB & TG will look into hiring an engineer (internship?) to complete the software library started by LT. ## May 4th 2022 Attendance: YC, LT, TG, YC * LT is leaving for France :( Effective May 6th * missing features for the app * local run * logging of tests (central stats?) * have a clean set of datalad image repos in niprep * in general move everything to nipreps * Follow-up LT departure * Maybe hire a Master student to extend the validation? * Maybe transfer the LTS maintenance to new data platform at criugm - if it gets created * Tests still don't work * MCA more libraries (lapack et blas) * smooth images prior to stats (nilearn's smooth_img) * enable random seeds in the pipeline * Idea: use MCA to do data augmentation in fMRI machine learning models ## Apr 13th 2022 Attendance: YC, LT * Statistic results (https://nbviewer.org/github/yohanchatelain/fmriprep-reproducibility/blob/master/notebooks/stat_test_normal.ipynb) * 64 MCA iterations of ds000256/sub-CTS201 * Parametric (gaussian) * Still lot of false alarms (~10%) * Non-parametric (min-max) * ~3% false alarms => generate much more samples (20 cpu.year allows 1500 iterations) ## Mar 30th 2022 Attendance: LT, PB * presentation of the new datalad version control on all test data * discussion on pros and cons of using datalad run containers to keep track of test execution. * preliminary consensus: use datalad only to keep track of reference data. Use pytest to run tests and the same mechanism as mriqc to aggregate results across users. * for reference: API MRIQC https://mriqc.nimh.nih.gov/ ## Mar 16th 2022 Attendance: YC, TG, PB, LT, CM * Storage * results of the preprocessing generated by YC is saved on compute canada tapes. * challenge posting the outputs of fuzzy on OSF * need to remove the workdir. * LT prepared a datalad repo with bids convention for the archival on OSF. * We will save only the T1w derivatives for now: one mean and one standard deviation image for each subject. * How to push a DataLad dataset to OSF 1. ``` OSF_TOKEN=<your_token> datalad create-sibling-osf --title 'BigBrain histogram' \ --mode exportonly \ -s osf-export \ --description "This carefully acquired data will bring science forward" \ --public ``` 2. ``` OSF_TOKEN=<your_token> git-annex export HEAD --to osf-export-storage ``` 3. Make sure to make the dataset public through the OSF portal * Compute canada outage * should be ok transferring on cedar. * Stats * In practice, RFT "doesn't work": it is too liberal but we don't really understand why. Instead, the plan is to use a (more experimental) cross-validation approach: estimate the mean and standard deviation of the % of rejected voxels and adjust the p-value threshold from that. ## Feb 2nd 2022 Attendance: YC, TG, PB, LT * Tolerance interval. * Voxels aren't independent * Correct for family-wise error (Bonferroni). But with this we'll pay the price of non-Gaussianity and low # samples. So don't go there yet. * Don't go with non-parametric test if # of non-Gaussian voxels remains modest. We don't want to impose a computational debt for the package maintainers. * Other possibility: make averages by ROIs. Using the ROIs produced by Freesurfer. For instance: look at average abs diff between voxels in region. ROIs could be considered independent. * Do we want to use 95% as absolute interval? Where do we put the cut-off? * -> check test by ROI? * Other sanity checks: * Generate more fuzzy samples (30) * Use cases * Software update * Deployment on different hardware * Simulate a template corruption (ask Basile for corrupted template or how to replicate corruption) * Functional data: use Hao-Ting's nilearn code to process the images and design the test on this result (don't introduce noise in this analysis) ## Jan 10th 2022 Attendance: YC, LT * statistics: * normality testing * mask for non-gaussian voxels * non-parametric test for 5 samples ? * fuzzy status: * now work on fMRI data! Issue was that fsl-bet caught stderr output from verificarlo as errors. * reference: * update on make-reference (raw fmriprep outputs, bids anat mean/std) * what alse to include in reference (func corr mean/std, masks for gaussian distributed voxels) ## Dec 21st 2021 Notes by TG and YC on next steps following our HBM abstract submission: # Error maps * Get error maps for the functional pipeline. Status: we have 5 fuzzy samples for 1 subject. * Curiosity: check if error map resembles local SNR * Trace the pipeline with Pytracer to understand where the error comes from # Test definition * Check if fuzzy samples are Gaussian (they're probably not) * If they are not, find a way to define confidence intervals from non-Gaussian samples * Sanity check: check that new fuzzy samples pass the test # Experiments with test * Does the test pass if random seed isn't fixed? * Does the test pass if multithreading is enabled? * Does the test pass if dependencies are updated? # Other * Clean GH repo and explain how users can test their pipeline ## Dec 15th 2021 Attendance: YC, TG, CM, LT * Fuzzy outputs: * Anatomical outputs instabilities mostly on the borders * Error maps are available for the anatomical pipeline. 8 subjets. * Tests will be implemented for 3 derivatives (native space, 2 versions of MNI template) x 8 subjets * Mask subcortical gray matter regions of the MNI template * https://templateflow.s3.amazonaws.com/tpl-MNI152NLin6Asym/tpl-MNI152NLin6Asym_res-06_atlas-HCP_dseg.nii.gz * Overall we keep half of the precision * For HBM * TBD push outputs to osf and create datalad repo * OHBM abstract: * https://docs.google.com/document/d/1qE2W3qBhf_MZKto_Ywlg93Sesn5aMNAze45fU8J44V0/edit * TBD fuzzy with numpy update (if available before deadline) * TBD dataset descriptions * Statistical reports: * Relative differences (for anatomical) and pearson correlation (for fMRI) * We want instead absolute comparison with fuzzy bounds for each voxel * Current work can be used for fuzzy func ref generation ## Nov 24th 2021 Attendance: YC, LT * YC: * run fuzzy-container for anatomic only * with IEEE => 100% reproducible with sequential mode (what was expected) * with MCA => big difference related to masking (max std ~10^9 voxel-wise) * next step to run for functionnal * explanation on reproducibility error multithreading VS multiprocess: * multiprocess is on independent task parralelism where multithreading happens at more fine-grained level (@markiewicz and @mathias you confirm ?) * LT: * big errors must come from registration: voxel inside mask VS outside mask (from max std) * will create anat only slurm file for YC to experiment within our tool * statistical test still on devloppment ## Nov 10th 2021 Attendance: PB, YC, LT * LT * has an app working for fuzzy (as fuzzy does not work yet, it's using multiple runs of IEEE). Quantifies correlation of time series across all pairs of replications, averaged in a brain mask, as well as html reports showing the results voxel by voxel. * PB * acquiring a better test dataset 100% feasible. We need a way to release them publicly. WIP (high priority because the same issue applies to cneuromod). * re brainhack global, still unclear, will revisit on Nov 24th. * YC * test on anat only. Ran IEEE inside fuzzy, with 8 participants. Big variability in run length - one subject takes 5 days (!!). * next step to include MCA. * Then check anat+func. ## Oct 25th 2021 Attendance: PB, YC, LT * LT: update on the fmriprep fMRI repro test * fuzzy is broken atm. Yohan is going to look into it. * simple reproducibility metrics -> PB and LT to draft. * structural * dice between reference / new brain mask * max absolute difference relative to reference intensity in the reference mask * functional * dice between reference / new brain mask * minimum correlation of activity at a voxel between test and retest in the reference mask * timeline: have a functional app to test inter-os repro in the coming weeks. For fuzzy, unclear, because it's unclear what the problem is. * move the app from simexp to nipreps? * YC: plans for identifying reproducibility bottlenecks. * To be discussed at a later point. We need the app and assess the magnitude of the problem first. * PB: plans for a brainhack project on fmriprep_reproducibility. * TBD. Depends if we have a working app soon. Will decide on our meetings scheduled Nov 24th. ## June 29 2021 (2d round) Attendance: Loïc Tetrel, Chris Markiewicz * CM: fmriprep team can also work on (github actions) tests * LT: CI tests may not be ideal for a datalad repo environment? * CM: It should accomodate the load (fmriprep running 10G containers) * LT: fuzzy issues * CM: `_fix_surfs7` - collecting precomputed outputs * from https://fmriprep.org/en/1.1.4/_modules/fmriprep/interfaces/surf.html * CM: you would better try re-running `bet` inside the fuzzy environment ## June 29 2021 (1st round) Attendance: Loïc Tetrel, Yohan Chatelain, Tristan Glatard * LT: reorganized tests to reduce scan time in subjects * LT: parametrized tests * LT: created a Makefile * LT/TG: we could publish test results to (another) git repo * TG: will investigate why fuzzy bet crashes ``` RuntimeError: Command: bet /WORK/fmriprep_work/fmriprep_wf/single_subject_CTS201_wf/func_preproc_task_restbaseline_wf/initial_boldref_wf/enhance_and_skullstrip_bold_wf/n4_correct/ref_bold_corrected.nii.gz /WORK/fmriprep_work/fmriprep_wf/single_subject_CTS201_wf /func_preproc_task_restbaseline_wf/initial_boldref_wf/enhance_and_skullstrip_bold_wf/skullstrip_first_pass/ref_bold_corrected_brain.nii.gz -f 0.20 -m ``` * ## June 15 2021 Attendance: Loïc Tetrel, Chris Markiewicz, Tristan Glatard, Mathias Goncalves, Yohan Chatelain * LT: compared reproducibility with single-threading vs multi-threading vs multi-processing - controlling otherwise for random seeding. Multi-processing is completely reproducible (YAY), but multi-threading is not. MG: ANTS is probably to blame. See https://github.com/ANTsX/ANTs/wiki/antsRegistration-reproducibility-issues * LT: still trouble running the fuzzy fmriprep. Next in line. * PB: still need to make the code modular for the evaluation, and change the evaluation metrics. * LT: shows first implementation of the test using pytest. CM suggests looking at https://docs.pytest.org/en/6.2.x/parametrize.html ## June 8 2021 Attendance: Loïc Tetrel, Chris Markiewicz, Yohan Chatelain, Tristan Glatard, Mathias Goncalves * TG: will make a PR on shellcheck nitpicks in the bash scripts. Nothing major. * LT: improved submission scripts * LT: using options `--random-seed 1234 --fs-no-reconall --anat-only --skull-strip-fixed-seed --nthreads 1 --omp-nthreads 1`, results are exactly reproducible for both the anatomical and functional pipelines * CM: `--skull-strip-fixed-seed` passes `--use-random-seed 1` to ANTs Atropos... cannot specify the seed directly * LT: still running fuzzy (5 repetitions) * How to show unchanged images? * CM: GIF is lossy; niworkflows uses SVG ([link](https://github.com/nipreps/niworkflows/blob/2600e4fe18012a852a9169bfac5804d3ad789eba/niworkflows/interfaces/reportlets/base.py#L89-L110)) * TODO: compare multithreading implementation of fmriprep with fuzzy * `omp-nthreads` is likely to affect the results but `nthreads` should not * Tests: we will implement both a global and a voxelwise one. LT to implement a first version. ## June 1 2021 Attendance: Pierre Bellec, Yohan Chatelain, Loïc Tetrel, Mathias Goncalves, Tristan Glatard Regrets: Chris Markiewicz * testing reproducibility of the anat pipeline. * LT: differences across runs relates to seeding the skull stripping step. * LT: experiment will be done, now that the latest singularity images are ready * TG: Singularity images were updated to use fmrirep 20.2.1. See https://github.com/SIMEXP/fmriprep-lts/pull/3 * TG: is each repetition really independent? * LT: yes it should * MG: you could check in the logs. * LT: I have, and it restarts from scratch. * Update on the container script * BP suggests to try to control versions in these build files: * https://github.com/nipreps/fmriprep/blob/7e17deaf05a27c577a8cfb0099b93c9883cc63ce/Dockerfile#L9-L18 * https://github.com/nipreps/fmriprep/blob/7e17deaf05a27c577a8cfb0099b93c9883cc63ce/Dockerfile#L73-L80 * PB: we will build a reference set of results from the existing container, then patch the container dockerfile, rebuild and confirm that the results don't change. * possible solution ND freeze https://neuro.debian.net/pkgs/neurodebian-freeze.html Maybe overkill. * after some debate, candidate solution would look like: * add more control on package versioning. * aim to rebuild periodically with newer packages (including ubuntu). * systematically test for changes. * what images and metrics? * TG: mean and standard deviation * PB for 4D: voxelwise correlation between time series, or percentage-of-baseline difference. * LT: how do we test departure from reference distributions. * we'll implement both a voxel-wise and volume-wise approach. * in terms of images: * preprocessed T1 and mask, as well as preprocessed fmri. TODO: * TG to have a look at the bash script * https://github.com/SIMEXP/fmriprep-lts/blob/4e983f7cb5914bf336c54942ccc1663cb3b484e1/code/run.bash#L88-L109 * discuss (and vote?) the solution for maintaining the lts container * contact BP for acquiring new test data NEXTTIME: * report * how to make code cleaner and modular * what outputs for the report * report or reportlets? ## May 25th 2021 Attendance: Pierre Bellec, Chris Markiewicz, Tristan Glatard, Mathias Goncalves, Loic Tetrel * TG did create PR for building a singularity container for lts with and without fuzzy: * Building the container: https://github.com/SIMEXP/fmriprep-lts/tree/master/code/containers * Container images: https://github.com/SIMEXP/fmriprep-lts/tree/master/envs * https://github.com/SIMEXP/fmriprep-lts/pull/1 containers are stored on OSF. * Usage: https://github.com/SIMEXP/fmriprep-lts * -> should pull 20.2.1 instead of 20.2.0 * LT is giving an overview of https://github.com/SIMEXP/fmriprep-lts * PB Q: where should all these repo live? * Where to put fmriprep-preprocessed data * PB: OSF * CJM: Another option: gin.g-node.org * CJM: BIDS-style outputs: [`--output-layout bids`](https://fmriprep.org/en/stable/usage.html#Other%20options) (see [nipreps/fmriprep#2303](https://github.com/nipreps/fmriprep/pull/2303), [neurostars #18492](https://neurostars.org/t/sharing-nested-bids-raw-derivatives-in-a-datalad-yoda-way/18492)) * PB: need to decide how to name/organize the different runs * MG: would be useful to have more slurm options exposed. * CJM: reproman would be the way to implement a generic tool * https://github.com/ReproNim/reproman * TG: maybe decouple the data processing from the test, to better deal with crashes. * LT: is it ok to develop our code base for linux only? or is there some form of windows and osx support for fmriprep? CJM: bash is a reasonable requirements. * TG: we should use a testing framework to implement the tests, for instance: * pytest if Python * [bats](https://bats-core.readthedocs.io/en/latest/index.html#) if bash * fmriprep's testing framework (link?) * Next steps * investigate discrepancies on the anatomical report * add CJM to compute canada -> PB to send directions * CJM to investigate the environment variables of the different jobs. * implementation of the report * TG and Ali to review the bash scripts * also compare fmriprep command line arguments used in Ali and TL's experiments * Next week * discuss which images to look at in the tests * discuss which metrics to quantify variations in the test * possibly Basile Pinsard update on the container build ## test data with NMIND **Attendance**: Greg Kiar, Pierre Bellec, Audrey Houghton, Xinhui, Yohan Chatelain, Tristan Glatard, Loic Tetrel, Mathias, Steve Giavasis, Sydney Covitz * GK: check this issue: https://github.com/nmind/hackathon2021/issues/4 * GK also check this issue: https://github.com/nmind/hackathon2021/discussions/16 * AH: selecting some subjects for ABCD and NKI Rockland. A next step is to run fmriprep on these data, and check changes on differences. Those are quality subjects. * SC: Q: are there tests done with T1+FLAIR. A: Mathias - no. * GK: how would you test the LTS for low quality data. PB: our aim is to test if an install is proper, not that the pipeline is robust. So a single subject should be enough. * PB: for test set: partial field of view (ICBM aging F1000). Hyper-intensity (oulu adhd200). atrophied brains (adni). multi-scanner (simon). * Mathias: pretty sure the FLAIR is only used with the surface reconstruction so shouldn't effect the fmriprep pipeline much * would be useful to have infants for a project like https://github.com/nipreps/nibabies * LT: would you need public data for evaluating data? GK: not necessary. You can have a centralized evaluation of the tests, or a partial evaluation just with the public data. * PB: what about having a very small dataset to run quick tests? GK: ultimately this type of test is quite different from processing real data. PB: it would be useful only to catch crashes and changes in outputs. GK: not on scope at the moment. ## May 11th, 2021: update and test data ### varia (container image hosting and generation) **Attendance**: Chris Markiewicz, Mathias goncalves, Ali Salari, Loic Tetrel, Pierre Bellec, Tristan Glatard (late) * LT: where are we at with the fuzzy container repo? AS: no progress yet. Next week. CM: singularity hub is closing down. Consensus: stand-alone datalad repo, put the singularity images on OSF (which has a datalad remote). * CM: may be worth saving the apt and pip cache and archive that for the build of the LTS. nd_freeze? https://neuro.debian.net/pkgs/neurodebian-freeze.html Talking to Yarik? * CM: Looking through our dockerfile it does look like we separated the Python from non-python packages pretty well, so Greg's suggestion seems like a reasonable fallback. ### test data selection * LT: rationale for selecting data * children, young adults, older adults * multiparametric (T2w check, FLAIR check, could not find SBREF... BP -> https://openneuro.org/datasets/ds001399/versions/1.0.1 or CM mentions ds000031, ds000244, ds001178, ds001399, ds001417, ds001734, ds001740, ds001771, ds001818, ds001978, ds002147, ds002278,ds002316) * fieldmap (all different) * MG: multi-echo * BP: every vendors? * CM: lesion masks. * CM: create one for one subject, with absolutely everything. * PB: three use cases: (1) test all aspects of the pipeline on one subject; (2) test the pipeline on all kinds of different subjects. (3) same as 1, but tiny for continuous integration. CM: may need to have dedicated test infrastructure. * TODO: put together some docs on the test data. ## May 5th, 2021: preliminary results and planning **Attendance**: Tristan Glatard, Yohan Chatelain, Greg Kiar, Ali Salari, Loic Tetrel, Pierre Bellec **Notes**: * [neurostars thread on cross-run differences](https://neurostars.org/t/differences-between-fmriprep-runs-on-same-data-what-causes-them/18543/3) * LT: investigated reprodudicibility * selected some data on open neuro * implemented a pure functional workflow with a fix anatomical workflow. * also implemented pure anatomical * running without MCA, but is possible with MCA * TG: * Ali focused on the anatomical pipeline * re-using options in the neurostars post: `fmriprep --random-seed 1234 --fs-no-reconall --anat-only --skull-strip-fixed-seed --omp-nthreads 1` * perfectly reproducible results with fixed seed and no MCA perturbations * it was run on local cluster * Used one session of the SIMON dataset * Evaluated stability via MCA on single subject (low sig. digs: ~3) * low significance appears to be due to subtle (but impactful) registration differences * still some small impact of MCA * see results here: https://github.com/glatard/fuzzy-fmriprep/blob/main/sigdigits.ipynb * LT: * present his project https://github.com/SIMEXP/fmriprep-lts * TG: * we should have a pass/fail test * users need a single run * GK: * it is going to be hard to rebuild the same container * this test could be integrated for developers in CI * PB: this is a great idea, but would require different data for different use cases * TG: CI could just grab outputs. Let's focus on the tests first, leave CI for later. * GK: initiative of Damian Fair, Ted S and Mike M to avoid duplications. Will create a test set. Greg will be our champion so far. NMIND (this Neuroimaging Method Is Not Duplicated) * TG: should it be part of standard fmriprep tests? PB: no, too heavy. * GK: * fmriprep already uses Sentry to integrate some basic testing/logging: https://sentry.io/welcome/ * We could bake in the test for comparing results to the stored MCA-derived mean/variance estimates * Would be a great way to integrate our tests to the fmriprep existing environment * PB: how can we find parts of the pipeline which are more variable? * GK: we could replace inputs of each step by pre-generated ones. * PB: this would be awesome, but requires lots of data. * GK: could be done for developers. * TG: other tracing methods could be used. * TG: maybe we should focus on the simple test comparing with final outputs. * PB * good test sets (GK) * decide target and performance measure for a regression test * next decide on structures for the repos * next decide on structures for the reports and tests * other issue is to track cleanly our containers * GK suggests using staged container build * TG and team to build a datalad repo with fuzzy fmriprep image. * https://handbook.datalad.org/en/latest/basics/101-127-yoda.html --- ## April 21th 2021: discussion on the origin of variations in fmriprep ### Attendance Tristan Glatard, Yohan Chatelain, Chris Markiewicz, Greg Kiar, Ali Salari, Mathias Goncalves, Loic Tetrel, Pierre Bellec ### Agenda * round-table discussion. ### Minutes LT: we should focus on the tool to quantify instabilities. TG: if there are random variations even when fixing the seed, then adding numerical noise is going to get drowned in those variations. CM: Good first step would be to do a full run, and then re-run each step one by one with fixed seed to narrow where variations occur. PB: maybe first step would be to re-run it with fixed seed, and also try to control for the anatomical variability. PB+TG: metrics number of digits? correlation? differences and variance? GK: dataset? HNU1, look at different CoRR sites. PB: provided we identify sources of instabilities related to seeds, what do we do about it? CM: some variations may be reasonable. GK: would need to assess if the tool sometimes diverges, or if it's "expected" variability. CM: could try to optimize the registration. TG: boostrap averaging? TG: compare solutions. PB: consider parcel-based stability metrics as well? CM: smooth data? What smoothing level is required to get to something acceptable? PB: who would like to be directly involved in this work: * TG and YC ### TODO [x] create a mattermost channel for the project [ ] understand better reproducibility with fixed seed (with / without fixing T1 processing) [ ] decide on data & metric