OpenNeuro automatically into the graph

OpenNeuro automatically into the graph === Make phases Language - ON = upstream OpenNeuro organization: https://github.com/OpenNeuroDatasets/ - ON-LD = our fork / mirror org: https://github.com/OpenNeuroDatasets-jsonld - repo: a git repo in ON - fork: a git repo in ON-LD that is a fork of ON # What do we want * OpenNeuro adds new dataset / github repo. * we want: * OpenNeuro-JSONLD should also make a repo * how should we do this: * we create a new fork * OpenNeuro writes to own repo / updates upstream * we want: * Our fork gets those changes * we retain a linear history with specific contribution on repo and from upstream * how should we do this: * TBD -> fill in from https://github.com/OpenNeuroDatasets-JSONLD/.github/issues/16 * we thought: * cron job that just merges every once in a while * but we could use GH CLI to only do this for the repos that were changed / written to in the last X days (if it exists) * Yarik says: * be more smart -> look at SHAs * what questions do we have * is there any way to have repo level workflows without adding the `.github` directory * Every time # Phase 1 Make forks of repos 1. Workflow that checks all upstream repos (OpenNeuro) and if the fork doesn't exist in ON-LD, create a fork ## Process - when run / trigger - cron - what happens - make a list of current OpenNeuro repos - for each ON repo - if: we don't have a fork - make a fork - else: - nothing - make a list of our forks - **for each ON-LD fork** (probably not relevant for this phase) - **if: no more ON repo** - ???? - **else:** - nothing - outcome - ON-LD is a mirror of ON with extra data ## Questions - what if we would like to delete a dataset - do it by hand? -> delete it from openneuro-annotations repo - if repos have different default branch names, should the forks - use the same branch name as the default of the repo - **or use a consistent branch name, e.g. `main`** # Phase 2 Fetch upstream updates for forks 2. Workflow for updating forks with changes from ON repo, regardless of whether fork has annotations or not ## Process - when run / trigger - cron - what happens - make a list of ON-LD forks - for each fork - find the upstream repo - afawct, git itself does not track parents of forks - therefore: we just guess the parent from the repo name - **if repo no longer exists** - nothing - **warning / echo** - Find the default branch of repo - - has repo changed (e.g. look at timestamp of HEAD commits) - if repo more recent than fork - do sync * - we can just use gh repo sync: - the history will then be a linear history: https://gist.github.com/surchs/4f2cb2295e4672eb87e40e6b63f64a29 - else - nothing - sync means - get all the changes from the repo if ff is available - makes a merge commit to get these changes - sync steps - - outcome - our forks are mirrors of ON repos with extra commits #### Left to do: - **Grab timestamps / 50 last modified from upstream to trigger current sync_forks.yml** ## TODO - [x] Has upstream changed compared to my fork - [ ] Figure out if upstream repo is still there - [x] Get me the changes from upstream repo and merge (if possible) - [ ] Figure out if we can avoid the merge commit - for now, keeping merge commit ## Questions - How do we detect that a fork has been changed - Some workflow will regularly run @ organization level and look at timestamps of all repos to produce a list of repo names that have changed since - **What should we do if the ON repo changes its default branch name?** - cry ## Ideas - if repo is gone when updating fork - -> update the README - OR -> nuke the fork ## Limitations - **Due to validation borrowed from the CLI, currently you can't push just a BIDS dictionary** # Phase 3 Push changes to forks 3. Service for uploading and validating an annotated JSON, incl. making sure annotation is consistent w/ original JSON - Not a GitHub workflow - we don't want ppl to have to push anywhere, and have minimal GH interaction if possible for less advanced users - Probably an API ## Process - when run / trigger - user action, e.g. "I have a .json I want to add" - could be: service on behalf of user - what happens - we validate change (service) - **decision**: - service opens PR and automerges ff - push to main when valid - service - receives - .json payload - git URL or dsID of target fork - steps - validate the .json (could this work?) - yell if fail - does target fork exist? - yell if no - **clone target fork (assume fork is current with parent-repo)** - not done as part of API, assumes already done (via our workflows) - if .json exists in fork - run MERGE - else - simply add new .json file - what - **API** that we write and serve - CLI would be possible and easy to use (?) - where / how - - MERGE - steps - if .json has "Annotations" already - your annotations are now the only annotations, i.e. any previous annotations get overwritten - this applies to all annotations, so that you can delete annotations by not including them in the current upload - match indentation of .json in the fork - Now API-Server has a ready-to-go participants.json - Make the fork have this participants.json - git commit && git push `<the right branch>` - use github content API - ??? - outcome - our fork has a new commit with the change for the .json ## Questions / Scenarios - Do we only update the "Annotations" section of the fork participants.json, or replace all contents? If all contents, may run into conflicts when the fork pulls changes from upstream? - Do we anticipate allowing users to change the non-Annotations portion of the data dict? - If the fork .json has changed since the user annotated it and we do the above, the non-Annotations portion and Annotations portion may become out of sync once the new .json is uploaded - should we warn? - User: I am uploading my annotations.json - but the fork has changed since I downloaded the .tsv to make my annotations, now what - but someone already has uploaded annotations that are overlapping / conflicting - User: I want to annotate - how do we - ? Should we run sync before a user can download data to make an annotation - how could the API clone a repo # Sort ## Our questions for incoming changes to the fork 1. how do we do updates to our fork a. upstream changed: What happens when there are already annotations in the JSON in our fork - scenario 1 - fork has: changes to .json "annotations" key - upstream has: also changes to .json, but not conflicting (!) - conclusion for now: assuming that Neurobagel will only ever touch the `Annotations` part, we should hopefully be able to merge any changes from upstream via fast-forwarding - possible exceptions: someone has deleted info about a column we annotated in the JSON, someone renamed "" - what ifs - fast-forward merge from upstream not possible / merge conflict - make this known somehow / yell - b. second annotation: What about when annotations have been pushed to a fork, and someone wants to add more annotations to the fork - service validates the new .json and then we replace what's there with the uploaded - validation to be fleshed out in more details - user can only make changes to the fork through the service - user will start with the .json file in the fork as opposed to what is on OpenNeuro 2. how do we make all of the forks in the first place, even for those repos that do not have annotations yet - issues we know about - some ON repos have `main` AND `master` branches - some ON repos get removed - For more details, see [Annotating OpenNeuro Datasets](/tkBKEXyxS5C9ja9wuUTz3w#Problems-encountered-when-updating-the-forks-OpenNeuroDatasets-JSONLD) 4. what about when upstream has annotations already - -> **we don't do that yet** 5. What should we do if the upstream repo gets deleted - ON says: there will always be a new dataset / repo to take over - but this information will likely only exist on the openneuro.org website # Phase 4 run the CLI when fork has been updated ## Brief summary of implemented workflow logic: - We have a script `sha_scraper.sh` that comes around periodically and is responsible for looking at all repos and ask "have I seen this repo before" if yes: "has this repo changed since I was last here" if yes: "could I do something about this repo (i.e. run CLI)" if yes again: "add the repo to my todo list" but not update the repo sha in the lookup file if no: only update the repo sha in the lookup file if no (nothing has changed): do nothing if no ("never seen this repo"): -> go to "could I do something about this repo (i.e. run CLI)" and do that flow When have "add the repo to my todo list but not update the repo sha in the lookup file" because I will only "update the repo sha in the lookup file" if my CLI run has succeeded. Otherwise I would not retry running the CLI next time I come by. #### sha.txt In the current automation framework, the sha.txt is a file that tells us: The last 'version' of each OpenNeuroDatasets-JSONLD repo that we were able to either (A) successfully run the CLI on, OR (B) skipped running the CLI because they were missing required files And that's because when a repo meets either of these conditions, we want to not try to run the CLI again when the workflow next comes around, UNLESS there has been a change to the repo. On the other hand, when a repo fails the CLI, for now we do always want to try running the CLI again the next time the workflow comes around. And we trigger/achieve this by not updating the entries for those repos in sha.txt until they meet one of the previous two conditions. These decisions are somewhat arbitrary, so please let me know / open an issue if you think any part of this doesn't make sense, and we can incrementally refine the workflow logic! ## Process: **Trigger**: - Maybe another cronjob workflow, in OpenNeuroDatasets-JSONLD/.github - Should also be able to be manually triggered, i.e., when we have updated the CLI **1. Detect changes** - For each fork repo - If repo has received a push on the default - We can detect changes by comparing SHAs as they're going to be different - Need a workflow to first record the current SHAs and second compare the previous SHAs with the current ones (presumably updated) and the SHA file needs to be updated with the SHA from repos that have received a change - branch, **AND** there is BOTH a participants.json file and a participants.tsv file: - Proceed to CLI running step - Otherwise: - Do nothing / go to next repo **2. Run CLI** (Happens on a GH wf runner, maybe a separate workflow triggered by step 1) - For each repo that matches conditions in step 1, call https://github.com/neurobagel/bulk_annotations/blob/main/datalad_get_single_dataset.sh and pass in repo/dataset ID - Steps covered by script: - Pull `latest` CLI docker image - datalad clones the repo - using the default branch - need to ensure that any annotations have been made on this branch - Gets human-readable dataset name - Runs dockerized CLI - Copies output pheno-bids JSONLD file to some place - Where should the output of the CLI run go - Separate repo, with all JSONLD files in one folder for easy uploading (`openneuro-annotations/data`?) **Nice to haves** - The workflow that runs the CLI would ideally write to a persistent log file with the name of the failed dataset and the error (probably just printing the STDOUT/STDERR of the CLI itself is enough) - And, the ratio of succeeded / total datasets - Questions: - What should happen to the output of the CLI until it gets into the graph - someone comes and does things # Remaining questions - Where should the openneuro-uploader API live? - Should users have to authenticate before being allowed to push? - Should we keep a 'backup' of participants.json files from the forks somewhere in case things go wrong? - How do we handle default branch changing in the upstream (this is already a problem for some datasets in our existing forks) # Other Steps 1) When upstream changes a) We need to detect that there's been a change in an upstream. When this happens: b) We run some scripts where the outcome is: our fork of the upstream looks exactly like the upstream, except the participants.json also has our annotations (or rather, stuff that we added) 2) # Resources - https://github.com/OpenNeuroDatasets-JSONLD/.github/issues/16 - https://github.com/neurobagel/planning/issues/77 - https://github.com/OpenNeuroDatasets-JSONLD/.github/issues/17 - [Current process to update our OpenNeuroDataset forks](https://github.com/neurobagel/documentation/wiki/Annotating-OpenNeuro-datasets#current-process-to-update-and-re-process-the-neurobagel-openneuro-forks-and-graph) - https://miro.com/app/board/uXjVNj-rvas=/?share_link_id=358313471441