File mismatching in omni_clustering

# File mismatching in omni_clustering ###### tags: `omb` ### Problem `Count files` from the koh dataset were matched with `meta data` files from the koh_ttc dataset. The current input file matching based on pattern matching is not robust for files with similar names. We need lineage based matching! ```python=1 omni_obj.inputs.input_files > {'filter_expression_1d45b_koh': {'filtered_rawcounts': 'data/filter_expression/filter_expression_koh_filtered_counts.mtx.gz', 'meta_file': 'data/filter_expression/koh_ttc_meta.json'}, ... } ``` :::info #### Manual fix: 1. Remove mismatched files and revert associated activities 3. Manually force correct matching 4. Check downstream activities ::: ### Expected hashes | Dataset | File | correct | incorrect | | -------------- | --------- | ------- | --------- | | filter_express | logcounts | 38d89 | 59b80 | | filter_express | counts | 9ad5c | 1d45b | | filter_m3drop | logcounts | 2eae1 | edc0f | | filter_m3drop | counts | 7dfac | cfc86 | --- ### Step 1: Remove mismatched files and revert related activities: 1. Check, if files are mismatched *within console:* ```python=1 omni_obj.inputs.input_files ``` *within terminal:* ```bash! renku workflow outputs | grep "koh__" ``` 2. Get outputs from mismatched files: ```python=1 from omnibenchmark.utils.build_omni_object import get_omni_object_from_yaml from omnibenchmark.renku_commands.general import renku_save from omnibenchmark.management.run_commands import revert_run ## Load config omni_obj = get_omni_object_from_yaml('src/config.yaml') # Get mismatched outputs out_mis = [out_map['output_files']['cluster_res'] for out_map in omni_obj.outputs.file_mapping if "koh__" in out_map['output_files']['cluster_res']] #Revert run revert_run(out_files = out_mis, dataset_name = omni_obj.name) renku_save() ``` 3. Check from terminal ```bash! renku workflow outputs | grep "koh__" ``` :::warning If this still shows files (e.g. with the correct hash) after the clean up, __please__ check the datasets these files belong to. Some of the methods have old output datasets left. __Only__ keep those activities if you are sure they represent valid runs! Otherwise manually remove them by replacing `OUTFILES` with the remaining files and `DATASET_NAME`: ```python=1 from renku.api import Activity, Plan from omnibenchmark.renku_commands.workflows import renku_workflow_revert from omnibenchmark.management.data_commands import unlink_dataset_files # Get activities for those orphan outputs activity_ids = [Activity.filter(outputs=out_fi) for out_fi in OUT_FILES] # Make sure they are not linked to any dataset any longer unlink_dataset_files(out_files=out_fis, dataset_name=DATASET_NAME, remove=True) # Revert the activities [renku_workflow_revert(activity_id=act_id[0]._activity.id, plan=False) for act_id in activity_ids] ``` ::: ### Step 2: Manually force correct matching This is very ugly and I wish there was an easier way to overwrite automatic inputs (one day ...). As we anyways plan to switch towards the use of lineages I would rather not try to wrangle to much with the current system? Add this to `run_workflow.py` after `omni_obj.update_object()`: For logcounts: ```python=1 ######################## fix for file matching of koh dataset ####################### #remove mismatched inputs del omni_obj.outputs.inputs.input_files['filter_expression_59b80_koh'] del omni_obj.outputs.inputs.input_files['filter_m3drop_edc0f_koh'] omni_obj.outputs.file_mapping = [] omni_obj.outputs.inputs.default = 'filter_expression_38d89_koh' #add correctly matched omni_obj.outputs.inputs.input_files['filter_expression_38d89_koh']= {'filtered_logcounts': 'data/filter_expression/filter_expression_koh_filtered_logcounts.mtx.gz', 'meta_file': 'data/filter_expression/koh_meta.json'} omni_obj.outputs.inputs.input_files['filter_m3drop_2eae1_koh']= {'filtered_logcounts': 'data/filter_m3drop/filter_m3drop_koh_filtered_logcounts.mtx.gz', 'meta_file': 'data/filter_m3drop/koh_meta.json'} #update outputs omni_obj.outputs.update_outputs() ###################################################################################### ``` For counts: ```python=1 ######################## fix for file matching of koh dataset ####################### #remove mismatched inputs del omni_obj.outputs.inputs.input_files['filter_expression_1d45b_koh'] del omni_obj.outputs.inputs.input_files['filter_m3drop_cfc86_koh'] omni_obj.outputs.file_mapping = [] omni_obj.outputs.inputs.default = 'filter_expression_9ad5c_koh' #add correctly matched omni_obj.outputs.inputs.input_files['filter_expression_9ad5c_koh']= {'filtered_rawcounts': 'data/filter_expression/filter_expression_koh_filtered_counts.mtx.gz', 'meta_file': 'data/filter_expression/koh_meta.json'} omni_obj.outputs.inputs.input_files['filter_m3drop_7dfac_koh']= {'filtered_rawcounts': 'data/filter_m3drop/filter_m3drop_koh_filtered_counts.mtx.gz', 'meta_file': 'data/filter_m3drop/koh_meta.json'} #update outputs omni_obj.outputs.update_outputs() omni_obj.command.outputs = omni_obj.outputs omni_obj.command.update_command() ###################################################################################### ``` If possible check with: ```python=1 omni_obj.check_run() ``` ### Step 3: Check downstream activities: Theoretically things should clean up by itself in downstream projects, but I'm not sure that works. So check outputs from the koh dataset in downstream projects by: ```bash! renku workflow inputs | grep "koh__" ``` Outputs from the following project are already used in metric workflows: * monocle (both) * flowsom * raceid * cidr * ascent * r-seurat (Is this on purpose calculated on logcounts and raw counts????) ### Checklist: - [x] [tscan](https://renkulab.io/projects/omnibenchmark/omni_clustering/tscan-clustering) - [x] [ascend](https://renkulab.io/projects/omnibenchmark/omni_clustering/ascend-clustering) - [x] [pca-hc](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/pca-hc) - [x] [cidr](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/cidr-clustering) - [x] [sc3-svm](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/sc3-svm-clustering) - [x] [flowsom](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/flowsom-clustering) - [x] [monocle](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/monocle-clustering) - [x] [sc3](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/clustering-sc3) - [x] [seurat](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/r-seurat) - [x] [rtsne-kmeans](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/r-tsne-k-means-clustering) - [x] [race_id2](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/raceid2-clustering) - [x] [pca_kmeans](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/pcakmeans-clustering)