# File mismatching in omni_clustering
###### tags: `omb`
### Problem
`Count files` from the koh dataset were matched with `meta data` files from the koh_ttc dataset. The current input file matching based on pattern matching is not robust for files with similar names. We need lineage based matching!
```python=1
omni_obj.inputs.input_files
> {'filter_expression_1d45b_koh':
{'filtered_rawcounts':
'data/filter_expression/filter_expression_koh_filtered_counts.mtx.gz',
'meta_file':
'data/filter_expression/koh_ttc_meta.json'},
... }
```
:::info
#### Manual fix:
1. Remove mismatched files and revert associated activities
3. Manually force correct matching
4. Check downstream activities
:::
### Expected hashes
| Dataset | File | correct | incorrect |
| -------------- | --------- | ------- | --------- |
| filter_express | logcounts | 38d89 | 59b80 |
| filter_express | counts | 9ad5c | 1d45b |
| filter_m3drop | logcounts | 2eae1 | edc0f |
| filter_m3drop | counts | 7dfac | cfc86 |
---
### Step 1: Remove mismatched files and revert related activities:
1. Check, if files are mismatched
*within console:*
```python=1
omni_obj.inputs.input_files
```
*within terminal:*
```bash!
renku workflow outputs | grep "koh__"
```
2. Get outputs from mismatched files:
```python=1
from omnibenchmark.utils.build_omni_object import get_omni_object_from_yaml
from omnibenchmark.renku_commands.general import renku_save
from omnibenchmark.management.run_commands import revert_run
## Load config
omni_obj = get_omni_object_from_yaml('src/config.yaml')
# Get mismatched outputs
out_mis = [out_map['output_files']['cluster_res'] for out_map in omni_obj.outputs.file_mapping if "koh__" in out_map['output_files']['cluster_res']]
#Revert run
revert_run(out_files = out_mis, dataset_name = omni_obj.name)
renku_save()
```
3. Check from terminal
```bash!
renku workflow outputs | grep "koh__"
```
:::warning
If this still shows files (e.g. with the correct hash) after the clean up, __please__ check the datasets these files belong to. Some of the methods have old output datasets left. __Only__ keep those activities if you are sure they represent valid runs! Otherwise manually remove them by replacing `OUTFILES` with the remaining files and `DATASET_NAME`:
```python=1
from renku.api import Activity, Plan
from omnibenchmark.renku_commands.workflows import renku_workflow_revert
from omnibenchmark.management.data_commands import unlink_dataset_files
# Get activities for those orphan outputs
activity_ids = [Activity.filter(outputs=out_fi) for out_fi in OUT_FILES]
# Make sure they are not linked to any dataset any longer
unlink_dataset_files(out_files=out_fis, dataset_name=DATASET_NAME, remove=True)
# Revert the activities
[renku_workflow_revert(activity_id=act_id[0]._activity.id, plan=False) for act_id in activity_ids]
```
:::
### Step 2: Manually force correct matching
This is very ugly and I wish there was an easier way to overwrite automatic inputs (one day ...). As we anyways plan to switch towards the use of lineages I would rather not try to wrangle to much with the current system?
Add this to `run_workflow.py` after `omni_obj.update_object()`:
For logcounts:
```python=1
######################## fix for file matching of koh dataset #######################
#remove mismatched inputs
del omni_obj.outputs.inputs.input_files['filter_expression_59b80_koh']
del omni_obj.outputs.inputs.input_files['filter_m3drop_edc0f_koh']
omni_obj.outputs.file_mapping = []
omni_obj.outputs.inputs.default = 'filter_expression_38d89_koh'
#add correctly matched
omni_obj.outputs.inputs.input_files['filter_expression_38d89_koh']= {'filtered_logcounts': 'data/filter_expression/filter_expression_koh_filtered_logcounts.mtx.gz',
'meta_file': 'data/filter_expression/koh_meta.json'}
omni_obj.outputs.inputs.input_files['filter_m3drop_2eae1_koh']= {'filtered_logcounts': 'data/filter_m3drop/filter_m3drop_koh_filtered_logcounts.mtx.gz',
'meta_file': 'data/filter_m3drop/koh_meta.json'}
#update outputs
omni_obj.outputs.update_outputs()
######################################################################################
```
For counts:
```python=1
######################## fix for file matching of koh dataset #######################
#remove mismatched inputs
del omni_obj.outputs.inputs.input_files['filter_expression_1d45b_koh']
del omni_obj.outputs.inputs.input_files['filter_m3drop_cfc86_koh']
omni_obj.outputs.file_mapping = []
omni_obj.outputs.inputs.default = 'filter_expression_9ad5c_koh'
#add correctly matched
omni_obj.outputs.inputs.input_files['filter_expression_9ad5c_koh']= {'filtered_rawcounts': 'data/filter_expression/filter_expression_koh_filtered_counts.mtx.gz',
'meta_file': 'data/filter_expression/koh_meta.json'}
omni_obj.outputs.inputs.input_files['filter_m3drop_7dfac_koh']= {'filtered_rawcounts': 'data/filter_m3drop/filter_m3drop_koh_filtered_counts.mtx.gz',
'meta_file': 'data/filter_m3drop/koh_meta.json'}
#update outputs
omni_obj.outputs.update_outputs()
omni_obj.command.outputs = omni_obj.outputs
omni_obj.command.update_command()
######################################################################################
```
If possible check with:
```python=1
omni_obj.check_run()
```
### Step 3: Check downstream activities:
Theoretically things should clean up by itself in downstream projects, but I'm not sure that works. So check outputs from the koh dataset in downstream projects by:
```bash!
renku workflow inputs | grep "koh__"
```
Outputs from the following project are already used in metric workflows:
* monocle (both)
* flowsom
* raceid
* cidr
* ascent
* r-seurat (Is this on purpose calculated on logcounts and raw counts????)
### Checklist:
- [x] [tscan](https://renkulab.io/projects/omnibenchmark/omni_clustering/tscan-clustering)
- [x] [ascend](https://renkulab.io/projects/omnibenchmark/omni_clustering/ascend-clustering)
- [x] [pca-hc](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/pca-hc)
- [x] [cidr](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/cidr-clustering)
- [x] [sc3-svm](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/sc3-svm-clustering)
- [x] [flowsom](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/flowsom-clustering)
- [x] [monocle](https://renkulab.io/gitlab/omnibenchmark/omni_clustering/monocle-clustering)
- [x] [sc3](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/clustering-sc3)
- [x] [seurat](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/r-seurat)
- [x] [rtsne-kmeans](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/r-tsne-k-means-clustering)
- [x] [race_id2](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/raceid2-clustering)
- [x] [pca_kmeans](https://renkulab.io/gitlab/omni_hackathon/omni_clustering/pcakmeans-clustering)