Datasets - HackMD

--- tags: Howto ---  Datasets ===  Training datasets --- ### CRISPR HMP reads (binary classification set) :::info This dataset consists of reads of CRISPR regions (_crispr reads_) and reads of non-CRISPR regions (_negative dataset_) from the [HMP1-II study](https://www.hmpdacc.org/hmp/). ::: The current dataset generated René and consists of 5 files in `/net/sgi/genomenet/data/randomReads_vs_crispr/train` ```bash -rw-rw-r--+ 1 rmreches algbio-core 124M Jan 30 16:15 e.fasta -rw-rw-r--+ 1 rmreches algbio-core 123M Jan 30 16:09 d.fasta -rw-rw-r--+ 1 rmreches algbio-core 126M Jan 30 16:08 c.fasta -rw-rw-r--+ 1 rmreches algbio-core 79M Jan 30 16:06 b.fasta -rw-rw-r--+ 1 rmreches algbio-core 120M Jan 30 16:04 a.fasta ``` and 1 file in `/net/sgi/genomenet/data/randomReads_vs_crispr/validation` ``` -rw-rw-r--+ 1 rmreches algbio-core 119M Jan 30 16:07 v.fasta ``` While the CRISPR data should be correct (e.g. all entries that start with `>crispr`), the negative data (all entries that start with `>randomReads`) should be replaced with the new negative dataset outlined below since the current negative set is not drawn similar as the CRISPR data from the HMP1-II study and therefore not have the same body-site ratio. The new *negative* datasets with random reads from HMP1-II samples are located at: ```bash luna-gpucompute-01.bifo.helmholtz-hzi.de:/net/sgi/genomenet/data/randomReads_vs_crispr/reads_neg.fastq.tar.gz` ``` This is in the `fastq` format which contains besides the nucleotide information also information of the quality of the sequence and can be transformed to `fasta` using ```bash sed -n '1~4s/^@/>/p;2~4p' file.fastq > file.fasta ``` So either we should update the `>randomReads` located at `/net/sgi/genomenet/data/train` and `/net/sgi/genomenet/data/validation` or create a new dataset from scratch using the raw CIRSPR reads from `/net/sgi/genomenet/data/all_hmp1-II_reads/all_hmp1-II_reads.fasta` (or [here from dropbox](https://www.dropbox.com/s/lbdyiw8i7slbii8/hmp1-II-crispr-reads.tar.gz?dl=0)). See also [CIRSPR detection task](https://hackmd.io/vrCp0yG0TjS4pqYeRb10pQ?both) Reomval of reads --- :::warning The negative dataset are random drawn reads from the raw dataset and might include CRISPR reads. So maybe we should check/remove reads from the negative dataset that are present in the CRISPR positive dataset. Both dataset maybe also contain copies of the same reads, which can be removed (?). ::: To remove reads from the CRISPR negative set we use [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml). ```bash! bowtie2-build all_hmp1-II_reads.fasta all_hmp1-II_reads --threads 50 --large-index bowtie2 -p 50 -x all_hmp1-II_reads -f reads_neg.fasta -S alignment.sam --un reads_neg_filtered.fasta ``` Models --- ==TODO==