--- tags: Great Salt Lakes --- # Checking there are no duplicate headers in fastq files [toc] --- I'm sure there is something already out there that would do this for us, but I couldn't find it. So I wrote something in python and added it to my [bioinformatics toolkit](https://github.com/AstrobioMike/bioinf_tools/tree/v1.8.25#bioinformatics-tools-bit). I think the easiest way for you to be able to use this will be through conda. So you'll have to install that first unfortunately, but I don't think that will cause any problems or delays for you at this point :) ## Installing conda This is installed at the command line. There are installation instructions outlined on my intro to conda page [here](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda). Be sure to grab the one that is appropriate for your system and alter the file we are pointing to as needed (e.g. as commented out in the example code there). ## Installing the [bit](https://github.com/AstrobioMike/bioinf_tools/tree/v1.8.25#bioinformatics-tools-bit) package with conda We are ignoring extraneous details about conda here, if you want to know more about it at some point, the rest of the [intro](https://astrobiomike.github.io/unix/conda-intro) with those installation instructions is a good place to start. At the command line, we first need to run this command which installs the package in it's own, contained environment: ```bash conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit ``` When that's done we need to enter the new environment with this: ```bash conda activate bit ``` We can now see a "(bit)" precedes our command-line prompt, telling us with are in the environment we just created. We need to be in this environment in order for the programs to be available. So if we opened a new terminal window, we'd need to run `conda activate bit` again. ### Checking there are no duplicate fastq headers Now to check our files, we need to run the program like so (replacing the input file with the appropriate one on our system): ```bash bit-check-for-fastq-dup-headers -i test-R1.fq.gz ``` Which reports back whether or not there are any duplicates: ![](https://i.imgur.com/ybPYdZ8.png) Whereas running it on a problem one looks like this: ```bash bit-check-for-fastq-dup-headers -i test-R1-with-dups.fq.gz ``` ![](https://i.imgur.com/ZqhxBII.png) --- > Running it on just a forward read file would be enough to tell us if there's a problem, but we can run it on forward and reverse reads if we want. I expect you will just get a report back saying there are no duplicate headers found. If not, then we'll figure out how to deal with it :)