General Bioinformatics Workflow for Consensus Calling Rolling Circle Amplifaction Data

## Section 3: Analysing basic metrics of rolling circle data We can check several metrics from the RCA now: - Repeat count - Overall read lengths - Subread lengths - Average read qualities In addition, we can filter out reads that don't met certin criteria, such as minimum number of repeats or minimum quality I have a series of R-scripts for this purpose in this [GitHub repository](https://github.com/M-Dunnet/C3POa_metrics_and_filters/tree/main). Information on requirements and how to use the scripts is within the GitHub repository, but in short: - Graphs for all of the above metrics can be produced using the `Consensus_BasicMetrics.R` script. - Consensus FASTA files can be split by repeat count using the `Subset_by_Repeats.R` script. - Consensus FASTA files can be split by average quality using the `Subset_by_Quality.R` script. The requirements for these scripts are: - R version 4.3.1 - stringr_1.5.0 - Biostrings_2.68.1 - argparser_0.7.1 ### Running Consensus_BasicMetrics.R This script makes graphs from the information provided by the header in C3POa consensused called reads. The script will produce 5 graphs per file: Repeat count distribution Qaulity distributions subread length distriubtion (includes splint and adapters) trimmed subread length distribution (target read only) Overall length distribution There is only one positional argument: A path string to the directory containing conensus called reads. Usage: `Rscript ConsensusFASTA_BasicMetrics.R <input_directory>` Example Outputs <style> table { border-collapse: collapse; } table, th, td { border: 1px solid black; } blockquote { border-left: solid blue; padding-left: 10px; } </style> | ![](https://hackmd.io/_uploads/ByHvPWyQ6.png) | ![](https://hackmd.io/_uploads/ByuGPbJQT.png) | | -------- | -------- | | ![](https://hackmd.io/_uploads/H1kEP-ymp.png) | ![](https://hackmd.io/_uploads/By4ldZ176.png) | ### Running Subset_by_Repeats.R or Subset_by_Quality.R These scripts filter R2C2 consensus called reads from C3POa based on read repeat count or average quality score. In regular mode, the script will take all reads above a specific threshold (specified by -min_repeat or -min_quality, both default to 3) and output them as a new file. This is the default. In split mode, the script will split reads into multiple groups based on their repeat count/quality score and output multiple files. Reads cannot be split into multi-repeat or multi-quality groups (i.e groups can only consist of one repeat number). #### Arguments Both scripts have two positional arguments: - First: Input directory where `R2C2_full_length_consensus_reads.fasta` files are held. - Second: Output directory where new files will end up. ##### Subset_by_Repeats.R optional arguments ``` --min_repeat Minium number of repeats to include in the output file. Defaults to 3 (minimum for proofreading) --split Turns on split mode, will split reads into multiple files by repeat count --split_range Integer. Sets the range for repeat splits e.g. 0 - [INT]. Split mode only. Deafults to 10 ``` ##### Subset_by_Quality.R optional arguments ``` --min_quality Minium average quaility to include in the output file. Defaults to 20 (1/100 error rate) --split Turns on split mode, will split reads into multiple files by repeat count --split_min Integer. Sets the lower bound for splitting. Split mode only. Defaults to 20 --split_max Integer. Sets the upper bound range for spliting. Split mode only. Defaults to 40 --round Rounds average quailty to this numeric value e.g. '2' will round to the nearest number divisable by 2. Only useful in split mode. Defaults to 2 ``` ## Section 4: Mapping Mapping of consensus called reads, either splints or target regions, is performed with Minimap2-2.26. Documentation and build instructions can be found [here](https://lh3.github.io/minimap2/minimap2.html) and [here](https://github.com/lh3/minimap2/releases). Samtools is also required. Next step is to index the genome. We can do this from the minimap2 direcotry, or create a symlink to perform it elsewhere. I perfer to create a symlink in the same directory as `MapFiles.sh` as it makes mapping later on easier ``` ln -s /Volumes/archive/cancergeneticslab/MichaelDunnet/Software/minimap2-2.26_x64-linux/minimap2 ``` We can now access minimap2 and `MapFiles.sh` from the same place. Now to index the genome: ``` ./minimap2 -d <genome_index.mmi> <genome.fasta> ``` We can now move onto mapping (either target reads or splints). Navigate to the folder containing reads to be mapped and run: I have Written a script to map, filter, and index multiple files: `MapFiles.sh` To run this file: ``` chmod +=x MapFiles.sh ./MapFiles.sh <input dir> <output dir> <genome.fa> ``` To easily keep write out the command we can save locations of files (edit these if locations are different): ``` Input=/Volumes/archive/cancergeneticslab/MichaelDunnet/Sequencing_runs/<files> Output=/Volumes/archive/cancergeneticslab/MichaelDunnet/Sequencing_runs/<output> Genome=/Volumes/archive/cancergeneticslab/MichaelDunnet/Genomes/GRCh38_hg38/hg38.fa Index=/Volumes/archive/cancergeneticslab/MichaelDunnet/Genomes/GRCh38_hg38/hg38.mmi ``` Then we can just run: `./MapFiles.sh $Input $Output $Genome $Index` Bam files are now ready for further analysis. We can either assess accuracy further by looking at differences to the reference, do variant calling or both. ## Section 5: Assessing Accuracy Metrics This section contains a series of scripts for assessing target enrichment and seqeuncing accuracy. Requirements: - pandas = 2.0.1 - matplotlib = 3.7.1 - numpy = 1.19.5 - pysam = 0.21.0 - seaborn = 0.21.0 ### Mapping Statistics Mapping statistics can be assesed with the script `Panel_coverage.py`. This will print coverage histograms of each region within a bed file as well as produce a tsv file containing the number of reads mapping to each target gene. Arguments: ``` --input, -i The directory location containing input BAM files --output, -o The directory location where files will end up --bedfile, -b Path string to the location of a bedfile. Only loci in the bed file will be analysed ``` Usage: ``` python3 Panel_coverage.py -i /path/to/input_dir -o /path/to/output_dir -b /path/to/bed.txt ``` Example coverage for MET, where each peak is an exon: ![MET_Coverage.png](https://hackmd.io/_uploads/Sy5LStlQp.png) ### Accuracy Metrics A series of scripts that can measure the error from the reference either (a) per Kb, or (b)per read #### Per Kilobase Two scripts measure difference from the refernce per kilobase, and they function the same; however, one is aimed at comparing difference repeat counts, `SNV_types_by_Repeat.py`, while the other is aimed at comparing different sample types, `SNV_types.py`. Both scripts will output a grouped bar plot conatining the total error per Kb, as well as insertions, deletions, and mismatches per kB. In addition, the data will be written to a tsv for further manipulation if requried. SNV_types_by_Repeat.py is deisgned to be used downstream of repeat splitting from Section 3. Arguments: ``` --input, -i The directory location containing input BAM files --output, -o The directory location where files will end up ``` Usage: ``` python3 SNV_types.py -i /path/to/input_dir/ -o /path/to/output_dir/ python3 SNV_types_by_Repeat.py -i /path/to/input_dir/ -o /path/to/output_dir/ ``` Example Output for SNV_types: ![](https://hackmd.io/_uploads/SJlRIH-yX6.png) Example Output for SNV_types_by_Repeat: ![](https://hackmd.io/_uploads/rJj_xZk7p.png) #### Per read The script `EDperRead.py` counts the total errors per read and plots them as a histogram. It does not measure insertions, deletions, and mismatches seperately. Multiple samples will be plotted in seperate histograms within the same overall figure. This script only has one positonal argument: The directory location containing input BAM files. Usage: ``` python3 EDperRead.py /path/to/input_dir/ ``` Example Output: ![](https://hackmd.io/_uploads/r1GgbWkXa.png) ## Section 6: Variant Calling *not finished* The current vairant calling software was built with two assumptions: - There is not matched normal. This is becasue it takes up valuable seqeuncing real-estate. Nevertheless, this is something I want to implement with the buffy coats. - Data is in FASTA format. This is becasue we lose FASTQ format during consensus calling This cannot be fixed unless we move away from Nanopore. *add in current features* ### Requirements - Python3 - pandas == 2.0.1 - numpy == 1.24.3 - matplotlib == 3.7.1 - pysam == 0.21.0 - alignparse == 0.6.1 * - seaborn == 0.12.2 Changes have been made to Alignparse in order for it to function better in our context. Specifically, within the `alignparse` library there is a member called `cs_tag` containing the function `cs_tag_to_mutation_string()`. I have changed the option for deletions: ```diff elif op_type == "deletion": deletion = "".join( - ["del", str(seq_loc), "to", str(seq_loc + len(cs_op) - 2)] + ["del", str(seq_loc), cs_op[1:].upper()] ) ``` This change is required for the scripts to run properly and enables conversion of mutation strings to standard mutation nomenclature. Variant calling and filtering is performed with two scripts, both of which are currently in a development stage. ### Running the variant caller. Initial variant calling is performed with `Mutation_calling_v1.0.2.py`. This script will: 1. Extracts mapping data with pysam in regions of interest 2. Finds and Extracts UMI sequences from mapped reads 3. Finds differences to the genome for each unique UMI 4. Translates mutation information into a VCF style format The output VCF has the following columns as tab-serpated values: <style scoped> table { font-size: 11px; } </style> | Column Number | Column Header | Column Info | -------- | -------- | -------- | 1 | GENE_SYMBOL | HUGO gene symbol for region of interst | 2 | MUTATION_TYPE | Insertion (ins), Deletion (del), or SNP | 3 | GENOMIC_WT_ALLELE | The reference genome allele | 4 | GENOMIC_MUT_ALLELE | The mutant allele | 5 | CHROMOSOME | Chromosome of the region of interest | 6 | GENOME_START | Start coordinate of the mutation (1-based indexing) | 7 | GENOME_END | End coordinate of the mutation (1-based indexing) | 8 | Mutation Counts | The number of filtered reads with the specified variant | 9 | F.DEPTH | Filtered (duplicate UMIs removed) covering this position | 10 | T.Depth | Total reads covering this position | 11 | VAF | F.Depth/Mutation Counts #### Arguments and Usage: ##### Required arguments ``` --input, -i The directory location containing input BAM files --bedfile, -b Path string to the location of a bedfile. Only loci in the bed file will be analysed ``` ##### Optional arguments ``` --output, -o The directory location where files will end up. Defaults to the input directory --UMI_cutoff, -u Sets the cut-off for the number of UMI copies for a mutation to pass. Defaults to 1 --mut_proportion, -p Float value between 0.0 and 1.0 that sets the cut-off for proportion of reads in a UMI group containing a mutation. Defaults to 1.0 (all reads in UMI group must contain the mutation) --minimum_coverage, -c Sets the minimum read coverage when determining relevant mutations. Defaults to 50 --minimum_frequency, -f Sets the minimum frequency when determining relevant mutations. Defaults to 0.05 (5%) --maximum_frequency, -F Sets the maximum frequency when determining relevant mutations. Defaults to 0.95 (95%) --graph_UMI Toggles the mutation caller to produce a histogram of UMI occurrences --graph_mutations Toggles the mutation caller to produce a histogram of unique mutation occurrences ``` ##### Example Usage ``` python3 Mutation_calling_v1.0.2 -i <input dir> -b <bedfile.txt> -c 500 -f 0.01 ``` Example histogram of Individual UMI occurrences: ![](https://hackmd.io/_uploads/BknUAM1XT.png) Example histogram of Individual mutation occurrences: ![](https://hackmd.io/_uploads/rywYCM1XT.png) ### Running the variant filter. Variant calling is performed with `Mutations_Filters.py`. This script will: 1. Check the VCF file mutations to detect if they are in homopolyer repeats 2. Cross-reference the VCF file with a list of known nanopore errors The first time running, the script will make an index file of the genome, e.g for `genome.fa`, `genome.fa.fai` will be created. Creating the index file takes around 30 seconds to a minute for the human genome but only needs to be done once. Like BAM index files, the `genome.fai` file needs to remain in the same directory as the `genome.fa` file. After indexing sample processing should be very quick. No Specific filtering is performed, rather a list of additional columns are added which enable user-specific filtering. | Column Number | Column Header | Column Info | | -------- | -------- | -------- | | 12 | Ref_Seq | The four bases pairs either side of variant region| | 13 | Homo-Polymer | Yes if two or more repeated bases in the variant are the same| | 14 | KNOWN_ERROR | Yes if this variant is present in the reference dataset| | 15 | KNOWN_ERROR_VAF | Reference variant VAF| ##### Required arguments ``` --input, -i The directory location containing input VCF files --genome, -g The path string to the genome fasta file --cross_reference, -f Path string to the cross reference file ``` ##### Optional arguments ``` --output, -o The directory location where files will end up. Defaults to the input directory ``` ##### Example Usage ``` python3 Mutations_filters -i <input dir> -g <genome.fa> -f <cross-reference.tsv> ```