Comparing WGS and Array data

# Code and scripts used for the array vs wgs ### Basic starting offs: ``` ssh vanburen@farm.hpc.ucdavis.edu ``` ``` srun -p med2 --time=9:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 10GB --pty /bin/bash ``` ``` srun -p high2 --time=8:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 10GB --pty /bin/bash ``` ### Merging VCF files Here: ```/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis``` Created a sample list with which horses are in which vcfs: For ```extractsamples_sept25.txt``` ``` NAD49 NAD101 NAD216 NAD220 NAD228 NAD157 NAD197 NAD205 NAD211 NAD217 NAD238 NAD264 ``` For ```extractsamples_pioneer.txt``` ``` 241 416 608 853 854 55 219 408 483 607 ``` Update indexes: ``` for vcf in *.vcf.gz; do bcftools index "$vcf" done ``` Extracting samples from VCFs: ``` #!/bin/bash # Define the directory containing your VCF files vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Pioneer" # Define the directory for the filtered VCF output (same as input directory) output_dir="${vcf_dir}" # Define the list of samples sample_list="241,416,608,853,854,55,219,408,483,607" # Loop through each VCF file for vcf_file in ${vcf_dir}/chr{1..31}.fixed.vcf.gz ${vcf_dir}/chrX.fixed.vcf.gz; do # Extract the base name of the VCF file (e.g., chr1.fixed.vcf.gz) base_name=$(basename ${vcf_file}) # Define the output file name (e.g., chr1.fixed_filtered.vcf.gz) output_file="${output_dir}/filtered_${base_name}" echo "Processing ${vcf_file}..." # Apply the bcftools view command bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file} done echo "Filtering complete." ``` ``` #!/bin/bash # Define the directory containing your VCF files vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022" # Define the directory for the filtered VCF output (same as input directory) output_dir="${vcf_dir}" # Define the list of samples sample_list="NAD49,NAD101,NAD216,NAD220,NAD228,NAD157,NAD197,NAD205,NAD211,NAD217,NAD238,NAD264" # Loop through each VCF file for vcf_file in ${vcf_dir}/*.vcf.gz; do # Extract the base name of the VCF file (e.g., chr1.vcf.gz) base_name=$(basename ${vcf_file}) # Define the output file name (e.g., chr1_filtered.vcf.gz) output_file="${output_dir}/filtered_${base_name}" echo "Processing ${vcf_file}..." # Apply the bcftools view command bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file} done echo "Filtering complete." ``` Now to merge both files for each chromosome into one: ``` #!/bin/bash # Define directories filtered_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022" pioneer_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Pioneer" output_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/merged" # Create output directory if it does not exist mkdir -p ${output_vcf_dir} # Loop through each chromosome (assuming naming pattern is consistent) for chr in {1..31} X; do # Define file paths filtered_vcf="${filtered_vcf_dir}/filtered_chr${chr}.vcf.gz" pioneer_vcf="${pioneer_vcf_dir}/filtered_chr${chr}.fixed.vcf.gz" output_vcf="${output_vcf_dir}/merged_chr${chr}.fixed.vcf.gz" # Check if both files exist if [[ -f ${filtered_vcf} && -f ${pioneer_vcf} ]]; then echo "Merging ${filtered_vcf} and ${pioneer_vcf} into ${output_vcf}" # Merge the VCF files bcftools merge ${filtered_vcf} ${pioneer_vcf} -Oz -o ${output_vcf} else echo "One or both files for chromosome ${chr} do not exist. Skipping." fi done ``` Now, need to merge all chromosomes vcf's into one: ``` # Extract the header from the first VCF file zgrep '^#' merged_chr1.fixed.vcf.gz > merged_all.vcf # Concatenate all the data lines (non-header) from each VCF file zgrep -v '^#' merged_chr{1..31}.fixed.vcf.gz merged_chrX.fixed.vcf.gz >> merged_all.vcf ``` Now to remove Bien (483): ``` vcftools --remove-indv 483 --vcf merged_all.vcf --recode --out merged_all_with483removed.vcf ``` Then zipped and indexed the VCF: ``` bgzip merged_all_with483removed.vcf.recode.vcf tabix -p vcf merged_all_with483removed.vcf.recode.vcf.gz mv merged_all_with483removed.vcf.recode.vcf.gz merged_all_with483removed.vcf.gz ``` Remove the weird formatting naming convention in the VCF for the positions: ``` zcat merged_all_with483removed.vcf.gz | \ sed 's/^$merged_chr[0-9]*\.fixed\.vcf\.gz$:chr$[0-9]*\|X\|Y$/chr\2/' | \ bgzip -c > simple.vcf.gz ``` ``` tabix -p vcf simple.vcf.gz ``` ``` zcat simple.vcf.gz | \ sed 's/^##contig=<ID=merged_chr$[0-9XY]*$.fixed.vcf.gz:chr$[0-9XY]*$>/##contig=<ID=chr\2>/' | \ sed 's/^merged_chr$[0-9XY]*$.fixed.vcf.gz:chr$[0-9XY]*$/chr\2/' | \ bgzip -c > simple_fixed.vcf.gz ``` ``` tabix -p vcf simple_fixed.vcf.gz ``` ___ ### Array data Using Axiom Analysis Suite, got the raw CEL files and filtered using the following parameters: ![Screen Shot 2024-10-09 at 1.12.55 PM](https://hackmd.io/_uploads/rk1JND4y1x.png) The outputted TPED file has 623469 SNPs. Need to match NAD219 to 219 in VCF ``` sed -i 's/NAD219/219/g' SamsArrayGenotyping_pleasegodwork.tfam ``` ### Here is my final script that takes into account indels, no calls, etc: ```completescript.py``` * adjusted as necessary for with or without the two NAD horses (ie the coverage files or the NAD ones) ``` import gzip from collections import defaultdict, Counter import pickle import os # File paths tped_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/SamsArrayGenotyping_pleasegodwork.tped' tfam_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/SamsArrayGenotyping_pleasegodwork.tfam' vcf_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/withNAD197andNAD205/simple_fixed.vcf.gz' output_file = 'comparison_results.txt' checkpoint_file = 'checkpoint.pkl' def is_indel(allele, ref): return len(allele) != len(ref) or allele == '*' or len(allele) > 1 or len(ref) > 1 def read_tped(file_path): tped_data = defaultdict(dict) line_count = 0 filtered_count = 0 duplicate_count = 0 unique_snps = 0 all_tped_positions = set() with open(file_path, 'r') as f: for line in f: line_count += 1 parts = line.strip().split() if len(parts) < 4: print(f"Warning: Invalid TPED line {line_count}: {line.strip()}") continue chrom = parts[0] pos = int(parts[3]) # Manually exclude chromosome 17 at position 50,503,041 if chrom == '17' and pos == 50503041: filtered_count += 1 continue if chrom in ['0', 'Y'] or chrom.startswith('NW') or pos == 0: filtered_count += 1 continue genotypes = [parts[i:i+2] for i in range(4, len(parts), 2)] all_tped_positions.add((chrom, pos)) if pos in tped_data[chrom]: duplicate_count += 1 continue # Skip duplicates tped_data[chrom][pos] = genotypes unique_snps += 1 total_tped_snps = line_count - filtered_count print(f"Total SNPs in TPED: {total_tped_snps}") print(f"Filtered out {filtered_count} SNPs") print(f"Found {duplicate_count} duplicate positions") print(f"Total unique SNPs after filtering and removing duplicates: {unique_snps}") print(f"TPED data contains {len(tped_data)} chromosomes") return tped_data, unique_snps, filtered_count, duplicate_count, all_tped_positions, total_tped_snps def read_vcf_sample_ids(vcf_file): with gzip.open(vcf_file, 'rt') as vcf: for line in vcf: if line.startswith('#CHROM'): return line.strip().split('\t')[9:] return [] def read_tfam_sample_ids(tfam_file): sample_ids = [] with open(tfam_file, 'r') as tfam: for line in tfam: sample_ids.append(line.split()[1]) return sample_ids def parse_vcf_genotype(gt, ref, alts): alleles = gt.split(':')[0] if alleles in ['./.', '.|.']: return ['N', 'N'] if '/' in alleles: a1, a2 = alleles.split('/') elif '|' in alleles: a1, a2 = alleles.split('|') else: return ['N', 'N'] # Unexpected format all_alleles = [ref] + alts # alts is already a list, so we don't need to split it try: return [all_alleles[int(a1)] if a1 != '.' else 'N', all_alleles[int(a2)] if a2 != '.' else 'N'] except (ValueError, IndexError): return ['N', 'N'] # Handle any parsing errors or out-of-range indices def count_top_positions(data): position_counter = Counter() for sample, positions in data.items(): position_counter.update((pos[0], pos[1]) for pos in positions) return position_counter.most_common() def write_top_positions(filename, top_positions): with open(filename, 'w') as f: for (chrom, pos), count in top_positions: f.write(f'{chrom}:{pos} - Count: {count}\n') def write_no_matching_snps(filename, no_matching_snps, tped_data, tfam_sample_ids): with open(filename, 'w') as f: f.write("Sample\tChromosome\tPosition\tVCF_Genotype\tArray_Genotype\n") for sample, snps in no_matching_snps.items(): for chrom, pos, vcf_genotype in snps: array_genotype = ' '.join(tped_data[chrom][pos][tfam_sample_ids.index(sample)]) f.write(f"{sample}\t{chrom}\t{pos}\t{vcf_genotype}\t{array_genotype}\n") def write_excluded_positions(filename, excluded_positions): with open(filename, 'w') as f: f.write("Chromosome\tPosition\tReason\n") for chrom, pos in sorted(excluded_positions): f.write(f"{chrom}\t{pos}\tExcluded\n") def save_checkpoint(position, horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, vcf_positions, snps_actually_compared): checkpoint_data = { 'position': position, 'horse_stats': horse_stats, 'no_matching_snps': no_matching_snps, 'no_calls_array': no_calls_array, 'no_calls_vcf': no_calls_vcf, 'no_calls_both': no_calls_both, 'vcf_positions': vcf_positions, 'snps_actually_compared': snps_actually_compared } with open(checkpoint_file, 'wb') as f: pickle.dump(checkpoint_data, f) def load_checkpoint(): if os.path.exists(checkpoint_file): try: with open(checkpoint_file, 'rb') as f: return pickle.load(f) except (EOFError, pickle.UnpicklingError): print(f"Warning: Checkpoint file '{checkpoint_file}' is corrupted or incomplete. Starting from the beginning.") return None return None def compare_genotypes(vcf_file, tped_data, vcf_sample_ids, tfam_sample_ids, all_tped_positions, start_position=0): horse_stats = {sample: {'total_matched_snps': 0, 'total_snps_compared': 0, 'no_calls_array': 0, 'no_calls_vcf': 0, 'no_calls_both': 0, 'total_no_matching_snps': 0} for sample in tfam_sample_ids} no_matching_snps = defaultdict(set) no_calls_array = defaultdict(set) no_calls_vcf = defaultdict(set) no_calls_both = defaultdict(set) vcf_line_count = 0 vcf_positions = set() snps_actually_compared = set() excluded_positions = set() vcf_to_tped_index = {vcf_sample_ids.index(sample): tfam_sample_ids.index(sample) for sample in vcf_sample_ids if sample in tfam_sample_ids} def is_indel(allele, ref): return len(allele) != len(ref) or allele == '*' or len(allele) > 1 or len(ref) > 1 def is_no_call(gt): return gt in ['./.', '.|.', '.'] or gt.startswith('./.') with gzip.open(vcf_file, 'rt') as vcf: for line in vcf: if line.startswith('#'): continue vcf_line_count += 1 if vcf_line_count <= start_position: continue parts = line.strip().split('\t') full_chrom = parts[0].replace('chr', '') chrom = full_chrom.split(':')[-1] if chrom in ['0', 'Y'] or chrom.startswith('NW'): continue pos = int(parts[1]) ref, alts = parts[3], parts[4].split(',') vcf_genotypes = parts[9:] vcf_positions.add((chrom, pos)) if (chrom, pos) in all_tped_positions: # Check for indels or multiallelic sites only for matching positions is_indel_site = is_indel(ref, ref) or any(is_indel(alt, ref) for alt in alts) is_multiallelic = len(alts) > 1 if is_indel_site or is_multiallelic: excluded_positions.add((chrom, pos)) continue snps_actually_compared.add((chrom, pos)) if chrom in tped_data and pos in tped_data[chrom]: tped_genotypes = tped_data[chrom][pos] for vcf_index, vcf_gt in enumerate(vcf_genotypes): if vcf_index not in vcf_to_tped_index: continue tped_index = vcf_to_tped_index[vcf_index] tped_gt = tped_genotypes[tped_index] horse_id = vcf_sample_ids[vcf_index] horse_stats[horse_id]['total_snps_compared'] += 1 is_vcf_no_call = is_no_call(vcf_gt) is_array_no_call = '0' in tped_gt if is_array_no_call and is_vcf_no_call: horse_stats[horse_id]['no_calls_both'] += 1 no_calls_both[horse_id].add((chrom, pos, vcf_gt)) elif is_array_no_call: horse_stats[horse_id]['no_calls_array'] += 1 no_calls_array[horse_id].add((chrom, pos, vcf_gt)) elif is_vcf_no_call: horse_stats[horse_id]['no_calls_vcf'] += 1 no_calls_vcf[horse_id].add((chrom, pos, vcf_gt)) else: vcf_alleles = parse_vcf_genotype(vcf_gt, ref, alts) if sorted(vcf_alleles) == sorted(tped_gt): horse_stats[horse_id]['total_matched_snps'] += 1 else: horse_stats[horse_id]['total_no_matching_snps'] += 1 no_matching_snps[horse_id].add((chrom, pos, '/'.join(vcf_alleles))) if vcf_line_count % 1000000 == 0: print(f"Processed {vcf_line_count} VCF lines") save_checkpoint(vcf_line_count, horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, vcf_positions, snps_actually_compared) snps_not_in_vcf = len(all_tped_positions - vcf_positions) total_snps_compared = len(snps_actually_compared) print(f"Processed {vcf_line_count} VCF lines") print(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}") print(f"Excluded positions (indels or multiallelic): {len(excluded_positions)}") print(f"Total unique positions compared: {total_snps_compared}") return horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, snps_not_in_vcf, total_snps_compared, excluded_positions def write_results(output_file, horse_stats, total_tped_snps, filtered_count, duplicate_count, snps_not_in_vcf, total_snps_compared, excluded_positions_count): with open(output_file, 'w') as f: f.write(f"Filtered SNPs: {filtered_count}\n") f.write(f"Total SNPs in TPED (including duplicates): {total_tped_snps}\n") f.write(f"Duplicate SNPs found: {duplicate_count}\n") f.write(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}\n") f.write(f"Excluded positions (indels or multiallelic): {excluded_positions_count}\n") f.write(f"Total unique positions compared: {total_snps_compared}\n\n") f.write("Note: All subsequent comparisons are based on unique positions only.\n\n") for horse_id, stats in horse_stats.items(): total_snps_compared = stats['total_snps_compared'] total_no_calls = stats['no_calls_array'] + stats['no_calls_vcf'] + stats['no_calls_both'] total_no_matching_snps = stats['total_no_matching_snps'] incorrect_calls_denominator = total_snps_compared - total_no_calls incorrect_calls_percentage = (total_no_matching_snps / incorrect_calls_denominator * 100) if incorrect_calls_denominator > 0 else 0 f.write(f"Sample: {horse_id}\n") f.write(f"Total SNPs compared: {total_snps_compared}\n") f.write(f"Total matched SNPs: {stats['total_matched_snps']}\n") f.write(f"Total no matching SNPs: {total_no_matching_snps}\n") f.write(f"No calls from array: {stats['no_calls_array']}\n") f.write(f"No calls from VCF: {stats['no_calls_vcf']}\n") f.write(f"No calls from both: {stats['no_calls_both']}\n") f.write(f"Incorrect calls percentage: {incorrect_calls_percentage:.2f}%\n\n") def main(): print("Reading TPED file...") tped_data, unique_tped_snps, filtered_count, duplicate_count, all_tped_positions, total_tped_snps = read_tped(tped_file) print("Reading sample IDs...") vcf_sample_ids = read_vcf_sample_ids(vcf_file) tfam_sample_ids = read_tfam_sample_ids(tfam_file) checkpoint = load_checkpoint() start_position = 0 if checkpoint: start_position = checkpoint['position'] print(f"Resuming from checkpoint at position {start_position}") print("Comparing genotypes...") horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, snps_not_in_vcf, total_snps_compared, excluded_positions = compare_genotypes( vcf_file, tped_data, vcf_sample_ids, tfam_sample_ids, all_tped_positions, start_position) print("Writing results...") write_results(output_file, horse_stats, total_tped_snps, filtered_count, duplicate_count, snps_not_in_vcf, total_snps_compared, len(excluded_positions)) print("Writing top positions...") write_top_positions('top_no_matching_snps.txt', count_top_positions(no_matching_snps)) write_top_positions('top_no_calls_array.txt', count_top_positions(no_calls_array)) write_top_positions('top_no_calls_vcf.txt', count_top_positions(no_calls_vcf)) write_top_positions('top_no_calls_both.txt', count_top_positions(no_calls_both)) print("Writing all no matching SNPs...") write_no_matching_snps('all_no_matching_snps.txt', no_matching_snps, tped_data, tfam_sample_ids) print("Writing excluded positions...") write_excluded_positions('excluded_positions.txt', excluded_positions) print(f"Results written to {output_file}") print("Top positions written to top_no_matching_snps.txt, top_no_calls_array.txt, top_no_calls_vcf.txt, and top_no_calls_both.txt") print("All no matching SNPs written to all_no_matching_snps.txt") print("Excluded positions written to excluded_positions.txt") print("\nSummary:") print(f"Filtered SNPs: {filtered_count}") print(f"Total SNPs in TPED (including duplicates): {total_tped_snps}") print(f"Duplicate SNPs found: {duplicate_count}") print(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}") print(f"Excluded positions (indels or multiallelic): {len(excluded_positions)}") print(f"Total unique positions compared: {total_snps_compared}") # Consistency check assert total_snps_compared + snps_not_in_vcf + len(excluded_positions) == unique_tped_snps, "Mismatch in comparison counts" # Remove checkpoint file after successful completion if os.path.exists(checkpoint_file): os.remove(checkpoint_file) if __name__ == '__main__': main() ``` #### Let's get more info about those excluded positions... ``` import gzip def read_excluded_positions(file_path): excluded_positions = {} with open(file_path, 'r') as f: next(f) for line in f: columns = line.strip().split('\t') if len(columns) < 3: continue chromosome = columns[0] position = columns[1] reason = columns[2] excluded_positions[f"chr{chromosome}:{position}"] = reason return excluded_positions def calculate_allele_frequencies(genotypes, alleles): allele_counts = {allele: 0 for allele in alleles} total_alleles = 0 for genotype in genotypes: if genotype in ['./.', '.|.']: continue alleles_in_genotype = genotype.replace('|', '/').split('/') for allele in alleles_in_genotype: try: allele_index = int(allele) allele_value = alleles[allele_index] allele_counts[allele_value] += 1 total_alleles += 1 except (ValueError, IndexError): continue allele_frequencies = {allele: count / total_alleles for allele, count in allele_counts.items()} if total_alleles > 0 else {} return allele_frequencies, len(genotypes) - (total_alleles // 2) def process_vcf(vcf_file_path, output_file_path, excluded_positions_file): excluded_positions = read_excluded_positions(excluded_positions_file) with gzip.open(vcf_file_path, 'rt') as vcf_file, open(output_file_path, 'w') as out_file: for line in vcf_file: if line.startswith('#'): continue fields = line.strip().split('\t') chromosome = fields[0] position = fields[1] pos_key = f"{chromosome}:{position}" if pos_key in excluded_positions: print(f"Processing excluded position: {pos_key}") out_file.write(f"Excluded Position: {pos_key}\n") ref_allele = fields[3] alt_alleles = fields[4].split(',') alleles = [ref_allele] + alt_alleles genotypes = [sample.split(':')[0] for sample in fields[9:]] allele_frequencies, no_call_count = calculate_allele_frequencies(genotypes, alleles) out_file.write(f"Alleles: {', '.join(alleles)}\n") out_file.write("Allele Frequencies:\n") for allele, freq in allele_frequencies.items(): out_file.write(f" {allele}: {freq:.4f}\n") out_file.write(f"No-call count: {no_call_count} / {len(genotypes)}\n") out_file.write("-" * 40 + "\n") else: continue vcf_file_path = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/joint_lessdeep.recode.vcf.gz' excluded_positions_file = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/excluded_positions.txt' output_file_path = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/output.txt' process_vcf(vcf_file_path, output_file_path, excluded_positions_file) ``` ----- #### Now the goal is to make a VCF of the deeper coverage ones and also compare in my script: Extracting samples from VCFs: ``` #!/bin/bash # Define the directory containing your VCF files vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022" # Define the directory for the filtered VCF output (same as input directory) output_dir="${vcf_dir}" # Define the list of samples sample_list="197,205" # Loop through each VCF file for vcf_file in ${vcf_dir}/*.vcf.gz; do # Extract the base name of the VCF file (e.g., chr1.vcf.gz) base_name=$(basename ${vcf_file}) # Define the output file name (e.g., chr1_filtered.vcf.gz) output_file="${output_dir}/197_205_${base_name}" echo "Processing ${vcf_file}..." # Apply the bcftools view command bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file} done echo "Filtering complete." ``` Now, need to merge all chromosomes vcf's into one: ``` # Extract the header from the first VCF file zgrep '^#' 197_205_chr1.vcf.gz > merged_197_205.vcf # Concatenate all the data lines (non-header) from each VCF file zgrep -v '^#' 197_205_chr{1..31}.vcf.gz 197_205_chrX.vcf.gz >> merged_197_205.vcf ``` Then zipped and indexed the VCF: ``` bgzip merged_197_205.vcf tabix -p vcf merged_197_205.vcf.gz ``` ### Now merge this VCF with the other, remove NAD197 and NAD205 (not 197 and 205), then, filter out SNPs with >1 call Need to match contig names: chrom_rename.txt ``` merged_chr1.fixed.vcf.gz:chr1 chr1 merged_chr2.fixed.vcf.gz:chr2 chr2 merged_chr3.fixed.vcf.gz:chr3 chr3 merged_chr4.fixed.vcf.gz:chr4 chr4 merged_chr5.fixed.vcf.gz:chr5 chr5 merged_chr6.fixed.vcf.gz:chr6 chr6 merged_chr7.fixed.vcf.gz:chr7 chr7 merged_chr8.fixed.vcf.gz:chr8 chr8 merged_chr9.fixed.vcf.gz:chr9 chr9 merged_chr10.fixed.vcf.gz:chr10 chr10 merged_chr11.fixed.vcf.gz:chr11 chr11 merged_chr12.fixed.vcf.gz:chr12 chr12 merged_chr13.fixed.vcf.gz:chr13 chr13 merged_chr14.fixed.vcf.gz:chr14 chr14 merged_chr15.fixed.vcf.gz:chr15 chr15 merged_chr16.fixed.vcf.gz:chr16 chr16 merged_chr17.fixed.vcf.gz:chr17 chr17 merged_chr18.fixed.vcf.gz:chr18 chr18 merged_chr19.fixed.vcf.gz:chr19 chr19 merged_chr20.fixed.vcf.gz:chr20 chr20 merged_chr21.fixed.vcf.gz:chr21 chr21 merged_chr22.fixed.vcf.gz:chr22 chr22 merged_chr23.fixed.vcf.gz:chr23 chr23 merged_chr24.fixed.vcf.gz:chr24 chr24 merged_chr25.fixed.vcf.gz:chr25 chr25 merged_chr26.fixed.vcf.gz:chr26 chr26 merged_chr27.fixed.vcf.gz:chr27 chr27 merged_chr28.fixed.vcf.gz:chr28 chr28 merged_chr29.fixed.vcf.gz:chr29 chr29 merged_chr30.fixed.vcf.gz:chr30 chr30 merged_chr31.fixed.vcf.gz:chr31 chr31 merged_chrX.fixed.vcf.gz:chrX chrX 197_205_chr1.vcf.gz:chr1 chr1 197_205_chr2.vcf.gz:chr2 chr2 197_205_chr3.vcf.gz:chr3 chr3 197_205_chr4.vcf.gz:chr4 chr4 197_205_chr5.vcf.gz:chr5 chr5 197_205_chr6.vcf.gz:chr6 chr6 197_205_chr7.vcf.gz:chr7 chr7 197_205_chr8.vcf.gz:chr8 chr8 197_205_chr9.vcf.gz:chr9 chr9 197_205_chr10.vcf.gz:chr10 chr10 197_205_chr11.vcf.gz:chr11 chr11 197_205_chr12.vcf.gz:chr12 chr12 197_205_chr13.vcf.gz:chr13 chr13 197_205_chr14.vcf.gz:chr14 chr14 197_205_chr15.vcf.gz:chr15 chr15 197_205_chr16.vcf.gz:chr16 chr16 197_205_chr17.vcf.gz:chr17 chr17 197_205_chr18.vcf.gz:chr18 chr18 197_205_chr19.vcf.gz:chr19 chr19 197_205_chr20.vcf.gz:chr20 chr20 197_205_chr21.vcf.gz:chr21 chr21 197_205_chr22.vcf.gz:chr22 chr22 197_205_chr23.vcf.gz:chr23 chr23 197_205_chr24.vcf.gz:chr24 chr24 197_205_chr25.vcf.gz:chr25 chr25 197_205_chr26.vcf.gz:chr26 chr26 197_205_chr27.vcf.gz:chr27 chr27 197_205_chr28.vcf.gz:chr28 chr28 197_205_chr29.vcf.gz:chr29 chr29 197_205_chr30.vcf.gz:chr30 chr30 197_205_chr31.vcf.gz:chr31 chr31 197_205_chrX.vcf.gz:chrX chrX ``` ``` bcftools annotate --rename-chrs chrom_rename.txt merged_197_205.vcf.gz -Oz -o renamed_merged_197_205.vcf.gz ``` ``` bcftools annotate --rename-chrs chrom_rename.txt merged_all_with483removed.vcf.gz -Oz -o renamed_merged_all_with483removed.vcf.gz ``` ``` bcftools index renamed_merged_197_205.vcf.gz bcftools index renamed_merged_all_with483removed.vcf.gz ``` ``` bcftools merge renamed_merged_197_205.vcf.gz renamed_merged_all_with483removed.vcf.gz -Oz -o renamed_merged_197_205_withmain.vcf.gz ``` Now to remove NAD197 and NAD205: ``` bcftools view -s ^NAD197,NAD205 renamed_merged_197_205_withmain.vcf.gz -Oz -o renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz ``` Need to match ID names in tfam/tped to vcf, so copied new tfam/tped than changed it ``` sed -e 's/NAD197/197/g' -e 's/NAD205/205/g' SamsArrayGenotyping_pleasegodwork197and205.tfam > updated_SamsArrayGenotyping_pleasegodwork197and205.tfam ``` Index the VCF: ``` tabix -p vcf renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz ``` ____ Can check genotypes at a vcf position with this: ``` zgrep -E "chr1[[:space:]]*156686314" renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz ``` And array with this: ``` awk '$1 == 2 && $4 == 47290946' SamsArrayGenotyping_pleasegodwork.tped ``` Can match genotype to id in the array with: ``` awk 'FNR==NR {horses[NR]=$2; next} $1 == 1 && $4 == 20619679 {for (i=5; i<=NF; i+=2) print horses[(i-3)/2], $i, $(i+1)}' SamsArrayGenotyping_pleasegodwork.tfam SamsArrayGenotyping_pleasegodwork.tped ``` And for vcf: ``` bcftools query -r chr1:20619679 -f '%CHROM\t%POS[\t%SAMPLE=%GT]\n' renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz ``` ____ # Now, mapping chestnut ### For the VCF from the deeper coverage horses Split multiallelic sites into biallelic records: ``` bcftools norm -m-any -Oz -o deeperone.vcf.gz renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz ``` Transform vcf into ped/map files: ``` plink --vcf deeperone.vcf.gz --allow-extra-chr --chr-set 32 --recode --biallelic-only --out deeper ``` Add IDs: ``` awk '{ $2 = $1 ":" $4 }1' deeper.map > deeper_ids.map ``` Now to prune it: ``` plink --file deeper_ids --allow-extra-chr --chr-set 32 --geno 0.1 --maf 0.001 --out deeper_ids_pruned --recode ``` #### Set up phenotype As: ```coat_color.phe``` ``` FID IID PHENOTYPE 55 55 1 219 219 1 241 241 2 408 408 1 416 416 2 607 607 1 608 608 1 853 853 2 854 854 2 NAD101 NAD101 2 NAD157 NAD157 1 197 197 1 205 205 0 NAD211 NAD211 1 NAD216 NAD216 1 NAD217 NAD217 2 NAD220 NAD220 2 NAD228 NAD228 2 NAD238 NAD238 2 NAD264 NAD264 1 NAD49 NAD49 1 ``` Put into tped/tfam: ``` plink --file deeper_ids_pruned -allow-extra-chr --chr-set 32 --recode transpose --out deeper_final ``` ``` cut -f1-6 joint_deeper_ids_pruned.ped > joint_deeper_ids_pruned.fam ``` Do GWAS (basic allelic): ``` plink --file deeper_ids_pruned \ --pheno coat_color.phe \ --assoc \ --allow-extra-chr \ --chr-set 32 \ --allow-no-sex \ --out deeper_vcf ``` Is chestnut in there? ``` grep "3:36979560" /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/with197and205/deeper_vcf.assoc ``` Yep it is: 3 3:36979560 36979560 C 0 0.7273 T 21.82 2.997e-06 0 ___ ### Now for the Array: Go to directory with tped/tfam from array and add phenotypes to the 6th column. Prune it: ``` plink --tped SamsArrayGenotyping_pleasegodwork.tped \ --tfam SamsArrayGenotyping_pleasegodwork.tfam \ --allow-extra-chr \ --chr-set 32 \ --geno 0.1 \ --maf 0.001 \ --out pheno_pruned --recode ``` And go to directory and run GWAS (basic allelic) other tped: ``` plink --file pheno_pruned \ --pheno coat_color.phe \ --assoc \ --allow-extra-chr \ --chr-set 32 \ --out array_array ``` Is chestnut in there? ``` grep -w '3.*36979560' /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/old_prebam/arraydata/array_array.assoc ``` Yes it is: 3 AX-104805525 36979560 C 0 0.7273 T 20.1 7.353e-06 0 Let's at least figure out what that highest associated SNP in the basic allelic model was then: ``` import pandas as pd # Load your GWAS results gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False) # Convert 'CHR' column to numeric, forcing errors to NaN gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce') # Remove rows with chromosomes 0 and 33 gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)] # Calculate the position of the highest association (lowest p-value) highest_assoc = gwas_data.loc[gwas_data['P'].idxmin()] # Extract the relevant information snp_position = highest_assoc['BP'] snp_chr = highest_assoc['CHR'] snp_id = highest_assoc['SNP'] # Replace 'SNP' with the actual column name for SNP ID # Print the results print(f"Highest associated SNP: {snp_id}") print(f"Chromosome: {snp_chr}, Position: {snp_position}") print(f"P-value: {highest_assoc['P']}") ``` #### Results: Highest associated SNP: AX-104528810 Chromosome: 3.0, Position: 36317372 P-value: 2.997e-06 ##### This makes sense since chestnut is at 3:36979560 Can see all genotypes in the tped with for example: ``` awk '$1 == 3 && $4 >= 36279560 && $4 <= 36999560' SamsArrayGenotyping_pleasegodwork.tped ``` ### PYTHON: Create Manhattan plot for array AND same code for WGS tped/tfam Adjust as needed for the .assoc you are using, removed chr 0 from the graph: ``` import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load your GWAS results using a raw string for the separator gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False) # Convert 'CHR' to numeric, forcing errors to NaN (if any) gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce') # Remove chromosome 0 gwas_data = gwas_data[gwas_data['CHR'] > 0] # Calculate -log10(p-value) gwas_data['neg_log10_p'] = -np.log10(gwas_data['P']) # Remove any rows with NaN values in 'CHR' or 'neg_log10_p' gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p']) # Sort the DataFrame by chromosome and then by base pair position gwas_data = gwas_data.sort_values(by=['CHR', 'BP']) # Create the Manhattan plot plt.figure(figsize=(12, 6)) # Initialize an empty DataFrame for the adjusted positions plot_data = pd.DataFrame() # Initialize position offset current_pos = 0 # Loop through each chromosome and plot for chr_num in sorted(gwas_data['CHR'].unique()): subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset # Calculate new x positions subset['x_pos'] = subset['BP'] + current_pos plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame # Update current position for next chromosome current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset # If it's chromosome X (33), color it red if chr_num == 33: plt.scatter(subset['x_pos'], subset['neg_log10_p'], color='red', edgecolor='none', s=10) # Chr X in red else: # Even chromosomes in red, odd chromosomes in blue color = 'red' if chr_num % 2 == 0 else 'blue' plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10) # Other chromosomes # Calculate the Bonferroni-corrected threshold num_snps = len(gwas_data) # Total number of SNPs tested bonferroni_threshold = 0.05 / num_snps # Calculate -log10 of the Bonferroni threshold neg_log10_bonferroni = -np.log10(bonferroni_threshold) # Add the Bonferroni-corrected significance threshold line to the plot plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--') # Bonferroni line # Customizing the x-axis plt.xlabel('Base Pair Position') plt.ylabel('-log10(p-value)') plt.title('Manhattan Plot for GWAS (Chromosome X in Red)') # Set the x-ticks to reflect the chromosome positions tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())] tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())] plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7) # Reduced font size to avoid overlap # Tight layout and save the plot plt.tight_layout() plt.savefig("manhattan_plot_with_chr_x_red.png", dpi=300) # Save the plot as PNG with high resolution plt.show() # Write SNP count and Bonferroni threshold to a file with open("bonferroni_threshold_info.txt", "w") as f: f.write(f"Number of SNPs used: {num_snps}\n") f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n") f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n") ``` ### Dope, here is the array: ![Screen Shot 2024-10-16 at 3.41.17 PM](https://hackmd.io/_uploads/B1_Qbpp1ye.png) Highest associated SNP: AX-104528810 Chromosome: 3.0, Position: 36317372 P-value: 2.997e-06 ##### This makes sense since chestnut is at 3:36979560 Number of SNPs used: 456243 Bonferroni threshold (0.05 / 456243): 1.09590722e-7 -log10 Bonferroni threshold: 6.96 - the reason for 456243 is there was 491616 in array_array.assoc (due to geno filtering in plink), however the script for the Manhattan plot drops NaN values and any with chr 0. ### And here from the VCF SNPs: ![Screen Shot 2024-10-16 at 4.40.40 PM](https://hackmd.io/_uploads/HkKfJApy1l.png) Highest associated SNP: 7:95499072 Chromosome: 7, Position: 95499072 P-value: 3.375e-08 Number of SNPs used: 16526343, determined by: ```grep -v "NA" deeper_vcf.assoc | wc -l``` Bonferroni threshold (0.05 / 16526343): 3.02547273e-9 -log10 Bonferroni threshold: 8.52 _____________ ## Looking at depths In:```/group/ctbrowngrp/finnolab_shared/wgs_bams/recalibrated``` with command: ``` samtools depth <bam_file> | awk '{sum+=$3} END {if (NR>0) print "Average depth:", sum/NR; else print "No coverage"}' ``` _____ ## Dr Finno comments May 28 - change red and blue array manhattan plot - color coat chestnut SNP alleles - make array QQ plot #### Change red and blue array manhattan plot Python script: ``` import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load GWAS results gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False) # Convert 'CHR' to numeric gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce') # Filter: remove chr 0 and any NaNs gwas_data = gwas_data[gwas_data['CHR'] > 0] gwas_data['neg_log10_p'] = -np.log10(gwas_data['P']) gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p']) # Sort data gwas_data = gwas_data.sort_values(by=['CHR', 'BP']) # Start plot plt.figure(figsize=(12, 6)) plot_data = pd.DataFrame() current_pos = 0 # Loop through chromosomes and plot for chr_num in sorted(gwas_data['CHR'].unique()): subset = gwas_data[gwas_data['CHR'] == chr_num].copy() subset['x_pos'] = subset['BP'] + current_pos plot_data = pd.concat([plot_data, subset]) current_pos += subset['BP'].max() + 1 color = 'pink' if chr_num == 33 or chr_num % 2 == 0 else 'green' plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10) # Highlight the chestnut allele chestnut = plot_data[(plot_data['CHR'] == 3) & (plot_data['BP'] == 36979560)] if not chestnut.empty: plt.scatter(chestnut['x_pos'], chestnut['neg_log10_p'], color='hotpink', edgecolor='black', s=40, zorder=10, label='Chestnut Allele') # Output chestnut allele stats chestnut_info = chestnut[['CHR', 'SNP', 'BP', 'P', 'neg_log10_p']] chestnut_info.to_csv("chestnut_allele_info.txt", sep='\t', index=False) else: print("Chestnut allele not found in the dataset.") # Bonferroni threshold line num_snps = len(gwas_data) bonferroni_threshold = 0.05 / num_snps neg_log10_bonferroni = -np.log10(bonferroni_threshold) plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--') # Axis formatting plt.xlabel('Base Pair Position') plt.ylabel('-log10(p-value)') plt.title('Manhattan Plot for GWAS (Chestnut Allele Highlighted)') tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())] tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())] plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7) plt.legend() plt.tight_layout() plt.savefig("manhattan_plot_with_chestnut.png", dpi=300) plt.show() # Output Bonferroni stats with open("bonferroni_threshold_info.txt", "w") as f: f.write(f"Number of SNPs used: {num_snps}\n") f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n") f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n") ``` #### Chestnut is 3:36979560 ### With QQ plot for array: ``` import pandas as pd import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.axes_grid1.inset_locator import inset_axes from scipy.stats import chi2 # Load GWAS results gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False) # Convert 'CHR' to numeric gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce') # Filter: remove chr 0 and any NaNs gwas_data = gwas_data[gwas_data['CHR'] > 0] gwas_data['neg_log10_p'] = -np.log10(gwas_data['P']) gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p']) # Sort data gwas_data = gwas_data.sort_values(by=['CHR', 'BP']) # Start Manhattan plot plt.figure(figsize=(12, 6)) plot_data = pd.DataFrame() current_pos = 0 # Loop through chromosomes and plot for chr_num in sorted(gwas_data['CHR'].unique()): subset = gwas_data[gwas_data['CHR'] == chr_num].copy() subset['x_pos'] = subset['BP'] + current_pos plot_data = pd.concat([plot_data, subset]) current_pos += subset['BP'].max() + 1 color = 'pink' if chr_num == 33 or chr_num % 2 == 0 else 'green' plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10) # Highlight the chestnut allele chestnut = plot_data[(plot_data['CHR'] == 3) & (plot_data['BP'] == 36979560)] if not chestnut.empty: plt.scatter(chestnut['x_pos'], chestnut['neg_log10_p'], color='limegreen', edgecolor='black', s=40, zorder=10) # Output chestnut allele stats chestnut_info = chestnut[['CHR', 'SNP', 'BP', 'P', 'neg_log10_p']] chestnut_info.to_csv("chestnut_allele_info.txt", sep='\t', index=False) chestnut_label = True else: print("Chestnut allele not found in the dataset.") chestnut_label = False # Bonferroni threshold line num_snps = len(gwas_data) bonferroni_threshold = 0.05 / num_snps neg_log10_bonferroni = -np.log10(bonferroni_threshold) plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--') # Axis labels and title plt.xlabel('Base Pair Position') plt.ylabel('-log10(p-value)') plt.title('Manhattan Plot for GWAS (Chestnut Allele Highlighted)') # Chromosome tick positions and labels tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())] tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())] plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7) # Legend for chestnut allele only if chestnut_label: plt.legend( [plt.Line2D([], [], marker='o', color='limegreen', label='Chestnut Allele', markerfacecolor='limegreen', markeredgecolor='black', linestyle='', markersize=6)], ['Chestnut Allele'], fontsize=8, loc='upper left', frameon=True ) # Add QQ plot inset (upper right, small, labeled with white title background) ax = plt.gca() axins = inset_axes(ax, width=1.6, height=1.6, loc='upper right', borderpad=2) # QQ data sorted_p = np.sort(gwas_data['P']) expected = -np.log10(np.linspace(1 / num_snps, 1, num_snps)) observed = -np.log10(sorted_p) # Lambda GC (calculate, not shown) chi2_stats = chi2.isf(sorted_p, df=1) lambda_gc = np.median(chi2_stats) / 0.456 # QQ plot axins.scatter(expected, observed, s=4, color='black') axins.plot([0, expected.max()], [0, expected.max()], color='red', linestyle='--') # Set limits and ticks max_val = max(expected.max(), observed.max()) axins.set_xlim(0, max_val) axins.set_ylim(0, max_val) axins.set_xticks(np.arange(0, max_val + 1, step=1)) axins.set_yticks(np.arange(0, max_val + 1, step=1)) # Axis labels and title with white background axins.set_title('QQ Plot', fontsize=7, pad=3, bbox=dict(facecolor='white', edgecolor='none', boxstyle='round,pad=0.3')) axins.set_xlabel('Expected -log10(p)', fontsize=6, labelpad=1) axins.set_ylabel('Observed -log10(p)', fontsize=6, labelpad=1) axins.tick_params(axis='both', which='major', labelsize=6) axins.grid(False) # Clean layout plt.subplots_adjust(left=0.06, right=0.98, top=0.95, bottom=0.1) plt.savefig("manhattan_plot_with_chestnut_and_qq.png", dpi=300) plt.show() # Output Bonferroni and lambda info with open("bonferroni_threshold_info.txt", "w") as f: f.write(f"Number of SNPs used: {num_snps}\n") f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n") f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n") f.write(f"Genomic inflation factor (lambda GC): {lambda_gc:.3f}\n") ``` ______ ### EXTRA: Recessive shit that looked weird af: #### Using SNPSift to run a recessive GWAS model For array: ``` plink --file SamsArrayGenotyping_pleasegodwork_pheno_pruned \ --recode vcf --out array \ --allow-extra-chr \ --chr-set 32 ``` ``` mamba create -n snpsiftenv mamba activate snpsiftenv ``` ``` mamba install -c bioconda snpEff conda install -c bioconda snpeff ``` ``` wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip unzip snpEff_latest_core.zip ``` ``` git clone https://github.com/pcingola/SnpEff.git git clone https://github.com/pcingola/SnpSift.git ``` ``` unzip snpEff_latest_core.zip ``` ``` snpEff download EquCab3.0.105 ``` ``` snpEff ann EquCab3.0.105 /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/array.vcf > annotated_horse_array.vcf ``` ``` SnpSift caseControl -v "-++---+++---0--+-+++-" annotated_horse_array.vcf > case_control_output.vcf ``` ``` SnpSift extractFields case_control_output.vcf CHROM POS REF ALT "Cases" "Controls" "CC_REC" > gwas_results_snpsift.txt ``` Create Manhattan plot: ``` import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load your GWAS results gwas_data = pd.read_csv("gwas_results_snpsift.txt", sep='\t', low_memory=False) # Print column names to verify print("Columns in the data:", gwas_data.columns) # Rename columns to match the expected format gwas_data = gwas_data.rename(columns={'CHROM': 'CHR', 'POS': 'BP'}) # Convert 'CHR' to numeric, forcing errors to NaN (if any) gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'].str.replace('chr', ''), errors='coerce') # Remove chromosome 0 and 33 (if present) gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)] # Calculate -log10(p-value) using 'CC_REC' column gwas_data['neg_log10_p'] = -np.log10(gwas_data['CC_REC'].astype(float)) # Remove any rows with NaN values in 'CHR' or 'neg_log10_p' gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p']) # Sort the DataFrame by chromosome and then by base pair position gwas_data = gwas_data.sort_values(by=['CHR', 'BP']) # Create the Manhattan plot plt.figure(figsize=(12, 6)) # Initialize an empty DataFrame for the adjusted positions plot_data = pd.DataFrame() # Initialize position offset current_pos = 0 # Define colors for alternating chromosomes colors = ['red', 'blue'] # Define two colors # Loop through each chromosome and plot for chr_num in sorted(gwas_data['CHR'].unique()): subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset # Calculate new x positions subset['x_pos'] = subset['BP'] + current_pos plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame # Update current position for next chromosome current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset # Assign color based on whether the chromosome is odd or even color = colors[int(chr_num) % 2] # Alternate between red and blue plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10, label=f'Chr {int(chr_num)}') # Customizing the x-axis plt.xlabel('Base Pair Position') plt.ylabel('-log10(p-value)') plt.title('Manhattan Plot for Recessive Association') plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--') # Add a significance threshold line # Set the x-ticks to reflect the chromosome positions with smaller font size tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())] tick_labels = [f'Chr {int(chr_num)}' for chr_num in sorted(gwas_data['CHR'].unique())] plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=8) # Set a smaller font size # Tight layout and save the plot plt.tight_layout() plt.savefig("manhattan_plot_recessive_snpsift.png", dpi=300) # Save the plot as PNG with high resolution plt.show() ``` Looks like shit tf: ![Screen Shot 2024-10-07 at 1.06.24 PM](https://hackmd.io/_uploads/SJXik6Z11l.png) ### For VCF data: ``` plink --file sos_biallelic_with_ids_pruned \ --recode vcf --out fromvcf \ --allow-extra-chr \ --chr-set 32 ``` ``` SnpSift caseControl "-+-++--0-++-++-++----" fromvcf.vcf > case_control_fromvcf_output.vcf ``` ``` SnpSift extractFields case_control_fromvcf_output.vcf CHROM POS REF ALT "Cases" "Controls" "CC_REC" > gwas_results_snpsift.txt ``` Create Manhattan plot: ``` import pandas as pd import matplotlib.pyplot as plt import numpy as np # Load your GWAS results gwas_data = pd.read_csv("gwas_results_snpsift.txt", sep='\t', low_memory=False) # Print column names to verify print("Columns in the data:", gwas_data.columns) print("Data type of CHROM column:", gwas_data['CHROM'].dtype) # Rename columns to match the expected format gwas_data = gwas_data.rename(columns={'CHROM': 'CHR', 'POS': 'BP'}) # Convert 'CHR' to numeric, handling both string and numeric inputs gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'].astype(str).str.replace('chr', ''), errors='coerce') # Remove chromosome 0 and 33 (if present) gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)] # Calculate -log10(p-value) using 'CC_REC' column gwas_data['neg_log10_p'] = -np.log10(gwas_data['CC_REC'].astype(float)) # Remove any rows with NaN values in 'CHR' or 'neg_log10_p' gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p']) # Sort the DataFrame by chromosome and then by base pair position gwas_data = gwas_data.sort_values(by=['CHR', 'BP']) # Create the Manhattan plot plt.figure(figsize=(12, 6)) # Initialize an empty DataFrame for the adjusted positions plot_data = pd.DataFrame() # Initialize position offset current_pos = 0 # Define colors for alternating chromosomes colors = ['red', 'blue'] # Define two colors # Loop through each chromosome and plot for chr_num in sorted(gwas_data['CHR'].unique()): subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset # Calculate new x positions subset['x_pos'] = subset['BP'] + current_pos plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame # Update current position for next chromosome current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset # Assign color based on whether the chromosome is odd or even color = colors[int(chr_num) % 2] # Alternate between red and blue plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10, label=f'Chr {int(chr_num)}') # Customizing the x-axis plt.xlabel('Base Pair Position') plt.ylabel('-log10(p-value)') plt.title('Manhattan Plot for Recessive Association') plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--') # Add a significance threshold line # Set the x-ticks to reflect the chromosome positions with smaller font size tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())] tick_labels = [f'Chr {int(chr_num)}' for chr_num in sorted(gwas_data['CHR'].unique())] plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=8) # Set a smaller font size # Tight layout and save the plot plt.tight_layout() plt.savefig("manhattan_plot_recessive_snpsift.png", dpi=300) # Save the plot as PNG with high resolution plt.show() ``` Not sure why it also looks like shit: ![Screen Shot 2024-10-08 at 3.18.41 PM](https://hackmd.io/_uploads/ryoC1EQkJl.png)