# Code and scripts used for the array vs wgs
### Basic starting offs:
```
ssh vanburen@farm.hpc.ucdavis.edu
```
```
srun -p med2 --time=9:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 10GB --pty /bin/bash
```
```
srun -p high2 --time=8:00:00 -A ctbrowngrp --nodes=1 --cpus-per-task 2 --mem 10GB --pty /bin/bash
```
### Merging VCF files
Here: ```/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis```
Created a sample list with which horses are in which vcfs:
For ```extractsamples_sept25.txt```
```
NAD49
NAD101
NAD216
NAD220
NAD228
NAD157
NAD197
NAD205
NAD211
NAD217
NAD238
NAD264
```
For ```extractsamples_pioneer.txt```
```
241
416
608
853
854
55
219
408
483
607
```
Update indexes:
```
for vcf in *.vcf.gz; do
bcftools index "$vcf"
done
```
Extracting samples from VCFs:
```
#!/bin/bash
# Define the directory containing your VCF files
vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Pioneer"
# Define the directory for the filtered VCF output (same as input directory)
output_dir="${vcf_dir}"
# Define the list of samples
sample_list="241,416,608,853,854,55,219,408,483,607"
# Loop through each VCF file
for vcf_file in ${vcf_dir}/chr{1..31}.fixed.vcf.gz ${vcf_dir}/chrX.fixed.vcf.gz; do
# Extract the base name of the VCF file (e.g., chr1.fixed.vcf.gz)
base_name=$(basename ${vcf_file})
# Define the output file name (e.g., chr1.fixed_filtered.vcf.gz)
output_file="${output_dir}/filtered_${base_name}"
echo "Processing ${vcf_file}..."
# Apply the bcftools view command
bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file}
done
echo "Filtering complete."
```
```
#!/bin/bash
# Define the directory containing your VCF files
vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022"
# Define the directory for the filtered VCF output (same as input directory)
output_dir="${vcf_dir}"
# Define the list of samples
sample_list="NAD49,NAD101,NAD216,NAD220,NAD228,NAD157,NAD197,NAD205,NAD211,NAD217,NAD238,NAD264"
# Loop through each VCF file
for vcf_file in ${vcf_dir}/*.vcf.gz; do
# Extract the base name of the VCF file (e.g., chr1.vcf.gz)
base_name=$(basename ${vcf_file})
# Define the output file name (e.g., chr1_filtered.vcf.gz)
output_file="${output_dir}/filtered_${base_name}"
echo "Processing ${vcf_file}..."
# Apply the bcftools view command
bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file}
done
echo "Filtering complete."
```
Now to merge both files for each chromosome into one:
```
#!/bin/bash
# Define directories
filtered_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022"
pioneer_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Pioneer"
output_vcf_dir="/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/merged"
# Create output directory if it does not exist
mkdir -p ${output_vcf_dir}
# Loop through each chromosome (assuming naming pattern is consistent)
for chr in {1..31} X; do
# Define file paths
filtered_vcf="${filtered_vcf_dir}/filtered_chr${chr}.vcf.gz"
pioneer_vcf="${pioneer_vcf_dir}/filtered_chr${chr}.fixed.vcf.gz"
output_vcf="${output_vcf_dir}/merged_chr${chr}.fixed.vcf.gz"
# Check if both files exist
if [[ -f ${filtered_vcf} && -f ${pioneer_vcf} ]]; then
echo "Merging ${filtered_vcf} and ${pioneer_vcf} into ${output_vcf}"
# Merge the VCF files
bcftools merge ${filtered_vcf} ${pioneer_vcf} -Oz -o ${output_vcf}
else
echo "One or both files for chromosome ${chr} do not exist. Skipping."
fi
done
```
Now, need to merge all chromosomes vcf's into one:
```
# Extract the header from the first VCF file
zgrep '^#' merged_chr1.fixed.vcf.gz > merged_all.vcf
# Concatenate all the data lines (non-header) from each VCF file
zgrep -v '^#' merged_chr{1..31}.fixed.vcf.gz merged_chrX.fixed.vcf.gz >> merged_all.vcf
```
Now to remove Bien (483):
```
vcftools --remove-indv 483 --vcf merged_all.vcf --recode --out merged_all_with483removed.vcf
```
Then zipped and indexed the VCF:
```
bgzip merged_all_with483removed.vcf.recode.vcf
tabix -p vcf merged_all_with483removed.vcf.recode.vcf.gz
mv merged_all_with483removed.vcf.recode.vcf.gz merged_all_with483removed.vcf.gz
```
Remove the weird formatting naming convention in the VCF for the positions:
```
zcat merged_all_with483removed.vcf.gz | \
sed 's/^\(merged_chr[0-9]*\.fixed\.vcf\.gz\):chr\([0-9]*\|X\|Y\)/chr\2/' | \
bgzip -c > simple.vcf.gz
```
```
tabix -p vcf simple.vcf.gz
```
```
zcat simple.vcf.gz | \
sed 's/^##contig=<ID=merged_chr\([0-9XY]*\).fixed.vcf.gz:chr\([0-9XY]*\)>/##contig=<ID=chr\2>/' | \
sed 's/^merged_chr\([0-9XY]*\).fixed.vcf.gz:chr\([0-9XY]*\)/chr\2/' | \
bgzip -c > simple_fixed.vcf.gz
```
```
tabix -p vcf simple_fixed.vcf.gz
```
___
### Array data
Using Axiom Analysis Suite, got the raw CEL files and filtered using the following parameters:

The outputted TPED file has 623469 SNPs.
Need to match NAD219 to 219 in VCF
```
sed -i 's/NAD219/219/g' SamsArrayGenotyping_pleasegodwork.tfam
```
### Here is my final script that takes into account indels, no calls, etc:
```completescript.py```
* adjusted as necessary for with or without the two NAD horses (ie the coverage files or the NAD ones)
```
import gzip
from collections import defaultdict, Counter
import pickle
import os
# File paths
tped_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/SamsArrayGenotyping_pleasegodwork.tped'
tfam_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/SamsArrayGenotyping_pleasegodwork.tfam'
vcf_file = '/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/withNAD197andNAD205/simple_fixed.vcf.gz'
output_file = 'comparison_results.txt'
checkpoint_file = 'checkpoint.pkl'
def is_indel(allele, ref):
return len(allele) != len(ref) or allele == '*' or len(allele) > 1 or len(ref) > 1
def read_tped(file_path):
tped_data = defaultdict(dict)
line_count = 0
filtered_count = 0
duplicate_count = 0
unique_snps = 0
all_tped_positions = set()
with open(file_path, 'r') as f:
for line in f:
line_count += 1
parts = line.strip().split()
if len(parts) < 4:
print(f"Warning: Invalid TPED line {line_count}: {line.strip()}")
continue
chrom = parts[0]
pos = int(parts[3])
# Manually exclude chromosome 17 at position 50,503,041
if chrom == '17' and pos == 50503041:
filtered_count += 1
continue
if chrom in ['0', 'Y'] or chrom.startswith('NW') or pos == 0:
filtered_count += 1
continue
genotypes = [parts[i:i+2] for i in range(4, len(parts), 2)]
all_tped_positions.add((chrom, pos))
if pos in tped_data[chrom]:
duplicate_count += 1
continue # Skip duplicates
tped_data[chrom][pos] = genotypes
unique_snps += 1
total_tped_snps = line_count - filtered_count
print(f"Total SNPs in TPED: {total_tped_snps}")
print(f"Filtered out {filtered_count} SNPs")
print(f"Found {duplicate_count} duplicate positions")
print(f"Total unique SNPs after filtering and removing duplicates: {unique_snps}")
print(f"TPED data contains {len(tped_data)} chromosomes")
return tped_data, unique_snps, filtered_count, duplicate_count, all_tped_positions, total_tped_snps
def read_vcf_sample_ids(vcf_file):
with gzip.open(vcf_file, 'rt') as vcf:
for line in vcf:
if line.startswith('#CHROM'):
return line.strip().split('\t')[9:]
return []
def read_tfam_sample_ids(tfam_file):
sample_ids = []
with open(tfam_file, 'r') as tfam:
for line in tfam:
sample_ids.append(line.split()[1])
return sample_ids
def parse_vcf_genotype(gt, ref, alts):
alleles = gt.split(':')[0]
if alleles in ['./.', '.|.']:
return ['N', 'N']
if '/' in alleles:
a1, a2 = alleles.split('/')
elif '|' in alleles:
a1, a2 = alleles.split('|')
else:
return ['N', 'N'] # Unexpected format
all_alleles = [ref] + alts # alts is already a list, so we don't need to split it
try:
return [all_alleles[int(a1)] if a1 != '.' else 'N',
all_alleles[int(a2)] if a2 != '.' else 'N']
except (ValueError, IndexError):
return ['N', 'N'] # Handle any parsing errors or out-of-range indices
def count_top_positions(data):
position_counter = Counter()
for sample, positions in data.items():
position_counter.update((pos[0], pos[1]) for pos in positions)
return position_counter.most_common()
def write_top_positions(filename, top_positions):
with open(filename, 'w') as f:
for (chrom, pos), count in top_positions:
f.write(f'{chrom}:{pos} - Count: {count}\n')
def write_no_matching_snps(filename, no_matching_snps, tped_data, tfam_sample_ids):
with open(filename, 'w') as f:
f.write("Sample\tChromosome\tPosition\tVCF_Genotype\tArray_Genotype\n")
for sample, snps in no_matching_snps.items():
for chrom, pos, vcf_genotype in snps:
array_genotype = ' '.join(tped_data[chrom][pos][tfam_sample_ids.index(sample)])
f.write(f"{sample}\t{chrom}\t{pos}\t{vcf_genotype}\t{array_genotype}\n")
def write_excluded_positions(filename, excluded_positions):
with open(filename, 'w') as f:
f.write("Chromosome\tPosition\tReason\n")
for chrom, pos in sorted(excluded_positions):
f.write(f"{chrom}\t{pos}\tExcluded\n")
def save_checkpoint(position, horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, vcf_positions, snps_actually_compared):
checkpoint_data = {
'position': position,
'horse_stats': horse_stats,
'no_matching_snps': no_matching_snps,
'no_calls_array': no_calls_array,
'no_calls_vcf': no_calls_vcf,
'no_calls_both': no_calls_both,
'vcf_positions': vcf_positions,
'snps_actually_compared': snps_actually_compared
}
with open(checkpoint_file, 'wb') as f:
pickle.dump(checkpoint_data, f)
def load_checkpoint():
if os.path.exists(checkpoint_file):
try:
with open(checkpoint_file, 'rb') as f:
return pickle.load(f)
except (EOFError, pickle.UnpicklingError):
print(f"Warning: Checkpoint file '{checkpoint_file}' is corrupted or incomplete. Starting from the beginning.")
return None
return None
def compare_genotypes(vcf_file, tped_data, vcf_sample_ids, tfam_sample_ids, all_tped_positions, start_position=0):
horse_stats = {sample: {'total_matched_snps': 0, 'total_snps_compared': 0,
'no_calls_array': 0, 'no_calls_vcf': 0, 'no_calls_both': 0,
'total_no_matching_snps': 0} for sample in tfam_sample_ids}
no_matching_snps = defaultdict(set)
no_calls_array = defaultdict(set)
no_calls_vcf = defaultdict(set)
no_calls_both = defaultdict(set)
vcf_line_count = 0
vcf_positions = set()
snps_actually_compared = set()
excluded_positions = set()
vcf_to_tped_index = {vcf_sample_ids.index(sample): tfam_sample_ids.index(sample)
for sample in vcf_sample_ids if sample in tfam_sample_ids}
def is_indel(allele, ref):
return len(allele) != len(ref) or allele == '*' or len(allele) > 1 or len(ref) > 1
def is_no_call(gt):
return gt in ['./.', '.|.', '.'] or gt.startswith('./.')
with gzip.open(vcf_file, 'rt') as vcf:
for line in vcf:
if line.startswith('#'):
continue
vcf_line_count += 1
if vcf_line_count <= start_position:
continue
parts = line.strip().split('\t')
full_chrom = parts[0].replace('chr', '')
chrom = full_chrom.split(':')[-1]
if chrom in ['0', 'Y'] or chrom.startswith('NW'):
continue
pos = int(parts[1])
ref, alts = parts[3], parts[4].split(',')
vcf_genotypes = parts[9:]
vcf_positions.add((chrom, pos))
if (chrom, pos) in all_tped_positions:
# Check for indels or multiallelic sites only for matching positions
is_indel_site = is_indel(ref, ref) or any(is_indel(alt, ref) for alt in alts)
is_multiallelic = len(alts) > 1
if is_indel_site or is_multiallelic:
excluded_positions.add((chrom, pos))
continue
snps_actually_compared.add((chrom, pos))
if chrom in tped_data and pos in tped_data[chrom]:
tped_genotypes = tped_data[chrom][pos]
for vcf_index, vcf_gt in enumerate(vcf_genotypes):
if vcf_index not in vcf_to_tped_index:
continue
tped_index = vcf_to_tped_index[vcf_index]
tped_gt = tped_genotypes[tped_index]
horse_id = vcf_sample_ids[vcf_index]
horse_stats[horse_id]['total_snps_compared'] += 1
is_vcf_no_call = is_no_call(vcf_gt)
is_array_no_call = '0' in tped_gt
if is_array_no_call and is_vcf_no_call:
horse_stats[horse_id]['no_calls_both'] += 1
no_calls_both[horse_id].add((chrom, pos, vcf_gt))
elif is_array_no_call:
horse_stats[horse_id]['no_calls_array'] += 1
no_calls_array[horse_id].add((chrom, pos, vcf_gt))
elif is_vcf_no_call:
horse_stats[horse_id]['no_calls_vcf'] += 1
no_calls_vcf[horse_id].add((chrom, pos, vcf_gt))
else:
vcf_alleles = parse_vcf_genotype(vcf_gt, ref, alts)
if sorted(vcf_alleles) == sorted(tped_gt):
horse_stats[horse_id]['total_matched_snps'] += 1
else:
horse_stats[horse_id]['total_no_matching_snps'] += 1
no_matching_snps[horse_id].add((chrom, pos, '/'.join(vcf_alleles)))
if vcf_line_count % 1000000 == 0:
print(f"Processed {vcf_line_count} VCF lines")
save_checkpoint(vcf_line_count, horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, vcf_positions, snps_actually_compared)
snps_not_in_vcf = len(all_tped_positions - vcf_positions)
total_snps_compared = len(snps_actually_compared)
print(f"Processed {vcf_line_count} VCF lines")
print(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}")
print(f"Excluded positions (indels or multiallelic): {len(excluded_positions)}")
print(f"Total unique positions compared: {total_snps_compared}")
return horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, snps_not_in_vcf, total_snps_compared, excluded_positions
def write_results(output_file, horse_stats, total_tped_snps, filtered_count, duplicate_count, snps_not_in_vcf, total_snps_compared, excluded_positions_count):
with open(output_file, 'w') as f:
f.write(f"Filtered SNPs: {filtered_count}\n")
f.write(f"Total SNPs in TPED (including duplicates): {total_tped_snps}\n")
f.write(f"Duplicate SNPs found: {duplicate_count}\n")
f.write(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}\n")
f.write(f"Excluded positions (indels or multiallelic): {excluded_positions_count}\n")
f.write(f"Total unique positions compared: {total_snps_compared}\n\n")
f.write("Note: All subsequent comparisons are based on unique positions only.\n\n")
for horse_id, stats in horse_stats.items():
total_snps_compared = stats['total_snps_compared']
total_no_calls = stats['no_calls_array'] + stats['no_calls_vcf'] + stats['no_calls_both']
total_no_matching_snps = stats['total_no_matching_snps']
incorrect_calls_denominator = total_snps_compared - total_no_calls
incorrect_calls_percentage = (total_no_matching_snps / incorrect_calls_denominator * 100) if incorrect_calls_denominator > 0 else 0
f.write(f"Sample: {horse_id}\n")
f.write(f"Total SNPs compared: {total_snps_compared}\n")
f.write(f"Total matched SNPs: {stats['total_matched_snps']}\n")
f.write(f"Total no matching SNPs: {total_no_matching_snps}\n")
f.write(f"No calls from array: {stats['no_calls_array']}\n")
f.write(f"No calls from VCF: {stats['no_calls_vcf']}\n")
f.write(f"No calls from both: {stats['no_calls_both']}\n")
f.write(f"Incorrect calls percentage: {incorrect_calls_percentage:.2f}%\n\n")
def main():
print("Reading TPED file...")
tped_data, unique_tped_snps, filtered_count, duplicate_count, all_tped_positions, total_tped_snps = read_tped(tped_file)
print("Reading sample IDs...")
vcf_sample_ids = read_vcf_sample_ids(vcf_file)
tfam_sample_ids = read_tfam_sample_ids(tfam_file)
checkpoint = load_checkpoint()
start_position = 0
if checkpoint:
start_position = checkpoint['position']
print(f"Resuming from checkpoint at position {start_position}")
print("Comparing genotypes...")
horse_stats, no_matching_snps, no_calls_array, no_calls_vcf, no_calls_both, snps_not_in_vcf, total_snps_compared, excluded_positions = compare_genotypes(
vcf_file, tped_data, vcf_sample_ids, tfam_sample_ids, all_tped_positions, start_position)
print("Writing results...")
write_results(output_file, horse_stats, total_tped_snps, filtered_count, duplicate_count, snps_not_in_vcf, total_snps_compared, len(excluded_positions))
print("Writing top positions...")
write_top_positions('top_no_matching_snps.txt', count_top_positions(no_matching_snps))
write_top_positions('top_no_calls_array.txt', count_top_positions(no_calls_array))
write_top_positions('top_no_calls_vcf.txt', count_top_positions(no_calls_vcf))
write_top_positions('top_no_calls_both.txt', count_top_positions(no_calls_both))
print("Writing all no matching SNPs...")
write_no_matching_snps('all_no_matching_snps.txt', no_matching_snps, tped_data, tfam_sample_ids)
print("Writing excluded positions...")
write_excluded_positions('excluded_positions.txt', excluded_positions)
print(f"Results written to {output_file}")
print("Top positions written to top_no_matching_snps.txt, top_no_calls_array.txt, top_no_calls_vcf.txt, and top_no_calls_both.txt")
print("All no matching SNPs written to all_no_matching_snps.txt")
print("Excluded positions written to excluded_positions.txt")
print("\nSummary:")
print(f"Filtered SNPs: {filtered_count}")
print(f"Total SNPs in TPED (including duplicates): {total_tped_snps}")
print(f"Duplicate SNPs found: {duplicate_count}")
print(f"SNPs in TPED not found in VCF: {snps_not_in_vcf}")
print(f"Excluded positions (indels or multiallelic): {len(excluded_positions)}")
print(f"Total unique positions compared: {total_snps_compared}")
# Consistency check
assert total_snps_compared + snps_not_in_vcf + len(excluded_positions) == unique_tped_snps, "Mismatch in comparison counts"
# Remove checkpoint file after successful completion
if os.path.exists(checkpoint_file):
os.remove(checkpoint_file)
if __name__ == '__main__':
main()
```
#### Let's get more info about those excluded positions...
```
import gzip
def read_excluded_positions(file_path):
excluded_positions = {}
with open(file_path, 'r') as f:
next(f)
for line in f:
columns = line.strip().split('\t')
if len(columns) < 3:
continue
chromosome = columns[0]
position = columns[1]
reason = columns[2]
excluded_positions[f"chr{chromosome}:{position}"] = reason
return excluded_positions
def calculate_allele_frequencies(genotypes, alleles):
allele_counts = {allele: 0 for allele in alleles}
total_alleles = 0
for genotype in genotypes:
if genotype in ['./.', '.|.']:
continue
alleles_in_genotype = genotype.replace('|', '/').split('/')
for allele in alleles_in_genotype:
try:
allele_index = int(allele)
allele_value = alleles[allele_index]
allele_counts[allele_value] += 1
total_alleles += 1
except (ValueError, IndexError):
continue
allele_frequencies = {allele: count / total_alleles for allele, count in allele_counts.items()} if total_alleles > 0 else {}
return allele_frequencies, len(genotypes) - (total_alleles // 2)
def process_vcf(vcf_file_path, output_file_path, excluded_positions_file):
excluded_positions = read_excluded_positions(excluded_positions_file)
with gzip.open(vcf_file_path, 'rt') as vcf_file, open(output_file_path, 'w') as out_file:
for line in vcf_file:
if line.startswith('#'):
continue
fields = line.strip().split('\t')
chromosome = fields[0]
position = fields[1]
pos_key = f"{chromosome}:{position}"
if pos_key in excluded_positions:
print(f"Processing excluded position: {pos_key}")
out_file.write(f"Excluded Position: {pos_key}\n")
ref_allele = fields[3]
alt_alleles = fields[4].split(',')
alleles = [ref_allele] + alt_alleles
genotypes = [sample.split(':')[0] for sample in fields[9:]]
allele_frequencies, no_call_count = calculate_allele_frequencies(genotypes, alleles)
out_file.write(f"Alleles: {', '.join(alleles)}\n")
out_file.write("Allele Frequencies:\n")
for allele, freq in allele_frequencies.items():
out_file.write(f" {allele}: {freq:.4f}\n")
out_file.write(f"No-call count: {no_call_count} / {len(genotypes)}\n")
out_file.write("-" * 40 + "\n")
else:
continue
vcf_file_path = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/joint_lessdeep.recode.vcf.gz'
excluded_positions_file = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/excluded_positions.txt'
output_file_path = '/group/ctbrowngrp4/2024-vanburen-horse-genomes/arrayWGSanalysis/lessdeep/output.txt'
process_vcf(vcf_file_path, output_file_path, excluded_positions_file)
```
-----
#### Now the goal is to make a VCF of the deeper coverage ones and also compare in my script:
Extracting samples from VCFs:
```
#!/bin/bash
# Define the directory containing your VCF files
vcf_dir="/home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/VCFs/Sep25_2022"
# Define the directory for the filtered VCF output (same as input directory)
output_dir="${vcf_dir}"
# Define the list of samples
sample_list="197,205"
# Loop through each VCF file
for vcf_file in ${vcf_dir}/*.vcf.gz; do
# Extract the base name of the VCF file (e.g., chr1.vcf.gz)
base_name=$(basename ${vcf_file})
# Define the output file name (e.g., chr1_filtered.vcf.gz)
output_file="${output_dir}/197_205_${base_name}"
echo "Processing ${vcf_file}..."
# Apply the bcftools view command
bcftools view -s ${sample_list} ${vcf_file} -Oz -o ${output_file}
done
echo "Filtering complete."
```
Now, need to merge all chromosomes vcf's into one:
```
# Extract the header from the first VCF file
zgrep '^#' 197_205_chr1.vcf.gz > merged_197_205.vcf
# Concatenate all the data lines (non-header) from each VCF file
zgrep -v '^#' 197_205_chr{1..31}.vcf.gz 197_205_chrX.vcf.gz >> merged_197_205.vcf
```
Then zipped and indexed the VCF:
```
bgzip merged_197_205.vcf
tabix -p vcf merged_197_205.vcf.gz
```
### Now merge this VCF with the other, remove NAD197 and NAD205 (not 197 and 205), then, filter out SNPs with >1 call
Need to match contig names:
chrom_rename.txt
```
merged_chr1.fixed.vcf.gz:chr1 chr1
merged_chr2.fixed.vcf.gz:chr2 chr2
merged_chr3.fixed.vcf.gz:chr3 chr3
merged_chr4.fixed.vcf.gz:chr4 chr4
merged_chr5.fixed.vcf.gz:chr5 chr5
merged_chr6.fixed.vcf.gz:chr6 chr6
merged_chr7.fixed.vcf.gz:chr7 chr7
merged_chr8.fixed.vcf.gz:chr8 chr8
merged_chr9.fixed.vcf.gz:chr9 chr9
merged_chr10.fixed.vcf.gz:chr10 chr10
merged_chr11.fixed.vcf.gz:chr11 chr11
merged_chr12.fixed.vcf.gz:chr12 chr12
merged_chr13.fixed.vcf.gz:chr13 chr13
merged_chr14.fixed.vcf.gz:chr14 chr14
merged_chr15.fixed.vcf.gz:chr15 chr15
merged_chr16.fixed.vcf.gz:chr16 chr16
merged_chr17.fixed.vcf.gz:chr17 chr17
merged_chr18.fixed.vcf.gz:chr18 chr18
merged_chr19.fixed.vcf.gz:chr19 chr19
merged_chr20.fixed.vcf.gz:chr20 chr20
merged_chr21.fixed.vcf.gz:chr21 chr21
merged_chr22.fixed.vcf.gz:chr22 chr22
merged_chr23.fixed.vcf.gz:chr23 chr23
merged_chr24.fixed.vcf.gz:chr24 chr24
merged_chr25.fixed.vcf.gz:chr25 chr25
merged_chr26.fixed.vcf.gz:chr26 chr26
merged_chr27.fixed.vcf.gz:chr27 chr27
merged_chr28.fixed.vcf.gz:chr28 chr28
merged_chr29.fixed.vcf.gz:chr29 chr29
merged_chr30.fixed.vcf.gz:chr30 chr30
merged_chr31.fixed.vcf.gz:chr31 chr31
merged_chrX.fixed.vcf.gz:chrX chrX
197_205_chr1.vcf.gz:chr1 chr1
197_205_chr2.vcf.gz:chr2 chr2
197_205_chr3.vcf.gz:chr3 chr3
197_205_chr4.vcf.gz:chr4 chr4
197_205_chr5.vcf.gz:chr5 chr5
197_205_chr6.vcf.gz:chr6 chr6
197_205_chr7.vcf.gz:chr7 chr7
197_205_chr8.vcf.gz:chr8 chr8
197_205_chr9.vcf.gz:chr9 chr9
197_205_chr10.vcf.gz:chr10 chr10
197_205_chr11.vcf.gz:chr11 chr11
197_205_chr12.vcf.gz:chr12 chr12
197_205_chr13.vcf.gz:chr13 chr13
197_205_chr14.vcf.gz:chr14 chr14
197_205_chr15.vcf.gz:chr15 chr15
197_205_chr16.vcf.gz:chr16 chr16
197_205_chr17.vcf.gz:chr17 chr17
197_205_chr18.vcf.gz:chr18 chr18
197_205_chr19.vcf.gz:chr19 chr19
197_205_chr20.vcf.gz:chr20 chr20
197_205_chr21.vcf.gz:chr21 chr21
197_205_chr22.vcf.gz:chr22 chr22
197_205_chr23.vcf.gz:chr23 chr23
197_205_chr24.vcf.gz:chr24 chr24
197_205_chr25.vcf.gz:chr25 chr25
197_205_chr26.vcf.gz:chr26 chr26
197_205_chr27.vcf.gz:chr27 chr27
197_205_chr28.vcf.gz:chr28 chr28
197_205_chr29.vcf.gz:chr29 chr29
197_205_chr30.vcf.gz:chr30 chr30
197_205_chr31.vcf.gz:chr31 chr31
197_205_chrX.vcf.gz:chrX chrX
```
```
bcftools annotate --rename-chrs chrom_rename.txt merged_197_205.vcf.gz -Oz -o renamed_merged_197_205.vcf.gz
```
```
bcftools annotate --rename-chrs chrom_rename.txt merged_all_with483removed.vcf.gz -Oz -o renamed_merged_all_with483removed.vcf.gz
```
```
bcftools index renamed_merged_197_205.vcf.gz
bcftools index renamed_merged_all_with483removed.vcf.gz
```
```
bcftools merge renamed_merged_197_205.vcf.gz renamed_merged_all_with483removed.vcf.gz -Oz -o renamed_merged_197_205_withmain.vcf.gz
```
Now to remove NAD197 and NAD205:
```
bcftools view -s ^NAD197,NAD205 renamed_merged_197_205_withmain.vcf.gz -Oz -o renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz
```
Need to match ID names in tfam/tped to vcf, so copied new tfam/tped than changed it
```
sed -e 's/NAD197/197/g' -e 's/NAD205/205/g' SamsArrayGenotyping_pleasegodwork197and205.tfam > updated_SamsArrayGenotyping_pleasegodwork197and205.tfam
```
Index the VCF:
```
tabix -p vcf renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz
```
____
Can check genotypes at a vcf position with this:
```
zgrep -E "chr1[[:space:]]*156686314" renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz
```
And array with this:
```
awk '$1 == 2 && $4 == 47290946' SamsArrayGenotyping_pleasegodwork.tped
```
Can match genotype to id in the array with:
```
awk 'FNR==NR {horses[NR]=$2; next} $1 == 1 && $4 == 20619679 {for (i=5; i<=NF; i+=2) print horses[(i-3)/2], $i, $(i+1)}' SamsArrayGenotyping_pleasegodwork.tfam SamsArrayGenotyping_pleasegodwork.tped
```
And for vcf:
```
bcftools query -r chr1:20619679 -f '%CHROM\t%POS[\t%SAMPLE=%GT]\n' renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz
```
____
# Now, mapping chestnut
### For the VCF from the deeper coverage horses
Split multiallelic sites into biallelic records:
```
bcftools norm -m-any -Oz -o deeperone.vcf.gz renamed_merged_all_with483removed_no_NAD197_NAD205.vcf.gz
```
Transform vcf into ped/map files:
```
plink --vcf deeperone.vcf.gz --allow-extra-chr --chr-set 32 --recode --biallelic-only --out deeper
```
Add IDs:
```
awk '{ $2 = $1 ":" $4 }1' deeper.map > deeper_ids.map
```
Now to prune it:
```
plink --file deeper_ids --allow-extra-chr --chr-set 32 --geno 0.1 --maf 0.001 --out deeper_ids_pruned --recode
```
#### Set up phenotype
As: ```coat_color.phe```
```
FID IID PHENOTYPE
55 55 1
219 219 1
241 241 2
408 408 1
416 416 2
607 607 1
608 608 1
853 853 2
854 854 2
NAD101 NAD101 2
NAD157 NAD157 1
197 197 1
205 205 0
NAD211 NAD211 1
NAD216 NAD216 1
NAD217 NAD217 2
NAD220 NAD220 2
NAD228 NAD228 2
NAD238 NAD238 2
NAD264 NAD264 1
NAD49 NAD49 1
```
Put into tped/tfam:
```
plink --file deeper_ids_pruned -allow-extra-chr --chr-set 32 --recode transpose --out deeper_final
```
```
cut -f1-6 joint_deeper_ids_pruned.ped > joint_deeper_ids_pruned.fam
```
Do GWAS (basic allelic):
```
plink --file deeper_ids_pruned \
--pheno coat_color.phe \
--assoc \
--allow-extra-chr \
--chr-set 32 \
--allow-no-sex \
--out deeper_vcf
```
Is chestnut in there?
```
grep "3:36979560" /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/with197and205/deeper_vcf.assoc
```
Yep it is:
3 3:36979560 36979560 C 0 0.7273 T 21.82 2.997e-06 0
___
### Now for the Array:
Go to directory with tped/tfam from array and add phenotypes to the 6th column.
Prune it:
```
plink --tped SamsArrayGenotyping_pleasegodwork.tped \
--tfam SamsArrayGenotyping_pleasegodwork.tfam \
--allow-extra-chr \
--chr-set 32 \
--geno 0.1 \
--maf 0.001 \
--out pheno_pruned --recode
```
And go to directory and run GWAS (basic allelic) other tped:
```
plink --file pheno_pruned \
--pheno coat_color.phe \
--assoc \
--allow-extra-chr \
--chr-set 32 \
--out array_array
```
Is chestnut in there?
```
grep -w '3.*36979560' /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/old_prebam/arraydata/array_array.assoc
```
Yes it is:
3 AX-104805525 36979560 C 0 0.7273 T 20.1 7.353e-06 0
Let's at least figure out what that highest associated SNP in the basic allelic model was then:
```
import pandas as pd
# Load your GWAS results
gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False)
# Convert 'CHR' column to numeric, forcing errors to NaN
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce')
# Remove rows with chromosomes 0 and 33
gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)]
# Calculate the position of the highest association (lowest p-value)
highest_assoc = gwas_data.loc[gwas_data['P'].idxmin()]
# Extract the relevant information
snp_position = highest_assoc['BP']
snp_chr = highest_assoc['CHR']
snp_id = highest_assoc['SNP'] # Replace 'SNP' with the actual column name for SNP ID
# Print the results
print(f"Highest associated SNP: {snp_id}")
print(f"Chromosome: {snp_chr}, Position: {snp_position}")
print(f"P-value: {highest_assoc['P']}")
```
#### Results:
Highest associated SNP: AX-104528810
Chromosome: 3.0, Position: 36317372
P-value: 2.997e-06
##### This makes sense since chestnut is at 3:36979560
Can see all genotypes in the tped with for example:
```
awk '$1 == 3 && $4 >= 36279560 && $4 <= 36999560' SamsArrayGenotyping_pleasegodwork.tped
```
### PYTHON: Create Manhattan plot for array AND same code for WGS tped/tfam
Adjust as needed for the .assoc you are using, removed chr 0 from the graph:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load your GWAS results using a raw string for the separator
gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False)
# Convert 'CHR' to numeric, forcing errors to NaN (if any)
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce')
# Remove chromosome 0
gwas_data = gwas_data[gwas_data['CHR'] > 0]
# Calculate -log10(p-value)
gwas_data['neg_log10_p'] = -np.log10(gwas_data['P'])
# Remove any rows with NaN values in 'CHR' or 'neg_log10_p'
gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p'])
# Sort the DataFrame by chromosome and then by base pair position
gwas_data = gwas_data.sort_values(by=['CHR', 'BP'])
# Create the Manhattan plot
plt.figure(figsize=(12, 6))
# Initialize an empty DataFrame for the adjusted positions
plot_data = pd.DataFrame()
# Initialize position offset
current_pos = 0
# Loop through each chromosome and plot
for chr_num in sorted(gwas_data['CHR'].unique()):
subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset
# Calculate new x positions
subset['x_pos'] = subset['BP'] + current_pos
plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame
# Update current position for next chromosome
current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset
# If it's chromosome X (33), color it red
if chr_num == 33:
plt.scatter(subset['x_pos'], subset['neg_log10_p'],
color='red', edgecolor='none', s=10) # Chr X in red
else:
# Even chromosomes in red, odd chromosomes in blue
color = 'red' if chr_num % 2 == 0 else 'blue'
plt.scatter(subset['x_pos'], subset['neg_log10_p'],
color=color, edgecolor='none', s=10) # Other chromosomes
# Calculate the Bonferroni-corrected threshold
num_snps = len(gwas_data) # Total number of SNPs tested
bonferroni_threshold = 0.05 / num_snps
# Calculate -log10 of the Bonferroni threshold
neg_log10_bonferroni = -np.log10(bonferroni_threshold)
# Add the Bonferroni-corrected significance threshold line to the plot
plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--') # Bonferroni line
# Customizing the x-axis
plt.xlabel('Base Pair Position')
plt.ylabel('-log10(p-value)')
plt.title('Manhattan Plot for GWAS (Chromosome X in Red)')
# Set the x-ticks to reflect the chromosome positions
tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())]
tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())]
plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7) # Reduced font size to avoid overlap
# Tight layout and save the plot
plt.tight_layout()
plt.savefig("manhattan_plot_with_chr_x_red.png", dpi=300) # Save the plot as PNG with high resolution
plt.show()
# Write SNP count and Bonferroni threshold to a file
with open("bonferroni_threshold_info.txt", "w") as f:
f.write(f"Number of SNPs used: {num_snps}\n")
f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n")
f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n")
```
### Dope, here is the array:

Highest associated SNP: AX-104528810
Chromosome: 3.0, Position: 36317372
P-value: 2.997e-06
##### This makes sense since chestnut is at 3:36979560
Number of SNPs used: 456243
Bonferroni threshold (0.05 / 456243): 1.09590722e-7
-log10 Bonferroni threshold: 6.96
- the reason for 456243 is there was 491616 in array_array.assoc (due to geno filtering in plink), however the script for the Manhattan plot drops NaN values and any with chr 0.
### And here from the VCF SNPs:

Highest associated SNP: 7:95499072
Chromosome: 7, Position: 95499072
P-value: 3.375e-08
Number of SNPs used: 16526343, determined by:
```grep -v "NA" deeper_vcf.assoc | wc -l```
Bonferroni threshold (0.05 / 16526343): 3.02547273e-9
-log10 Bonferroni threshold: 8.52
_____________
## Looking at depths
In:```/group/ctbrowngrp/finnolab_shared/wgs_bams/recalibrated```
with command:
```
samtools depth <bam_file> | awk '{sum+=$3} END {if (NR>0) print "Average depth:", sum/NR; else print "No coverage"}'
```
_____
## Dr Finno comments May 28
- change red and blue array manhattan plot
- color coat chestnut SNP alleles
- make array QQ plot
#### Change red and blue array manhattan plot
Python script:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load GWAS results
gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False)
# Convert 'CHR' to numeric
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce')
# Filter: remove chr 0 and any NaNs
gwas_data = gwas_data[gwas_data['CHR'] > 0]
gwas_data['neg_log10_p'] = -np.log10(gwas_data['P'])
gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p'])
# Sort data
gwas_data = gwas_data.sort_values(by=['CHR', 'BP'])
# Start plot
plt.figure(figsize=(12, 6))
plot_data = pd.DataFrame()
current_pos = 0
# Loop through chromosomes and plot
for chr_num in sorted(gwas_data['CHR'].unique()):
subset = gwas_data[gwas_data['CHR'] == chr_num].copy()
subset['x_pos'] = subset['BP'] + current_pos
plot_data = pd.concat([plot_data, subset])
current_pos += subset['BP'].max() + 1
color = 'pink' if chr_num == 33 or chr_num % 2 == 0 else 'green'
plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10)
# Highlight the chestnut allele
chestnut = plot_data[(plot_data['CHR'] == 3) & (plot_data['BP'] == 36979560)]
if not chestnut.empty:
plt.scatter(chestnut['x_pos'], chestnut['neg_log10_p'],
color='hotpink', edgecolor='black', s=40, zorder=10, label='Chestnut Allele')
# Output chestnut allele stats
chestnut_info = chestnut[['CHR', 'SNP', 'BP', 'P', 'neg_log10_p']]
chestnut_info.to_csv("chestnut_allele_info.txt", sep='\t', index=False)
else:
print("Chestnut allele not found in the dataset.")
# Bonferroni threshold line
num_snps = len(gwas_data)
bonferroni_threshold = 0.05 / num_snps
neg_log10_bonferroni = -np.log10(bonferroni_threshold)
plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--')
# Axis formatting
plt.xlabel('Base Pair Position')
plt.ylabel('-log10(p-value)')
plt.title('Manhattan Plot for GWAS (Chestnut Allele Highlighted)')
tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())]
tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())]
plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7)
plt.legend()
plt.tight_layout()
plt.savefig("manhattan_plot_with_chestnut.png", dpi=300)
plt.show()
# Output Bonferroni stats
with open("bonferroni_threshold_info.txt", "w") as f:
f.write(f"Number of SNPs used: {num_snps}\n")
f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n")
f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n")
```
#### Chestnut is 3:36979560
### With QQ plot for array:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
from scipy.stats import chi2
# Load GWAS results
gwas_data = pd.read_csv("array_array.assoc", sep=r'\s+', low_memory=False)
# Convert 'CHR' to numeric
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'], errors='coerce')
# Filter: remove chr 0 and any NaNs
gwas_data = gwas_data[gwas_data['CHR'] > 0]
gwas_data['neg_log10_p'] = -np.log10(gwas_data['P'])
gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p'])
# Sort data
gwas_data = gwas_data.sort_values(by=['CHR', 'BP'])
# Start Manhattan plot
plt.figure(figsize=(12, 6))
plot_data = pd.DataFrame()
current_pos = 0
# Loop through chromosomes and plot
for chr_num in sorted(gwas_data['CHR'].unique()):
subset = gwas_data[gwas_data['CHR'] == chr_num].copy()
subset['x_pos'] = subset['BP'] + current_pos
plot_data = pd.concat([plot_data, subset])
current_pos += subset['BP'].max() + 1
color = 'pink' if chr_num == 33 or chr_num % 2 == 0 else 'green'
plt.scatter(subset['x_pos'], subset['neg_log10_p'], color=color, edgecolor='none', s=10)
# Highlight the chestnut allele
chestnut = plot_data[(plot_data['CHR'] == 3) & (plot_data['BP'] == 36979560)]
if not chestnut.empty:
plt.scatter(chestnut['x_pos'], chestnut['neg_log10_p'],
color='limegreen', edgecolor='black', s=40, zorder=10)
# Output chestnut allele stats
chestnut_info = chestnut[['CHR', 'SNP', 'BP', 'P', 'neg_log10_p']]
chestnut_info.to_csv("chestnut_allele_info.txt", sep='\t', index=False)
chestnut_label = True
else:
print("Chestnut allele not found in the dataset.")
chestnut_label = False
# Bonferroni threshold line
num_snps = len(gwas_data)
bonferroni_threshold = 0.05 / num_snps
neg_log10_bonferroni = -np.log10(bonferroni_threshold)
plt.axhline(y=neg_log10_bonferroni, color='black', linestyle='--')
# Axis labels and title
plt.xlabel('Base Pair Position')
plt.ylabel('-log10(p-value)')
plt.title('Manhattan Plot for GWAS (Chestnut Allele Highlighted)')
# Chromosome tick positions and labels
tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())]
tick_labels = [f'Chr {int(chr_num)}' if chr_num != 33 else 'Chr X' for chr_num in sorted(gwas_data['CHR'].unique())]
plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=7)
# Legend for chestnut allele only
if chestnut_label:
plt.legend(
[plt.Line2D([], [], marker='o', color='limegreen', label='Chestnut Allele',
markerfacecolor='limegreen', markeredgecolor='black', linestyle='', markersize=6)],
['Chestnut Allele'],
fontsize=8,
loc='upper left',
frameon=True
)
# Add QQ plot inset (upper right, small, labeled with white title background)
ax = plt.gca()
axins = inset_axes(ax, width=1.6, height=1.6, loc='upper right', borderpad=2)
# QQ data
sorted_p = np.sort(gwas_data['P'])
expected = -np.log10(np.linspace(1 / num_snps, 1, num_snps))
observed = -np.log10(sorted_p)
# Lambda GC (calculate, not shown)
chi2_stats = chi2.isf(sorted_p, df=1)
lambda_gc = np.median(chi2_stats) / 0.456
# QQ plot
axins.scatter(expected, observed, s=4, color='black')
axins.plot([0, expected.max()], [0, expected.max()], color='red', linestyle='--')
# Set limits and ticks
max_val = max(expected.max(), observed.max())
axins.set_xlim(0, max_val)
axins.set_ylim(0, max_val)
axins.set_xticks(np.arange(0, max_val + 1, step=1))
axins.set_yticks(np.arange(0, max_val + 1, step=1))
# Axis labels and title with white background
axins.set_title('QQ Plot', fontsize=7, pad=3,
bbox=dict(facecolor='white', edgecolor='none', boxstyle='round,pad=0.3'))
axins.set_xlabel('Expected -log10(p)', fontsize=6, labelpad=1)
axins.set_ylabel('Observed -log10(p)', fontsize=6, labelpad=1)
axins.tick_params(axis='both', which='major', labelsize=6)
axins.grid(False)
# Clean layout
plt.subplots_adjust(left=0.06, right=0.98, top=0.95, bottom=0.1)
plt.savefig("manhattan_plot_with_chestnut_and_qq.png", dpi=300)
plt.show()
# Output Bonferroni and lambda info
with open("bonferroni_threshold_info.txt", "w") as f:
f.write(f"Number of SNPs used: {num_snps}\n")
f.write(f"Bonferroni threshold (0.05 / {num_snps}): {bonferroni_threshold:.2e}\n")
f.write(f"-log10 Bonferroni threshold: {neg_log10_bonferroni:.2f}\n")
f.write(f"Genomic inflation factor (lambda GC): {lambda_gc:.3f}\n")
```
______
### EXTRA: Recessive shit that looked weird af:
#### Using SNPSift to run a recessive GWAS model
For array:
```
plink --file SamsArrayGenotyping_pleasegodwork_pheno_pruned \
--recode vcf --out array \
--allow-extra-chr \
--chr-set 32
```
```
mamba create -n snpsiftenv
mamba activate snpsiftenv
```
```
mamba install -c bioconda snpEff
conda install -c bioconda snpeff
```
```
wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
```
```
git clone https://github.com/pcingola/SnpEff.git
git clone https://github.com/pcingola/SnpSift.git
```
```
unzip snpEff_latest_core.zip
```
```
snpEff download EquCab3.0.105
```
```
snpEff ann EquCab3.0.105 /home/vanburen/2024-vanburen-horse-genomes/arrayWGSanalysis/arraydata/array.vcf > annotated_horse_array.vcf
```
```
SnpSift caseControl -v "-++---+++---0--+-+++-" annotated_horse_array.vcf > case_control_output.vcf
```
```
SnpSift extractFields case_control_output.vcf CHROM POS REF ALT "Cases" "Controls" "CC_REC" > gwas_results_snpsift.txt
```
Create Manhattan plot:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load your GWAS results
gwas_data = pd.read_csv("gwas_results_snpsift.txt", sep='\t', low_memory=False)
# Print column names to verify
print("Columns in the data:", gwas_data.columns)
# Rename columns to match the expected format
gwas_data = gwas_data.rename(columns={'CHROM': 'CHR', 'POS': 'BP'})
# Convert 'CHR' to numeric, forcing errors to NaN (if any)
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'].str.replace('chr', ''), errors='coerce')
# Remove chromosome 0 and 33 (if present)
gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)]
# Calculate -log10(p-value) using 'CC_REC' column
gwas_data['neg_log10_p'] = -np.log10(gwas_data['CC_REC'].astype(float))
# Remove any rows with NaN values in 'CHR' or 'neg_log10_p'
gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p'])
# Sort the DataFrame by chromosome and then by base pair position
gwas_data = gwas_data.sort_values(by=['CHR', 'BP'])
# Create the Manhattan plot
plt.figure(figsize=(12, 6))
# Initialize an empty DataFrame for the adjusted positions
plot_data = pd.DataFrame()
# Initialize position offset
current_pos = 0
# Define colors for alternating chromosomes
colors = ['red', 'blue'] # Define two colors
# Loop through each chromosome and plot
for chr_num in sorted(gwas_data['CHR'].unique()):
subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset
# Calculate new x positions
subset['x_pos'] = subset['BP'] + current_pos
plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame
# Update current position for next chromosome
current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset
# Assign color based on whether the chromosome is odd or even
color = colors[int(chr_num) % 2] # Alternate between red and blue
plt.scatter(subset['x_pos'], subset['neg_log10_p'],
color=color, edgecolor='none', s=10, label=f'Chr {int(chr_num)}')
# Customizing the x-axis
plt.xlabel('Base Pair Position')
plt.ylabel('-log10(p-value)')
plt.title('Manhattan Plot for Recessive Association')
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--') # Add a significance threshold line
# Set the x-ticks to reflect the chromosome positions with smaller font size
tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())]
tick_labels = [f'Chr {int(chr_num)}' for chr_num in sorted(gwas_data['CHR'].unique())]
plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=8) # Set a smaller font size
# Tight layout and save the plot
plt.tight_layout()
plt.savefig("manhattan_plot_recessive_snpsift.png", dpi=300) # Save the plot as PNG with high resolution
plt.show()
```
Looks like shit tf:

### For VCF data:
```
plink --file sos_biallelic_with_ids_pruned \
--recode vcf --out fromvcf \
--allow-extra-chr \
--chr-set 32
```
```
SnpSift caseControl "-+-++--0-++-++-++----" fromvcf.vcf > case_control_fromvcf_output.vcf
```
```
SnpSift extractFields case_control_fromvcf_output.vcf CHROM POS REF ALT "Cases" "Controls" "CC_REC" > gwas_results_snpsift.txt
```
Create Manhattan plot:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load your GWAS results
gwas_data = pd.read_csv("gwas_results_snpsift.txt", sep='\t', low_memory=False)
# Print column names to verify
print("Columns in the data:", gwas_data.columns)
print("Data type of CHROM column:", gwas_data['CHROM'].dtype)
# Rename columns to match the expected format
gwas_data = gwas_data.rename(columns={'CHROM': 'CHR', 'POS': 'BP'})
# Convert 'CHR' to numeric, handling both string and numeric inputs
gwas_data['CHR'] = pd.to_numeric(gwas_data['CHR'].astype(str).str.replace('chr', ''), errors='coerce')
# Remove chromosome 0 and 33 (if present)
gwas_data = gwas_data[(gwas_data['CHR'] > 0) & (gwas_data['CHR'] != 33)]
# Calculate -log10(p-value) using 'CC_REC' column
gwas_data['neg_log10_p'] = -np.log10(gwas_data['CC_REC'].astype(float))
# Remove any rows with NaN values in 'CHR' or 'neg_log10_p'
gwas_data = gwas_data.dropna(subset=['CHR', 'neg_log10_p'])
# Sort the DataFrame by chromosome and then by base pair position
gwas_data = gwas_data.sort_values(by=['CHR', 'BP'])
# Create the Manhattan plot
plt.figure(figsize=(12, 6))
# Initialize an empty DataFrame for the adjusted positions
plot_data = pd.DataFrame()
# Initialize position offset
current_pos = 0
# Define colors for alternating chromosomes
colors = ['red', 'blue'] # Define two colors
# Loop through each chromosome and plot
for chr_num in sorted(gwas_data['CHR'].unique()):
subset = gwas_data[gwas_data['CHR'] == chr_num].copy() # Create a copy of the subset
# Calculate new x positions
subset['x_pos'] = subset['BP'] + current_pos
plot_data = pd.concat([plot_data, subset]) # Combine all subsets into one DataFrame
# Update current position for next chromosome
current_pos += subset['BP'].max() + 1 # Update for next chromosome, with a small offset
# Assign color based on whether the chromosome is odd or even
color = colors[int(chr_num) % 2] # Alternate between red and blue
plt.scatter(subset['x_pos'], subset['neg_log10_p'],
color=color, edgecolor='none', s=10, label=f'Chr {int(chr_num)}')
# Customizing the x-axis
plt.xlabel('Base Pair Position')
plt.ylabel('-log10(p-value)')
plt.title('Manhattan Plot for Recessive Association')
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--') # Add a significance threshold line
# Set the x-ticks to reflect the chromosome positions with smaller font size
tick_positions = [plot_data[plot_data['CHR'] == chr_num]['x_pos'].iloc[0] for chr_num in sorted(gwas_data['CHR'].unique())]
tick_labels = [f'Chr {int(chr_num)}' for chr_num in sorted(gwas_data['CHR'].unique())]
plt.xticks(tick_positions, tick_labels, rotation=45, fontsize=8) # Set a smaller font size
# Tight layout and save the plot
plt.tight_layout()
plt.savefig("manhattan_plot_recessive_snpsift.png", dpi=300) # Save the plot as PNG with high resolution
plt.show()
```
Not sure why it also looks like shit:
