## Extrachromosomal DNA (microDNA) in Human, Mouse and Chicken MicroDNA is the most abundant subtype of Extrachromosomal Circular DNA (eccDNA). Extrachromosomal circular DNA (eccDNA) is a type of double-stranded circular DNA that is derived and free from chromosomes. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68644 | SampleID | Organisms/Tissues | | ---------- | -------------------- | | GSM1677986 | Mouse-AMBrain.Rep1 | | GSM1677987 | Mouse-AMHeart.Rep1 | | GSM1677988 | Mouse-AMKidney.Rep1 | | GSM1677989 | Mouse-AMLiver.Rep1 | | GSM1677990 | Mouse-AMLung.Rep1 | | GSM1677991 | Mouse-AMSMuscle.Rep1 | | GSM1677992 | Mouse-AMSperm.Rep1 | | GSM1677993 | Mouse-AMSpleen.Rep1 | | GSM1677994 | Mouse-AMTestis.Rep1 | | GSM1677995 | Mouse-AMThymus.Rep1 | | GSM1677996 | Chicken-DT40-BRCA1 | | GSM1677997 | Chicken-DT40-BRCA2 | | GSM1677998 | Chicken-DT40-CtIP | | GSM1677999 | Chicken-DT40-Ku70 | | GSM1678000 | Chicken-DT40-Lig4 | | GSM1678001 | Chicken-DT40-MSH3 | | GSM1678002 | Chicken-DT40-NBS1 | | GSM1678003 | Chicken-DT40-Rad54 | | GSM1678004 | Chicken-DT40-WT | | GSM1678005 | Human-C4-2 | | GSM1678006 | Human-ES2 | | GSM1678007 | Human-LnCap | | GSM1678008 | Human-OVCAR8 | | GSM1678009 | Human-PC-3 | I removed Chicken Organism and I am consedering only human (for now). ## 1. MicroDNA from Human Ovarian and Prostate Cancer Cell Lines. They examined human cancer cell lines of two origins, prostate (LNCaP, C4-2, and PC-3) or ovarian (OVCAR8 and ES-2). ## 2. Liftover (HG19 (they used)-->HG38) ### - Total length per chromosome ![](https://i.imgur.com/cec7gQ4.png) microDNAs from the human cancer cell lines are primarily 100 to 400 bp in length. ![](https://i.imgur.com/DbZzuMS.png) ### 3. Adjust all HPRC chromosomes - Normalizion and removing duplicates (adjust_pgggraph.sh) ``` for chrom in {1..22} X Y; do bcftools norm --multiallelics -both -f grch38.fa.gz /lizardfs/erikg/HPRC/year1v2genbank/wgg.88/chr${chrom}.pan/chr${chrom}.pan.fa.*.smooth.grch38.vcf.gz -o chr${chrom}_norm_grch38.vcf && sed 's/grch38#//g' chr${chrom}_norm_grch38.vcf > chr${chrom}.tmp.vcf && bcftools norm -d all chr${chrom}.tmp.vcf -Oz -o chr${chrom}_norm_grch38.norm.nodup.vcf.gz don ``` ### - Number of intervals along chromosomes ``` declare -A cell_line_chrom_counts # Process each file in the directory for file in GSM*microDNA.lift.bed; do # Extract the cell line name from the file name cell_line=$(echo "$file" | grep -o "Human-.*" | cut -d"-" -f 2) # Loop through each chromosome for chrom in {1..22} X Y; do # Count the number of intervals in the file for this chromosome count=$(grep -c "chr$chrom" "$file") # Initialize the count for this cell line and chromosome if it doesn't exist if [ -n "${cell_line_chrom_counts[$cell_line,$chrom]}" ]; then cell_line_chrom_counts[$cell_line,$chrom]=$((cell_line_chrom_counts[$cell_line,$chrom]+count)) else cell_line_chrom_counts[$cell_line,$chrom]=$count fi done done # Output the results in the desired format echo -e "Cell_lines\tChromosome\tNumber_intervals" for cell_line in "${!cell_line_chrom_counts[@]}"; do chrom=${cell_line#*,} cell_line=${cell_line%%,*} echo -e "$cell_line\t$chrom\t${cell_line_chrom_counts[$cell_line,$chrom]}" done # Write the result to a file echo -e "Cell_lines\tChromosome\tNumber_intervals" > cell_line_chrom_counts.txt for cell_line in "${!cell_line_chrom_counts[@]}"; do chrom=${cell_line#*,} cell_line=${cell_line%%,*} echo -e "$cell_line\t$chrom\t${cell_line_chrom_counts[$cell_line,$chrom]}" >> cell_line_chrom_counts.txt done ``` ![](https://i.imgur.com/KiGlHvn.png) ### - Intersection between each bed files and HPRC vcfs (intersect.sh): ``` vcffolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/out_pggb/" bedfolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/" for vcffile in $vcffolder/*.vcf.gz; do for bedfile in $bedfolder/*.bed; do outfile="$(basename $bedfile)_$(basename $vcffile)" vcftools --gzvcf $vcffile --bed $bedfile --recode --stdout | bgzip -c > $outfile done done ``` ### - Statistics on these files (stats.sh): ``` # Define a list of cell lines cell_lines=("C4-2" "ES2" "PC-3" "OVCAR8") # Define a list of chromosomes chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) # Define the path to the folder containing the VCF files vcffolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/out_intersect" # Create a header for the output file echo -e "Cell_lines\tCount\tTypes\tChr" > output.txt # Iterate through all cell lines for cell_line in "${cell_lines[@]}"; do # Iterate through all chromosomes for chromosome in "${chromosomes[@]}"; do # Iterate through all VCF files in the specified folder for vcffile in "$vcffolder"/*"$cell_line"*"chr$chromosome"*vcf.gz; do # Check if the file exists if [ -f "$vcffile" ]; then # Run bcftools stats on the VCF file for the current cell line and chromosome bcftools stats "$vcffile" > stats.txt # Extract the counts of SNPs, MNPs, INDELs, and "others" from the bcftools stats output snps=$(grep "number of SNPs:" stats.txt | cut -f 2) mnps=$(grep "number of MNPs:" stats.txt | cut -f 2) indels=$(grep "number of indels:" stats.txt | cut -f 2) others=$(grep "number of others:" stats.txt | cut -f 2) # Write the counts to the output file echo -e "$cell_line\t$snps\tSNPs\tchr$chromosome" >> output.txt echo -e "$cell_line\t$mnps\tMNPs\tchr$chromosome" >> output.txt echo -e "$cell_line\t$indels\tINDELs\tchr$chromosome" >> output.txt echo -e "$cell_line\t$others\tothers\tchr$chromosome" >> output.txt fi done done done ``` ![](https://i.imgur.com/2hRHZBg.jpg) - TO DO: - [x] Email to check the reference that they used - [ ] Do on mouse(liftover) Notes CHECK AND REMOVE chr4_GL000008v2_random chrUn_KI270742v1 chr14_GL000009v2_random chr15_KI270850v1_alt chr17_KI270862v1_alt chr22_KI270879v1_alt chr22_KI270879v1_alt chr2_KI270894v1_alt chr7_KI270803v1_alt chr1_KI270766v1_alt chrUn_KI270742v1 chr1_KI270706v1_random chr15_KI270850v1_alt chr2_KI270894v1_alt chr7_KI270803v1_alt ``` for file in *.bed > do > grep -v "chr[0-9XY]_" $file | sed '/^$/d' > temp.bed > mv temp.bed $file > done ``` Notes: