## Extrachromosomal DNA (microDNA) in Human, Mouse and Chicken
MicroDNA is the most abundant subtype of Extrachromosomal Circular DNA (eccDNA).
Extrachromosomal circular DNA (eccDNA) is a type of double-stranded circular DNA that is derived and free from chromosomes.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68644
| SampleID | Organisms/Tissues |
| ---------- | -------------------- |
| GSM1677986 | Mouse-AMBrain.Rep1 |
| GSM1677987 | Mouse-AMHeart.Rep1 |
| GSM1677988 | Mouse-AMKidney.Rep1 |
| GSM1677989 | Mouse-AMLiver.Rep1 |
| GSM1677990 | Mouse-AMLung.Rep1 |
| GSM1677991 | Mouse-AMSMuscle.Rep1 |
| GSM1677992 | Mouse-AMSperm.Rep1 |
| GSM1677993 | Mouse-AMSpleen.Rep1 |
| GSM1677994 | Mouse-AMTestis.Rep1 |
| GSM1677995 | Mouse-AMThymus.Rep1 |
| GSM1677996 | Chicken-DT40-BRCA1 |
| GSM1677997 | Chicken-DT40-BRCA2 |
| GSM1677998 | Chicken-DT40-CtIP |
| GSM1677999 | Chicken-DT40-Ku70 |
| GSM1678000 | Chicken-DT40-Lig4 |
| GSM1678001 | Chicken-DT40-MSH3 |
| GSM1678002 | Chicken-DT40-NBS1 |
| GSM1678003 | Chicken-DT40-Rad54 |
| GSM1678004 | Chicken-DT40-WT |
| GSM1678005 | Human-C4-2 |
| GSM1678006 | Human-ES2 |
| GSM1678007 | Human-LnCap |
| GSM1678008 | Human-OVCAR8 |
| GSM1678009 | Human-PC-3 |
I removed Chicken Organism and I am consedering only human (for now).
## 1. MicroDNA from Human Ovarian and Prostate Cancer Cell Lines.
They examined human cancer cell lines of two origins, prostate (LNCaP, C4-2, and PC-3) or ovarian (OVCAR8 and ES-2).
## 2. Liftover (HG19 (they used)-->HG38)
### - Total length per chromosome

microDNAs from the human cancer cell lines are primarily 100 to 400 bp in length.

### 3. Adjust all HPRC chromosomes
- Normalizion and removing duplicates (adjust_pgggraph.sh)
```
for chrom in {1..22} X Y; do
bcftools norm --multiallelics -both -f grch38.fa.gz /lizardfs/erikg/HPRC/year1v2genbank/wgg.88/chr${chrom}.pan/chr${chrom}.pan.fa.*.smooth.grch38.vcf.gz -o chr${chrom}_norm_grch38.vcf && sed 's/grch38#//g' chr${chrom}_norm_grch38.vcf > chr${chrom}.tmp.vcf && bcftools norm -d all chr${chrom}.tmp.vcf -Oz -o chr${chrom}_norm_grch38.norm.nodup.vcf.gz
don
```
### - Number of intervals along chromosomes
```
declare -A cell_line_chrom_counts
# Process each file in the directory
for file in GSM*microDNA.lift.bed; do
# Extract the cell line name from the file name
cell_line=$(echo "$file" | grep -o "Human-.*" | cut -d"-" -f 2)
# Loop through each chromosome
for chrom in {1..22} X Y; do
# Count the number of intervals in the file for this chromosome
count=$(grep -c "chr$chrom" "$file")
# Initialize the count for this cell line and chromosome if it doesn't exist
if [ -n "${cell_line_chrom_counts[$cell_line,$chrom]}" ]; then
cell_line_chrom_counts[$cell_line,$chrom]=$((cell_line_chrom_counts[$cell_line,$chrom]+count))
else
cell_line_chrom_counts[$cell_line,$chrom]=$count
fi
done
done
# Output the results in the desired format
echo -e "Cell_lines\tChromosome\tNumber_intervals"
for cell_line in "${!cell_line_chrom_counts[@]}"; do
chrom=${cell_line#*,}
cell_line=${cell_line%%,*}
echo -e "$cell_line\t$chrom\t${cell_line_chrom_counts[$cell_line,$chrom]}"
done
# Write the result to a file
echo -e "Cell_lines\tChromosome\tNumber_intervals" > cell_line_chrom_counts.txt
for cell_line in "${!cell_line_chrom_counts[@]}"; do
chrom=${cell_line#*,}
cell_line=${cell_line%%,*}
echo -e "$cell_line\t$chrom\t${cell_line_chrom_counts[$cell_line,$chrom]}" >> cell_line_chrom_counts.txt
done
```

### - Intersection between each bed files and HPRC vcfs (intersect.sh):
```
vcffolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/out_pggb/"
bedfolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/"
for vcffile in $vcffolder/*.vcf.gz; do
for bedfile in $bedfolder/*.bed; do
outfile="$(basename $bedfile)_$(basename $vcffile)"
vcftools --gzvcf $vcffile --bed $bedfile --recode --stdout | bgzip -c > $outfile
done
done
```
### - Statistics on these files (stats.sh):
```
# Define a list of cell lines
cell_lines=("C4-2" "ES2" "PC-3" "OVCAR8")
# Define a list of chromosomes
chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y)
# Define the path to the folder containing the VCF files
vcffolder="/lizardfs/flaviav/microDNA/microDNA/GSE68644_nochicken/GSE68644_human/out_intersect"
# Create a header for the output file
echo -e "Cell_lines\tCount\tTypes\tChr" > output.txt
# Iterate through all cell lines
for cell_line in "${cell_lines[@]}"; do
# Iterate through all chromosomes
for chromosome in "${chromosomes[@]}"; do
# Iterate through all VCF files in the specified folder
for vcffile in "$vcffolder"/*"$cell_line"*"chr$chromosome"*vcf.gz; do
# Check if the file exists
if [ -f "$vcffile" ]; then
# Run bcftools stats on the VCF file for the current cell line and chromosome
bcftools stats "$vcffile" > stats.txt
# Extract the counts of SNPs, MNPs, INDELs, and "others" from the bcftools stats output
snps=$(grep "number of SNPs:" stats.txt | cut -f 2)
mnps=$(grep "number of MNPs:" stats.txt | cut -f 2)
indels=$(grep "number of indels:" stats.txt | cut -f 2)
others=$(grep "number of others:" stats.txt | cut -f 2)
# Write the counts to the output file
echo -e "$cell_line\t$snps\tSNPs\tchr$chromosome" >> output.txt
echo -e "$cell_line\t$mnps\tMNPs\tchr$chromosome" >> output.txt
echo -e "$cell_line\t$indels\tINDELs\tchr$chromosome" >> output.txt
echo -e "$cell_line\t$others\tothers\tchr$chromosome" >> output.txt
fi
done
done
done
```

- TO DO:
- [x] Email to check the reference that they used
- [ ] Do on mouse(liftover)
Notes
CHECK AND REMOVE
chr4_GL000008v2_random
chrUn_KI270742v1
chr14_GL000009v2_random
chr15_KI270850v1_alt
chr17_KI270862v1_alt
chr22_KI270879v1_alt
chr22_KI270879v1_alt
chr2_KI270894v1_alt
chr7_KI270803v1_alt
chr1_KI270766v1_alt
chrUn_KI270742v1
chr1_KI270706v1_random
chr15_KI270850v1_alt
chr2_KI270894v1_alt
chr7_KI270803v1_alt
```
for file in *.bed
> do
> grep -v "chr[0-9XY]_" $file | sed '/^$/d' > temp.bed
> mv temp.bed $file
> done
```
Notes: