# MCScan Provided: ``` $conda activate JCVI $cd scratch/data/SbicolorMCscan $ls sbicolor.bed sbicolor.cds sbicolor.sbicolor.lifted.anchors ``` Create blocks file with ```--iter=3``` ``` $python -m jcvi.compara.synteny mcscan sbicolor.bed sbicolor.sbicolor.lifted.anchors --iter=3 -o sbicolor.sbicolor.i3.blocks [15:03:13] DEBUG Load file `sbicolor.bed` base.py:34 DEBUG Load file `sbicolor.sbicolor.lifted.anchors` base.py:34 Chain started: 312 blocks Chain 0: score=5984 36 blocks remained.. Chain 1: score=237 done! [15:03:14] DEBUG MCscan blocks written to `sbicolor.sbicolor.i3.blocks`. synteny.py:1557 ``` Check out the ```sbicolor.sbicolor.i3.blocks``` file ``` $head sbicolor.sbicolor.i3.blocks Sobic.001G000100 . . Sobic.001G000200 . . Sobic.001G000400 . . Sobic.001G000300 . . Sobic.001G000501 . . Sobic.001G000700 . . Sobic.001G000800 . . Sobic.001G000900 . . Sobic.001G001000 . . Sobic.001G001066 . . $tail sbicolor.sbicolor.i3.blocks Sobic.K028200 . . Sobic.K044412 . . Sobic.K028600 . . Sobic.K044413 . . Sobic.K044414 . . Sobic.K029900 . . Sobic.K030000 . . Sobic.K030100 . . Sobic.K030200 . . Sobic.K030700 . . ``` Remove ```Sobic.K*``` ID ``` $egrep -v "Sobic.K" sbicolor.sbicolor.i3.blocks > blocks $tail blocks Sobic.010G279100 . . Sobic.010G279200 . . Sobic.010G279300 . . Sobic.010G279450 . . Sobic.010G279600 . . Sobic.010G279700 . . Sobic.010G279800 . . Sobic.010G279900 . . Sobic.010G280000 . . Sobic.010G280100 . . ``` # Excel Sheet In this section, we used Excel sheet to match RING classes to the ```blocks``` file created. On the local computer ```$scp joelshin@pronghorn.rc.unr.edu:~/scratch/data/SbicolorMCscan/blocks ~/Downloads``` Open the excel file that contains the RING classes. ```RING_ID_Class.xlxs``` File contains this: ``` Sobic.001G055900 RING-HC Sobic.001G095200 RING-HC Sobic.001G121200 RING-HC Sobic.001G138300 RING-HC Sobic.001G151800 RING-HC Sobic.001G161900 RING-HC Sobic.001G162900 RING-HC Sobic.001G163300 RING-HC Sobic.001G191400 RING-HC Sobic.001G196000 RING-HC ``` And ```blocks.xlxs``` contains this: ``` Sobic.001G000100 . . Sobic.001G000200 . . Sobic.001G000400 . . Sobic.001G000300 . . Sobic.001G000501 . . Sobic.001G000700 . . Sobic.001G000800 . . Sobic.001G000900 . . Sobic.001G001000 . . Sobic.001G001066 . . ``` In cell D of ```blocks.xlxs``` write this formula: ``` =COUNTIF(RING_ID_Class.xlsx!$A$1:$A$431,A1:A34027) ``` The ```=COUNTIF(x, y)``` function takes in two arguments. Argument ```x``` takes in the range. In this case we want all the range of the RING genes in ```RING_ID_Class.xlxs```. Argument ```y``` takes in at what we want to look at. In this case, we want to see all the ```x``` that matches in ```blocks.xlxs```. If the RING gene is found in ```blocks.xlxs```, it will count it. For cell E: ``` =COUNTIF(RING_ID_Class.xlsx!$A$1:$A$431,B1:B34027) ``` For Cell F: ``` =COUNTIF(RING_ID_Class.xlsx!$A$1:$A$431,C1:C34027) ``` Sum Cell D, E, F ``` =SUM(D1+E1+F1) ... ``` Use ```Ctrl + F``` on Cell G, and use ```values``` for the "Look in:" box. Search up ```2``` and ```3```. The number ```2``` represents a RING gene duplication while ```3``` represents multiple RING gene duplication? To make things easier, lets copy and paste the results to ```RING_Links.blocks``` file. ``` $vim RING_Links.blocks ***Copied and pasted*** $wc -l RING_Links.blocks 34027 $head RING_Links.blocks Sobic.001G000100 . . 0 0 0 0 Sobic.001G000200 . . 0 0 0 0 Sobic.001G000400 . . 0 0 0 0 Sobic.001G000300 . . 0 0 0 0 Sobic.001G000501 . . 0 0 0 0 Sobic.001G000700 . . 0 0 0 0 Sobic.001G000800 . . 0 0 0 0 Sobic.001G000900 . . 0 0 0 0 Sobic.001G001000 . . 0 0 0 0 Sobic.001G001066 . . 0 0 0 0 ``` Lets grab and save the file whose value is ```2``` ``` $awk '$7 == 2 {print $1 "\t" $2 "\t" $3}' RING_Links.blocks > value_2.blocks $head value_2.blocks Sobic.001G055900 Sobic.002G055800 . Sobic.001G063400 Sobic.002G040600 . Sobic.001G063900 Sobic.002G038733 . Sobic.001G138300 Sobic.008G188700 . Sobic.001G153900 . Sobic.002G038733 Sobic.001G161900 Sobic.008G156200 . Sobic.001G204300 Sobic.006G269600 . Sobic.001G226900 Sobic.001G508600 . Sobic.001G227000 Sobic.001G508600 . Sobic.001G227100 Sobic.001G508466 . ``` Lets also save the file whose value is ```3``` ``` $awk '$7 == 3 {print $1 "\t" $2 "\t" $3}' RING_Links.blocks > value_3.blocks $head value_3.blocks Sobic.002G038733 Sobic.001G063900 Sobic.001G153900 Sobic.004G348500 Sobic.007G165000 Sobic.001G502400 ``` From the ```value_2.blocks``` and ```value_3.blocks```file, we can see important RING gene links: ``` From value_2.blocks file RING 1 -> RING 1 RING 1 -> RING 2 RING 1 -> RING 6 RING 1 -> RING 8 RING 2 -> RING 1 RING 2 -> RING 3 RING 2 -> RING 7 RING 3 -> RING 9 RING 3 -> RING 10 RING 4 -> RING 6 RING 4 -> RING 10 RING 5 -> RING 8 RING 6 -> RING 1 RING 6 -> RING 4 RING 6 -> RING 7 RING 7 -> RING 2 RING 7 -> RING 6 RING 8 -> RING 1 RING 8 -> RING 5 RING 9 -> RING 3 RING 10 -> RING 4 From value_3.blocks file RING 2 -> RING 1 -> RING 1 RING 4 -> RING 7 -> RING 1 ``` # The Selection According to the Microsynteny section in the MCscan documentary, we need a ```.blocks``` file, ```.bed``` file, and a ```.layouts``` file. Lets discuss the ```.blocks``` file first. According to the ```value_2.blocks``` files, there are up to 10 Chromosomes who have RING gene links. This is an example of what the ```.blocks``` file should look like for 10 different chromosomes: ``` $vim roughdraft.blocks $cat roughdraft.blocks Sobic.001 . . . . . . . . . Sobic.002 . . . . . . . . . . . . . . . . . . Sobic.004 . . . . . . . . . Sobic.005 . . . . . . . . . Sobic.006 . . . . . . . . . Sobic.007 . . . . . . . . . Sobic.008 . . . . . . . . . Sobic.009 . . . . . . . . . Sobic.010 ``` As you can see, Column 1 represents Gene 1, Column 2 represents Gene 2, and so on. From the ```value_2.blocks``` file and ```value_3.blocks file```, lets select the best RING link representation. These are what I chose: ``` RING 1 -> RING 2 Sobic.001G055900 Sobic.002G055800 RING 1 -> RING 6 Sobic.001G204300 Sobic.006G269600 RING 1 -> RING 8 Sobic.001G138300 Sobic.008G188700 RING 2 -> RING 1 Sobic.002G055800 Sobic.001G055900 RING 2 -> RING 3 Sobic.002G256300 Sobic.003G403400 RING 2 -> RING 7 Sobic.002G212200 Sobic.007G147000 RING 3 -> RING 9 Sobic.003G367200 Sobic.009G153100 RING 3 -> RING 10 Sobic.003G018300 Sobic.010G043500 RING 4 -> RING 7 -> RING 1 Sobic.004G348500 Sobic.007G165000 Sobic.001G502400 RING 4 -> RING 6 Sobic.004G283400 Sobic.006G184500 RING 4 -> RING 10 Sobic.004G042100 Sobic.010G246000 RING 5 -> RING 8 Sobic.005G001000 Sobic.008G001200 RING 6 -> RING 1 Sobic.006G269600 Sobic.001G204300 RING 6 -> RING 4 Sobic.006G184500 Sobic.004G283400 RING 6 -> RING 7 Sobic.006G205100 Sobic.007G086900 RING 7 -> RING 2 Sobic.007G196201 Sobic.002G256601 RING 7 -> RING 6 Sobic.007G086900 Sobic.006G205100 RING 8 -> RING 1 Sobic.008G188700 Sobic.001G138300 RING 8 -> RING 5 Sobic.008G001200 Sobic.005G001000 RING 9 -> RING 3 Sobic.009G153100 Sobic.003G367200 RING 10 -> RING 4 obic.010G246000 Sobic.004G042100 ``` As you can see from the file, we can obviously see some issues. The issue is that the coordinates between each RING genes are too big. For example: ```Sobic.001G055900``` and ```Sobic.001G204300```. ``` $egrep "Sobic.001G055900" sbicolor.bed Chr01 4188377 4193250 Sobic.001G055900 0 - $egrep "Sobic.001G204300" sbicolor.bed Chr01 18617327 18620279 Sobic.001G204300 0 - ``` From the coordinates, ```Sobic.001G055900``` starts at 4.18 MB and ```Sobic.001G204300``` ends at 18.62 MB. MCscan will draw the chromosome lengths starting from 4.18 MB till 18.62 MB. This will be an issue since it will try to drawing every sobic genes between 4.18 and 18.62 MB. Therefore lets select the best and closest links. Our aim is to have either a coordinate difference of 500 kB or just shy of 1 MB. From this, a new selection was formed: ``` Sobic.001G153900 Sobic.002G038733 . . . . . . . . . . . . Sobic.005G012100 . . Sobic.008G031800 . . Sobic.001G161900 . . . . . . Sobic.008G156200 . . . . . Sobic.004G283400 . Sobic.006G184500 . . . . . . . . . Sobic.006G205100 Sobic.007G086900 . . . Sobic.001G204300 . . . . Sobic.006G269600 . . . . . . Sobic.003G018400 . . . . . . Sobic.010G043500 . . Sobic.003G044700 . . . . . Sobic.009G065200 . . . . Sobic.004G305000 . . . . . Sobic.010G086400 . Sobic.002G256300 Sobic.003G403400 . . . . . . . Sobic.001G502400 . . Sobic.004G348500 . . Sobic.007G165000 . . . . Sobic.002G239200 . . . . Sobic.007G221000 . . . ``` From the new seleciton, these are the representative size: ``` Chr1: 12.37 - 77.11 Mb Chr2: 3.78 - 64.25 Mb Chr3: 1.63 - 71.13 Mb Chr4: 62.51 - 67.76 Mb Chr5: 1.12 - 1.12 Mb Chr6: 53.97 - 60.21 Mb Chr7: 11.70 - 64.93 Mb Chr8: 2.84 - 58.85 Mb Chr9: 6.96 - 6.96 Mb Chr10: 3.87 - 7.44 Mb ``` As you can see, there is still the coordinate issues. # Test Blocks file. Our goal is to obtain a representative size of at least 0.5 - 1 MB.