Projects and status

# Projects and status ## strategic direction - Strategic objectives to the CSP. - ## Publications coming: - SDG contigger/scaffolder: - microbes? hybrid or not, maybe technology agnostic? maybe with the log. - replacing w2rap: pe + LMP on thaliana and such (up to human?/barley?/wheat?). - SDG HiFi assembler - strawberry RG/Hapil (probably with some of SORTei) - january - strawberry all genomes - pseudoseq - strider - magic pe+lmp+10x+hi-c - CS42 full integration - PP genome +genomes - F.cylindrus? - --- ## SDG Roadmap building simpliyfing reconnecting 1) sdg contigger/pe + lmp to replace w2rap. 2) Hi-C 3) HiFi -> graph construction 4) --- ## Wheat pangenome assemblies (NIAB's magic population) **Status:** assemblies with NIAB (w2rap pe+LMP). Weebil published as part of 10+ genomes. **Next research objective:** SDG pe+lmp+10x+Hi-C assemblies? **Publications to do:** - A genomes paper led by us. --- ## Strawberry **Status:** a lot of data produced, no analyses were ever finished or published. RG assembly in perpetual "about to finish" status. **Next research objective:** RG pe+nanopore+10x+lmp assembly. - Bring LinkedReadsMapper in line with PairedReadsMapper. **Publications to do:** - RG genome paper - RG annotation + biology paper - RG + Happil genome / map paper /comparisson - Many other analyses of this dataset. --- ## SDG **Status:** there's a big backlog of cleanup to do, but heuristics for genome assembly are mostly implemented. **Next research objectives:** - HiFi assemblies - Different recipes for data integration --- ## F.cylindrus --- ## Pink pigeon --- ## Salzberg wheat (CS42 illumina+pacbio+10x+Hi-C) --- ## Pseudoseq / polyploid puzzles / strider --- ## Nicotiana --- ## Bees collaboration with wilfried --- ## Metagenomic collaboration with Chris --- # SDG YiI student project proposal #YiI Student proposal Genome sequencing technologies have matured to the point that we are now producing genome references for all living things. Genome assemblers have long been tasked with combining the information from fragmented and imprecise reads coming from the instruments into genome assemblies representing a version of an organism genetic code. The most used representation for the problem of assembling a genome sequencing dataset is a graph that shows the sequences read by the instruments and their adjacencies or overlaps. Because different assemblers are developed by independent groups in independent projects, they often use a purpose-driven version of this representation that is incompatible with other assemblers and other types of data. This leads to genome assembly projects being conducted as a series of "improvements" upon a base sequence reconstruction, effectively correcting and joining assembled sequences together as the analyses move from one sequencing technology to the next. Our group specialises in complex genome assemblies, where the current state-of-the-art methods are still challenged to produce adequate references. The biggest challenge is this area is to adequately reconstruct multiple copies of areas of high similarity in the genome. A very relevant biological feature that corresponds with this is the presence of multiple haplotypes in the genome. The current approaches to distinguish between different haplotypes involve the use of trios to separate paternal and maternal haplotypes before assembly, and the use of phasing data nd approaches to separate or correct the haplotype composition after assembly. Still, the haplotypes are often mixed incorrectly reconstructed on relatively straightforward diploid samples only containing two haplotypes. Appropriate reconstructing of haplotypes from organisms with higher ploidy, especially in genomes that present other complications, is very much an unsolved problem. This is holding back the advance of genomic methods in a whole range of organisms, including many crops. The group's strategy to advance the genome assembly field to a haplotype-resolution level has taken two routes: to develop analytical methods to validate haplotype reconstruction and correctness, and to develop a genome assembly framework that enables the concurrent analysis of information from different sequencing technologies to appropriately reconstruct haplotypes to the best ability of the combined dataset. K-mer spectra vs. copy number analyses and k-mer completeness, a set of analytical methods we pioneered with our KAT tool in 2013, has now been widely accepted as the main tool to evaluate haplotype reconstruction completeness. While we are still advancing these and other analytical and validation tools, we have found that current versions of these analyses already pinpoint major shortcomings in the current methods for haplotype reconstruction. Our work on pseudoseq.jl (https://github.com/bioinfologics/Pseudoseq.jl) has enabled us to present conceptually simple and clean versions of the polyploid assembly problem to state-of-the-art tools and to understand where some of the shortcomings of their approaches lie. The next step for our group is to shift the analyses from haplotype-aware (that is, trying to guess how many haplotypes are there and reconstruct them) to haplotype-specific (that is, to analyse each haplotype independently as long the data allows for it). Our graph-based assembly analytical framework, SDG, presented in a short publication last year (https://f1000research.com/articles/8-1490), can be used as a python library to produce interactive analyses of all the data involved in a genome assembly project, integrating across different technologies and datasets. We have a number of areas to apply a combination of SDG methods to the problem of haplotype-specific genome reconstruction: Analyses of graph topology, where we aim to understand when and how sequencing technologies provide data supporting the reconstruction of multiple haplotypes. These are problems of signal analysis and classification. Data integration tests, where we analyse the results of current pipelines vs. the integration of all information into a single instance of SDG workspace to find unnecessary errors or ambiguity. This are again problems of signal analysis and classification, but focused on evaluating combinations of data and solutions. Haplotype-specific assembly methods, where we tackle particular datasets and organism and their reconstruction. This area is tightly related to producing haplotype-specific assemblies for crops, where we assembly multiple cultivars in a haplotype-specific manner, enabling better analyses of diversity. A number of novel approaches are possible by implementing multi-datatype analyses in SDG. Multi-datatype one-step assembly method development, where we aim to replace some of the current iterative processes to combine different data into a final assembly. This is of particular relevance to large-scale assembly projects currently aiming to sequence "all living organisms". In general the introduction of methods to analyse variation in-the-graph to allow us to use the information from whole datasets without introducing methodological bias. The group has two large datasets we are currently working on: 10 wheat genomes with different datatypes, and 6 strawberry genomes plus multiple resequencing datasets of these. Applied methods for crops would be geared towards these, but other methodological developments can be conducted on DToL datasets or others. # Grants and funding to apply for: - *UKRI future leaders for bj:* given that technically I am just out of the - ERC / WT investigator grant (the ultimate goal!).