====== Genome Data Science ====== \\ {{ media:pangenom_small.png?nolink&400}} Welcome to the web pages of the Genome Data Science group at the Faculty of Technology. The group is headed by Prof. Dr. Alexander Schönhuth and develops methods and tools to work with tens of thousands of genomes and analyze and integrate the corresponding data. ===== Recent research highlights: ===== ==== AI accessible (pan-)genome embeddings (most recent) ==== * Computing Linkage Disequilibrium Aware Genome Embeddings using Autoencoders (paper link: https://www.biorxiv.org/content/10.1101/2023.11.01.565013v1.abstract, software link: https://github.com/gizem-tas/haploblock-autoencoders) * Generating synthetic genomes using diffusion models (ongoing) * Embedding the kingdom of bacteria (ongoing) * Embedding drugs with diseased cell lines (ongoing) ==== (Pan-)genome assembly: ==== * HyLight: Strain aware assembly of low coverage metagenomes (paper link: https://www.biorxiv.org/content/10.1101/2023.12.22.572963v1, software link: https://github.com/HaploKit/HyLight) * Hybrid-hybrid correction of errors in long reads with HERO (paper link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03112-7, software link: https://github.com/HaploKit/HERO) * VeChat: correcting errors in long reads using variation graphs (paper link: https://www.nature.com/articles/s41467-022-34381-8, software link: https://github.com/HaploKit/vechat) * StrainXpress: strain aware metagenome assembly from short reads (paper link: https://academic.oup.com/nar/article/50/17/e101/6625806, software link: https://github.com/HaploKit/StrainXpress) * Strainline: full-length de novo viral haplotype reconstruction from noisy long reads (paper link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02587-6, software link: https://github.com/HaploKit/Strainline) * phasebook: haplotype-aware de novo assembly of diploid genomes from long reads (paper link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02512-x, software link: https://github.com/phasebook/phasebook) ==== AI based prediction of disease risks : ==== * Predicting the prevalence of complex genetic diseases from individual genotype profiles using capsule networks (paper link: https://www.nature.com/articles/s42256-022-00604-2, software link: https://github.com/HaploKit/DiseaseCapsule) * Machine Learning-Based Ensemble Recursive Feature Selection of Circulating miRNAs for Cancer Tumor Classification (paper link: https://www.mdpi.com/2072-6694/12/7/1785, software link: https://github.com/steppenwolf0/circulating) * Deep variational graph autoencoders for novel host-directed therapy options against COVID-19 (paper link: https://www.sciencedirect.com/science/article/pii/S0933365722001701?via%3Dihub, software link: https://github.com/sumantaray/Covid19) ==== FDR control for single cell / somatic variants: ==== * Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo (paper link: https://www.nature.com/articles/s41467-021-26938-w, software link: https://github.com/prosolo/prosolo) * Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery (paper link: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01993-6, software link: https://varlociraptor.github.io/landing/) ===== Particular research highlights ===== * [[http://bioinformatics.oxfordjournals.org/content/29/24/3143.abstract|Twilight zone indels]] ([[https://bitbucket.org/tobiasmarschall/clever-toolkit/wiki/Home|Software]]) * [[http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3021.html|The Genome of the Netherlands]] * [[https://www.nature.com/articles/ncomms12989|Associating Disease Traits with genotypes]], in particular complex, large and difficult-to-discover ones * [[http://biorxiv.org/content/early/2017/03/29/121954|Cancer genome variants]] ([[https://prosic.github.io/|Software]]) * [[http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0157|Read-Based Phasing]] ([[https://whatshap.readthedocs.io/en/latest/|Software]]) ===== Genome Data Mining ===== We have developed various tools for mining relevant patterns in (huge) genomes and networks. Research highlights: * [[http://bioinformatics.oxfordjournals.org/cgi/content/abstract/bts566?ijkey=stM0R7NizuqFm6m&keytype=ref|Maximal cliques in overlap graphs]] ([[https://bitbucket.org/tobiasmarschall/clever-toolkit/wiki/Home|Software]]) * [[http://genome.cshlp.org/content/27/5/835.ful|Reconstruction of viral quasispecies]] (a particular highlight of ours, see here for [[https://bitbucket.org/jbaaijens/savage|Software]]) * [[https://homepages.cwi.nl/~as/|Cancer subnetwork markers]] ([[https://homepages.cwi.nl/~as/software.html#wDCB|Software]]).