SP1 - SASKit

Learning best biomarkers for pancreatic cancer, stroke and for their comorbidity based on human, animal and in-vitro data

The bioinformatics focuses on analyzing the transcriptomics and protein data, as well as blood count and other clinical/phenotypic data, together with public data, towards biomarker identification. We aim for easy-to-interpret diagnostic/theranostic biomarkers for deterioration/recovery, and ultimately, therapeutic intervention. Machine learning of biomarkers is first based on simple correlation-based approaches that, in principle, can include any measurements for which several time points or conditions are available. The highest-correlation interactions and correlation-based subnetworks will be investigated in detail in Summer/Fall 2023. For example, we would expect that the longitudinal change in inflammatory blood components (e.g. CRP) correlates with inflammation-related gene expression across tissue, age group and overall senescence status. We also run the most promising standard omics analyses, such as Gene Ontology and pathway enrichment analyses. As endpoint data such as progression/survival or recovery data become available, Cox hazard models, support vector machines (SVM), random forests and (deep) neural networks will be used.
An important component of our analysis is the parallelogram and transfer learning approach where we extrapolate expression data for one species/tissue combination once data for three other combinations are known. We used machine learning with the gene expression dataset from van der Velpen (2016) to optimize the parallelogram algorithm. Indeed, neural networks were able to reduce the prediction error for the unknown human tissue by 63% compared to the simple assumption that the tissue and blood expression ratios are exactly the same in both species, and we obtained similarly good results for publicly available gene expression datasets matching our SASKit project (mouse and human; blood, brain, and pancreas, from the portals GTEx and MGI and from Srivastava (2020)). However, we could not find paired samples for both species in any public dataset. Furthermore, we began an in-depth investigation of transfer learning, which resulted in a review (Kowald et al., 2022). In particular, the 'transfer Variational Autoencoder' (trVAE) of Lotfollahi (2020) can predict a fourth dataset, based on three suitable datasets. Thus, we plan to use trVAE by taking our expression data on mouse brain and blood, as well as patient blood, and extrapolating expression in the human brain.