SP 4

Data handling, optimization of analysis workflows and applications

The smooth processing of large genomic datasets such as produced in genome sequencing involves both, a sophisticated strategy of data management, as well as optimized algorithms and analysis pipelines. The analysis of cancer genomes in a discovery setting is usually performed on large-scale high performance computers (HPCs) that allow parallelized processing of the analysis tasks. We will optimize existing analysis pipelines from the consortium for our HPC infrastructure in order to speed up the identification of oncogenic drivers and modulators. When moving to a diagnostic setting, the requirements of the compute infrastructure differ substantially. Here, the systems should be rather small and have low infrastructural requirements. Operating, maintenance requirements and associated cost should be as low as possible. This calls for the utilization of rather specialized hardware. We aim on using a combination of commodity computer hardware equipped with accelerators and coprocessors like GPGPUs and Intel Phi processors. We will work toward adapting the required pipeline to this kind of hardware. To this end, we will migrate data analysis pipelines to accelerators (e.g. GPGPU) and coprocessors (e.g. Intel Xeon Phi) to provide a stable, computationally efficient system suitable for processing next-generation cancer genome diagnostic workflows. Data lifecycle management is required to ensure data integrity and efficient data handling. To cope with the high demands for data management, the RRZK will provide data life circle management based on IRODS. IRODS was specifically designed to provide user friendly and highly customizable data management functionality. The provision of the results will be accomplished by implementing interfaces between the data lifecycle management system and a relational database combined with a portal as an extension of the already existing sample database. We will thus provide comprehensive data ready to use for clinicians.

Keywords: Data handling, optimization of analysis workflows and applications