In this tutorial you will be introduced to the Apache Hadoop and Spark frameworks for processing big data. These frameworks offer a novel way for creating data analysis applications that easily scale over hundreds to thousands of machines. This data-parallel approach has been pioneered in industry by tech companies such as Google and Facebook, and is very applicable to many scientific workloads in general. We will introduce you to the key concepts and features of the Apache Hadoop and Spark stacks. In addition you will work on hands-on Spark exercises in a Jupyter notebook environment. The exercises and demos will provide a basic understanding of Spark and demonstrate the applicability to bioinformatics applications such as sequence alignment and variant calling using ADAM and running BLAST on Hadoop. Read more…
