Title: "MADDER and Self-Tuning Data Analytics on Hadoop with Starfish"
Abstract
Timely and cost-effective analytics over "big data" is now a key ingredient for success in businesses and scientific disciplines. The Hadoop platform---consisting of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces to express analysis tasks---is an emerging choice for big data analytics. Hadoop's performance out of the box can be poor, causing suboptimal use of resources, time, and money (e.g., in pay-as-you-go clouds). Unfortunately, practitioners of big data analytics such as business analysts, computational scientists, and researchers often lack the expertise to tune the Hadoop platform for good performance.
I will introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop, while adapting to system workloads and user needs to provide good performance automatically; without any need for users to understand and manipulate the many tuning knobs in the Hadoop platform. While Starfish's design is guided by work on self-tuning database systems, I will discuss how new analysis practices (dubbed the MADDER principles) over big data pose new challenges; leading us to different design choices in Starfish. Starfish is under active development and is available at: http://www.cs.duke.edu/starfish