• Speaker: Andrew Ferguson
  • Date: October 28th, 2011 (Friday)
  • Room: CIT 345
  • Title: "Jockey: Guaranteed Job Latency in Data Parallel Clusters"
  • Abstract:
Data processing frameworks such as MapReduce and Dryad are used today in business environments where customers expect guaranteed performance. To date, however, these systems are not capable of providing guarantees on job latency because scheduling policies are based on fair-sharing, and operators seek high cluster use through statistical multiplexing and over-subscription. With Jockey, we provide latency SLAs for jobs in Bing's Cosmos environment. Jockey precomputes statistics using a simulator that captures the job's complex internal dependencies, accurately and efficiently predicting the remaining run time at different resource allocations and in different stages of the job. Our control policy monitors a job's performance, and dynamically adjusts resource allocation in the shared cluster to maximize the job's economic utility while minimizing its impact on the rest of the cluster. In our experiments in Bing's production clusters, Jockey meets the specified job latency SLAs and responds to changes in cluster conditions.