Resource Management in Shared Distributed Systems

In distributed systems shared by multiple tenants, effective resource management is an important pre-requisite to providing quality of service guarantees.

However, providing performance guarantees and isolation in multi-tenant distributed systems is extremely hard. Tenants not only share fine-grained resources within a process (such as threadpools and locks) but also resources across multiple processes and machines (such as the disk and the network) along the execution path of their requests. As a result, traditional resource management mechanisms in the operating system and in the hypervisor are ineffective due to a mismatch in the management granularity. Moreover, tenant-generated requests not only compete with each other but also with system-generated tasks, such as replication and garbage collection, for shared resources. In addition, the bottleneck responsible for degrading the performance of a tenant can change in unpredictable ways depending on its input workload, the workload of other tenants and system tasks, the overall state of the system (including caches), and the (nonlinear) performance characteristics of underlying resources.

In this work, we have approached the problem of shared systems from two directions.  First, we present Retro, a top-down approach to resource management in shared systems.   Retro is a resource management framework for shared distributed systems. Retro monitors per-tenant resource usage both within and across distributed systems, and exposes this information to centralized resource management policies through a high-level API. A policy can shape the resources consumed by a tenant using Retro’s control points, which enforce sharing and ratelimiting decisions.

Second, we present 2DFQ, a fair-queue scheduling algorithm based on Weighted Fair Queueing (WFQ).  Using fair queue schedulers to provide fairness in shared systems is difficult because of high execution concurrency, and because request costs are unknown and have high variance.  2DFQ spreads requests of different costs across different threads and minimizes the impact of tenants with unpredictable requests.

Facebooktwittergoogle_plusredditlinkedinmail

People Involved

Jonathan Mace

Jonathan Mace

PhD Student (2012)

Distributed Systems, Operating Systems

Rodrigo Fonseca

Rodrigo Fonseca

Associate Professor of Computer Science

Faculty

Peter Bodik

Peter Bodik

Collaborator

Microsoft Research, Redmond

Madan Musuvathi

Madan Musuvathi

Collaborator

Microsoft Research, Redmond

Publications

Videos

LEAVE A REPLY