Principled Workflow-Centric Tracing of Distributed Systems (SoCC ’16)

Workflžow-centric tracing captures the workžow of causally related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource accounting and diagnosis. Without research into this important issue, there is a danger that workflžow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflžow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and the design choices we suggest are based on our experiences developing several previous workflžow-centric tracing infrastructures.