The Cloudera Cluster Validation System: Checking for hundreds of
thousands of known issues per day
Cloudera is a leading provider of big-data processing software,
particularly in the Hadoop ecosystem. High performance distributed
systems are notoriously difficult to set up, manage, and troubleshoot,
and so we have invested heavily in tools to automate this work. In
particular, we check all incoming diagnostic data bundles from
customers against a library of hundreds of known problems.
This talk will describe the how and why of that system. Our
programming model is simple enough that supporters can write the
checks, expressive enough for hundreds of checks, scalable enough to
grow with the business, efficient enough to leave constantly running.
We also explain how we exploit a limited form of execution tracing to
automate regression testing for these checks. The talk will also
discuss the way the system fits into our business processes, and how
the two have co-evolved.