Low-latency Network Monitoring via Oversubscribed Port Mirroring (ONS ’14)

Modern networks operate at a speed and scale that make it impossible for human operators to manually respond to transient problems, e.g., congestion induced by workload dynamics. Even reacting to issues in seconds can cause significant disruption, so network operators overprovision their networks to minimize the likelihood of problems. Software-defined networking (SDN) introduces the possibility of building autonomous, self-tuning networks that constantly monitor network conditions and react rapidly to problems. Previous work has demonstrated that new routes can be installed by an SDN controller in tens of milliseconds, but state-of-the-art network measurement systems take hundreds of milliseconds or more to collect a view of current network conditions. To support future autonomous SDNs, a much lower latency network monitoring mechanism is necessary, especially as we move from 1 Gb to 10 Gb and 40 Gb links, which require 10x and 40x faster measurement to detect flows of the same size. We believe that networks need to, and can, adapt to network dynamics at timescales closer to milliseconds or less. This paper introduces Planck, a network measurement architecture that provides statistics on 1 Gb and 10 Gb networks at 3.5–6.5 ms timescales, more than an order of magnitude improvement over the state-of-the-art. Planck does so with a novel use of the well-known port mirroring mechanism: it mirrors all traffic traversing a switch to a small number—one in our current system—of oversubscribed monitor ports. When more traffic is sent to the monitor ports than they can handle, traffic is dropped, resulting in the monitor port emitting a “random” sample of all traffic on the switch. As this is done at line rate in the data path, Planck collects orders of magnitude more samples per second than is possible using previous monitoring mechanisms. In addition to supporting sFlow-style [9] sampling, Planck provides extremely low latency link utilization and flow rate estimates as well as alerts when congestion is detected. A Planck collector, consisting of software running on a commodity server, receives these samples and performs lightweight analysis to provide measurement data to those interested, e.g., an SDN controller.