Disco: Fast, good, and cheap outage detection

Abstract

Outage detection has been studied from different angles, such as active probing, analysis of background radiations, or control plane information. We approach outage detection from a new perspective. Disco is a detection technique that uses existing long-running TCP connections to identify bursts of disconnections. The benefits are considerable as we can monitor, without adding a single packet to the traffic, Internet-wide swaths of infrastructure that were not monitored previously because they are, for example, not responsive to ICMP probes or behind NATs. With Disco we analyze state changes on connections between RIPE Atlas probes and the RIPE Atlas infrastructure. This data, that is originally logged to monitor probe availability, has a small footprint and is available as a publicly accessible live stream, which makes light-weight near real-time outage detection possible. Probes perform planned traceroute measurements regardless of their connectivity to the RIPE Atlas infrastructure. This gives us a no cost advantage of viewing the outage inside out as the probes experienced it, characterizing the outage after the fact. Thus, we present an outage detection system able to run in near real-time (fast), with a precision of 95% (good), and without generating any new measurement traffic (cheap). We studied historical probe disconnections from 2011 to 2016 and report on the 443 most prominent outages. To validate our results we inspected traceroute results from affected probes and compared our detection to that of Trinocular.

Publication
Network Traffic Measurement and Analysis Conference, TMA 2017
Cristel Pelsser
Cristel Pelsser
Critical embedded systems, Computer networking, Researcher, Professor

The focus of my research is on network operations, routing, Internet measurements, protocols and security.