Disco: Fast, good, and cheap outage detection
Anant Shah , Romain Fontugne , Emile Aben , Cristel Pelsser and Randy Bush
Abstract
Outage detection has been studied from different angles, such as active probing, analysis of background radiations, or control plane information. We approach outage detection from a new perspective. Disco is a detection technique that uses existing long-running TCP connections to identify bursts of disconnections. The benefits are considerable as we can monitor, without adding a single packet to the traffic, Internet-wide swaths of infrastructure that were not monitored previously because they are, for example, not responsive to ICMP probes or behind NATs. With Disco we analyze state changes on connections between RIPE Atlas probes and the RIPE Atlas infrastructure. This data, that is originally logged to monitor probe availability, has a small footprint and is available as a publicly accessible live stream, which makes light-weight near real-time outage detection possible. Probes perform planned traceroute measurements regardless of their connectivity to the RIPE Atlas infrastructure. This gives us a no cost advantage of viewing the outage inside out as the probes experienced it, characterizing the outage after the fact. Thus, we present an outage detection system able to run in near real-time (fast), with a precision of 95% (good), and without generating any new measurement traffic (cheap). We studied historical probe disconnections from 2011 to 2016 and report on the 443 most prominent outages. To validate our results we inspected traceroute results from affected probes and compared our detection to that of Trinocular.
Publication Details
- Publication Type
- Conference Paper
- Publication Date
- June 2017
- Published In
- 2017 Network Traffic Measurement and Analysis Conference (TMA)
- Pages
- 1--9
- Publisher
- IEEE
- Location
- Dublin, Ireland
- Digital Object Identifier (DOI)
- 10.23919/TMA.2017.8002902
- External Link
- http://icube-publis.unistra.fr/4-SFAP17
Suggested citation
Anant Shah, Romain Fontugne, Emile Aben, Cristel Pelsser, and Randy Bush. 2017. Disco: Fast, good, and cheap outage detection. In 2017 Network Traffic Measurement and Analysis Conference (TMA). IEEE, Dublin, Ireland, 1–9. https://doi.org/10.23919/TMA.2017.8002902
BibTeX Citation
@inproceedings{Shah2017,
title = {Disco: Fast, good, and cheap outage detection},
author = {Anant Shah and Romain Fontugne and Emile Aben and Cristel Pelsser and Randy Bush},
year = 2017,
month = jun,
journal = {2017 Network Traffic Measurement and Analysis Conference ({TMA})},
booktitle = {Network Traffic Measurement and Analysis Conference, {TMA} 2017},
publisher = {IEEE},
address = {Dublin, Ireland},
pages = {1--9},
doi = {10.23919/TMA.2017.8002902},
isbn = {978-1-5386-0405-2},
url = {http://icube-publis.unistra.fr/4-SFAP17},
abstract = {Outage detection has been studied from different angles, such as active probing, analysis of background radiations, or control plane information. We approach outage detection from a new perspective. Disco is a detection technique that uses existing long-running TCP connections to identify bursts of disconnections. The benefits are considerable as we can monitor, without adding a single packet to the traffic, Internet-wide swaths of infrastructure that were not monitored previously because they are, for example, not responsive to ICMP probes or behind NATs. With Disco we analyze state changes on connections between RIPE Atlas probes and the RIPE Atlas infrastructure. This data, that is originally logged to monitor probe availability, has a small footprint and is available as a publicly accessible live stream, which makes light-weight near real-time outage detection possible. Probes perform planned traceroute measurements regardless of their connectivity to the RIPE Atlas infrastructure. This gives us a no cost advantage of viewing the outage inside out as the probes experienced it, characterizing the outage after the fact. Thus, we present an outage detection system able to run in near real-time (fast), with a precision of 95% (good), and without generating any new measurement traffic (cheap). We studied historical probe disconnections from 2011 to 2016 and report on the 443 most prominent outages. To validate our results we inspected traceroute results from affected probes and compared our detection to that of Trinocular.},
bibsource = {dblp computer science bibliography, https://dblp.org},
biburl = {https://dblp.org/rec/conf/tma/ShahFAPB17.bib},
eventdate = {21-23 June 2017},
eventtitleaddon = {Dublin, Ireland},
file = {:Shah2017 - Disco_ Fast, Good, and Cheap Outage Detection.pdf:PDF},
groups = {International Conferences},
keywords = {Probes, Monitoring, Internet, Real-time systems, Data models, Automata, Telescopes},
x-international-audience = {Yes},
x-language = {EN}
}
Related publications
Chocolatine: Outage Detection for Internet Background Radiation
Andreas Guillot, Romain Fontugne, and Philipp Winter, et al.
2019 Network Traffic Measurement and Analysis Conference (TMA), 2019
Pinpointing Delay and Forwarding Anomalies Using Large-Scale Traceroute Measurements
Romain Fontugne, Emile Aben, and Cristel Pelsser, et al.
CoRR, 2017
Evaluating the performance of NRENs in deploying IoT in Africa: the case for TTN
Marco Zennaro, Cristel Pelsser, and Franck Albinet, et al.
2020 IEEE 17th Annual Consumer Communications & Networking Conference (CCNC), 2020
Measurement Vantage Point Selection Using A Similarity Metric
Thomas Holterbach, Emile Aben, and Cristel Pelsser, et al.
Proceedings of the Applied Networking Research Workshop, 2017