FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks

Abstract

Data acquisition (DAQ) networks, widely used in scientific research and indus- trial applications, are composed of numerous interconnected servers, exchanging substantial data volumes produced by large scientific instruments. One traf- fic matrix generally used in such networks is the all-to-all collective exchange, which demands substantial network resources, making network failures partic- ularly challenging to mitigate. If not mitigated, the effects of network failures severely hamper the performance of the DAQ network, potentially leading to the loss of valuable experimental data. In the context of DAQ networks using a fat-tree topology, we propose FORS: a scheduling and associated routing solution to support the all-to-all collective exchange under network failures. FORS optimizes bandwidth utilization in the face of any failure scenarios, ensuring robust performance compared to the exist- ing approaches. We propose an algorithm to solve the scheduling. For the routing, we design an algorithm for simple failure scenarios, along with a linear program- ming model to address more complex failure scenarios. We validate our proposed solution using a real-world DAQ network as a case study. Results demonstrate significant performance degradation in existing approaches and FORS’ consistent ability to achieve higher throughput across various failure scenarios.

Publication
Computing and Software for Big Science
Cristel Pelsser
Cristel Pelsser
Critical embedded systems, Computer networking, Researcher, Professor

The focus of my research is on network operations, routing, Internet measurements, protocols and security.