FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks

Eloise Stein , Quentin Bramas , Flavio Pisani , Tommaso Colombo and Cristel Pelsser

Download PDF Full Text

This 2025 international journal article, by Eloise Stein and 4 coauthors, was published in Computing and Software for Big Science. Topics covered include all-to-all, fat-tree networks, integer linear programming, optimal routing, fault-tolerance, and data acquisition.

Full author list: Eloise Stein, Quentin Bramas, Flavio Pisani, Tommaso Colombo, and Cristel Pelsser.

Abstract

Data acquisition (DAQ) networks, widely used in scientific research and indus- trial applications, are composed of numerous interconnected servers, exchanging substantial data volumes produced by large scientific instruments. One traf- fic matrix generally used in such networks is the all-to-all collective exchange, which demands substantial network resources, making network failures partic- ularly challenging to mitigate. If not mitigated, the effects of network failures severely hamper the performance of the DAQ network, potentially leading to the loss of valuable experimental data. In the context of DAQ networks using a fat-tree topology, we propose FORS: a scheduling and associated routing solution to support the all-to-all collective exchange under network failures. FORS optimizes bandwidth utilization in the face of any failure scenarios, ensuring robust performance compared to the exist- ing approaches. We propose an algorithm to solve the scheduling. For the routing, we design an algorithm for simple failure scenarios, along with a linear program- ming model to address more complex failure scenarios. We validate our proposed solution using a real-world DAQ network as a case study. Results demonstrate significant performance degradation in existing approaches and FORS’ consistent ability to achieve higher throughput across various failure scenarios.

Publication Details

Publication Type
Journal Article
Publication Date
April 2025
Published In
Computing and Software for Big Science

Suggested citation

Eloise Stein, Quentin Bramas, Flavio Pisani, Tommaso Colombo, and Cristel Pelsser. 2025. FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks. Computing and Software for Big Science (Apr. 2025).

BibTeX Citation

@article{Stein2025,
	title        = {FORS: Fault-adaptive Optimized Routing and Scheduling for DAQ Networks},
	author       = {Stein, Eloise and Bramas, Quentin and Pisani, Flavio and Colombo, Tommaso and Pelsser, Cristel},
	year         = 2025,
	month        = apr,
	journal      = {Computing and Software for Big Science},
	url          = {https://link.springer.com/journal/41781},
	abstract     = {Data acquisition (DAQ) networks, widely used in scientific research and indus- trial applications, are composed of numerous interconnected servers, exchanging substantial data volumes produced by large scientific instruments. One traf- fic matrix generally used in such networks is the all-to-all collective exchange, which demands substantial network resources, making network failures partic- ularly challenging to mitigate. If not mitigated, the effects of network failures severely hamper the performance of the DAQ network, potentially leading to the loss of valuable experimental data. In the context of DAQ networks using a fat-tree topology, we propose FORS: a scheduling and associated routing solution to support the all-to-all collective exchange under network failures. FORS optimizes bandwidth utilization in the face of any failure scenarios, ensuring robust performance compared to the exist- ing approaches. We propose an algorithm to solve the scheduling. For the routing, we design an algorithm for simple failure scenarios, along with a linear program- ming model to address more complex failure scenarios. We validate our proposed solution using a real-world DAQ network as a case study. Results demonstrate significant performance degradation in existing approaches and FORS’ consistent ability to achieve higher throughput across various failure scenarios.},
	groups       = {International Journals and Magazines},
	keywords     = {all-to-all, fat-tree networks, integer linear programming, optimal routing,fault-tolerance, data acquisition}
}

Related publications