T. Rausch, W. Hummer, V. Muthusamy:
"An Experimentation and Analytics Framework for Large-Scale AI Operations Platforms";
Talk: 2020 USENIX Conference on Operational Machine Learning (OpML 2020) - Online Conference, Berkeley, CA, USA; 2020-07-28 - 2020-08-07; in: "Proceedings of the 2020 USENIX Conference on Operational Machine Learning (OpML 2020)", USENIX Association, (2020), ISBN: 978-1-939133-15-1; 17 - 19.

English abstract:
This paper presents a trace-driven experimentation and analytics framework that allows researchers and engineers to
devise and evaluate operational strategies for large-scale AI
workflow systems. Analytics data from a production-grade AI
platform developed at IBM are used to build a comprehensive
system and simulation model. Synthetic traces are made available for ad-hoc exploration as well as statistical analysis of
experiments to test and examine pipeline scheduling, cluster
resource allocation, and similar operational mechanisms.

