Publication Entry

[Back]

Talks and Poster Presentations (with Proceedings-Entry):

J. Träff, A. Rougier, S. Hunold:
"Implementing a Classic: Zero-copy All-to-all Communication with MPI Datatypes";
Talk: 28th ACM International Conference on Supercomputing, ICS 2014, Munich, Germany; 2014-06-10 - 2014-06-13; in: "Proceedings of the 28th ACM International Conference on Supercomputing, ICS 2014", M. Gerndt, P. Stenström, L. Rauchwerger, B. Miller, M. Schulz (ed.); ACM, (2014), ISBN: 978-1-4503-2642-1; 135 - 144.

English abstract:

We investigate the use of the derived datatype mechanism of MPI (the Message-Passing Interface) in the implementation of the classic all-to-all communication algorithm of Bruck et al. (1997). Through a series of improvements to the canonical implementation of the algorithm we gradually eliminate initial and nal processor-local data reorganizations, culminating in a zero-copy version that contains no explicit, process-local data movement or copy operations: all necessary data movements are implied by MPI derived datatypes, and carried out as part of the communication operations. We furthermore show how the improved algorithm can be used to solve irregular all-to-all communication problems (that are not too irregular). The Bruck algorithm serves as a vehicle to demonstrate descriptive and performance advantages with MPI datatypes in the implementation of complex algorithms, and discuss shortcomings and inconveniences in
the current MPI datatype mechanism. In particular, we use and implement three new derived datatypes (bounded vector, circular vector, and bucket) not in MPI that might be useful in other contexts. We also discuss the role of persistent collectives which are currently not found in MPI for amortizing type creation (and other) overheads, and implement a persistent variant of the MPI_Alltoall collective.
On two small systems we experimentally compare the algorithmic improvements to the Bruck et al. algorithm when implemented on top of MPI, showing the zero-copy version to perform signi cantly better than the initial, straightforward implementation. One of our variants has also been implemented inside mvapich, and we show it to perform better than the mvapich implementation of the Bruck et al. algorithm for the range of processes and problem sizes where it is enabled. The persistent version of MPI_Alltoall has no overhead and outperforms all other variants, and in particular improves upon the standard implementation by 50% to 15% across the full range of problem sizes considered.

Keywords:

All-to-all collective communication; MPI; derived datatypes

"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)

http://dx.doi.org/10.1145/2597652.2597662

Created from the Publication Database of the Vienna University of Technology.