Doctor's Theses (authored and supervised):
"Data and Container Placement in Scalable Data Analytics Platforms";
Supervisor, Reviewer: O. Kao, I. Brandic;
TU Berlin, Department of Telecommunication Systems,
oral examination: 2018-10-30.
Distributed dataflow systems process large volume of data in parallel on multiple machines. In production, multiple dataflow applications are scheduled for execution in virtual containers on a per-job basis. Furthermore, they access datasets partitioned into datablocks across the cluster machinesī disks. Runtime performance is important for many of these jobs, as their users expect fast results. However, optimizing performance is difficult, because dataflow jobs are very diverse and used in a wide variety of domains such as relational processing, machine learning, and graph processing. Container and datablock placement decisions impact a jobīs runtime performance significantly. Furthermore, changing placements affects runtime performance without modifying the applicationīs code, and thus can be applied to many jobs without much configuration effort from the userīs side. However, jobs
benefit differently from placement decisions, because their resource demands differ from job to job. Hence, there is not a single placement strategy that is optimal for all possible jobs. Besides that, users require a secure long-term data retention for their documents and datasets. This thesis presents container and datablock placement strategies to optimize the runtime performance of distributed dataflow applications running on shared data analytics platforms. It contributes two placement methods for this. The first method improves the efficiency of a jobīs dataflow operations and the degree of data locality by colocating its input datablocks and containers on a selected set of nodes.
The second method places a jobīs containers based on network distances between containers and its input datablocks as well as container interference. In addition, this thesis explores the problem of data retention in shared data analytics platforms. Therefore, it contributes a method of storing and accessing lineage metadata through smart-contracts executed on a decentralized blockchain network.
The methods presented in this thesis have been implemented in a research prototype that has been integrated with Hadoop and Ethereum. For evaluation, we used a 64 nodes commodity cluster and workloads consisting of applications implemented in Flink from the domains of relational processing, machine learning, and graph processing. We compared the runtime performance of workloads scheduled with our methods with Hadoopīs default placement method. For our blockchain-based
data retention method, we measured overhead in terms of additional response time and reported costs using it on Ethereumīs blockchain network.
Created from the Publication Database of the Vienna University of Technology.