Diploma and Master Theses (authored and supervised):
"Adaptability in Distributed Stream Processing";
Supervisor: S. Dustdar, B. Satzger;
Institut für Informationssysteme, AB Verteilte Systeme,
The processing of large data sets is normally done using batch-oriented approaches. However, when confronted with the processing of live data streams and events, which are usually subject to high fluctuations in terms of data arrival, these batch-oriented approaches are not applicable.
For addressing these kinds of processing requirements, various stream processing engines have been developed. One of these engines is ESC , a cloud based stream processing engine designed for computations with real time demands written in Erlang . In order to cope withincreasing or decreasing computational needs, ESC is able to automatically scale by attaching or releasing nodes using information about the workload of the underlying machines. But, for
preventing an overload of a node due to bursty data arrival or intensive computation, this approach alone is not sufficient. An adaptation technique is needed that, on the one hand, is able to
identify unbalanced work distributions and take counteracting measures, and, on the other hand, is utilizing strategies of placing operators intelligently onto nodes, such that the chance of an
overload situation becomes less likely.
The problem of mapping operators to nodes can be divided into two separate problems.First of all, the question on where to place operators initially is to be answered. Therefore, three different approaches have been implemented and analyzed. These include the random mapping of operators to nodes, the creation of operators on the currently least loaded node, as well as the mapping of an operator to a node, according to its unique name and considering the logic of the currently executed scenario.
Secondly, a strategy is needed, which is deciding when and where to move operators between nodes, in order to establish a balanced load distribution. Hence, two different load-balancing approaches
have been implemented and analyzed. The first strategy balances the load by moving random workers from nodes with a high load to nodes with a lower load. In addition to that, the second strategy, which constitutes the main focus of this thesis, is derived from an existing
solution to a problem of mapping tasks to processor nodes at run-time, which is called Particles Approach . Therefore, a porting of the existing algorithm has been performed into the environment of distributed stream processing systems, together with an analysis of its effectiveness.
In order to compare the developed concepts with each other, a benchmarking application is required. Currently, the only available benchmark for stream processing systems is Linear Road , which simulates vehicles on expressways in a large metropolitan area. The developed methods are therefore checked using the Linear Road benchmark. Observed performance gains with different approaches are compared with each other and with the results of other stream
Created from the Publication Database of the Vienna University of Technology.