Publication Entry

[Back]

Diploma and Master Theses (authored and supervised):

A. Heinisch:
"Selection and Hardware-Implementation of an Efficient Consensus Algorithm for a Mesochronous System";
Supervisor: T. Polzer, A. Steininger; Technische Informatik, 2015; final examination: 2015-01-15.

English abstract:

This thesis explores the possibility of using a consensus algorithm to replace the traditional,
triple modular redundancy scheme for fault tolerance. We start with the presentation of state of
the art fault tolerance mechanisms and discuss the advantages and drawbacks of TMR systems.
To be able to compare these mechanisms to our approach, we outline the basics of distributed
algorithms and discuss different consensus algorithms found in literature. Based on this theoretical
background a reduction of TMR and replica determinism to consensus has been created.
A theoretical evaluation of the well known Exponential Information Gathering protocol, the
Phase King protocol and the Phase Queen protocol as well as modifications of these protocols to
implement them directly in hardware is given. The theoretical results are analyzed in detail and
augmented by results from fault injection experiments performed using a software simulator. As
the simulator was specifically tailored to incorporate the properties of the target platform, we
were able to evaluate their fitness for direct hardware implementation solely based on the simulation
results. As the fault injection experiments were designed to violate the fault hypothesis in
some cases, the degradation properties of the algorithms could also be analyzed.
The Phase King and the Exponential Information Gathering protocol were implemented on
a Field Programmable Gate Array (FPGA) network. To circumvent the single shared clock tree
in synchronous systems which introduces a single point of failure we decided to implement the
system based on a mesochronous clocking system, namely the Distributed Algorithms for Robust
Tick-Synchronization (DARTS) protocol and a fully parametrizable communication buffer
supporting metastability free communication in this setting. The implementation of these two illustrate
that Byzantine fault tolerance using distributed algorithms is achievable in VLSI circuits.

German abstract:

Diese Arbeit analysiert die Möglichkeit die traditionellen, dreifach redundanten Fehlertoleranzmechanismen
durch Konsensus-Implementierungen zu ersetzen. Nach einem Überblick über
gängige Fehlertoleranzansätze und einer Erörterung der Probleme bei der Verwendung von Triple
Modular Redundancy und Replikadeterminismus werden das Consensus Problem und die
zugrundeliegenden Modelle erörtert. Anhand der theoretischen Modelle werden verwandte Probleme
wie Byzantine Agreement und Reliable Broadcast vorgestellt und eine Reduktion von
Agreement mit TMR und Replikadeterminismus zu Consensus präsentiert.
Eine theoretische Analyse bekannter Konsensusprotokolle, nämlich des Exponential Information
Gathering, des Phase King und des Phase Queen Algorithmus, sowie deren für hardware
Implementierung modifizierte Versionen, wird durchgeführt. Die theoretischen Eigenschaften
der Protokolle werden im Detail analysiert und mit zusätzlichen Daten aus Fehlerinjektions-
Simulationen erweitert. Diese Zusatzinformationen ermöglichen Einblicke in das Verhalten der
Protokolle, wenn diese außerhalb der festgelegten Spezifikation ausgeführt werden.
Das Phase King Protokoll und das Exponential Information Gathering Protokoll werden auf
einem Field Programmable Gate Array (FPGA) Netzwerk implementiert. Da ein einzelner Clock
Tree, wie er in synchronen Systemen gängig ist, ein gewaltiges Fehlerpotenzial aufweist, wurde
zugunsten einer Implementierung mittels verteiltem System entschieden. Dies wurde durch Zuhilfenahme
des Distributed Algorithms for Robust Tick-Synchronization (DARTS) Protokolls
und eines parametrisierbaren Kommunikatonsbuffers, der metastabilitätsfreie Kommunikation
in dieser Konfiguration erlaubt, erreicht. Die Implementierung dieser zwei Protokolle zeigt, dass
verteilte Algorithmen dafür geeignet sind, ein byzantinisch fehlertolerantes Hardwaresystem zu
entwickeln.