Publication Entry

[Back]

Diploma and Master Theses (authored and supervised):

C. Berger:
"Solving a Generalized Constrained Longest Common Subsequence Problem";
Supervisor: G. Raidl, M. Djukanovic; Institute of Logic and Computation, 2020; final examination: 2020-06.

English abstract:

In this thesis we are studying the constrained longest common subsequence (CLCS)
problem that has been introduced as a specific measure of similarity of biological sequences.
It extends the well-studied problem of finding a longest common subsequence (LCS) of a
given set of strings by an additional pattern string that is required to be a subsequence of
the LCS. There are several algorithms introduced in the literature dealing with the CLCS
problem with exactly two input strings (2-CLCS), but the general m-CLCS problem
with an arbitrary set of strings has not yet been approached except by one approximation
algorithm. The m-CLCS problem may find its application in biology for discovering
molecular clusters composed of molecules that all share a common structural pattern.
In this work we propose several new approaches to effectively solve the m-CLCS problem.
We present a heuristic beam search in shape of a general search framework as well as
an exact A search algorithm. Moreover, a greedy heuristic to find CLCS solutions of
reasonable quality within short time is proposed.
Our experimental evaluation has proven that our A search guided by a tight upper
bound calculation is significantly faster than current state-of-the-art algorithms in finding
proven optimal solutions on various 2-CLCS problem instances. Moreover, for the general
m-CLCS problem, our A approach was able to solve small to medium instances to
proven optimality within the allowed time and memory limit. For those instances where
A cannot prove optimality, we propose a heuristic beam search. Two beam search
configurations, one guided by a probability based heuristic and another one guided by
an expected-length calculation heuristic, specially adapted for the m-CLCS problem,
have been shown as particularly efficient. They deliver solutions that almost all reach
the quality of the optimal solutions proven by A search within significantly less time.
Moreover, the proposed search framework provides a solid basis for extensions towards
more general variants of the CLCS problem like the (k,m)-CLCS problem, where instead
of one, we are given an arbitrary number of k 2 N pattern strings constraining the LCS.

German abstract:

Diese Arbeit beschäftigt sich mit dem Constrained Longest Common Subsequence (CLCS)
Problem, welches eingeführt wurde, um die Ähnlichkeit verschiedener biologischer Sequenzen
zu messen. Dabei wird für eine gegebene Menge von beliebigen Strings die
längste gemeinsame Teilfolge an Zeichen gesucht, welche selbst wiederum einen bestimmten
gegebenen String als Teilfolge enthält. Es handelt sich um eine Variante des gut
untersuchten Longest Common Subsequence (LCS) Problems. Verschiedene Algorithmen
sind bekannt um das CLCS Problem für genau zwei Strings (2-CLCS) zu lösen, das
allgemeine m-CLCS Problem mit einer beliebigen Menge von Strings wurde bisher jedoch
nur näherungsweise mit einem Approximationsverfahren gelöst. Das m-CLCS Problem
kann in der Biologie der Identifikation von Molekülgruppen dienen, deren Moleküle eine
Gemeinsamkeit in Form einer bestimmten vorhandenen Teilstruktur aufweisen.
In dieser Arbeit werden mehrere neue Methoden vorgestellt um das m-CLCS Problem
effektiv zu lösen. Wir präsentieren eine heuristische Beam Search in der Form eines
generellen Suchframeworks sowie einen exakten A -Algorithmus. Außerdem wird eine
Greedy Heuristic vorgestellt, die es ermöglicht, Lösungen von akzeptabler Qualität in
kurzer Zeit zu finden.
Die Ergebnisse unserer Tests zeigen, dass unsere A -Suche, geführt von bekannten Upper
Bounds des LCS Problems, signifikant schneller im Lösen von 2-CLCS Instanzen als
bisherige Algorithmen ist. Beim m-CLCS Problem mit mehreren Strings konnten von
der A -Suche kleine bis mittelgroße Instanzen gelöst werden. Für jene Instanzen, die
von A nicht gelöst werden können, schlagen wir den Einsatz von Beam Search vor. Zur
Führung der Beam Search haben sich eine auf Wahrscheinlichkeitstheorie basierenden
Heuristik, sowie eine Heuristik zur Berechnung der erwarteten Länge als besonders effektiv
erwiesen. Diese Beam Search Konfigurationen konnten bei fast allen von der A -Suche
exakt gelösten Instanzen ebenfalls optimale Lösungen finden und waren dabei signifikant
schneller als die A -Suche. Das vorgestellte Suchframework kann fernerhin auf einfache
Art für weitere Varianten des CLCS Problems adaptiert werden, beispielsweise für das
(k,m)-CLCS Problem, bei dem eine beliebige Anzahl k 2 N an Strings als notwendige
Teilfolge der Lösung spezifiziert wird

Electronic version of the publication:

https://publik.tuwien.ac.at/files/publik_294890.pdf

Created from the Publication Database of the Vienna University of Technology.