[Back]


Doctor's Theses (authored and supervised):

M. Djukanovic:
"Exact and Heuristic Approaches for Solving String Problems from Bioinformatics";
Supervisor, Reviewer: G. Raidl, C. Blum; Institute of Logic and Computation, 2021; oral examination: 2021-05-05.



English abstract:
This thesis provides several new algorithms for solving prominent string problems from
the literature, most of these being variants of the well-known longest common subsequence
(LCS) problem. Given a set of input strings, a longest common subsequence is a string
of maximum length that can be obtained from each input string by removing letters, i.e.,
which is a common subsequence of all input strings. This is a combinatorial optimization
problem, which can be solved efficiently for two strings but is N P-hard and challenging
to solve in practice for the general case of an arbitrary set of input strings. The LCS
problem and related string problems make relationships among strings explicit and
provide a range of important measures which serve for detecting similarities between
molecules of various structures, for example. Finding similarities between molecules plays
an important role in bioinformatics helping in understanding complex biological processes,
identifying important motifs and clusters of molecules that possess similar behaviors.
Other important applications lie, for example, in text processing. The Unix command
diff and the Git version-control system are some examples where finding common text
patterns quickly is important. String problems have attracted attention for more than
50 years due to their computational hardness, and various methods were derived and
successfully applied.
The current work primarily focuses on N P-hard variants of string problems, where
previous exact approaches are of limited practical value, and therefore many heuristic
methods had been suggested. On the one side, more effective exact approaches are
proposed here, but on the other side also more effective heuristic methods that scale to
large problem instances of practical interest. Moreover, we consider in particular also
anytime algorithms that are in principle exact but can be terminated early and then yield
promising approximate solutions.
Besides the basic LCS problem, we consider here the following important variants: the
longest common palindromic subsequence problem, the arc-preserving LCS problem, the
longest common square subsequence problem, the repetition-free LCS problem, and the
constrained LCS problem.
Concerning heuristic approaches, we propose a general beam search framework in which
also many previously described methods can be expressed. A rigorous experimental
evaluation and comparison were done. In particular, new state-of-the-art results were
obtained on various benchmark sets utilizing a novel heuristic guidance that approximates the expected solution length of three different string problems. This guidance can more
effectively lead the search towards promising search regions of the search space than
formerly used functions. For solving the longest common square subsequence problem, in
particular, we propose a hybrid of a Reduced Variable Neighborhood Search method and
a Beam search technique.
Concerning exact techniques to solve the string problems, two kinds of methods are
presented in this thesis: (i) pure exact methods based on A∗ search and (ii) anytime
algorithms that build upon the A∗ search framework. An effective A∗ search is obtained
primarily by utilizing an efficiently calculable and at the same time comparably tight upper
bound function for the length of real optimal common subsequence. Besides exhibiting
an excellent performance on the general LCS problem, experimental results indicate that
this A∗ search is also able to outperform all previously published more specific exact
algorithms for the special cases of the longest common palindromic subsequence problem
and the constrained longest common subsequence problem with two input strings.
Concerning anytime algorithms, we first make use of the already derived A∗ search
framework in the way that classical A∗ iterations are interleaved with beam search runs.
The iterations are performed on the same state graph to avoid unnecessary expansions
of redundant nodes in the search. This hybrid is able to meaningfully tackle large-sized
problem instances that cannot be solved exactly with the A∗ search and yields, after
early termination, also upper bounds in addition to the heuristic solutions. In order to
improve the convergence of the hybrid, we further suggest another anytime algorithm
variant in which the beam search part is replaced by a major iteration of the Anytime
Column Search. The effectiveness of the two hybrids is experimentally compared on the
basic LCS problem and the longest common palindromic subsequence problem. New
state-of-the-art results are produced and better final optimality gaps were obtained by
the latter hybrid, in comparison to a several state-of-the-art anytime algorithms from
the literature. Optimality gaps obtained in practice for the considered string problems
are, to the best of our knowledge, compared and reported for the first time in literature.
As an alternative exact approach, we further consider the transformation of LCS problem
instances into Maximum Clique (MC) problem instance on the basis of so-called conflict
graphs. In this way, state-of-the-art MC solving approaches can be utilized for solving
the LCS problem instances. One of the best exact and one of the best heuristic MC
solvers from the literature are used for this purpose. Due to the complexity of the
transformation, the size of the conflict graph grows quickly in the LCS problem instance
size. To address this issue, we present a conflict graph reduction technique based on
suboptimality checks by making use of the best available lower and upper bounds of the
problems. The high effectiveness of the reduction is demonstrated on the repetition-free
LCS problem and the longest arc-preserving subsequence problem. In conjunction with
the general-purpose mixed integer linear programming solver Cplex, new state-of-the-art
results are obtained on a wide range of benchmark instances. Some of the real-world
longest common arc-preserving problem instances were solved exactly for the first time
in the literature by applying this new technique.


"Official" electronic version of the publication (accessed through its Digital Object Identifier - DOI)
http://dx.doi.org/10.34726/hss.2021.93203

Electronic version of the publication:
https://publik.tuwien.ac.at/files/publik_301832.pdf


Created from the Publication Database of the Vienna University of Technology.