M. Djukanovic:

"Exact and Heuristic Approaches for Solving String Problems from Bioinformatics";

Supervisor, Reviewer: G. Raidl, C. Blum; Institute of Logic and Computation, 2021; oral examination: 2021-05-05.

This thesis provides several new algorithms for solving prominent string problems from

the literature, most of these being variants of the well-known longest common subsequence

(LCS) problem. Given a set of input strings, a longest common subsequence is a string

of maximum length that can be obtained from each input string by removing letters, i.e.,

which is a common subsequence of all input strings. This is a combinatorial optimization

problem, which can be solved efficiently for two strings but is N P-hard and challenging

to solve in practice for the general case of an arbitrary set of input strings. The LCS

problem and related string problems make relationships among strings explicit and

provide a range of important measures which serve for detecting similarities between

molecules of various structures, for example. Finding similarities between molecules plays

an important role in bioinformatics helping in understanding complex biological processes,

identifying important motifs and clusters of molecules that possess similar behaviors.

Other important applications lie, for example, in text processing. The Unix command

diff and the Git version-control system are some examples where finding common text

patterns quickly is important. String problems have attracted attention for more than

50 years due to their computational hardness, and various methods were derived and

successfully applied.

The current work primarily focuses on N P-hard variants of string problems, where

previous exact approaches are of limited practical value, and therefore many heuristic

methods had been suggested. On the one side, more effective exact approaches are

proposed here, but on the other side also more effective heuristic methods that scale to

large problem instances of practical interest. Moreover, we consider in particular also

anytime algorithms that are in principle exact but can be terminated early and then yield

promising approximate solutions.

Besides the basic LCS problem, we consider here the following important variants: the

longest common palindromic subsequence problem, the arc-preserving LCS problem, the

longest common square subsequence problem, the repetition-free LCS problem, and the

constrained LCS problem.

Concerning heuristic approaches, we propose a general beam search framework in which

also many previously described methods can be expressed. A rigorous experimental

evaluation and comparison were done. In particular, new state-of-the-art results were

obtained on various benchmark sets utilizing a novel heuristic guidance that approximates the expected solution length of three different string problems. This guidance can more

effectively lead the search towards promising search regions of the search space than

formerly used functions. For solving the longest common square subsequence problem, in

particular, we propose a hybrid of a Reduced Variable Neighborhood Search method and

a Beam search technique.

Concerning exact techniques to solve the string problems, two kinds of methods are

presented in this thesis: (i) pure exact methods based on A∗ search and (ii) anytime

algorithms that build upon the A∗ search framework. An effective A∗ search is obtained

primarily by utilizing an efficiently calculable and at the same time comparably tight upper

bound function for the length of real optimal common subsequence. Besides exhibiting

an excellent performance on the general LCS problem, experimental results indicate that

this A∗ search is also able to outperform all previously published more specific exact

algorithms for the special cases of the longest common palindromic subsequence problem

and the constrained longest common subsequence problem with two input strings.

Concerning anytime algorithms, we first make use of the already derived A∗ search

framework in the way that classical A∗ iterations are interleaved with beam search runs.

The iterations are performed on the same state graph to avoid unnecessary expansions

of redundant nodes in the search. This hybrid is able to meaningfully tackle large-sized

problem instances that cannot be solved exactly with the A∗ search and yields, after

early termination, also upper bounds in addition to the heuristic solutions. In order to

improve the convergence of the hybrid, we further suggest another anytime algorithm

variant in which the beam search part is replaced by a major iteration of the Anytime

Column Search. The effectiveness of the two hybrids is experimentally compared on the

basic LCS problem and the longest common palindromic subsequence problem. New

state-of-the-art results are produced and better final optimality gaps were obtained by

the latter hybrid, in comparison to a several state-of-the-art anytime algorithms from

the literature. Optimality gaps obtained in practice for the considered string problems

are, to the best of our knowledge, compared and reported for the first time in literature.

As an alternative exact approach, we further consider the transformation of LCS problem

instances into Maximum Clique (MC) problem instance on the basis of so-called conflict

graphs. In this way, state-of-the-art MC solving approaches can be utilized for solving

the LCS problem instances. One of the best exact and one of the best heuristic MC

solvers from the literature are used for this purpose. Due to the complexity of the

transformation, the size of the conflict graph grows quickly in the LCS problem instance

size. To address this issue, we present a conflict graph reduction technique based on

suboptimality checks by making use of the best available lower and upper bounds of the

problems. The high effectiveness of the reduction is demonstrated on the repetition-free

LCS problem and the longest arc-preserving subsequence problem. In conjunction with

the general-purpose mixed integer linear programming solver Cplex, new state-of-the-art

results are obtained on a wide range of benchmark instances. Some of the real-world

longest common arc-preserving problem instances were solved exactly for the first time

in the literature by applying this new technique.

http://dx.doi.org/10.34726/hss.2021.93203

https://publik.tuwien.ac.at/files/publik_301832.pdf

Created from the Publication Database of the Vienna University of Technology.