26 February 2025
14:00 Doctoral defense Room 85 of IC 2
Topic
Small E-NFA for Regular Expression Constrained Sequence Alignment
Student
Lise Rommel Romero Navarrete
Advisor / Teacher
Guilherme Pimentel Telles
Brief summary
Sequence alignment is a fundamental problem in molecular biology, used to assess structural and functional similarities between biological sequences. An interesting version of this problem is regular expression constrained sequence alignment (RECSA), where a subalignment must match a regular expression. Regular expressions can be used to represent biological patterns, including those defined by PROSITE. To solve the RECSA problem, automata are used as sequence recognizers. This work addresses the RECSA problem focusing on improving the construction of automata and their integration with constrained sequence alignment. It is organized in three parts. In the first part, we propose, analyze, and implement a simplified version of Thompson's construction to solve the RECSA problem, introducing active state lists for PROSITE patterns as a constraint. We present the results of comparative experiments between our approach and others in the literature. In the second part, we explore Thompson's construction focusing on the conditions under which an automaton of linear size with respect to the size of the regular expression can be constructed. We introduce the concept of a compound regular expression and the construction of an automaton from a regular expression that isolates subexpressions that lead to large automata, allowing, in certain cases, the construction of an E-free NFA of linear size. These cases include PROSITE patterns and other regular expressions that have epsilon symbols and Kleene closures, which typically result in large automata. In the third part we explore the Glushkov automaton based on the sets first, last and follow, and introduce the follow matrix as a new formalism for representing regular expressions. We show properties of follow matrices, including their intrinsic ability to eliminate redundant and trivial components from the regular expression representation, thus providing a more efficient and compact representation of regular languages. We explore the use of the follow matrix to simplify and reduce the size of an E-free NFA, presenting ideas for an algorithm for constructing an E-free NFA based on the follow matrix.
Examination Board
Headlines:
Guilherme Pimentel Telles IC / UNICAMP
Maria Emília Machado Telles Walter IE/UnB
Said Sadique Adi FACOM / UFMS
João Meidanis IC / UNICAMP
Lehilton Lelis Chaves Pedrosa IC / UNICAMP
Substitutes:
Zanoni Dias IC / UNICAMP
Marcelo Keese Albertini FACOM / UFU
Nalvo Franco de Almeida Junior FACOM / UFMS