Canonical minimal automata for generalised regular expressions
Given a generalised regular expression (GRE) R, this program computes the
canonical minimal (incomplete) deterministic finite automaton (DFA) which
recognises the language described by R and hence solves the
equivalence problem for GREs.
The uniqueness up to isomorphism of the minimal DFA for every
regular language is given by the Myhill–Nerode theorem. The
states of the canonical minimal DFA are usually expressed as equivalence
classes of strings under the Nerode congruence; here they are given
as natural numbers in [0, ..., m-1], indexed in order of the
earliest string in shortlex order to reach each state.
It follows from (Stockmeyer, 1974) that this program runs in non-ELEMENTARY
time in the worst case; i.e. there is no k such that this program
runs in O(2^2^2^...^n) time for a power tower of height k. More
specifically, the height of the power tower is given by the number of
nested complements permitted within the expression.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ GREs are given in postfix notation. Whitespace is ignored. ┃
┠─────────────────────────────┬──────────────────────────────┨
┃ $ nothing │ R+ one or more ┃
┃ % anything │ R! complement ┃
┃ ~ the empty string │ RS, concatenation ┃
┃ . any single character │ RS| union ┃
┃ a the letter ‘a’ │ RS& intersection ┃
┃ R* zero or more │ RS\ difference ┃
┃ R? zero or one │ RS^ symmetric difference ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
How it works
• Sub-expressions are successively computed via DFA operations and
minimised. At the end, canonicalisation is performed.
• For the base cases $, %, ~, . and a-to-z, DFAs are given
directly.
• For !, &, |, \ and ^, the product construction on DFAs is used.
Only reachable states are constructed, and the argument DFAs are
completed only when necessary (!, both arguments of | and ^, and the
second argument of \).
• For *, + and concat, epsilon-transitions are added and the powerset
construction is used on the resulting NFA. Again only reachable
states are constructed. (? is simple enough that it is just
constructed directly.)
• Minimisation is performed as follows:
1. Unreachable states are removed.
2. Dead states (states that cannot reach a final state) are removed,
except for the start state. These can be found by graph-searching
backwards from each final state.
3. States in the same Nerode congruence class are merged using
Hopcroft’s algorithm.
• Canonicalisation is performed a breadth-first search in alphabetic
order (which naturally visits each state in order of the earliest
string in shortlex order that reaches it). In particular there is
no need to solve the graph isomorphism problem.