Canonical automata for generalised regular expressions

Canonical minimal automata for generalised regular expressions

Given a generalised regular expression (GRE) R, this program computes the
canonical minimal (incomplete) deterministic finite automaton (DFA) which
recognises the language described by R and hence solves the
equivalence problem for GREs.

The uniqueness up to isomorphism of the minimal DFA for every
regular language is given by the Myhill–Nerode theorem. The
states of the canonical minimal DFA are usually expressed as equivalence
classes of strings under the Nerode congruence; here they are given
as natural numbers in [0, ..., m-1], indexed in order of the
earliest string in shortlex order to reach each state.

It follows from (Stockmeyer, 1974) that this program runs in non-ELEMENTARY
time in the worst case; i.e. there is no k such that this program
runs in O(2^2^2^...^n) time for a power tower of height k. More
specifically, the height of the power tower is given by the number of
nested complements permitted within the expression.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ GREs are given in postfix notation. Whitespace is ignored. ┃
┠─────────────────────────────┬──────────────────────────────┨
┃ $    nothing                │ R+   one or more             ┃
┃ %    anything               │ R!   complement              ┃
┃ ~    the empty string       │ RS,  concatenation           ┃
┃ .    any single character   │ RS|  union                   ┃
┃ a    the letter ‘a’         │ RS&  intersection            ┃
┃ R*   zero or more           │ RS\  difference              ┃
┃ R?   zero or one            │ RS^  symmetric difference    ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛


  
    
    
    
  
  
    
  
  
    
  
  
    
  
  
    
  
  
    
  







How it works

• Sub-expressions are successively computed via DFA operations and
  minimised. At the end, canonicalisation is performed.
• For the base cases $, %, ~, . and a-to-z, DFAs are given
  directly.
• For !, &, |, \ and ^, the product construction on DFAs is used.
  Only reachable states are constructed, and the argument DFAs are
  completed only when necessary (!, both arguments of | and ^, and the
  second argument of \).
• For *, + and concat, epsilon-transitions are added and the powerset
  construction is used on the resulting NFA. Again only reachable
  states are constructed. (? is simple enough that it is just
  constructed directly.)
• Minimisation is performed as follows:
  1. Unreachable states are removed.
  2. Dead states (states that cannot reach a final state) are removed,
     except for the start state. These can be found by graph-searching
     backwards from each final state.
  3. States in the same Nerode congruence class are merged using
     Hopcroft’s algorithm.
• Canonicalisation is achieved by performing a breadth-first search in alphabetic
  order, which naturally visits each state in order of the earliest
  string in shortlex order that reaches it. In particular there is
  no need to solve the graph isomorphism problem.