Python etc / FSM

FSM

A regular language is a formal language that can be recognized by a finite-state machine (FSM). That means that reading text, character by character, you only need memory to remember current state, and the number of such states is finite.

The beautiful and simple example is a machine that checks whether an input is a simple number like -3, 2.2 or 001. The following diagram is an FSM diagram. Double circles mean accept states, they identify where the machine can stop.

FSM

The machine starts at (1), possibly matches minus sign, then processes as many digits as required. After that, it may match a dot (3->4) which must be followed by one digit (4->5), but maybe more.

The classic example of a non-regular language is a family of strings like:

a-b
aaa-bbb
aaaaa-bbbbb

Formally, we need a line that consists of N occurrences of a, then -, then N occurrences of b. N is any integer greater than zero. You can't do it with a finite machine, because you have to remember the number of a chars you encountered which leads you to the infinite number of states.

Regular expressions can match only regular languages. Remember to check whether the line you are trying to process can be handled by FSM at all. JSON, XML or even simple arithmetic expression with nested brackets cannot be.

Mind, however, that a lot of modern regular expression engines are not regular. For example, Python regex module supports recursion (which will help with that aaa-bbb problem).

Chomsky

Apart from regular languages, Chomsky distinguishes three more types (ordered by descending strictness): context-free, context-sensitive, and unrestricted.

Context-free languages are more powerful than regular ones but still can be efficiently parsed by a program. XML, JSON and SQL are context-free for example.

Many tools allow you to parse such languages easily. Usually, they require you to define some grammar, the rules on how to parse and create a parser automatically. The most popular way to define such grammar is the BNF language. Here is the grammar to parse simple arithmetical expressions (only + supported) defined in BNF:

 <expr> ::= <operand> "+" <expr>  |  <operand>
 <operand> ::= "(" <expr> ")"  |  <const>
 <const> ::= integer

This is the set of rules. An expression is an operand plus another expression or just operand. An operand is either a constant or an expression enclosed in brackets. This way we can see the recursive nature of this language, which makes it non-regular.

The example of a context-free grammar parser for Python is lark. It is what you want if regexes are not enough or code that does the parsing gets messy.

BNF of BNF

Since BNF is a context-free language itself, you can represent its syntax with a BNF :).

<syntax>         ::= <rule> | <rule> <syntax>
<rule>           ::= <opt-whitespace> "<" <rule-name>
                     ">" <opt-whitespace>
                     "::=" <opt-whitespace> <expression>
                     <line-end>
<opt-whitespace> ::= " " <opt-whitespace> | ""
<expression>     ::= <list> | <list> <opt-whitespace>
                     "|" <opt-whitespace> <expression>
<line-end>       ::= <opt-whitespace> <EOL> |
                     <line-end> <line-end>
<list>           ::= <term> |
                     <term> <opt-whitespace> <list>
<term>           ::= <literal> | "<" <rule-name> ">"
<literal>        ::= '"' <text1> '"' | "'" <text2> "'"
<text1>          ::= "" | <character1> <text1>
<text2>          ::= "" | <character2> <text2>
<character>      ::= <letter> | <digit> | <symbol>
<letter>         ::= "A" | "B" | "C" | "D" | "E" | "F" |
                     "G" | "H" | "I" | "J" | "K" | "L" |
                     "M" | "N" | "O" | "P" | "Q" | "R" |
                     "S" | "T" | "U" | "V" | "W" | "X" |
                     "Y" | "Z" | "a" | "b" | "c" | "d" |
                     "e" | "f" | "g" | "h" | "i" | "j" |
                     "k" | "l" | "m" | "n" | "o" | "p" |
                     "q" | "r" | "s" | "t" | "u" | "v" |
                     "w" | "x" | "y" | "z"
<digit>          ::= "0" | "1" | "2" | "3" | "4" | "5" |
                     "6" | "7" | "8" | "9"
symbol>         ::=  "|" | " " | "!" | "#" | "$" | "%" |
                     "&" | "(" | ")" | "*" | "+" | "," |
                     "-" | "." | "/" | ":" | ";" | ">" |
                     "=" | "<" | "?" | "@" | "[" | "\" |
                     "]" | "^" | "_" | "`" | "{" | "}" |
                     "~"
<character1>     ::= <character> | "'"
<character2>     ::= <character> | '"'
<rule-name>      ::= <letter> | <rule-name> <rule-char>
<rule-char>      ::= <letter> | <digit> | "-"