A notation for describing the languages of Finite Automata.
Last week I discussed state machines; in particular I presented Deterministic Finite Automata, the simplest of the machines recognizing languages that form the subject matter of computability theory. In this post I discuss a common notation for describing the languages that these automata recognize, though I won’t show that that is what I am doing until a later post.
Recall that an alphabet
is a set of symbols, like {a
, b
, #
};
a string is an ordered list of symbols from that alphabet
like aab#ba##
,
and a language is a set of those strings,
like {baa
, aba
, ##a#b
}.
Our goal today is to come up with a succinct notation
for describing languages that might contain infinitely many strings,
such as “strings with a b
after every a
”
or “strings with more #
s than b
s.”
We won’t succeed in creating a notation for “more #
s than b
s”
in this post…
We’ll call representations of this notation expressions
and display them thus
.
Let’s start simple.
Let’s let a sequence of symbols from our alphabet represent that sequence of symbols:
a
describes the language {a
},
baa
describes the language {baa
}, etc.
We don’t have any way to describe languages with more than a single string in them,
but we have to start somewhere.
Let’s add a symbol not in our language—say, ?
—that means something else.
Let’s have it mean “You can include the previous symbol or not, both are ok.”
Thus
ab?
describes the language {ab
, a
},
b?a?a?
describes the language {baa
, ,
ba
, a
, b
, aa
}, etc.
Our expressions are more succinct now, but still can’t express {abb
, a
}
Try it; you’ll end up with {abb
, ab
, a
} or {abb
, ab
}.
.
So let’s add more symbols, (
and )
,
and let them group things together.
a(bb)
describes the language {a
},
abb?
describes the language {ab
, abb
},
a(bb)?
describes the language {a
, abb
},
etc.
This adds to what we can express, but we’re still missing things like {a
, b
, #
}.
Let’s add another symbols, |
.
This will mean “either the thing on the left, or the thing on the right, but not both”.
aba|bb
describes the language {aba
, bb
},
a(b|a)
describes the language {ab
, aa
},
a(b|a)?
describes the language {a
, ab
, aa
},
etc.
Now we can express any finite language,
but we still don’t have infinite languages like {a
, aa
, aaa
, …}.
There’s one more symbol we need to make our expressions “regular”: *
.
A *
is very similar to a ?
,
but where a?
means zero or one a
s,
a*
means any number of a
s, zero or more.
aa*
describes the language {a
, aa
, aaa
, …},
(b|a)*
describes the language consisting of any number of a
s and b
s in any order,
etc.
Some of these languages are hard to describe any other way:
for example, a(bab|aaa?)*b
describes the set of strings where each string starts with an a
, ends with a b
,
and in between has any number of bab
, aa
, and aaa
in any order.
There are a variety of shorthands for regular expressions:
for example, many use the symbol +
to mean “one or more” instead of *
’s “zero or more”;
but all of these are just for convenience, not expressive power.
Regular expressions exist “in the wild”:
in most word processors and text editors, for example,
when you search you can choose to search for regular expressions
instead of just simple strings.
But if I were going for that I would have taught you about
some of the more convenient extras they use
such as .
and [
and ]
.
You are welcome to read up on practical regular expressions elsewhere:
there are plenty of good tutorials accessible through any search engine.
This is a post about theory. One of the basic theoretic questions we will ask is given two ways of representing a language (in our case regular expressions and DFAs) can we tell if the two are inter-changeable? Another, perhaps more interesting, question is are there languages that cannot be expressed using these techniques? These and others will be investigated in future posts.
Looking for comments…