Some languages aren’t regular.
I’ve spent several posts (1 2 3 4) on regular languages and the machines that can recognize them. This post will quickly run through languages that aren’t regular, as well as peek at a few more complicated classes of languages.
Consider the set of strings
that have any number of a
s followed by the same number of b
s:
{,
ab
, aabb
, aaabbb
, …}.
This language is not regular:
the only way to get “any number of” in a regular language
is with a *
in the regular expression
or with a loop in the finite automaton.
If we have two things that need to be in arbitrary numbers
(here, the a
s and the b
s)
then they are independent of one another.
This idea is formalized in what is known as the pumping lemma. There are several pumping lemmas; that used most often for regular languages is this:
Every regular language L has some some fixed number p. Any string in L that is longer than p can be “pumped”: that is, there is some non-empty sequence of symbols in the first p symbols of L that can be repeated an arbitrary number of times to produce more strings in L.
Applying this lemma to the language above,
it implies that if the language was regular we’d be able to “pump”
strings in the language with more a
s than b
s.
Most computing theory courses and texts will follow up a discussion of regular languages with two more classes: context free, which will get a lot of attention, and context sensitive, which will be given a more summarial presentation.
Context-free languages are computed by something called a push-down automaton
and represented by context-free grammars.
They also have a pumping lemma
that states you can find some chunk in the middle of long strings
the front and back of which can be pumped together.
Thus, any number of a
s followed by the same number of b
s is context-free,
but any number of a
s followed by the same number of b
s followed by that same number of a
s again is not.
People like to say that programming languages are context-free; this is not entirely true. Most programming languages use a context-free parser as a first step, but then perform a second pass to apply some context-sensitive rules too.
Context-sensitive languages are a generalization of context-free languages that turns out to be neither widely useful nor efficiently computable. Indeed, the presence of context-sensitive languages in computation texts is largely a side-effect of the fact that a linguist (Noam Chomsky) created the main categories of languages (which also explains their name).
There are many more kinds of simplified languages and machines, but I’ll jump next to the most general of all, the subject of the Church-Turing Thesis.
Looking for comments…