Some languages aren’t regular.

I’ve spent several posts (1 2 3 4) on regular languages and the machines that can recognize them. This post will quickly run through languages that aren’t regular, as well as peek at a few more complicated classes of languages.

Not Regular and the Pumping Lemma

Consider the set of strings that have any number of as followed by the same number of bs: {, ab, aabb, aaabbb, …}. This language is not regular: the only way to get “‍any number of‍” in a regular language is with a * in the regular expression or with a loop in the finite automaton. If we have two things that need to be in arbitrary numbers (here, the as and the bs) then they are independent of one another.

This idea is formalized in what is known as the pumping lemma. There are several pumping lemmas; that used most often for regular languages is this:

Every regular language L has some some fixed number p. Any string in L that is longer than p can be “‍pumped‍”: that is, there is some non-empty sequence of symbols in the first p symbols of L that can be repeated an arbitrary number of times to produce more strings in L.

Applying this lemma to the language above, it implies that if the language was regular we’d be able to “‍pump‍” strings in the language with more as than bs.

Context-Free and Context-Sensitive

Most computing theory courses and texts will follow up a discussion of regular languages with two more classes: context free, which will get a lot of attention, and context sensitive, which will be given a more summarial presentation.

Context-free languages are computed by something called a push-down automaton and represented by context-free grammars. They also have a pumping lemma that states you can find some chunk in the middle of long strings the front and back of which can be pumped together. Thus, any number of as followed by the same number of bs is context-free, but any number of as followed by the same number of bs followed by that same number of as again is not.

People like to say that programming languages are context-free; this is not entirely true. Most programming languages use a context-free parser as a first step, but then perform a second pass to apply some context-sensitive rules too.

Context-sensitive languages are a generalization of context-free languages that turns out to be neither widely useful nor efficiently computable. Indeed, the presence of context-sensitive languages in computation texts is largely a side-effect of the fact that a linguist (Noam Chomsky) created the main categories of languages (which also explains their name).

And Beyond…

There are many more kinds of simplified languages and machines, but I’ll jump next to the most general of all, the subject of the Church-Turing Thesis.