Or, how we encoding arbitrary meaning in a computer.
The World Wide Web and The Internet have become so successful that it is easy to forget they are not inevitable. The Internet is just one of many possible internets, or ways of connecting computers that do not trust one another. The World Wide Web is just one of many applications of an internet, one that focuses on human-targeted content and hyperlinks.
From an early stage of the Web, it was clear that human-targeted content was insufficient to provide humans with the content they wanted. The problem was not with the content itself: it was with our ability to find the content. Several companies have made billions of dollars out of finding ways to reduce the magnitude of this problem, but it still remains in searches that they cannot handle. Suppose, for example, I want to find pages that use the word “set” in its relatively rare meaning “be seated”: many of these documents exist, but no current search engine can find them. One way of viewing this problem is that both websites and searches are expressions of concepts, but search engines only handle the words in the expressions, not the concepts themselves.
The obvious fix for this is to allow Web documents to include a computer-understandable representation of meaning. This concept has gone by many names, of which “semantic web” is one of the more common. It’s development has faced various challenges, most of them social rather than technical. Two of the largest are the difficulty in convincing people to add semantic markup to their work and the potential for pages to lie in their markup, representing their semantics as discussing a more popular topic than their human-readable content actually contains in an effort to direct traffic to their site.
But along the way, various technologies were developed that provide flexible and expressive representation of meaning. A few principles emerged from this:
First, many useful concepts are readily expressed in a hierarchy. An author is a type of creator, and so is an editor, and so is an illustrator. A primary author is a type of author, and a sole author is a type of primary author. If you and I are both aware of this hierarchy and you tell me someone is the sole author of a document, but I don’t have special processing of sole authors instead treating all creators the same, I can still operate on what you told me.
Second, as Shakespeare put it, “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” There’s no such thing as an exhaustive list of all concepts I’ll ever care about. So I should make the vocabulary extensible: any user should be able to add a new term. To help me process the new term, they should also be able to tell me the more general term their term is a specialized form of.
Third, we can simplify quite complicated concepts to relationships between pairs of things. If you need a relationship between more things, such as the members of a family or the authors of a paper, you can name the group and then state each individuals relationship to that group. If you need to express complicated ideas you can either add them as a new specialized term that is a variant of an existing term or you can assemble them as you would in English with a root term modified by various adjectives, adverbs, and prepositional phrases.
Fourth, the effectiveness of the first three principles is dependent on the right starting point and overall structure. For example, if I wish to express ideas about family history research but I start with terminology for historical people and facts about them, I have set myself down the wrong path. Research deals in records and evidence and possibility and uncertainty, not in facts and conclusions, and no semantic organizational system will succeed if it requires users to express the wrong things.
Fifth, high generality = low usability. Anything can be written on a blank page, but if I want you to keep accounts of revenue and expenditures then a ledger is preferable. Similarly, by the third principle above we know that anything can be represented in RDF triples, but if want you to express document markup then a much more limited language like HTML is far better. When I limit what you can do you find it easier to focus and accomplish what you should do.
The Semantic Web never really emerged as a thing most people use, but it also never really died. Technologies like RDFa and vocabularies like schema.org have helped keep it growing and it in turn has helped inform and power important new applications like Accessible Rich Internet Applications (ARIA) and Extensible Metadata Platform (XMP). It’s part of the background work that makes our modern world what it is.
Looking for comments…