An exploration of how storing inference rules in the data can help ensure that collaborators know what they mean by the terms they use.
There is a large an open-ended set of concepts that we want to discuss in family history. “Ritual application of water as an act of admittance into a religious organization”, is one example; another is “a government-run effort to collect snap-shot information about the residents of its territory on a particular date”, another “the person whose sperm caused another person to be born”.
I do not expect any group to ever honestly claim to have enumerated all of the concepts to which we might want to refer. That leaves us with several less-than-desirable alternatives:
Don’t even try to standardize terminology. If someone wants to call it “baptism” and another “christening”, let them.
Make as large a database of concepts as possible and require people to use those and only those. Not in the database? Submit a motion to have it added, and until that motion is approved just don’t refer to it at all.
Make a database of broad categories of concepts so that any new concept can be categorized as something we have. If that means we can’t tell two concepts apart in our data then the difference must not have mattered anyway.
Let any user add new concepts to “the database” writ large: anyone can add a new concept, provided they also give a way of specifying that concept to the exclusion of any others and a way for others to discover what that concept meant. If a hundred people add the same concept then we just have a hundred variants of it in the data.
As I said, less than desirable all around.
In any given community of speakers of a common language, terms exist that refer to various concepts. Often there are groups of concepts that are referred to together, such as “father” suggesting genetic parentage and social guardianship. Often there are also distinct concepts that are referred to with the same term, such as “father” being either a genetic parent or a leader in some churches. Often there are some concepts that can be referred to by several terms within that community, such as “church” and “religion” (depending which definition we pick of each). And finally, there are often concepts that do not have a standard terminology within a community, such as the Japanese concept of 切腹 (seppuku) in almost any non-Japanese culture.
The act of taking a document and determining what concepts its contents are indented to convey is almost always an act of inference. There are documents that come with a built-in glossary or the like, but usually even if the terms are defined they are defined in a separate document and some human judgment is needed to determine if the right set of definitions was found for the document in question. Even with a glossary, there is not guarantee that the producer of the document followed the definitions as they were described in the glossary.
As I have suggested before, there is advantage in allowing the data to contain the source’s original terms, the concepts we believe those terms are meant to convey, and the inference used to derive the concept from the term. In addition to the advantages such a storage technique outlined in my previous post, storing the rule for these inferences allows the understanding of how to interpret documents like the one being interpreted to be shared digitally and automatically with other researchers.
Another advantage to including inferences and rules explicitly is that by entering the rules into the computer we are implicitly teaching the computer how to translate documents. The translation problem can be posed as “given a document that we would infer means the set of interlinked concepts X, find a document in the target language that we would also infer means X”.
With the understanding that concept extraction is inference and the suggestions on how to store inference and its underlying reasoning which I gave in my earlier post, I believe I can propose an acceptable solution to the plethora-of-concepts problem.
The first step toward a solution is to note that the existence of sharing of inference rules can remove most of the pain associated with having multiple variants of a single concept. If you create the concept “maker of hob-nailed boots” and I create the concept “hob-nailed boot maker” all we have to do to allow the computer to make the duplication seem to vanish is add a rule suggesting that these concepts may be derived from one another. Assuming a reasonably clever user interface, we can then work together as if we had created the same concept, not even needing to be aware that there are two concepts under the hood.
The next step toward a solution is the existence of a standards body to define the preferred form and definition of the most common concepts. Birth, child-of, marriage certificate, obituary… the majority of concepts that will probably come from a relatively small set of concepts even though the overall set of concepts is mind-bogglingly large. Having a single standard concept representation for these common concepts will simplify matters by reducing the proliferation of rules and duplicate concepts.
Looking for comments…