Attributes, annotations, and choosing between them
In a recent paper CFPS 79 submitted to FHISO The Family History Information Standards Organisation; see this post for more. genealogy call for papers I described a principle I think might have applicability in many other areas. To explain it, though, I need to say a few words about data structures.
Two years ago I wrote a little bit about data structures. For this discussion, though, we don’t need to even think about computers. We can just use paper.
Suppose I want to write down information about Seth and store it in a filing cabinet where each sheet of paper in the file is numbered. I could do that by making a single sheet of paper like this:
page 5type: Person
name: Seth
gender: Male
birth: circa 3868 BC
death: circa 2956 BC
father: Adam (page 1)
mother: Eve (page 2)
known children: Enos (page 6)
In programmer speak, the name, birth, etc., on this sheet are attributes or fields of the page (which programmers might call a record, object, struct, or node) representing Adam.
But I don’t need to store all of that information on a single sheet. For example, I could separate out the birth onto a separate sheet:
page 7type: event
subtype: birth
father: Adam (page 1)
mother: Eve (page 2)
child: Seth (page 5)
date: circa 3868 BC
page 5type: Person
name: Seth
gender: Male
death: circa 2956 BC
known children: Enos (page 6)
I could even take this to an extreme and have nodes like
page 53type: name
value: Seth
whom: (page 5)
page 7type: event
subtype: birth
father: (page 1)
mother: (page 2)
child: (page 5)
date: circa 3868 BC
page 5type: Person
and put no information at all on Seth’s own node. There is, as far as I know, no standard name for these records-that-describe-records; I call them annotations or properties, but the names are not optimal….
Since we can store all the information on one sheet or put it on many individual sheets, which is the “right” thing to do? This is where my Principle of Sensible Disbelief comes in. Generalized slightly beyond the genealogical context, it states:
It should be sensible to exclude any node(s) to which no included nodes refer, and not sensible to exclude any attribute(s) of nodes included.
Returning to the Seth example, there are people who disagree with the dates I used so the dates should be on their own nodes. Similarly, one might imagine someone claiming that Seth was just a nickname, or that he had a different mother, etc. Each of these properties should thus be its own node.
However, not all attributes should be pulled off in this fashion. For example, it probably does not make sense to pull off the type of each node; it may be sensible to disagree on when Seth was born (or even if he was a real person) but it would be pretty strange to disagree about whether page 5 refers to a person or a birth or a date.
So, why would the principle of sensible disbelief be useful? Why would we want to have a lot of nodes at all, and having them why restrict them in this fashion?
Having many nodes simplifies collaboration in two ways.
First, it reduces the need to have two editors access the same node simultaneously or to try to merge two versions of the same record: each conceptual edit is on a separate page, so only one need be accessed at a time and merging is a matter of picking which one to believe (or to add “match” nodes; more on these below).
Second, it facilitates “persistent data” where every change is simply added to the filing cabinet and no paper is ever removed. Persistent data is a huge boon if various collaborators might not always agree; instead of you and I getting in an edit war, changing Seth’s birth date back and forth, we can simply each add our believed date to the birth, possibly adding an explanations of our reasoning (in separate nodes) as well.
Having only nodes that encode something that might be sensibly disbelieved also has two benefits regarding data consistency.
First, any set of nodes, together with the nodes to which they refer, is a sensible model of the data. There is not need to place special restrictions on which nodes you believe; you might not pick the right set, or even a logically consistent set, but you will at least pick a sensible set. You’ll never have a node with no type, or with multiple conflicting types, etc.
Second, by selecting the attributes carefully we can prevent people from skipping steps or hiding what they are actually doing. For example, I have proposed See FHISO CFPS 4 for this proposal. that each node should have, as an attribute, a single source. Putting this as an attribute prevents people from making a single node with multiple sources. I chose this because I believe that any multi-source data is actually the result of a match, which might look like
page 56type: match
same: page 5, page 43
and be used like
page 57type: person
source: page 56
to represent the concept that two sources refer to the same person. Actually, since a match is virtually always accompanied by a node citing the match with the same type as the matched nodes, in CFPS 4 I propose that these two be merged; but that distinction is not important for this post. Forcing this distinction would not be possible if sources could be appended at will. Without it, the sensible-to-disbelieve claim that two sources refer to the same individual/event/etc. would not be in its own node.
I enjoy thinking about how I think. One of my many forms of self-absorption, I suppose. I was thus surprised and delighted to find that this principle of sensible disbelief, initially developed to help me design a good family history data model, has proven a useful tool in thinking about other things as well.
Of necessity, most thoughts I think are informal summaries of many distinct concepts mashed together. When I wish to take those thoughts apart I have a set of tools I use to evaluate ideas and see if I can split them further. Quite unexpectedly, the principle of sensible disbelief has come to mind several times since I developed it as a tool for assisting in this idea decomposition.
I believe in the fundamental transferability of truth, but it still surprises me each time an idea or mental tool jumps over the little fences I put up between disciplines in my mind.
Looking for comments…