Applying the Principle of Sensible Disbelief to derive polygenea.
It is common in existing family history data to store conclusion-oriented data, such as the following example person record which references two other person records that I do not show:
Person: id: 1 name: as-written: Charlotte Ward surname: Ward givenname: Charlotte birth: father: 2 mother: 3 sources: - Kelleys Island, Erie, Ohio, pp. 370–371 no. 22, 1876 - US Census 1900, Kelleys Island, Erie, Ohio, United States, sheet 5B, family 114
I’m using YAML-like syntax because it is easier to read than many other data formats. I suspect an implementation would use XML or JSON instead.
Now let’s remove the parts that violate the principle of sensible disbelief. People might contend that this person’s name is recorded incorrectly, so we’ll have to pull that out.
Person: id: 1 birth: father: 2 mother: 3 sources: - Kelleys Island, Erie, Ohio, pp. 370–371 no. 22, 1876 - US Census 1900, Kelleys Island, Erie, Ohio, United States, sheet 5B, family 114
We could get the parents wrong, so they need to be separate too.
Person: id: 1 birth: sources: - Kelleys Island, Erie, Ohio, pp. 370–371 no. 22, 1876 - US Census 1900, Kelleys Island, Erie, Ohio, United States, sheet 5B, family 114
Maybe those sources are about different people, so we have to get rid of all but one source. We can leave one, though, since a person with no source could never have entered our research to begin with.
Person: id: 1 source: Kelleys Island, Erie, Ohio, pp. 370–371 no. 22, 1876
The style of citation is bad too, or at the very least a point of disagreement. Pull it out and replace it with a reference to another node.
Person: id: 1 birth: source: 4
How about the word “Person”? Maybe you thought it was a person but it was really a pet, or a city, or a house, or an imaginary friend…. We do know it was something, but what kind of thing is a sensible object of disbelief. We’d best store just a “thing” and let the personness be a subject of discussion.
Thing: id: 1 birth: source: 4
And now that we realize it might not be a person, it also might not be born, right?
Thing: id: 1 source: 4
And here we are left with the core atomic element of a person, place, pet, event, or almost anything else: the Thing node, which contains exactly two fields: a unique ID, and a pointer to exactly one source node.
So what about all those parts we pulled away? They all go off into nodes of their own. The sources, for example, become Citation nodes:
Citation id: 4 type: Christening record place: Kelleys Island, Erie, Ohio, United States page: 370–371 date: 1876 number: 22 lang: en Citation id: 5 county: Erie date: 1900-06-07 district: 33 household: 114 lang: en line: 70 sheet: 5B state: Ohio supervisors district: 12 type: census document: Twelfth Census of the United States township: Kelleys Island village: Kelleys Island schedule: 1 – Population
The set of attributes of a citation is unbounded; anything you know about a cited document you can put in the citation.
The birth that we pulled out earlier is another Thing node:
That the birth cites source 4 and not source 5 is not something we can determine from the original record alone; we have to actually know something about the content of the two sources.Thing: id: 6 source: 4
Most of the other parts we removed are Property nodes; for example, we have the type of each thing:
Property: id: 7 of: 1 key: type value: person source: 4 Property: id: 8 of: 6 key: type value: birth source: 4
as well as fields like name:
That source 4 used “Charlotte Ward” and not “Ward, Charlotte” or “First name: Charlotte; Last name: Ward” or some such is not something we can determine from the record alone; even with the “as-written” part of the name in the record we don’t know if it was written that way in source 4 or source 5.Property: id: 9 of: 1 key: name value: Charlotte Ward source: 4
Let’s assume that the separated name parts in the original example resulted from some indirect evidence; We can’t tell if it was direct or indirect from the given data; we’d have to ask the researcher (and trust memory) or re-perform the research ourselves. that is to say, they were not identified as separate in the source, we inferred them based on the name writing conventions of the period. That phrase “based on” suggests a rule or trend, a pattern that usually holds which we can use to derive new information. To store a rule in the data, we list its antecedents and consequents: if the antecdents are matched by other nodes, the consequent nodes can be derived.
Rule: id: 10 antecedents: - Citation: lang: en date: between(1800, 1900) - Node: - Property: of: antecedent #2 key: name value: regex(^([^,]+) (\S+)$) source: antecedent #1 consequents: - Property: of: antecedent #2 key: surname value: group 2 of value of antecedent #3 - Property: of: antecedent #2 key: givenname value: group 1 of value of antecedent #3
This rule states that if you have a citation-style source in the English language, created between 1800 and 1900, and it is the source of a name property where the name value is a multi-word string with no commas (that’s what that regex means, in case you are not fluent in regular expressions), then you may derive a surname and givenname property.
I am not suggesting that the rule syntax above is ideal; however, the idea of using functions to make more general rules holds. I anticipate that rules would usually be generated either via user-friendly rule-generation wizards (I may write more about those later) or by a relatively small set of users willing to write them by hand.
Using the rule we can create an inference; inferences match the rule up with some concrete antecedents and assert that we believe the rule holds in a particular case.
What if we are wrong and the rule does not apply in this case? We’d add a Property of the inference node that asserts it is false with a source explaining why we don’t believe it.Inference: id: 11 antecedents: - 4 - 1 - 9
The inference is now the source of the consequents of the rule:
Property: id: 12 of: 1 key: surname value: Ward source: 11 Property: id: 13 of: 1 key: givenname value: Charlotte source: 11
Now, recall that we had two things, a person and a birth:
Thing: id: 1 source: 4 Thing: id: 6 source: 4 Property: id: 7 of: 1 key: type value: person source: 4 Property: id: 8 of: 6 key: type value: birth source: 4
How do we connect them together? Using a Connection:
Connection: id: 14 from: 1 description: is-child-in to: 6 source: 4
We’d likewise have connections for the father and mother, like so:
Connection: id: 15 from: 2 description: is-father-in to: 6 source: 4 Connection: id: 16 from: 3 description: is-mother-in to: 6 source: 4
Connections have the same number of fields as properties, but the value is a reference instead of a string.
We’ve seen a bunch of nodes that use citation 4 as a source; we’d also create a bunch that use citation 5 as a source, such as:
Thing: id: 17 source: 5 Property: id: 18 of: 17 key: type value: person source: 5 Property: id: 19 of: 17 key: surname value: Ward source: 5 Property: id: 20 of: 17 key: givenname value: Charlotte source: 5
Then to get back to the original two-source record we’d record the idea that thing 1 and thing 17 are the same thing.
Some tools and data models (for example LifeLines, DeadEnds, and the now-defunct new.familysearch) do have match actions explicit in the data, but many do not.Match: id: 21 same: - 1 - 17
Node 21, a match, is semantically the union of Thing 1 and Thing 17. It has two properties that assert it is a person (7 and 18) with two different sources, suggesting that we have two sources for the personness of this thing. It likewise has two sources asserting the surname of “Ward”, one from a cited source and one from indirect evidence represented by an inference node. And so on.
Although it may not be immediately evident, there are only two kinds of nodes missing from the set of nodes introduced above: note or comment nodes for containing arbitrary text that one researcher might wish to share with another, and belief sets for representing which nodes a particular researcher considers part of that researcher’s genealogy.
So what does this version of genealogical data give us?
Each disputable claim is individually addressable.
Actions complicated in other data models, such as merging or splitting people or deciding that some record’s “Charlotte Ward” was really the name of a ward of a city, not the name of a person, are as simple as adding or refuting a match node.
We can make nodes immutable, adding new nodes if an edit is needed (and adding a Connection node “update-of” between them); thus no one can change your data out from under you but you can still see all the changes that others think are useful.
You can I can share all of our data without having to believe everything the other person believes.
We don’t have to pick just one of a set of likely alternatives: we can enter them all and decide which is most likely later.
Reasoning and process is evident in the data itself. This simplifies checking the quality of work and picking up where someone left off without redundant effort.
Attribution is as simple as a property with “key: contributed-by” (or connection if the user is a node). Multiple independent attributions for duplicated work are handled trivially.
LDS Temple ordinances and other centralized-authority properties can be added to as digitally-signed properties of particular Thing nodes and thereafter automatically move as the Thing is matched and unmatched with other Things.
Data can be distributed as long as globally-unique IDs can be generated; there need not be a single centralized repository (though a centralized repository is certainly possible). Because nodes are be immutable, almost all concurrency and consistency issues are removed.
Automation can be used to derive rules, to check the probability with which a rule applies in a given context, and to provide hints such as “this rule applies only 83% of the time but you have applied it 100% of the time in your research. That means that about 8 of these 45 applications are probably wrong.” Further, all of this automation can coexist with your data without you being required to accept it.
There are probably more, but those come readily to mind.
Many of these benefits come because polygenea stores more information than do conclusion-oriented data models. Most data models simply do not have information regarding indirect inferences; only a few record match actions explicitly; and many don’t match sources to individual claims. Since that information is not in the original data, a fully-automated other-model-to-polygenea converter is not possible. Clever use of change logs might recover some of the information, but much of it was never entered into the computer before.
Looking for comments…