What are the core pieces of genealogical research?
A few weeks ago I wrote that one criterion for a good collaborative genealogical tool is modeling atomic elements of genealogical of research. Over the past three days I’ve spent my blogging time writing up this single post presenting some of my thoughts about what those atomic pieces are.
I intentionally ignore search and presentation herein. Search is how you get information from the real world into your genealogy; presentation the reverse. Using an earlier analogy, search is the cistern and presentation the spigot, but I’m only going to discuss plumbing.
For the most part, I’ll divide pieces of research into five categories: sources, information, rules, and matches, and arbitrage. An explanation of each follows. There are some things that don’t fit nicely into these buckets, but they work quite well for most elements of research of which I know.
A source makes claims that have bearing on genealogy. Some notable subtypes of sources are
Records, documents created by people in the know that include assertions about people’s life and relationships. These include government and church records, headstones, obituaries, etc; they are the mainstay of genealogy.
Placeholders, documents that express the conclusions of other genealogists. Ideally all such documents would be reduced to their atomic elements, which is why I call them placeholders; practically, such reduction is not always possible nor worth the effort.
Anecdotes, things that people (the genealogist or others) attribute to personal knowledge but have not documented. Hunches, traditions, etc., fall in this category.
Research logs are, in part, sources that describe where other sources have been (or have not been) located.
See also Narratives in the “other” section below.
Sources have provenance, the “ancestry” of the source itself (e.g., “jpeg of microfilm of clerk’s copy of original”). Provenance is important but slightly tangential to this post so I’ll not say more about it here.
As a rough rule of thumb, “source = information + formatting.” Source information is the set of assertions the source makes. There is also information created by genealogists as they match up records and draw inferences.
Extracting information from a source requires some interpretation, but that interpretation should be restricted to translation, not computation or inference. Thus, handwriting, formatting, etc., should be parsed (possibly with ambiguity noted in things like “Van or Uan”); but higher-level inferences (like bounding a birthdate based on an age) isn’t really extraction.
While “information” is an established term in genealogy, we also need some additional terms relating to information. I call each person, place, event, etc., attested by a source a datum. Some existing genealogy software calls a person datum a “persona”, but that style of phrasing doesn’t generalize well. Each datum comes from exactly one claimant, which might be a source, a match, or an arbitrage. The set of data coming from a single claimant I call a claim. The individual things a claimant asserts about a datum or about the relationship between data I call an assertion.
I’m not sold on the particular terms in the previous paragraph, but I am confident that each of the concepts need terms.
Each datum and each assertion within a claim must be individually addressable. That is, we must be able to refer unambiguously to each person, place, event, detail, and relationship in each claim.
Each assertion within a source’s information may be first- or second-hand and may be direct (recorded from current observation) or indirect (recorded based on recollection). As with provenance, these distinctions and how they are made are important topics but beyond this scope of this post.
Where sources reference particular individuals, events, etc., inference rules refer to patterns across many instances. Rules need not be universally true; common trends are sufficient. A few examples:
A person A years old on date D was born between D − A − 1 year and D − A.
If X and Y are siblings, then there is some person who is both a parent of X and a parent of Y.
People who immigrate to the USA from Europe between 1892 and 1924 have immigration records at Ellis Island.
In general, a rule is a relationship between various types of assertions. Rules could also contain probabilities of being correct and preconditions under which they are worth considering.
Some rules might also rank conflicting claims:
If a direct and an indirect assertion disagree, the indirect assertion is probably in error.
Assertion-ranking rules are important for arbitration, discussed below.
Most existing genealogical tools contain a limited set of inbuilt rules, but many more rules-of-thumb and assumptions are used by researchers. By recording and sharing these rules as rules instead of repeating them informally in the notes attached to each person researcher reasoning becomes transparent. Additionally, collaboration on rules obviates some of the current function of research guides and allows for statistical validation of common rules-of-thumb.
Matches are of four types, as diagrammed below.
Positive | Negative | |
---|---|---|
Same | Particular data from two claims refer to the same person/place/event/etc. | Particular data from two claims refer to distinct people/places/events/etc. |
Applies | Links assertions with a rule to derive new assertions and/or data. | Indicates that a particular rule is not applicable in to a particular set of assertions. |
Some matches may be relatively self-evident, but others may require explanation. I have not yet thought of a good way to model these explanations. It may be that human-readable text is sufficient.
Matches create new information, either by combining existing data with their associated assertions or by creating new assertions from rules. In general, the information resulting from a match need not be internally consistent. I might end up with multiple birthdates for the same person or with intransitive links (A = B ≠ C = A). The states of believing conflicting information is a natural part of ongoing research and should not be prevented by the tool, but it may be corrected later with negative matches or with arbitrage.
When various assertions contradict, arbitration can be used to select out the set most likely true. There are rules for arbitrage just as there are for extrapolation, such as “trust direct records more than indirect”. But some arbitrage is must more case-by-case.
An arbitrage applies to a claim, removing one or more related assertions to create a more-consistent claim. Ideally, each arbitrage is backed by a ranking rule but I am not yet comfortable suggesting that all arbitrage will have such rule backing. It may be that arbitrary arbitration with (with human-readable explanatory text) is as good as we can do.
There are elements of research that don’t fit easily into the above. For example, there is a general notion that human lives follow narratives. One kind of indirect evidence comes from the tie-ins needed to fit the existing evidence into a reasonable narrative. These narratives thus serve as sources, but are supported themselves by earlier conclusions putting them at both ends of the genealogical process.
There are other special cases as well, but on the whole I think the above is reasonable set of atomic pieces of genealogical research.
If my set of atomic elements is a good set (I welcome comments on this) then the next step is to build a data model capturing as many aspects of each as possible. This will necessarily involve some kind of enumeration of the kinds data, assertions, and rules a tool will support. Then there’s a need for a database design allowing for easy access to the various elements, collaboration protocols to formalize how users interact with one another and with potentially multiple databases, and a user-interface design allowing for easy genealogical work.
There’s a lot of non-trivial work ahead. A conceptual model is just the first step on a long road. But I feel like I’ve taken the right first step.
Looking for comments…