Possible Properties of Sources

What should be true of every citation?

When I started working with FHISO I was surprised to discover that a common complaint leveled against existing genealogical data specifications was that they did not handle sources very well.

“‍Self,‍” said I, “‍sources are easy. They are just references to documents, right? Patashnic and Lamport solved that back in 1985 with BibTeX. A citation is just a set of key:value pairs.‍”

“‍Perhaps,‍” my self replied, “‍but BibTeX is focused on technical papers, not historical documents.‍”

To which my reply was “‍so add a few new keys, right? What’s the difficulty?‍”

But many people agreed there was a problem, so I was hesitant to write it off that easily. As years have passed I have come to see more of the challenges there. I still hold that citations are simply a set of key:value pairs, and that defining a common but not exclusive set of keys is sufficient for most purposes; but I have come to realize that that model does not express the desirable properties of a citation.

In all that I have read about sources and citations I have not yet seen a list of desirable properties of citations. This post is my attempt to provide such a list.

Referential

Every citation ought to refer to an information source. Given an information source x and a citation y, it should be clear if y refers to x or not.

Constructable

Given a source, and average amateur family historian should be able to generate a citation that refers to that source with very low probability of error.

Comparable

Given two distinct citations it should be possible to determine if the citations refer to the same source by reference to the citations alone (i.e., without checking the referenced source).

See also Multigranular for a more nuanced version of comparability.

Coverage

For every real-world information source, there should exist a valid citation. This includes documents, conversations, monuments, user memory, and possibly even sources of dubious validity such as hunches.

See also Multilingual for another aspect of coverage.

Identifier

A citation should refer to at most one source. If x refers to y and z ≠ y then it should always be the case that x does not refer to z.

See also Multigranular for a more nuanced version of identity.

Locator

If the cited source is one that can be consulted repeatedly, the citation should provide sufficient information in order for a new researcher to locate and consult the source themselves.

A locator is generally also an identifier; however, you can have locators that are not comparable, readily constructable, or even self-evidently referential.

Multigranular

It should be possible to cite sources at various levels of granularity, such as citing a chapter, a page, or a single word of a book. This refined the notion of an identifier.

Multigranular constructability suggests that a courser-grained citation can be constructed from a finer-grained citation.

Multigranular comparability suggests that we can tell if one citation is a sub-citation of another.

With both constructability and comparability we can generate a citation that is a super-citation of all members of a set of fine-grained citations or assert that no such common super-citation exists.

All multi-grained citation systems that I have seen so far are locators at the finer granularity levels. That is, if you locate the source referenced by the course-grained part of the citation then the fine-grained citation provides all needed information to locate the finer-grained subpart of that source. Fine-grained data in theory could be non-locational (e.g., I could say “‍my second-favorite paragraph‍”) but I am unaware of existing models with non-locational sub-citation fields.

Multilingual

Citations should be easily generated by speakers of any language about sources in any language, including obscure, dead languages.

Translingual

Some portions of a citation may be extracts from the source and thus be in the source’s language. All other portions of the citation should language- and locale-transparent, easily presented in any language a user happens to desire.

Canonical

There should be a single one-to-one mapping between sources and citations. Note that this subsumes several other properties on this list, including Identification and Comparable.

Provenance

There are many possible provenance properties, allowing citations to store information about the chain of sources that led to the cited source. I do not enumerate them here in part because I believe that provenance is not part of a citation but rather the assertion of research about the origin of the source and should be stored as a set of research decisions.

As I look at this list I realize that attaining all of the properties on it is likely not feasible. For example, I doubt that there exists any constructable, canonical citation format. But I do think it is worth exploring which properties different citation techniques achieve, and to what degree they achieve each.