Reflections on data standards and safe round-trip communication between partial implementations thereof.
When modeling data, it is common to have some kind if hierarchy, which at it simplest level means some data are nested inside other data.
person: birth: date: +1605-03-04 place: London, England name: John
There is also an equivalence between nesting data and having data point to other data. The above example can be equivalently rendered as
@1, a person @2, a birth child of @2 is @1 date of @2 is +1605-03-04 place of @2 is London, England name of @1 is John
… where the various @number
are what JSON-LD calls blank node identifiers;
that is, tags used to make pointers unambiguous when presenting data as text.
You can always change nesting into pointers, and you can always change pointers back into nesting as long as there are no pointer cycles. If there are cycles, like the following:
@1, a person @2, a person parent of @1 is @2 child of @2 is @1
then at least one pointer per cycle has to remain:
@1, a person parent: a person child: @1
Immutable data has the feature that once created it can never be changed. The idea of immutability has two potential gotchas.
First, immutability works best in a pure-pointer format because if there is nesting then some situations are difficult to create. For example, the nested-with-pointers data
@1, a person child: @2 @2, a child parent: @1
can only be created immutabile all in one go; the natural progression of creating both pointed-to things prior to adding their pointers requires us to change things after they already exist, violating immutability:
@1, a person |
→ | @1, a person @2, a person parent: @1 |
→ | @1, a person
child: @2
@2, a person
parent: @1 |
This issue does not arise in an all-pointers situation like
@1, a person @2, a person parent of @1 is @2 child of @2 is @1
because each line can be made to depend only on the lines above it.
Second, immutability can appear impossible to edit and correct. If you recorded something incorrectly, perhaps with a typo like
@1, a person name of @1 is Jphn
you can’t modify what’s already there. However, this can be fixed by making everything available as a target of a pointer and adding “update of” markers:
@1, a person @2, name of @1 is Jphn @3, name of @1 is John @4, update of @2 is @3 @5, explanation of @4 is "Jphn was a typo"
There is one very nice feature of immutability: it lets programs communicate back-and-forth even if each program choses to ignore some things the other says. This is gained because the data makes explicit the difference between “didn’t include” (missing) and “user deleted” (presence of “deleted” element in data). Thus if I send
@1, a person @2, name of @1 is Jphn @3, name of @1 is John @4, update of @2 is @3 @5, explanation of @4 is "Jphn was a typo"
to someone who discards old versions and ignores explanations, I might get back something like
@1, a person @2, name of @1 is John @3, a birth @4, child of @3 is @1 @5, date of @3 is +1805-06-03
and, when I add this to what I already know I get
@1, a person @2, name of @1 is Jphn @3, name of @1 is John @4, update of @2 is @3 @5, explanation of @4 is "Jphn was a typo" @6, a birth @7, child of @6 is @1 @8, date of @6 is +1805-06-03
which merges the old and the new without spuriously removing the lost bits.
Note that the above process does require some nuance in handling;
there needs to be an unambiguous way of renaming identifiers
and determining which subjects (like @1
above, data that point to no other data)
are referred to by both datasets.
However, solutions to these problems already exist
(e.g., in JSON-LD subjects would have absolute IRIs while others would have blank node identifiers and be compared by content;
or subjects could be given an extra UUID field and all nodes compared by content;
etc.).
Immutability and pointers have pros and cons beyond those mentioned above.
One pro is that immutable data can be processed in a single pass; because pointers are acyclic, each datum can be presented before any data that point to it.
A con is that the data does not resemble the subject it describes: with nesting I can tell be glancing at the data stream all of the things that apply to a particular person, where with pointers I’d need to process the entire data stream first.
Another con is that it generally takes up more space than nesting for two reasons:
There are all of those pointers in the data stream that we have to store; we got the same information for free with adjacency-based nesting.
Immutability means we are sharing more information overall. In particular, the “update” and “delete” nodes encode a kind of data-centric edit history; I have yet to see as compact a way of storing that with nested data.
Looking for comments…