Thoughts on the creation and storage of meaningful data.
Most of what I do on a computer is done through the keyboard. I rarely hit more than half a dozen keys a second, and there are only about a hundred keys on it, suggesting that I rarely top five bytes of input a second. Even if I kept up that peak data outflow sixteen hours a day every day for a year, I’d only produce about 100MB; my entire life’s product could probably fit on a single DVD without resorting to any kind of compression. In practice, I probably produce at most an order of magnitude less data than that, since I spend far less 112 hours a week typing and even when actively typing I often pause for extended periods of time to think.
There are, of course, much faster ways to produce data. When I sketch with a graphics tablet I can produce data at a rate of several hundred bytes a second; thus, if I had stuck with my original desire to become an illustrator I would be producing about two orders of magnitude more data than I am now. If I were a photographer, I could expect to capture in a single day more than I currently expect to produce in my entire life. If I went into film I could capture a lifetime of data in less than a minute 30fps 1080p at standard bit depths is about 10GB of raw data per minute. It is compression that makes digital video practical. .
I sometimes think about data input rates when I back up my data. I have had the oldest of my current set of computers for about four years. I’ve put in maybe two dozen hours of sketching on this machine (less than 50MB) and own no kind of camera at all, so I expect to find that I have produced well under 450MB of data. But my hard drive contains several hundred times that: 172GB of data. The vast majority of that is things I’ve digitized or downloaded I find browsing through a detailed disk usage audit fascinating. For example, for about three years I archived podcast episodes that seemed worth re-listening to: 14GB. , but there are still in the neighborhood of 10GB of data I created.
More than half of the data I created is all part of the video I demoed in Odd Numbers. It required only few kilobytes of code to create but generated 5.7GB of images on its way to making that little clip.
My Ph.D. dissertaiton consisted of a few hundred kilobytes of LATEΧ and SVG; but those created a PDF ten times as large as well as twice that much space other derivatives that I generated to more easily interface with non-TEΧie reviewers. Similar statements apply to the hundred or so other documents that I’ve authored in LATEΧ in the past four years.
Then there’s code. D, C, and C++ code Ten years ago I wrote almost only C++: hundreds of thousands of lines of it, with fancy template meta-programming and all the bells and whistles. Then I learned D, which does everything I used C++ for better than C++ does it. It took me about two years to completely switch over, but the time I wrote C++ (other than small tweaks to others’ code) was in October of 2010. typically grows by a factor of 30–100 when I compile it; Python code by a factor of 3 if I use it in an import; Java and Scala I like the idea of Scala, but in practice Bash and Python are easier for throw-away scripts, D has more power for larger projects, Java is understood by other developers, and PHP and Javascript work online. Scala hasn’t really found its níche in my workflow. code by a factor of 2, or a factor of 10 if I run javadoc; and so on.
All these derivative works don’t just clutter up my hard drive and make automated backups either tricky or wasteful; they also hide and lose information.
To understand the loss, consider the intent to say “You should measure the duration of houseflies the same way that an arrow would measure their duration.” If I passed this source intent through a speech simplifier, I might end up with “time flies like an arrow.” The new sentence is a perfectly valid derivative of the original intent, but it is also a valid derivative of other intents as well. The derivation lost information about the noun-vs-verb character “flies”, the meaning of the “like” clause, and so on. As another example, when turning a diagram with layered components into an image, the image does not store what was behind the top set of components. Derivatives store only what is important to the derivative in question. They do not store the intent or structure of the original source.
To understand the hiding, consider one line from one of my LATEΧ files:
\parskip=0.5\baselineskip
.
That line expresses the intent that there should be half a blank line between paragraphs.
That intent is in the resulting PDF,
but it is not in any particular location.
To discover that I intended half a blank line between paragraphs
you’d have to check the gaps each place they appear
and know to what to compare them (the inter-line spacing).
The information is in the PDF; it’s just not in any single obvious spot.
A lot of derivation happens before data even enters the computer. When I write the word “well”, I know in my head if I mean a hole in the ground or a state of health or a prelude to an explanation or something else, but all I put on the page is the derivative of that idea, the word itself. I know which program I want to run, but all I tell the computer is “move the mouse to this pixel and then click”; the computer can’t tell if I cared about what part of the icon I clicked, or if the delay between motion and clicking was important, etc. All I give the computer is the derivative of my intent.
I surprise myself with how much time I spend pondering less derivative ways of communicating with computers. Polygenea came out of my distaste for the excessively-lossy derivative nature of the GEDCOM-like data prevalent in family history. I have also made many failed attempts to design a painless way to type sentences with their grammatical intent explicit, to compress sampled images with their underlying structure explicit, to make the beliefs I have as a programmer transparent in my code, and so on.
There is another side to the source-vs-derivative issue, when it comes to humans creating the one or the other; one that has some relation to Tuesday’s post. To create source you have to be aware of what you mean, while creating derivative directly allows you to go on instinct. It is harder to tell a computer the shape of an arm and the rules used to outline it than it is to give the computer a sketch of the arm. It requires more thought and care to express a fully diagrammed sentence than to spew out words that approximate how you habitually speak. It’s easier to say “there are two Henry Hermans; check this census” than it is to isolate the two Henrys in that census, make explicit the claim that they must be distinct people since they were living a street apart from each other therein, and then matching up each other Henry Herman record with one or the other of the two people. Sources are more powerful, but they are also more work.
And yet I have hope. I watch as student after student moves from knowing no source to using source over derivative by preference. I observe my own increasing annoyance when I cannot provide sourcey information and see similar annoyance in others that have experienced a source-code means of expression. Better tools have more expressive power and enable the creation of more potent tools; once the mind adapts, who would go back?
And so I continue to work on polygenea; I continue to ponder typing diagrammed sentences and rendering meaning-filled images and programming interfaces that are more sourcey than those we have now. I also ponder how to simplify source, to make it less verbose and less derivative and less of a hassle to create and use. I want to live in a world where everyone has access to the power of sourcey expression, and that means more and more accessible source formats and interfaces.
The power of source is a source of power.
Looking for comments…