Type definition basics

While trying out different things on the .dawn format, one of the central choices I’ve faced is the difference between description and prescription. This document outlines my current thoughts on how to leverage both approaches, and apply them to the .dawn format.

The problem

To a degree any data encoding will always be a specification of what a certain string of bytes means, and conversely will prescribe how to encode a given, abstract data structure. But for various reasons a number of general storage environments, like databases, scientific file formats, generic object stores and compound documents also utilize a second approach. Instead of defining a single, canonical format for a given data structure, they act as open ended frameworks which permit multiple different ways of organizing the data, and then fix the precise interpretation by attaching descriptive metadata to the data.

Databases and object stores do (or can/should do) this because they aim at being completely generic software development environments. They separate logical and physical data models for performance and maintenance reasons, and try to give the programmer broad choice in how to define each side of the data equation. Scientific data formats do it primarily because of performance and provenance issues. Such data tends to be both highly regular and extremely voluminous, which makes it easy to describe but causes all sorts of wholescale processing, transformation and fitting into existing storage patterns to be costly. Furthermore, the stringent archival and trust requirements in scientific research cause raw measurement data to be essentially untouchable; it is acquired and archived as‐is and any refinement and transformation is generally confined to separate, derivative data products. Compound formats then derive their structure from the underlying, heterogeneous software architecture, and try to be as transparent as possible. They are happy to store whatever the relevant set of applications cares to throw at them, and thus rarely define low level codings of their own.

Hence, in many applications it makes sense to store the data in an application specific format and only afterwards—and out‐of‐band—describe what was stored, instead of going the other way around, first prescribing a format and then forcing the data into the chosen model.

Another concern of mine is that, in the terms of the scientific data management community, many data formats are unnecessarily low in context and, in the terms of the knowledge representation folks, have underspecified semantics. Since tagging and other kinds of descriptive metadata can be used as extremely low overhead means of communicating such information, in most formats the underlying reason is the lack of sufficient metadata facilities and documentation rigour, not so much performance.

Since I’m using the .dawn format as a sandbox for my ideas in general data representation, I’d like to see it able to cope with all of these challenges, simultaneously. I’d like it to be able to precisely describe a broad class of preexisting data formats, preferably including its own encoding of descriptive metadata, while defining its own data model and settling for a limited subset of the describable structures which can be easily implemented and can serve as a canonicalization target in case something like that is needed e.g. for cryptographic purposes.

Divide and conquer

The number of data encodings out there is bewildering, so simply enumerating them does not make much sense and won’t facilitate implementation. Instead it makes sense to abstract away from the details and divide the problem into more manageable pieces.

In the end we wind up dividing the problem into five parts:

Choose the atomic datatypes and formally describe their possible encodings
Fix a set of compositional (or generative) primitives used to derive and describe more complex types
Choose a unified, abstract data model
Map all types into the abstract model and formally describe the translation
Attach additional, normally neglected semantics to the resulting data model via descriptive metadata