Unusual and constructive relations

Most data stored in relational databases is enumerative in nature and comes from discrete domains. However, in scientific computing this is often not the case. Surprisingly some of the most vexing problems in normal transactional data management can be shown to be special cases of this more general kind.

Temporal data

Time is a continuous quality. When temporal data is stored in databases, only the discrete representation chosen limits the temporal accuracy, and it is also possible to construct variable precision representations which do not set a fixed lower coarseness bound. Furthermore, even the coarsest representations are often used with continuous predicates and continuum reasoning is commonly applied to them. As a result, temporal data cannot really be thought of as being discrete. Instead the typical timestamped or bitemporal relation ought to be thought of as a constructive representation of an infinite relation, where a finite set of endpoints is picked from the countable, dense subset of the real line of all times representable by finite binary fractions, and the entire linear order on the reals is then used to construct time intervals which are uncountable, connected sets. Thus the underlying bitemporal relation doesn’t really show up in the database at all, but is just succinctly represented using a clever, constructive, discrete notation. The time part of the relation actually represents a particularly regular, uncountable set, which is then used in a Cartesian product with the rest of the tuple.

Many of the common problems with temporal data stem from treating temporal relations as though what is in the database is all there is to them. For example, many recent treatments of temporal data models assume that time is represented as a series of finite, discrete granules and that timestamped relations represent Cartesian products between contiguous ranges of such granules and the rest of the relation values. However, practical use of timestamps much better reflects the continuous semantics, with different granularities of timestamps used side by side and even the possible existence of a lower granularity bound being neglected. In this case, the topology of the underlying time dimension no longer discrete, and thus closed, half open and open intervals become separate. This leads directly to the many possibilities for fencepost errors, which are possibly the most difficult and inefficient part of interval calculations and the data structures maintained using them.

Another interesting thing to note is that once we recognize the topological issue, we can also see that better formalisms exist for describing the relevant sets than the current ones based on intervals. The ones based on simplicial complexes are particularly interesting, and something one would never come to think about before bumping into scientific data formats which deal with the same problem in a more general manner.

Sampled signals

Few database engineers have a background in DSP, so few of them really understand sampling, either. Doing lots of relational design also gives one a sort of relatitis, which causes one to abstract away from the many peculiarities of actual data and concentrate on the overhanging relational structure. This is why sampled functions never quite seem to fit into the relational fixture, and why extended datatypes are commonly used to store them despite their exceedingly neat functional structure.

The trouble with sampling is that a sampled function is a representation of a continuous one, not a freestanding data structure, and so some operations on it rely heavily on continuum reasoning. True to their name, most sampled signals are simply streams of instantaneous values taken from a bandlimited function at regular intervals, but the reason why we sample isn’t that we’re especially interested in the values the function takes at these precise points. Instead what makes sampling neat is the fact that when the function is bandlimited, the whole continuous function can be losslessly recovered from any regular series of instantaneous values that is spaced closely enough. While such a representation is both simple and perfectly inversible, it also makes the image of a function in a relational database just that: an image of a far more massive and involved structure. Eventhough many signal processing algorithms rely on the particular properties of the sampled representation, and so are able to operate directly on the discrete representation, others are more involved and also take into account the vast majority of the function being represented that can only be inferred from its samples, and isn’t directly visible in the database.

The picture of a sampled function as a mere shadow of the real thing is a bit difficult to grasp to most, so a couple of examples are in order. The first one is interpolation. Sampled signals are commonly reconstructed by interpolating between the samples, so that the resulting, continuous signal gets the sampled values at the sampling instants, and varies smoothly in between. Many signal processing algorithms actually materialize these in‐between points in software by applying interpolation algorithms to the surrounding samples. Thus, given a sampled song represented in the database as a function from sampling instant to sample value, it is perfectly reasonable to query for all the points where the signal gets a certain value, and expect the database to return sampling instants that are between the ones actually stored. We should then expect to be able to update the values at the resulting set of points, with the result that tens of stored values near the indicated times will change, and not just the time that was updated or the closest sample to it.

Of course this seems a bit far fetched at first. But there is nothing in the relational model that would stop something like this from being implemented, nor am I saying that relational databases must natively understand such complicated operations. What I’m saying instead is that databases ought to have hooks in place for user defined domain operations which can implement such unconventional query and update semantics.

Yet the second example is even nastier, because it shows that multidimensionality and continuity are productive. Let’s say we have a two dimensional image and we query for the part of the image where the brightness, according to a user defined formula, exceeds a certain threshold. The result is an irregular piece of the image where the edge will almost nowhere hit the sample points, and whose shape is difficult to express in closed form. What should the database do? I’m of the opinion that it should be able to do two things. First, it ought to be able to return an explicit answer accurate downto a level specified by the user, for example by rectangularly tiling the domain and then interpolating the range of the image over the tiles; one plausible mechanism would be to offer an error norm pseudocolumn and to return the results in desceding order over it. And second, it ought to be able to return relations which act as verification oracles for the relation predicate (i.e. are able to be queried for a match but whose content cannot be enumerated) so that the user can invoke the tiling and interpolation machinery herself, on an as-needed basis.

All this seems really complicated until we notice that this is precisely the kind of thing that databases are already being called for to do with temporal relations. They already deal with tilings because the underlying continuum cannot otherwise be represented in closed, discrete form, and this necessitates explicit support for domain operations where the domain is highly unconventional. Sound, images and other sampled functions differ somewhat from time, but the solutions are similar. The point is that once the underlying domain has continuum structure, the corresponding discrete relations are necessarily just representations and processing them will immediately lead to multiresolution representations, on the fly generation of new tuples based on topological and geometric constraints, and the need for interfaces with the capability of dealing with standardized decompositions of continuum domains, like tilings.

Representations in terms of basis functions

Above we already saw how sampled functions can give headaches to databases. A related but somewhat more general problem are representations in different bases, or even more generally using different decompositions of the underlying spaces. This is a rather common situation in scientific data processing, because continuous signals are often more compact, more orderly, more easily processed or best intuitively understood after a change of basis.

Once again this is an example of an operation that database designers are familiar with, but only in limited form. First of all, database normalization aims at data encodings where the stored fields are independent. In the case of vector spaces, the mathematician’s and the database designer’s definitions of independence coincide fully, so database engineers have been on the lookout for the same minimal, sufficient sets of independent basis vectors as the math geeks for a long time. Second, the same goes for more general notions of orthogonality as well. For example, database people generally frown upon nonlinear representations and favor ones that lead to minimal update semantics with common database operations. If we need to encode a time interval, it can be alternatively encoded as an ordered pair of start and end times or as a start time and a length; these are simply two different parametrizations or bases for the same linear, two dimensional space of intervals, and both the database and science folks generally choose from among such equivalent representations the one that best concentrates modifications across the resulting fields in the relevant application. So, bases are already chosen and they’re chosen with almost identical entropy concentration criteria in mind.

The trouble is that the database knows nothing about this sort of thing. All the logic having to do with, say, intervals is implemented at the application level and then the result of the representational choices is encoded in the static data dictionary with no indication of where they came from and what the precise semantics of the stored data are. With time periods this is not yet a huge problem, but once we get into more involved bases like spectral representations, noninterpolating splines or overcomplete wavelet bases, the lack of metadata capability and the impossibility of neatly integrating the numerical logic into the relational machinery become real issues. For example, one cannot declare relations as different representations of the same underlying quality and expect the database to invoke user defined transform routines to derive the other ones when one of them is changed, one cannot fetch values in one representation from a relation stored in another even if the user would be willing to supply the mediating logic, and one can’t declare any of the nice properties of such relations (e.g. the linearity properties of representations on vector space bases, or the even more specialized shift invariance properties of the Fourier transform) and expect the database to be able to take advantage of them (e.g. enabling the efficient update of derivative relations by just additively varying the contributions of those basis vectors which have updated coefficients).

Some of these operations wouldn’t be too difficult to integrate into existing databases, and they could even improve their everyday operations. For instance, quite evidently there would need to be hooks in place for incremental materialized view maintenance, which there currently aren’t, and the precise same mechanics would also enable efficient, incremental, materialized view maintenance in regular reporting databases. Looking at a hypothetical table containing an inverse Fourier transformed copy of a spectrally encoded vector field, the update could proceed by first batching updates, then letting the database sort out the redundant ones as usual, then calling a user supplied routine which would difference the old and new values and do a sparse inverse tranform, and then bulk merging the high level additions issued by the user routine into the view. This would be highly efficient, most of it is already done by the database engine, and the architecture could be immediately applied to regular view maintenance.

Of course not all operations are as easy as the above example. E.g. making the database understand the invariances of various coordinate transformations would likely not be easy enough to be supported. But some of the useful features one can imagine are easy, and would immensely help making getting relational technology accepted by the scientific community.