Array support in the relational model

If you ask a computational scientist why he doesn’t store his data in a relational database—and he almost certainly doesn’t—he’ll probably tell you that the system doesn’t properly support arrays. When interrogated for the details, the main problems he’s likely to cite are high performance overheads, poor storage utilization, cumbersome interfaces to array data, and sparse support for the kinds of access patterns and primitives that are common in scientific data processing.

To me this situation seems curious because array structures are highly regular. Here none of the kinds of problems with representational power that are normally used to justify object stores and ad hoc flat file formats are to be seen. In fact most arrays used in the scientific context are extremely regular, with a neat, rectangular domain, and some real vector space represented as a tuple of floating point numbers as the range. That would seem to make them a straight forward application for relational algebra, and a highly restricted one at that. So where’s the problem?

I don’t think I’m too bad of a relational bigot, but it does appear to me that just about all of the reasons have to do with the limitations and emphases of commercial implementations of the relational model, and not so much the conceptual model itself. Namely,

Relational database management systems fail to take the physical‐logical separation underlying the relational model seriously enough, and to implement it as generally as they should. The physical organizations they support are married far too tightly to the typical transaction processing environment to allow for storage organizations which are needed in a different setting. The extremely tight stored organizations needed by the science folks for their sampled functions are just one example, and we already know of others. For example one of the main benefits of dedicated OLAP and multidimensional database engines is precisely a storage organization which better suits complicated, multidimensional queries which retrieve a large percentage of the database contents in one fell swoop.
- Here the nastiest subproblem probably is that without sufficiently general tranformation machinery at the physical level, physical design freedom is limited. Thus relational databases tend to utilize canonical encodings and so prescribe how data ought to be stored, instead of describing existing data as it is. This is a problem with scientific data because the datasets can be positively huge, so that conversion is not a viable option.
- Often the problems caused by this inflexibility are circumvented using ad hoc extensions to the data model, like binary large objects. In the short term this seems sensible enough, but in the long run it actually exacerbates the problem because the extensions violate the relational model. As a result they fail to support full relational processing, like query optimization or possibilities for physical reorganization without logical impact. Thus we easily end up seriously diminishing the power of the relational approach.
- Current RDBMSs do tend to have some transformative powers and possibilities for physical level optimization, but even when they do, they do not expose that functionality in their user interface. Even if one can partition, choose numeric representations, cluster, provide alternative access paths and prematerialize views within the database, similar options or optimizations aren’t available on database import/export, in the line protocols used to communicate with the database or even within the in‐memory environment offered by the database to the programmer. The system does not generally allow one to treat the user interface as just another physical realization or serialization of the abstract data model, on par with internal storage. This makes interfacing with the system unnecessarily difficult because data will have to be converted to the target format and back instead of being served ready by the database, and needlessly inefficient because all the different kinds of storage and protocol optimizations that belong on the physical level either aren’t possible or have to be reimplemented from scratch. The resulting access barrier causes scientists to think of relational databases as cumbersome, inflexible, proprietary and uninteroperable, all of which are real showstoppers in the research environment.
- More generally, relational databases fail to implement most of the functionality present at the logical level lower down. For example, if we think about a float array in two dimensions, it could be compactly stored as a description of two finite sets of indexes, and metadata to the effect that the relation is a total function from their Cartesian product to floats. This way only two numbers and the array of values would have to be stored. Carrying at least some parts of relational algebra downto the physical level like this would enable tilings, stripings, index permutations, partial indexes and the like to be described within the same framework as the rest of the data model, helping the database to translate between the physical and logical models using raw relational algebra. In the process it would help the database efficiently negotiate most scientific data representations, and so significantly ease up interoperability. This is not just a fleeting daydream, either, because respectable computer scientists have already suggested similar architectures in the context of the nested relational model, and vice versa, there is at least one scientific data format library (CDF) in common use which actually implements data compaction by storing a succinct description of the functional dependencies of a relation and from there on synthesizing the full relation from its third normal form. Still, currently nothing of the sort is possible on the commercial front, even with dedicated ETL tools.
Relational databases are hindered by query languages which are lacking in power, regularity and simplicity. Generally speaking the interface offered by a typical RDBMS comes in the form of a human readable query language with lots of syntactic sugar but with limited productivity, and lacking the full ease and power of relational algebra. Such query languages have originally been designed for an environment where users interactively query the database, which then maintains the data in a comprehensive fashion and takes care of all of the computations related to it. They are not very suitable for an environment where processing efficiency, low latency, easy interoperability with software far stupider than human users and the possibility of tight coupling between the different modules of a decomposed software environment—of which the database is just another module—consistently override user‐friendliness concerns. The problem is most visible in the tasks which people would rarely perform using the query language, but which machines are very good at, like interleaved, pipelined execution, input/output of raw, unformatted data, management of huge amounts of temporary state (which in the interface domain manifests itself e.g. as the instantiation of sizable, first class, temporary relations as literal data) and speedy, wholesale updates (e.g. reinstalling an entire relation; no database that I know of has facilities for declaring or taking full advantage of this sort of operation). As a result, much of the raw power of a well built RDBMS becomes unavailable on the other side of the user interface, and the scientific users in need of such power end up building their own data format libraries to gain lower level interfaces to their data.
Databases fail to cleanly separate domain operations and relational algebra, and as a result also fail to implement domain operations as a fully productive part of their type system. For instance, if we think about the SQL GROUP BY structure, it is essentially a limited implementation of set division by an equivalence relation, combined with a small number of aggregative set functions built from associative, commutative domain operations. There is ample room for generalization, here, but unfortunately it is hidden under the special purpose GROUP BY extension which also takes away most of the financial incentive to implement a more general mechanism. Such general set operations are relatively common in scientific data processing, as are more complex ones, like the ones involving distance metrics in high dimensional spaces. The trouble is, current relational databases aren’t regular or extensible enough to accommodate such domain operations in a manner which would enable the database core to take advantage of them in the broader relational setting. If we for instance want to divide by a user defined equivalence predicate, we’re often fresh out of luck on the optimization front.
- As an offshoot of this, relational databases tend to have trouble with unconventional domain operations and properties. Perhaps the most salient example comes from the newest development in scientific data modeling, the fibre bundle abstraction. Essentially the whole point of this exercise is to derive a neat, general representation of the preexisting topology of the base space (or domain) of the fibre bundle section (function, respectively). But current relational databases understand nothing about topology, even if their spatial extensions sometimes implement structures which heavily rely on the mathematical structure being described. E.g. spatial indexes like R‐trees are needed precisely because topology and order theory work differently in dimension exceeding one. After that the scientists who would like to calculate explicitly with this data but also have it understood by the database engine aren’t too well served.
- It is also interesting to notice that relational databases do not implement abstract relations/predicates with purely computational or logic definitions. Even views always require some materialized base data. Logic databases excel at this sort of thing, but then they usually can’t handle simple relational operations efficiently. This is weird, given that run of the mill relational databases do support some limited forms of domain operations which can be thought of as abstract relations between domain elements, like equivalence and comparison, and do so internally in a manner which is often quite extensible. Furthermore, the problem isn’t limited to scientific data. For instance, many of the problems encountered with classification hierarchies in the OLAP/reporting context stem from the fact that such domains possess properties (e.g. a lattice structure) and are typically handled in ways (e.g. join/meet processing) which the database doesn’t natively understand but could easily leverage once implemented and exported to it. Even the interface to such computations needn’t be a problem because these properties are still well modelled by relations (here, subset inclusion and union) between the elements of the domain, but when the right abstractions and interfaces aren’t present, the database doesn’t grasp the structure because it expects every relation to be materialized and thus fully accessible. To continue the example, classification hierarchies and type lattices are often so big that we can’t efficiently materialize and maintain them in extensive form. This leads to sparser storage structures which can be queried and computed with using the right code, but not efficiently enumerated. The database system cannot utilize such structures or their supporting code, because the relevant domain operations and relations cannot be declared in a manner which it understands. All this is bad from the scientific data management viewpoint because unconventional domains are quite common in this context, and the lack of the right, general, abstract interfaces makes it extremely difficult and time consuming to circumvent the (understandable) limitations of the software using custom made abstract data types.
Relational databases favor keyed, record oriented updates, sometimes even limiting the full power of relational algebra to queries only. This can make massive, complicated updates prohibitively expensive, evenwhile a set and relation based update could in theory be optimized to a far higher degree than the explicit looping constructs often found in scientific code. Since scientific data processing tends to be highly generative, transformative and in general cumulative, slow insertions and selective copy operations on massive datasets are big minuses.
The primary market of RDBMSs in processing large volumes of small, independent transactions and the shadow of older, record structured storage have biased the industry for tuple oriented processing, against holistic modes of analysis based on entire relations. Scientific data on the other hand generally consists of different kinds of sampled representations of continuous functions, which the sampling theorem guarantees will not possess any perfectly localised, discrete interpretation. Such data always necessitates at least some level of nonlocality in processing. Only OLAP and statistical products are slowly beginning to enable mass updates, correlated processing across multiple table rows, iteration over orders, explicit support for higher dimension, and so on. But this approach is still nowhere near as convenient as using a scientific data format in conjunction with one of the proven computational libraries.
- Perhaps even more fundamentally, and in connection with the broad inability of relational products to understand topology, commercial relational databases completely fail to understand the approximative and representative nature of scientific data. Unlike the data we normally deal with in computer science, scientific data tends to represent continuous functions which, taken as relations, are not even numerably infinite. Such objects cannot be operated on directly, but must first be represented in a discrete form by numerical sampling or by the means of symbolic algebra, and only then operated on via the representation. Relational databases generally fail to support the notion of representation; they do not have any hooks in place which would enable them to support queries or updates to parts of the underlying object which aren’t directly visible in the database, nor do they have any means of declaring the many useful mathematical properties that the underlying object and/or the representation have and/or share. Furthermore, science data tends to represent noisy measurements of real, perfectly accurate quantities, and so in order to make sense of it, representations of noise, fault and uncertainty are often needed. Relational databases rarely support that sort of thing directly. The same goes for multiresolution and statistical summary data, which possess topological or algebraic structure that is vital to their use, yet cannot be expressed in relational databases, or at least in any standardized schematic form. Particularly in the latter case it’s curious that much of the needed code in fact already exists in the database, in the form of selectivity estimation histograms, wavelet hierarchies used in approximate query answering and neatly packaged data mining code used by executives. Yet the code isn’t always exposed to the user, or at least exposed in a form that would make it generally applicable.
Practical relational database environments heavily separate schema and data operations, and are geared towards treating the schema as a rigid background structure completely separate from the production data. Consequently any aspect of the live data best modelled in the schema, like the cardinality and structural constraints attached to vector field index sets, is difficult to describe under such systems. A typical example of the kinds of problems caused by this is the difficulty of processing simultaneously across both rows and columns, or the practical impossibility of flattening structures which would necessitate such processing into normal relations in an easy, efficient and fully supported manner. In the science context more flexible, more powerful and lighter weight data dictionary operations would significantly ease the storage of arrays in existing relations (e.g. they could help enforce the common, multidimensional cardinality constraints when multiple arrays of different sizes are serialized into a single relation) and also as first rate relations (e.g. they could make the declaration and management of tens of thousands of relations within a single database somewhat easier and more organized than it is now).
Typical RDBMSs faithfully implement the relational model, but they also inherit its inherent generality and open endedness. As such, they fail to assign unambiguous semantics to the data. Their metadata capabilities are slim and poorly standardized even in areas where full representational consensus exists, like in the treatment of physical units and geographical registration. This despite the fact that comprehensive abstract data dictionary overlays (e.g. RM/T or RM/V2) have been suggested by relational pioneers like Codd from early on. Today there are certain initiatives (like OpenDB) underway which aim at remedying the situation a bit, but the fact still remains that given an instance of a relational database, it is exceedingly difficult to figure out what the stored data means. In the scientific context this is strictly unacceptable, because research thrives on data sharing, cross validation, retrospective reanalysis, guaranteed provenance of data once gathered, and data sets easily found via comprehensive, publicized indexes.
Much of the concurrency and recovery management code present in commercial database products is overkill for the typical transformative, append only scientific workflow. This severely handicaps commercial databases compared to dedicated, scientific data management solutions, so elective turning off of the overhead should definitely be possible. Furthermore, implementing such a feature shouldn’t be too difficult or expensive because recovery mechanisms can generally be shut off in a simple manner (e.g. shadow pages aren’t allocated but overwritten in buffer; the write ahead log is not written and preimages are not copied), locks simply do not have to be acquired or checked, restricted mode processing and the attendant optimizations already exist in most relational products, and so on.

If even some of these problems were rectified, I think relational databases would quickly become the platform of choice for scientific data management.