Array support in the relational model
If you ask a computational scientist why he doesn’t store his data in a
relational database—and he almost certainly doesn’t—he’ll probably
tell you that the system doesn’t properly support arrays. When
interrogated for the details, the main problems he’s likely to cite are
high performance overheads, poor storage utilization, cumbersome
interfaces to array data, and sparse support for the kinds of access
patterns and primitives that are common in scientific data processing.
To me this situation seems curious because array structures are highly
regular. Here none of the kinds of problems with representational power
that are normally used to justify object stores and ad hoc flat file
formats are to be seen. In fact most arrays used in the scientific
context are extremely regular, with a neat, rectangular domain, and
some real vector space represented as a tuple of floating point
numbers as the range. That would seem to make them a straight forward
application for relational algebra, and a highly restricted one at
that. So where’s the problem?
I don’t think I’m too bad of a relational bigot, but it does appear to me
that just about all of the reasons have to do with the limitations and
emphases of commercial implementations of the relational model, and not so
much the conceptual model itself. Namely,
- Relational database management systems fail to take the physical‐logical
separation underlying the relational model seriously enough, and to
implement it as generally as they should. The physical organizations
they support are married far too tightly to the typical transaction
processing environment to allow for storage organizations which are
needed in a different setting. The extremely tight stored
organizations needed by the science folks for their sampled functions
are just one example, and we already know of others. For example one
of the main benefits of dedicated OLAP and multidimensional
database engines is precisely a storage organization which better
suits complicated, multidimensional queries which retrieve a large
percentage of the database contents in one fell swoop.
- Here the nastiest subproblem probably is that without
sufficiently general tranformation machinery at the physical
level, physical design freedom is limited. Thus relational
databases tend to utilize canonical encodings and so
prescribe how data ought to be stored, instead of describing
existing data as it is. This is a problem with scientific
data because the datasets can be positively huge, so that
conversion is not a viable option.
- Often the problems caused by this inflexibility are
circumvented using ad hoc extensions to the data model, like
binary large objects. In the short term this seems sensible
enough, but in the long run it actually exacerbates the
problem because the extensions violate the relational model.
As a result they fail to support full relational processing,
like query optimization or possibilities for physical
reorganization without logical impact. Thus we easily end up
seriously diminishing the power of the relational
approach.
- Current RDBMSs do tend to have some transformative
powers and possibilities for physical level optimization, but
even when they do, they do not expose that functionality in their
user interface. Even if one can partition, choose numeric
representations, cluster, provide alternative access paths and
prematerialize views within the database, similar options or
optimizations aren’t available on database import/export, in
the line protocols used to communicate with the database or
even within the in‐memory environment offered by the database
to the programmer. The system does not generally allow one to
treat the user interface as just another physical realization
or serialization of the abstract data model, on par with
internal storage. This makes interfacing with the system
unnecessarily difficult because data will have to be converted
to the target format and back instead of being served ready by
the database, and needlessly inefficient because all the
different kinds of storage and protocol optimizations that
belong on the physical level either aren’t possible or have
to be reimplemented from scratch. The resulting access barrier
causes scientists to think of relational databases as
cumbersome, inflexible, proprietary and uninteroperable, all
of which are real showstoppers in the research
environment.
- More generally, relational databases fail to implement most of
the functionality present at the logical level lower down. For
example, if we think about a float array in two dimensions, it
could be compactly stored as a description of two finite sets of
indexes, and metadata to the effect that the relation is a
total function from their Cartesian product to floats. This
way only two numbers and the array of values would have to be
stored. Carrying at least some parts of relational algebra
downto the physical level like this would enable tilings,
stripings, index permutations, partial indexes and the
like to be described within the same framework as the rest
of the data model, helping the database to translate
between the physical and logical models using raw
relational algebra. In the process it would help the
database efficiently negotiate most scientific data
representations, and so significantly ease up
interoperability. This is not just a fleeting daydream,
either, because respectable computer scientists have
already suggested similar architectures in the context of
the nested relational model, and vice versa, there is at
least one scientific data format library (CDF) in
common use which actually implements data compaction by
storing a succinct description of the functional
dependencies of a relation and from there on synthesizing
the full relation from its third normal form. Still,
currently nothing of the sort is possible on the
commercial front, even with dedicated ETL tools.
- Relational databases are hindered by query languages which are
lacking in power, regularity and simplicity. Generally speaking the
interface offered by a typical RDBMS comes in the form of a human
readable query language with lots of syntactic sugar but with limited
productivity, and lacking the full ease and power of relational
algebra. Such query languages have originally been designed for an
environment where users interactively query the database, which then
maintains the data in a comprehensive fashion and takes care of all of
the computations related to it. They are not very suitable for an
environment where processing efficiency, low latency, easy
interoperability with software far stupider than human users and the
possibility of tight coupling between the different modules of a
decomposed software environment—of which the database is just another
module—consistently override user‐friendliness concerns. The problem
is most visible in the tasks which people would rarely perform using
the query language, but which machines are very good at, like
interleaved, pipelined execution, input/output of raw, unformatted
data, management of huge amounts of temporary state (which in the
interface domain manifests itself e.g. as the instantiation of
sizable, first class, temporary relations as literal data) and speedy,
wholesale updates (e.g. reinstalling an entire relation; no database
that I know of has facilities for declaring or taking full advantage
of this sort of operation). As a result, much of the raw power of a
well built RDBMS becomes unavailable on the other side of the user
interface, and the scientific users in need of such power end up
building their own data format libraries to gain lower level
interfaces to their data.
- Databases fail to cleanly separate domain operations and relational
algebra, and as a result also fail to implement domain operations as a
fully productive part of their type system. For instance, if we think
about the SQL GROUP BY structure, it is essentially a limited
implementation of set division by an equivalence relation, combined
with a small number of aggregative set functions built from
associative, commutative domain operations. There is ample room for
generalization, here, but unfortunately it is hidden under the special
purpose GROUP BY extension which also takes away most of the
financial incentive to implement a more general mechanism. Such
general set operations are relatively common in scientific data
processing, as are more complex ones, like the ones involving distance
metrics in high dimensional spaces. The trouble is, current relational
databases aren’t regular or extensible enough to accommodate such
domain operations in a manner which would enable the database core to
take advantage of them in the broader relational setting. If we
for instance want to divide by a user defined equivalence predicate,
we’re often fresh out of luck on the optimization front.
- As an offshoot of this, relational databases tend to have
trouble with unconventional domain operations and properties.
Perhaps the most salient example comes from the newest
development in scientific data modeling, the fibre bundle
abstraction. Essentially the whole point of this exercise is
to derive a neat, general representation of the preexisting
topology of the base space (or domain) of the fibre bundle
section (function, respectively). But current relational
databases understand nothing about topology, even if their
spatial extensions sometimes implement structures which
heavily rely on the mathematical structure being described.
E.g. spatial indexes like R‐trees are needed precisely because
topology and order theory work differently in dimension
exceeding one. After that the scientists who would like to
calculate explicitly with this data but also have it
understood by the database engine aren’t too well served.
- It is also interesting to notice that relational databases do
not implement abstract relations/predicates with purely
computational or logic definitions. Even views always require
some materialized base data. Logic databases excel at this
sort of thing, but then they usually can’t handle simple
relational operations efficiently. This is weird, given that
run of the mill relational databases do support some limited
forms of domain operations which can be thought of as abstract
relations between domain elements, like equivalence and
comparison, and do so internally in a manner which is often
quite extensible. Furthermore, the problem isn’t limited to
scientific data. For instance, many of the problems
encountered with classification hierarchies in the
OLAP/reporting context stem from the fact that
such domains possess properties (e.g. a lattice structure) and
are typically handled in ways (e.g. join/meet processing)
which the database doesn’t natively understand but could
easily leverage once implemented and exported to it. Even the
interface to such computations needn’t be a problem
because these properties are still well modelled by relations
(here, subset inclusion and union) between the elements of the
domain, but when the right abstractions and interfaces aren’t
present, the database doesn’t grasp the structure because it
expects every relation to be materialized and thus fully
accessible. To continue the example, classification
hierarchies and type lattices are often so big that we can’t
efficiently materialize and maintain them in extensive form.
This leads to sparser storage structures which can be queried
and computed with using the right code, but not efficiently
enumerated. The database system cannot utilize such
structures or their supporting code, because the relevant
domain operations and relations cannot be declared in a manner
which it understands. All this is bad from the scientific
data management viewpoint because unconventional domains are
quite common in this context, and the lack of the right,
general, abstract interfaces makes it extremely difficult and
time consuming to circumvent the (understandable) limitations
of the software using custom made abstract data types.
- Relational databases favor keyed, record oriented updates, sometimes
even limiting the full power of relational algebra to queries only.
This can make massive, complicated updates prohibitively expensive,
evenwhile a set and relation based update could in theory be optimized
to a far higher degree than the explicit looping constructs often
found in scientific code. Since scientific data processing tends to be
highly generative, transformative and in general cumulative, slow
insertions and selective copy operations on massive datasets are big
minuses.
- The primary market of RDBMSs in processing large volumes of small,
independent transactions and the shadow of older, record structured
storage have biased the industry for tuple oriented processing,
against holistic modes of analysis based on entire relations.
Scientific data on the other hand generally consists of different
kinds of sampled representations of continuous functions, which the
sampling theorem guarantees will not possess any perfectly localised,
discrete interpretation. Such data always necessitates at least some
level of nonlocality in processing. Only OLAP and statistical
products are slowly beginning to enable mass updates, correlated
processing across multiple table rows, iteration over orders, explicit
support for higher dimension, and so on. But this approach is still
nowhere near as convenient as using a scientific data format in
conjunction with one of the proven computational libraries.
- Perhaps even more fundamentally, and in connection with the
broad inability of relational products to understand topology,
commercial relational databases completely fail to
understand the approximative and representative nature of
scientific data. Unlike the data we normally deal with in
computer science, scientific data tends to represent
continuous functions which, taken as relations, are not
even numerably infinite. Such objects cannot be operated
on directly, but must first be represented in a discrete
form by numerical sampling or by the means of symbolic
algebra, and only then operated on via the representation.
Relational databases generally fail to support the notion
of representation; they do not have any hooks in place
which would enable them to support queries or updates to
parts of the underlying object which aren’t directly
visible in the database, nor do they have any means of
declaring the many useful mathematical properties that the
underlying object and/or the representation have and/or
share. Furthermore, science data tends to represent noisy
measurements of real, perfectly accurate quantities, and
so in order to make sense of it, representations of noise,
fault and uncertainty are often needed. Relational
databases rarely support that sort of thing directly. The
same goes for multiresolution and statistical summary
data, which possess topological or algebraic structure
that is vital to their use, yet cannot be expressed in
relational databases, or at least in any standardized
schematic form. Particularly in the latter case it’s
curious that much of the needed code in fact already
exists in the database, in the form of selectivity
estimation histograms, wavelet hierarchies used in
approximate query answering and neatly packaged data
mining code used by executives. Yet the code isn’t always
exposed to the user, or at least exposed in a form that
would make it generally applicable.
- Practical relational database environments heavily separate schema and
data operations, and are geared towards treating the schema as a rigid
background structure completely separate from the production data.
Consequently any aspect of the live data best modelled in the schema,
like the cardinality and structural constraints attached to vector
field index sets, is difficult to describe under such systems. A
typical example of the kinds of problems caused by this is the
difficulty of processing simultaneously across both rows and columns,
or the practical impossibility of flattening structures which would
necessitate such processing into normal relations in an easy,
efficient and fully supported manner. In the science context more
flexible, more powerful and lighter weight data dictionary operations
would significantly ease the storage of arrays in existing relations
(e.g. they could help enforce the common, multidimensional cardinality
constraints when multiple arrays of different sizes are serialized
into a single relation) and also as first rate relations (e.g. they
could make the declaration and management of tens of thousands of
relations within a single database somewhat easier and more organized
than it is now).
- Typical RDBMSs faithfully implement the relational model, but they
also inherit its inherent generality and open endedness. As such, they
fail to assign unambiguous semantics to the data. Their metadata
capabilities are slim and poorly standardized even in areas where full
representational consensus exists, like in the treatment of physical
units and geographical registration. This despite the fact that
comprehensive abstract data dictionary overlays (e.g. RM/T or
RM/V2) have been suggested by relational pioneers like Codd from
early on. Today there are certain initiatives (like OpenDB) underway
which aim at remedying the situation a bit, but the fact still
remains that given an instance of a relational database, it is
exceedingly difficult to figure out what the stored data means. In the
scientific context this is strictly unacceptable, because research
thrives on data sharing, cross validation, retrospective reanalysis,
guaranteed provenance of data once gathered, and data sets easily
found via comprehensive, publicized indexes.
- Much of the concurrency and recovery management code present in commercial
database products is overkill for the typical transformative,
append only scientific workflow. This severely handicaps commercial
databases compared to dedicated, scientific data management solutions,
so elective turning off of the overhead should definitely be possible.
Furthermore, implementing such a feature shouldn’t be too difficult or
expensive because recovery mechanisms can generally be shut off in a
simple manner (e.g. shadow pages aren’t allocated but overwritten in
buffer; the write ahead log is not written and preimages are not
copied), locks simply do not have to be acquired or checked,
restricted mode processing and the attendant optimizations already
exist in most relational products, and so on.
If even some of these problems were rectified, I think relational
databases would quickly become the platform of choice for scientific data
management.