Repetition, weak entities and identity

Looking at some arbitrary file format, it is likely to have some repeating structures. Often such structures are defined as the content of a record type, or the element of a list or an array. Sometimes repeating structures just appear without being separately defined or named.

Repeating structures per se rhyme well with the relational model, because they constitute neat templates for relations. When they appear as structures or records, the corresponding mapping to the relational model often takes the form of a relation whose tuples directly possess the described structure, modulo some details with keys, pointers, redundancy and storage optimization. When repeating structure is brought about by arrays, the resulting relation typically looks like a Cartesian product of the sets of indices and the array member structure. Thus far everything is nice and simple—a tuple is a record, is a struct, is an object, is a function from the finite set of field names to the field values, is an entity, is a singleton set from the Cartesian product of the domains, is a substitution of variables which satisfies the respective n‐ary predicate between the domains.

But this is as far as it goes. There are also at least three common situations which break the nice correspondence. The first is when there is nesting. In this case repetitive use of the outermost structure can be easily handled, but then the inner structure can repeat as well, either within a single outer structure, across different containing structures, or at the broad conceptual level across files, representations and the like. From the database viewpoint the latter would need its own relation, but at the same time relational design theory strictly forbids nesting which is a non‐first normal form construct. The second exception is optionality. Optional fields and the possibility of representing a supertype with fewer fields by subsetting the structural description of a subtype give rise to structures without repetition, but still with a varying total number of fields. The only way to model these using a single relation would be to use null values, but then the relational model frowns upon nulls and other kinds of missing data notation. And third, full fledged format polymorphism using unions, assignments to variables of a supertype (at worst an abstract top supertype whose type subhierarchy might contain types not sharing a single field with each other) and equivalent constructs like weakly typed or typeless references, can make relational modelling approaches based on both nulls and multiple tables exceedingly cumbersome and irregular.

The general solution to this problem is to isolate each repeating structure to its own table and to subdivide each such table so that each combination of present fields defines its own relation. In entity‐relationship terms, nested structures and optional fields in repeating structures become weak entities and their tables will duplicate the keys of the containing structure. The diffent combinations of present fields will partition the set of rows in the relations into separate tables with no null values, and the number of tables will increase considerably. In the case of simple optionality, each optional field can be factored out of the relation and placed into a parallel table with a shared key. The absence of a row in the appropriate side table now takes the place of a null value, but when working with multiple optional fields, this approach can only handle the case where any combination of the options is permissible. Type hierarchies are keyed by the key of the lowest common supertype, and both regular entities and associations between them are handled normally as separate relations. Union types are tackled by taking the union of their set of fields, and applying the above to the relevant subsets (i.e. supertypes) of the union.

An interesting special case occurs when two types do not originally share any fields, or in particular key fields, but still need to be stored together or referred to in the same field. This case often comes about when working at extremely high levels of abstraction where more or less generic supertypes exist, or when utilizing multiple independent and overlapping aggregation hierarchies, e.g. by type and simultaneously by a free form grouping present in the user interface. In this case indiscriminate aggregation, composition and linking can occur, and consequently objects which might have to be referred to in the same context often do not share a smart key. Generally speaking this situation calls for a common surrogate key, which essentially becomes an object identity. To date this reasoning is the only robust one known to me, calling for OIDs. It has special significance for weak entities, which display the tendency towards being strenghtened by being independently named, numbered and reified.