A brief look at file format design

Introduction: What are file formats? Why talk about them?

Everybody’s come across great many file formats and all the problems that go along with them. Coders in particular face all the burdens of supporting the multitude of formats out there. File formats are a fact of life, then; why talk about them? The answer is simple: far too many bad formats exist. As the scene is very active in establishing its own standards and conventions, putting a finger at the basics might well serve a valuable purpose in the long run.

In computer science circles what most know as file formats are called external data representations or serialisations. These two terms nicely summarise what formats are all about: they are about representing abstract data types (ADTs) as concrete byte sequences. But this is only part of the story: ADTs include a procedural side as well. It tells how the stored structures are accessed and modified. Now, current software architectures do not admit a portable definition of complete ADTs—we must separate ADTs into procedural and declarative parts and fix a particular concrete transfer syntax for the latter. This is why computer scientists draw a line between transfer syntaxes and internal representations—the first is used when we cannot control the procedural side of our ADTs, the second when we can. Strictly specified file formats are, then, just a way to circumvent the need for common access libraries and proper type encapsulation. In the scene, this internal vs. external distinction is not often appreciated. A common mistake is made: file formats are modelled too closely after the operation of the writing program. Instead, the logical organization of the data itself should be used as the starting point.

One direct consequence of fixed representations is the need to accurately define them. Often, in addition to being incompletely defined, file formats tend to be too syntax oriented—usage guidelines, algorithm details and data dependencies are seldom spelled out. In effect, we leave a considerable hole in the definition of our abstract type. This means that even though a format spec may be available, improper implementation is still quite easy. All too frequently this happens because the writer of the original application thought no one else would use the format—the scene rests purely on de facto standards which are almost invariably meant to be private at the time of their inception. This assumption often fails, so most file formats should be designed and defined more carefully. This is the subject matter of this article—food for thought, basically.

Fundamentals of serialisation

In designing a transfer encoding, we face a multitude of engineering goals, some of which are mutually incompatible. Some of the major ones are:

Compact encoding
Speedy access
Generality and extensibility
Simplicity and clarity (in formats, specs and software)
Architectural requirements (like serial access)
Consistence/orthogonality (see below)
Fault tolerance and security

Usually tension develops between size and speed, generality and encoding simplicity, and consistence and software complexity. Fault tolerance is often at marked contrast with all the others. It is quite clear that file formats really require thought and design. In this text, design and tradeoffs are discussed from the point of view of a large scale storage intensive application, since most problems do not surface with small, simple files. Games and multimedia apps are good examples.

Proper design

When creating a file format, design is as important as in all software development. One should not just throw bytes in a file. Specification of requirements, drafting, implementation, testing, documentation and maintenance should all be done, and with care. The first obvious decision is whether creating a new format is worth the effort—often an existing one, possibly modified, will suffice. In each stage of development, simplicity, elegance, efficiency and feasibility should be checked for. Leaving something out to contingency may lock one with a format not worth supporting. Now, users hate flag days.

When putting data in a file, we are coding abstract structures. For instance, an instrument is not a sequence of bytes from sound editing software, it is a collection of sound signals with lots of auxiliary data of considerable inner structure. All of it may be of considerable size and might have to be modified later on. A picture is not a sequence of scanlines but rather a 2D signal of 1‐4D vectors with heaps of potential ambiguity on interpretation (e.g. the color and gamma problems so common in the print press industry). One must always take heed of the fact that what we see inside a program is usually not what we are trying to code, but instead is itself one particular representation of it.

In designing a format, the structure and purpose of the data guide us best. We need to define exactly what it is we are encoding, break the data down into solid, meaningful chunks and then encode these chunks along with their relationships to one another. The problems are the same as in constructing a relational database or building an object model for some application. Room must be left for extensions, but so that they neither make the format less efficient nor more cluttered. The need for later updates must be considered. We have to find out where the data will be used and how. Then usage patterns can be utilized to speed up operation. If we are dealing with structured data, the structure should be preserved and made available to users. Existing standards should be leveraged for compatibility. If the amount of data is large (or can potentially become so), we must think how it will be found without reading the whole file. The format must be made robust, since transmission errors and defective media are quite common. Self‐sufficiency is also a concern—only commonly used libraries, representations and algorithms should be used. Patents and other intellectual property issues must be taken into account. All in all, there is usually less work in implementing a file format than in designing it properly.

Normalisation

Normalisation is a concept which originates in relational database theory. The rigorous definition is rather tacky, but the basic idea is simple: we should store information only once. When this is done properly, all stored fields are completely independent—formally, normality is expressed in terms of functional dependencies: if some collection of fields (partially) determines the contents of some other field, there is a (partial) functional dependency. Normalisation tries to get rid of these. The advantages are two‐fold: first, such dependencies imply redundancy, and second, they impose constraints on the fields’ contents. (If such dependencies are present, we cannot arbitrarily change values in the individual fields—we must change all involved fields so that the dependency holds.) This means that we both waste space and complicate updates and program logic; killing dependencies is always good. Most programmers actively normalise their data structures, even when not conciously aiming at doing so. Sometimes, however, the resulting structures are so counter‐intuitive that something else is used instead. This may be good or bad, depending on the situation. Whatever the best solution might be, the decision should be conscious. Examples illustrate the point raised:

The first example is a rather straight forward adaptation of a classic relational database one. When we store customer information in a database, the first idea often involves storing records by customer name. But this is really bad practice, since now wherever we want to order by the name, we must store the name explicitly. It is clear that storing the same names in different relations over and over again is not efficient or robust. We have not normalised our database schema. Instead, we should assign customers unique ID numbers and use those to connect the relations to each other.

Next, consider a module format which permits variable tempo and bpm settings. It is clear that this is not the best possible way to set the speed of a composition since the same perceptual parameter (bpm) is affected by two separate, rather intricately connected parameters. Instead we should have only one parameter for the speed. Now, in our format each instrument may have multiple layers. Each layer is complete with envelopes, volumes and other parameters. It is clear that layers are independent of instruments, here. There is no point, then, in storing layer definitions as part of the instrument’s parameters since multiple instruments may need to use the same layer. (Say, two different piano sounds use an identical layer of hammer noise.) Storing a volume in instrument parameters is sound, though, because the volume is an inherent part of the instrument’s definition. Changing it should never affect other instruments. This simple example portrays one of the most important concepts in normalisation: separation and pointing. Objects which are connected but separate should not be nested but instead stored separately and linked using pointers of some kind. (This parallels the relational definition in which normalisation creates new tables and index fields.)

Data types

When we design file formats, the ADT we are trying to encode is often enormously complex. This means that we cannot just give a list of all possible values and their respective encodings, but we must break the problem down into more manageable pieces. In other words, we must build our encoding from simple, known pieces. These pieces are the atomic data types. Most computer platforms have a simple, limited set of datatypes in common use. They include numerical, character, string and logic types. Many standards exist for the representation of signed and unsigned integers, real and fractional numbers, characters, character strings and truth values. It must be emphasized that each platform has its own encoding for them. In other words, when we design a transfer file format, we must even define how the atomic types are encoded.

In addition to atomic types, more complex structures often need to be used. These are called composite types, since they are built up from the atomic ones. The simplest are records and vectors, more difficult ones include various kinds of linked lists, (search) trees, hash tables, heaps, procedural data, choice structures (polymorphism) and complex combinations of the preceding. Unlike atomic types, these often need to have some consistent structure and may need considerable program logic for updates (for instance, inserting to a balanced tree). Their definition and usage is considerably more complex, then. Defining them separately (i.e. giving all the operations, data encoding rules et cetera in their entirety instead of just showing one specific use of the structure in the format) is therefore wise: as with atomic types, reuse of existing constructs makes for elegant formats and clean software. Since composite types take more room than atoms and often entail polymorphism, efficiency must be assessed and addressed, too.

Though a treatment of common composite types and related algorithms (e.g. external sorting) could be nice, that would take a while and constitute a considerable detour. Perhaps I’ll address trees, heaps, hashing and the works some other time…

Pointers (references) deserve a special mention, here. Usually they are just atomic types with some special interpretation (bytes from some offset, index into a list, parameter to a function). Pointers are handy, but they always imply polymorphism which must then be accounted for. Robust design with pointers is difficult—pointers are vulnerable to transmission errors, tend to complicate updates somewhat (through the dangling pointer problem) and are of little use when serial access is sought after. Elegance is also an issue. It is especially bad if special values are assigned to pointer fields (say, zero means null) or pointers patterned after program logic are used (for instance, some module formats store GUS memory addresses for samples as an optimization). Pointers clearly do not promote portability, either.

Extensibility and maintenance

When we try to achieve long‐lived formats, we often need to leave hooks for future extension. There are many ways to achieve this. Indicating options so that they may be skipped safely and efficiently always pays off in the form of version compatibility. Putting in mere version tags does not. A good example of proper design is EA‐IFF—skipping and extensions work seamlessly, here.

Extensions suffer from normality constraints, as well: when adding to a format, normality should still hold after the extension. This can be difficult or impossible to attain. Alternatively the format can be made polymorphic so that the whole definition varies and functional dependencies do not arise. An example is seen in image file formats: palette images employ a form of normalisation. Here colors are heavily quantized which imposes a kind of complex functional dependency between RGB values. It is not wise, then, to store the RGB values for each pixel, but rather put them in a palette and index. When extending to true color images, the original dependency is broken. We must now discard the palette and our format probably becomes polymorphic, soon complicating software. (Had we thought it through more carefully in the beginning, we might have used another abstraction and avoided the polymorphism. TIFF almost achieves this, albeit with a considerable price in complexity.)

Like programs, file formats come in versions and need constant maintenance: definition of new fields and field values, occasional reworking of the format with the attendant flag days and support for coders using the format. Since most coders do not have the time or resources to maintain an extensive file format (say, a database encoding), they define the basic format one time and leave extensions to be done in a decentralized fashion. This is not always bad, but care should be taken of interoperability. If users are permitted to assign new field values, guidelines to do that should be given and the fields should be large enough to avoid collisions. If a file format gains enough popularity and a sufficient user base, one should think about documenting it a bit more formally and possibly handing it to an instance capable of maintaining it properly. This is what has happened with most international standards, and is usually good for a format. It is the reason we have HTML and the Web, for instance. (BTW, HTML is not an independent language, but an application of a highly modular and extensible base language, called SGML, which is also an ISO standard. It is only proper that HTML has returned to the standards community, then. SGML revisions come roughly at a pace of one per decade—I would guess it was well designed and documented in the first place.)

Issues and techniques

In an actual design, the tradeoffs outlined in chapter two need to be resolved. This and related aspects of file formats are brought fore here in the third part. Although this chapter is a bit more application oriented than the other ones, no code is still present. This is because the only nontrivial code involves shuffling around particular data structures, such as trees, and so was consciously left out. A lot remains to be said, still.

File system coexistence and interdependence

The behavior of the underlying file system is one of the determining factors in storage performance. No matter how well optimized a particular file format is, the file system is the ultimate bottleneck. The functions of file formats and file system metadata can also overlap somewhat. This is why one must think rather carefully what can and what cannot be expected of the underlying storage architecture.

The usual division of labor between programs and the file system is that file systems provide naming (e.g. hierarchical tree or an acyclic graph), certain minimum amount of metadata and access control (e.g. UNIX file permissions and ACLs and DOS file modes), maintain a cache and provide block based access to file contents whereas programs take care of multiple file integration, in‐file indexing and data representation, establish a proper naming hierarchy and mostly manage resource sharing (e.g. file locking). Conventions vary from platform to platform, of course.

The implementation of a file system is pivotal in how well it responds to differing usage patterns. Most file systems are based on lists of aggregated disk sectors (e.g. FAT clusters). Sometimes clusters are indexed by trees (e.g. in Ext2). Sectors can also be aggregated into variable length runs and then put under a search tree (e.g. HPFS). If files are composed of linked lists of clusters then either the lists must be stored in main memory all the time or cached efficiently, or otherwise random access will be too expensive. In tree based file systems, random access is usually very efficient, but serial access may not be as efficient as in the linked list approach. Again, caching of tree nodes greatly helps the situation. The tree paradigm also introduces some uncertainty in file continuation: when the file length increases, the tree may require considerable restructuring. In UNIX derivatives overwriting zero sectors can also behave strangely—many UNIX systems allow zero sectors not to be stored at all. In this case writing nonzero data to previously zero sectors can suddenly cause the disk to become full. This is the first instance in which one should be paranoid: file systems tend to do unexpected things. Expect it.

The implications to file format design are obvious. First, serial access is always the fastest method in any properly designed file system. Critical parts of a file format should be serially readable. The same goes for writing. Second, starting from the beginning of the file is good. Most file systems handle this case far more predictably than reading from end of file, for instance. (Seems obvious? Engineering tradeoffs may push some init data to the end of file, as in .zip files.) Grouping commonly used data in one location improves efficiency since the data may then be kept in the file system block cache. As access is block based, doing small discontinuous reads is not sensible. On the other hand, doing long continuous reads of data which is then immediately discarded is even worse. This is why proper skipping and indexing is important in big files—in current multiprocessing environments messing up the block cache with unnecessary data is a mortal sin. Aligning data to some proper boundary in the file (say, 512 bytes, the most common sector size), packing more data per disk sector and reading bigger chunks speed processing by cutting OS service calls (and context switches) and retrieval of data which is not needed and make it possible for the file system to use more aggressive hardware optimisations (scatter/gather, elevator algorithms, blind bus mastering DMA etc.). The same goes for writing: no sane coder would ever use library calls to store sequences of bytes directly. Instead, writing should always employ buffering. Sometimes even doing it inline can be justified (e.g. in search and replace algorithms employing finite automata). One should rarely store sequences of zeroes. Not only is it inefficient, but can also lead to unexpected behavior in UNIX based systems.

Uniform access and reuse of structures

Like always in software engineering, modularity and code reuse are advisable in storage related programming. Since we often have to deal with more than one file format, isolating format dependent code into separate modules makes it much easier to support multiple formats and reuse the libraries in another program. Because storage is rarely the number one concern in our programs, we usually want to implement it simply and throw as few lines of code at it as possible. This means that reuse of data types and structures is something to be pursued—reuse of structure leads to reuse of code. Small modules have other advantages as well: they reduce coding and testing time and so allow us to optimize and debug more carefully. Optimization is further eased by the fact that using the same structures and code multiple times tends to generate repeatable access patterns which can then be reliably modelled. Unified error handling is easier if interfaces are compact. Third parties like to have compact libraries with easy, intuitive interfaces. Documentation is easier if the same data structures recur over the documents. Manual checking of the resulting files is viable only if just a few complex data types are used. Packing file access code in neat modules and reusing both data structures and access code is enormously useful. Doing things this way will probably make for a cleaner format, as well—one does not feel so tempted to use irregular structures when compact code is an issue.

Random vs. serial access

In the distant past of mainframes and tape/drum storage, the issue of random versus serial access did not exist—all access had to be serial. Some people might say the situation is now reversed and all access should be considered random. This is a valid point, but a weak one. For one, random access is always less efficient than serial. Especially so when the operating system prefetches contiguous blocks of data and when the underlying medium is serial in nature (like CDs and tape). Second, serial access tends to lead to less fragmentation, since files are necessarily written in a single, swift, linear pass. Third, random access fares especially badly when only frugal resources are available. If the environment lacks extensive buffering and multitasking facilities, seek and rotational latency, interrupts and task switches to driver code, clogged asynchronous command queues (like the ones used with bus mastering SCSI cards), drive level cache thrashing and the like will become major problems. Especially, time is just wasted on head movement if only one program is running at a time, since the intermittent disk sectors crossed during a seek will not be utilized. Serial access would also make better use of the book keeping mechanism under certain file systems (e.g. HPFS, whose accounting mechanism uses sector runs and heavy caching of allocation structures). Furthermore, serial access is so much more deterministic overall that even hard realtime guarantees can be achieved. This means truly dependable multimedia and realtime applications are difficult to build on top of random access file formats.

Summa summarum, one should always strive for serial access, even if it can be achieved only partially. On the other hand, not all things can be expected to smoothly translate into serial constructs. One excellent example are updatable structures, discussed below. To get some perspective on the problems of serial access and updates, one should check out the design of the UDF 1.5 file system used on CD‐R and CD‐RW discs.

The need for speed

From a format perspective, there are three main ways to speed up file access. The first is to reduce the number of blocks fetched. This equals minimizing retrieval of useless data (i.e. knowing exactly where to go and placing unrelated data in separate disk blocks) and the size of the data we do use. The second is to blindly (i.e. without program intervention) retrieve long stretches of data blocks. This comes down to pruning structures which must be parsed in order to be accessed (pointers, polymorphism, variable length fields and discontinuous streams). The third involves pipelining and caching so that we need not wait for I/O when data is needed (e.g. prefetching). All these require taking advantage of expected usage patterns.

Two basic canets in file format and file system design are that the CPU is considerably faster than I/O and that there are ample amounts of external storage space available. The implication is, we can burn both cycles and bytes quite liberally to achieve speed. This way, data compression, complex storage models, indexing and redundant representations can be justified.

To avoid moving around something which is not of immediate use to us, logical grouping should show up in the encoding as well. This way it is likely that once a disk block is retrieved, it will be used in its entirety. Dimensionality and locality give good guidelines. For instance, storing an editable picture by scanline would never be as good as using a tiled organization. This is because most editing operations tend to obey the principle of locality—pixels close to each other are affected together. Since pictures are 2D, locality works in terms of distance on the plane, not 1D scanlines. Storing by scanline breaks the proper definition of locality and makes updates and off‐the‐file caching inefficient. (This is precisely the reason we use BSPs, octrees, bounding boxes etc. to sort objects in 3D graphics: they match the 3D definition of locality better than does a 1D list of extents.) Distances do not directly translate to locality, of course—if our algorithms do not work in terms of dimensions or distances, some other organization can give better performance. (For example, when displaying an unpacked animation, we would store by scanline nevertheless.)

If we wish to be able to fetch or store files without explicitly parsing each and every byte, we must use structures which allow blind access. This either means fixed length fields and the absence of options. Alternatively we might consider some extra I/O buffering. Pointers are a big no‐no either way. To optimise our use of the file system interface (which is invariably block oriented), we need to align our data blocks. Usually we do not know the exact block size of the device our data will be stored on, but a sufficiently large power of two is a good guess. 4K will probably do. Often individual fields are aligned too, but on smaller boundaries. The rationale is that certain architectures have trouble handling anything smaller than a word or a double word. In my opinion, concerns such as these are seldom important in transfer file formats—since the main bottleneck in storage is disk latency, twiddling nonaligned fields is a minor concern. This is especially so since most storage related code is written in high level languages which usually automate the handling of nonaligned fields. On the other hand, alignment costs little compared to the enormous disk sizes of today—RISC and Motorola advocates would probably bury anyone with a differing opinion.

As the major problem in storage subsystem performance is disk seek latency, not transfer rate (nowadays, at least), we see that if we can anticipate which data are needed next, we can pipeline and gain performance. File systems frequently use heuristics to achieve such an effect—they commonly prefetch the next block in line on the assumption that it will soon be needed. This is good, but not enough: the file system can hardly know about the logical structure of our files. Hence it would be nice if we could control the prefetch logic. Unfortunately many common storage APIs do not offer the programmer any such control. Often we cannot even construct our own because asynch I/O is not available. POSIX comes closest with its broad set of I/O primitives. They enable us to initiate I/O without disturbing the operation of application logic. If we know which data will be needed next, we can ask for it ahead of time. On the down side, such explicit prefetching necessitates later storage in application memory. It also implicates some rather intricate parallelism and all that comes with it (i.e. resource locking, deadlocks, asynch error reporting etc.).

Caching and state

As disks are slow, the fastest I/O is no I/O at all. The only way to achieve this is to already have the data in memory when we need it. Depending on access patterns there are many ways to accomplish this. Prefetching and pipelining were already touched upon in the last chapter. Caching is what is left, then.

Not all programs benefit from caching. Some programs, especially those with simple storage requirements and small files, tend to be designed so that both reading and writing are done in a one‐shot, all‐or‐nothing fashion. In this case, caching does not help. On the other hand, most larger programs use—or could be made to use—the same data over and over. Examples are found in databases, text editors, configuration management, graphics and 3D applications, file compression and archiving and more. In this case, keeping often used data in the core helps a lot.

Optimal caching requires knowledge the operating system (or the application itself, for that matter) does not have. Thus less must often suffice. Mostly it does. A simple least‐recently‐used discard policy with a coarce, shared block cache is quite effective. Combined with some simple prefetching (e.g. always prefetch the next disk block) and periodic writeback near optimal performance can result. But only in the average. There are situations where a single program may cause significant thrashing and degrade the performance of the whole system. Plus, access to the kernel and/or libraries is quite expensive in terms of processing time. This is why some caching/buffering should be done inside the program and some control over the operation of the system wide shared cache exerted. The negative side is that programs become more complex if caching is done in application space. We end up using a lot of memory, too. In the extreme, we may force the operating system to page out. This is the primary benefit of OS side caching versus the more customized schemes: the system does not know about the inner workings of our program but it does have global knowledge about memory usage and the behavior of other programs. The system can adapt to low memory conditions quite unlike we can. In addition to that, a global caching scheme yields a significant benefit if multiple programs share access to a file.

Program state is another thing, closely related to caching. Basically, there exists a tradeoff between keeping relevant data (especially metadata, such as indices) in the external store/cache and keeping it inside the application data structures in a decoded form. In the latter case, applications become more memory intensive and often require more careful handling of error conditions and special cases—keeping program state in sync with the external files takes some effort. In the former they depend on the cache of the operating system, generate extraneous disk accesses (metadata lookups and file updates) and generally come to require random access. The problem is, if the program does not keep count of where different data are in a file, it will have to seek for what it needs. Archivers provide examples: JAR is clearly in the keep‐it‐in‐the‐core camp. RAR takes the dump‐it‐to‐the‐disk line.

It seems statefulness is purely a programming tradeoff. But caching and intended program state often dictate how external data is encoded. If considerable state can be accommodated, complex external representations may become bearable (the JAR/UC2 way). If more weight is put in the storage side, some things such as serial access and low memory operation are greatly eased (the ARJ/PKZIP way). If both lowmem operation and efficient random access are required, there will usually be some redundancy in the file format. This is exemplified by .zip files which have both a distributed, block based organization and a central directory—the first works for lowmem and serial access, the second for random access and large files.

Some format designs betray an incongruence in expecting the application to be stateful but neglecting easy buildup of the state. This is bad, since such an approach does not scale. Text based formats are a pathological example—ever tried to build an efficient editor for text files of arbitrary size? (Say, 100MB and beyond: HTML, SGML and PS take room.) Since text editors deal with lines, one needs state to tell us where the lines are. Text files have no indices, so we need to read the entire file to be able to go around in the text. Using such an encoding as a basis for another format is dubious. As a worst case scenario, some formats have even used line numbers for pointers—extremely bad since now one needs to browse through the entire file to number the lines before proceeding to read the data.

Space efficiency and redundancy

Most people would agree small files are a good thing. Although storage space is ample today, there is always a pressure to make our data smaller. There are many ways to accomplish this: properly set field sizes, field packing, encodings designed especially for certain kinds of data, polymorphic formats and generic data compression methods, both lossless and lossy. All these reduce redundancy in the resulting file, sometimes resorting to losing some insignificant data in the process. (This might be called perceptual redundance.)

The first thing to do when wishing for a small file is to set field sizes properly, only code data once (normalize) and to be sure space is not wasted through excessive fragmentation and/or alignment. Fields which assume default values may be made optional and variable length fields employed. At the expense of byte alignment, some fields may be packed more densely (e.g. by using bit fields, stuffing more than one numeric field into a byte, using fields lengths which are not multiples of eight bits). Redundant encodings can be collapsed into a more efficient form. (One good example is the two character linefeed convention of MS‐DOS text files.)

Sometimes the data structures one uses contain hidden redundance. For instance, one can code binary tree structures extremely efficiently by sorting and storing branch lengths only. (This is the technique pkzip uses to store its static Huffman trees). Similarly effective constructs exist for many nontrivial data structures, such as graphs. In general, tailor made structures cut back on storage requirements. Especially so if complex storage organizations are used. Examples are found in scientific data formats, since they are designed to hold highly structured data. Sparse arrays are a prime one.

All the measures thus far take out what might be called external redundance: they make sure encoding the data does not cause appreciable bloat. Nothing is said about the redundance of the data itself, though. Substituting binary fields for numeric character strings is good, but does not suffice if most of our numeric data consists of zeroes. Compression is the tool of choice for reducing redundance of this kind.

Many methods are available, from run length coding through simple dictionary encoders (LZ77, LZ78 and LZW thereof) and basic statistical compressors (Shannon‐Fano, Huffman and arithmetic) to complex Markov modellers (PPM, PPMC, PPM+ etc.), hybrids (dictionary+low order statistical, like PKZIP, ARJ and RAR) and derivatives (LZP and methods using multiple dictionaries). Sometimes certain invertible transformations are used as preprocessors (like the Burrows‐Wheeler transformation). When dealing with sampled analog signals and alike, delta coding, prediction based on linear filtering (LPC), certain classes of mathematical transformations (like FFT/DCT or even Karhunen‐Loeve) may provide a useful impedance match between the source and the following compression engine. If perfect reconstruction can be sacrificed, lossy methods can be employed. These most often come in the guise of vector quantization, either in some suitable transform domain (like frequencies, as in the case of DCT) or not (like in the case of the difference signal in CELP). In some cases straight adaptive quantization does the trick. (An example is seen in ADPCM.) Even better performance results if perceptual models are used to tune the quantization process (say, MP3). The combinations and ramifications are endless.

When choosing a compression scheme, many factors come into play. One must think about the computational burden (the higher the compression ratio, the more data each outgoing byte depends on), memory requirements (good compression equals tons of model state), intellectual property issues (e.g. LZP and LZW are patented), code complexity and development time, the possibility of breakin (i.e. what happens when we want to start reading a compressed stream halfway through?), scalability (e.g. one will not want to use any of the optimal algorithms for streams of arbitrary size) and in the case of lossy compression, quality issues and the availability of knowhow and resources to build a quality perceptual model. It is also important to note that compression is generally much slower than decompression. MPEG encoders are a notorious example. Only approaches heavy on modelling (like PPM+) level the difference. In this case both ends of the pipe tend to slow down.

Compression is not a cure‐all and should be used with consideration. It really does not rhyme with indexing, random access, small memory footprints or easy modification. Quirky compression methods are one excellent reason to abandon a file format altogether. (See GIF versus PNG.) Anything that reduces redundance will reduce fault tolerance. This applies to compression as well. Checksums, error correction, synchro flags and alike often need to be added afterwards to ensure that bits survive intact. Often the best way to proceed is to use a simple, reliable compression method, do compression blockwise to limit error propagation and memory requirements and to rely little on stuffed fields and pointers. This is the approach used by early archivers (ARJ, PKZIP) and most compressive file systems (Stac, drvspace, NTFS etc.) and it seems a proper tradeoff.

Searching and indexing

Serialising highly structured data as simple streams is a mistake. Few sentient coders would even think of encoding a database as a simple stream of fixed size cells without any extra structure or indices. The key to processing large amounts of data efficiently is to ignore most of it. To do this, we need metadata (pointers, indices, directory trees, maps, keywords etc.). If complex datatypes and/or storage organizations are used, we also need indices to keep track of file contents, free space et cetera.

There are many ways to index data. The simplest is to order according to some fixed key field. This is seldom a good alternative if field lengths vary or our data are large, but if we know the number of sort items beforehand, it is still worth consideration. Gathering a separate directory (i.e. a sorted list of keys and pointers to data) is better, but does not scale very far. Such lists are best suited for serial processing and highly irregular sort keys or, on the other hand, for flatly distributed fixed length keys. In the latter case, interpolation search can be extremely efficient. Now, since a simple list does not scale, why not index it? Use induction, and we end up with a tree structure. Multiway balanced trees are really one of the few standard solutions to external indexing. They display quite admirable access statistics, since the capacity of a tree grows exponentially with its depth. Only hashing can beat trees, but because of its unpredictable access patterns, the difficulty of extending hash tables and the difficulty of building multidimensional hashes, trees are better suited to external storage use. The best way to store search trees is to cluster them into connected, equally large pieces and fit the pieces onto whole disk blocks. When done properly, such an organization can achieve an average of two to three disk accesses for trees spanning millions of search items. As an added benefit, trees are suitable for multidimensional indexing (e.g. k‐d trees and BSPs with quad and octrees as special cases), range queries, subranging (partitioning a search range into finer subranges) and insertion. (Though it must be said that keeping a tree balanced after insertion is not nice, especially if topological partitioning of the described kind is used. There are ways to distribute the load of insertions across lookups, but they require more complex structuring and behave better when used in‐core.)

If we are storing data which does not admit parametrization by numeric values, we often need to use keyword techniques. Trees (or, rather, tries) are suitable for storing these, as well: extensive keywording tends to generate lots of words with identical substrings. By packing these into a trie and storing it similarly to a B‐tree, we can achieve both time and space efficiency. The same argument goes for other similarly regular metadata.

Data and metadata should not be mixed: metadata need to be clustered together, whereas bulk data should be stored separately. This is because the access patterns and structuring of metadata and application data are different and largely independent. Separating the two allows us to use highly optimized data structures (fixed length fields, sorted arrays, prefix coded trees, statistical compression) for both. This way locality works for us, as well: metadata will stay cached, the likelihood of finding multiple useful data on the same disk sectors will increase, data accesses can largely be made blind and metadata lookups do not drive quite so many data sectors out of the OS block cache. We also gain in update complexity—we can now rewrite metadata and application data separately.

Updates and insertion

Updating is something which is not often thought very carefully when designing a file format. Basically, most files are small enough to be rewritten each time they need to be updated. However, every once in a while we bump into really big and complex files. File archives are a common (bad) example: big, meant for update and not optimized for it. Building a file which is meant to hold gigabytes of rewritable data and which needs to be completely rewritten on each update betrays a degree of indifference for proper design.

The basic problem with updates and inserts is that we are dealing with flat files. Each byte has a fixed address, so once a chunk of data is left behind, we cannot arbitrarily change its length. Nevertheless, changing data later often requires resizing. Since file systems generally do not support insertion, but only overwrite, we must do space management ourselves. The problems are similar to the ones encountered in memory management and file systems and require similar solutions. The most common way is to partition the data and use a linked list. Another one is to use linear chunks but instead of just packing them end to end, we form a file directory and index the data from there. Both these approaches suffer from severe external fragmentation. Using fixed size pages and linking is sometimes better, but now small data items need special treatment to avoid internal fragmentation. Random access and latency also become problems since now the data can be scattered all over the file. Sometimes the former can be alleviated by binding the data stream together with a tree based index instead of a linked list. This is what many file systems do. Efficient operation is more difficult to reach. If we mainly read in a serial fashion, maintaining a separate linked list organization and prefetching data can help a bit.

Of course, not all files need to be updated. Optimizing for update in a serial and/or lossy encoding (e.g. MP3, which is inherently a distribution format, not an editing intermediate) would be unwise. Coding for incremental update in a file which will never grow beyond a couple of hundred kilobytes is sheer waste of time. But sometimes we do need some flexibility in how we can move the data around in a file. This is especially important if many programs store their data in the same file, since we have no prior knowledge of field lengths or their internal organization. This is what happens in component software, software frameworks and some object oriented environments, like OLE 2. Maintaining files like these is a real bitch if incremental modifications and versioning is not supported.

Fault tolerance

Fault tolerance is another area which is often forgotten when designing everyday file formats. This may have something to do with the fact that usually disks do not fail and even if they do, the bulk of data on an average hard drive is dispensable. But anyone who has used a compression/archiving software knows what an asset fault tolerance can be. By making file formats resistant to some common forms of damage we can avoid a lot of user disappointment.

Most storage media of today use error correcting codes and outside the secondary store, most of the components in a computer are relatively reliable. This means that the probability of bit errors is not very high. Instead, protocol failures (like missing eighth bits) and entire damaged sectors have come to dominate. Especially, one is likely to miss at least an entire sector of data if any incorrectable errors are found. This means that our primary concern are long burst errors and completely malformed files. At the very least we should detect the latter, but preferably we should have some tolerance with respect to both.

Error tolerance is a sum of many things. The first is redundance. To be able to take some damage without dropping dead, a file must contain some of it. Second, the redundance should be in a common form to facilitate automated correction. In most cases, this implies error correcting coding. Third, our formats must not be fragile. Most data should be structured so that any local damage will stay local. This may exclude central directories, linked lists and so on, except perhaps when used to cache data which is also available elsewhere. Solid compression will also present serious problems. It should always be possible to resync to the data flow even if some errors are present. This means we should always insert some flag bytes so that we do not have to rely entirely on pointers. Polymorphism and variable length fields are never as fault tolerant as fixed records, either.

If we decide to go heavy weight and add error correction, we should remember what we are trying to correct. The traditional algebraic codes, such as BCH and RS, are optimized for single bit errors. This means they are suitable for correcting spurious bit errors (like the ones caused by coating defects on disks), but not very efficient at the level of entire files. They must be augmented by heavy interleaving (like in the CIRC coding used on CD s) or some entirely different form of coding must be used. One example of the alternatives are the page summing codes used by many archivers (like RAR and UC2). They xor together checksummed data pages. The redundance comes in the form of stored xor data. Recovery is extremely efficient, but using whole pages can lead to some fragmentation. Also, recovery may not be possible if a burst damage spans a page boundary.

So there are many ways in which to make a format more robust. But again, most of the complexity is in the software. We should always check data for consistency, correct any correctable errors, perhaps kill damaged data and inform the user. Writers should not exploit fragile features in file formats, even if such constructs are present. Neither reading nor writing software should rely on anything—some programs are plagued by incorrect storage API usage, faulty expectations on file names and so on. Assuming things not solidly documented will lead to errors and general headache.

Specification and documentation

As was indicated in the introduction, documentation is one of the most important parts in the design process for a new file format. Interoperability rests almost completely on proper definition of terms, data structures and common conventions. This chapter tries to point out some of the common mistakes made and raise the general awareness of file format designers for aspects of the process not easily spotted. Many obscure looking guidelines result, then. This means that taking this chapter too literally will result in ugly, bloated file format specs which may, at worst, turn out unreadable. This is precisely what has happened with most standards documents over the years: aiming too high in accuracy has taxed readability and clarity. Basically, what seems enough probably is—common sense is a good guide in documentation as well. One healthy habit is to include descriptions at multiple levels, first outlining the organisation of the document and defining terms, then drafting the structure of the file and finally giving all the details, for example. This way everybody will understand and those wishing to delve more deeply into the design will also find the data they need.

Stating the assumptions

Like software, file formats and other data representations never exist in a vacuum. They are only one part of the extensive abstract machinery which enables us to put meaning to the bits. This means that file formats rest on hidden assumptions about where and how they are used. Most people do not realize this, not in connection with file formats or otherwise. This is too bad—it leads to confusion and ambiguity.

There are loads of tacit assumptions about what files are: they are unstructured, pure byte streams with r/w access, seeking, a length, a name and possibly some auxiliary data. They exist in a directory and only one active program entity accesses them at a time. Once something is written to a file, it will be there the next time we look, even if the file has migrated to another computer in the mean time. Files must be opened in order to be used and once open, they work at a certain, constant speed. Et cetera. In addition to being too simple, most of these assumptions can fail completely under certain very real circumstances. Some file systems allow files to be accessed simultaneously by multiple programs. In some cases, files have types and are not flat byte streams (e.g. some older operating systems use a record based organization). Bytes need not be eight bit long. Read and write access need not be equal (e.g. flash file systems, CD‐RW’s and MO’s take longer to write) and file state need not immediately reflect what was written (e.g. batch updates in some mainframe environments). Media can fail under the most peculiar of circumstances. (How about network breakdowns with NFS? Or timeouts due to caching latency on opening a large file on an AFS volume?) Some tape file systems do not maintain file lengths and most do not allow seeking. Replicating network file systems sometimes return the master replica even if you just stored an entirely new version of some secondary one you were working on. The lesson goes: there is very little one can safely assume. Think and define.

Anyone designing file formats should actively recognize that files are an abstraction which has little to do with how the abstraction is in some particular case implemented. Files on NFS volumes, behind flash file systems, on hard drives or floppies, on defective media, behind the Internet on the Moon, on read‐only partitions, pointed to by multiple directory entries, residing on serial media (e.g. tape or solid compressed archives), existing with no name or directory, of indeterminate length etc. are files all the same. If some design decision limits the applicability of one’s format, this should be thought through and acknowledged. If some special properties are expected of the files or the programs reading them (e.g. some minimum amount of usable memory or only one program reading at a time), they should be stated.

Specifying the syntax

When specifying file syntax, we should first define the data types used. Often too much is assumed of the reader: unlike before, when the scene was based purely on the Amiga, nowadays we have multiple platforms and data encodings. Byte order, field lengths, numerical types, terminology and APIs vary. Fixing a file format requires definition of all of these, if referred to. Using C structs, for instance, is dangerous: the definition of C is vague in this regard. Talking about words means different things to x86 and VAX people. So the rule goes: if you use it, define it. It pays off.

Often one encounters format specs of reasonable accuracy, but lacking in completeness. Some fields may not be explained, or they may have private values. These are difficult, especially if they deal with format polymorphism. Leaving an option tag partially defined, one risks incompatibility. If you are thinking about leaving something out because not everybody should know, you should probably think about incorporating strong crypto instead. Else someone will eventually know. Also, one should always tell what to do with undefined values. Leaving them open to redefinition may call for trouble.

Specifying a format as a sequence of fields is the most common way. It is often the best way, as well. There are pitfalls, however. The worst come up with variable length and optional fields, pointers and alignment. Options should be indicated clearly, along with info on when they are present and when not. Variable length fields should have a unique algorithm to determine their length. For instance, name fields are often variable length with ␀ termination, but the terminating condition is not spelled out. Sometimes they have an implicit maximum length, not clearly indicated. A few formats even let fields be terminated by the end of file or by another field, possibly pointed to in the file header. This is not very robust. The definition of pointers often lacks in units and basing: it is not told exactly what is counted, do counts start from 1 or 0, are there any magic values (such as an invalid pointer) and where the origin is (beginning of file? beginning of block? beginning of format data, if embedded in another file?). Often interplay with alignment is seen: it is not very clear whether padding bytes should be included in counts or not. In IFF, for instance, they sometimes are and sometimes are not. (Hint: padding after a block vs. inside it…) Alignment in itself has similar problems as pointers: on what boundary is a field supposed to go? Counted from where? Counted how (include pad bytes, optional fields etc.)? All in all, there are tons of things in most common file formats which are not completely specified and so contribute to general confusion about the encoding. Doing it right should not be so hard.

Seeing to the semantics

The idea of file format specifications is not just to give rules outlining all the things one can do and still produce a valid file. Rather, we attempt to give a meaningful encoding of some object. Thus giving field values, constraints and so on does not suffice. The relationships between field values, proper interpretation of field contents and usage guidelines should be included as well.

One common feature of file formats in the scene is that they assume hideous piles of historical knowledge. For instance, what does a BPM count mean in a .mod? Sometimes this is told, sometimes it is not. How does one know that in some cases, .mod support requires skipping effects update on the first time slot of a row? It is clear that such assumptions and exceptions should be spelled out. How about referring to some esoteric data structure, known only to the readership of some peculiar ACM periodical? (Computer scientists often make this mistake…) Offline references do not serve well in format specs, either.

Another point is field interdependency. Forbidden combinations are often left untold. This is bad since the whole idea of spec writing is to make people able to use a format, i.e., to write it. Few formats are so orthogonal that fields may assume any combination of values.

Many coders think that talking about how the data is rendered or otherwise used is not a thing to do in a file format spec. But such things are an inherent part of what the files mean. They cannot be separated from the format—the files are of no use if such things are not discussed somewhere. We can think of semantic specification as a poor man’s version of providing the procedural side of an abstract data type. So, at the very least, pointers should be given to online material in which implementation and semantic details are discussed.

Types of file formats

Unlike relational database design, for example, file format design is not a well defined science. Rather, it is quite an intuitive area of coding and so is really nothing more than a bag of tricks. To add to this bag, some common high level strategies are outlined in this last chapter. I will try to briefly classify the different tactics taken by format designers throughout the years, and give some of the plusses and minuses of their approaches.

Text based formats

Possibly the oldest and most basic of all file formats is the one composed of pure, binary encoded text. The roots of this representation reach to the age telegraphs and punched cards. Then, two common encodings existed: EBCDIC and ASCII. The former was pushed by IBM as part of its early mainframe designs, the latter was ANSI standardized for telegraph use. Nowadays the most common encoding encountered is 8‐bit extended ASCII (XASCII). Codes in the upper half of the code space are assigned variably to special characters or some foreign alphabet or are not specified. (These assignments are called code pages in DOS/Win world.) No structure is present in addition to line and page breaks, although ASCII control codes do have provisions for record framing etc. On top of pure text, many transfer encodings have been defined. In most formats of this kind, portability is so great a concern that pure 7‐bit standard ASCII is used—in some cases, mail transfer and legacy software can strip the eighth bit and so ASCII encodings are considered more robust than binary ones. ASCII can also easily be edited by hand, quoted in text and put on paper for publication. Some prime examples of text based formats are PS, PDF (to a lesser degree), TeX, most formats used in document management applications and the Internet (SGML, HTML, XML, DSSSL, CSS, SDP, VRML) and some scientific ones (such as FITS).

It is true that these formats are robust. On the other hand, ASCII usually incurs a certain penalty in performance—trying to use absolute offsets in an ASCII files is quite risky. (e.g. How do we seek in a text file? Line break conventions vary and text may be filtered in transit. This makes unambiguous offset calculation impossible.) Since binary data must first be encoded in ASCII, length calculations, seeking, indexing and selective loading become difficult. Techniques like error correction and incremental update rest on the latter so they must often go if an ASCII encoding is desired.

One provision for ASCII must be made, however. If your program handles data with little structure (text, pictures, sound), having an alternative, descriptive text encoding is worth considering. It can greatly aid in testing and program development and in the occasional need to paste into text.

Pure binary data

Files of this type aren’t so numerous these days. This is good. It is quite clear that pure data with no identification or properly defined structure just begs for misinterpretation, compatibility problems, undetected transmission errors and the sort. Sometimes file structure is sacrificed in favor of out of band identification, though. This need not be bad, especially if the habit is so spread that nobody misinterprets the files. An example would be .au files: there is no structure apart from the name extension or external file type telling sound data is at hand. The same goes for certain other files (e.g. user profiles and initialisation files), especially in the UNIX world. It is clear that this type of file is not suitable for anything but the most basic storage and shouldn’t be used if there is any alternative. Since alternatives do not always exist, one should be able to dump simple data (text, pictures, sound) as pure data, though. That can help when no common ground exists between the programs the user has.

Record based formats

Nowadays record based formats are almost extinct in the PC world. That is understandable since their roots are in the commercial and legacy mainframe arenas. Most record based formats have been used as private data stores in database, commercial transaction and customer service systems provided by, e.g., IBM. Some operating systems (e.g. VMS) also propagate the idea of record based storage. In some cases this really makes sense: if your data can easily be structured in fixed length record form, record based storage is the fastest and easiest way to deal with storage. Second, if one has direct control over the file system (as some database products do, for example), directly using disk sectors to store records can greatly accelerate caching and processing. The problem is, in most computer systems of today such strict control is not available. Most contemporary operating systems have been designed to run multiple programs simultaneously and to offer the programmer a common, simple file abstraction instead of direct, fine grained control. Generally files come in the form of flat byte streams. When this is the case, strict record based thinking tends to lose its charm.

Header file formats

This is by far the most common type of file, today. Most programs use a header with some minimal amount of identification data, some parameter and length fields and pure data after that. Files of this type usually serve a single, well defined purpose, are meant for one‐shot operations (i.e. no incremental update), do not extend well and are probably the easiest to construct. As long as they are not used for extensive amounts of data, they serve their purpose well. Examples include almost all module types, some compressed file types (e.g. compress), most images (notably jfif), some word processor files and configuration files. Using a header is easy since one can simply lay it down and then dump whatever data we have after it. Reading and writing becomes serial, the coding often becomes relatively efficient and I/O buffering is very easy. On the downside, if there is potential for large files, caching off the file may not be possible. Updating equals rewriting. Extensions are usually implemented by extending the header. This soon makes for version incompatibility, clutter and slowdown. To summarize, if the amount of data stays below, say, 1‐2MB and newer format revisions are not expected, use a header. Otherwise, don’t.

Block based formats

Tagged and highly polymorphic file formats are an emerging breed. Examples include EA‐IFF, RIFF (e.g. .wav) and some module types. The idea is to organize data in chunks which can often be nested. Basically, each such chunk has a length and a type—the length can be used to skip a block, the type tells what is inside. Sometimes—especially if the format is generic, like IFF—special grouping blocks are available which allow the creation of composite objects. Constructs of this kind facilitate easy extension—all that is needed is a new block type. Since blocks can be made quite independent of one another, shuffling them around is quite easy and efficient. This is why many packers use similar constructs for their archives (e.g. LHA, ARJ, RAR and ACE). The idea also has a certain simplicity and elegance to it. Very regular and easily read formats can be constructed on the block paradigm. The downside is that incremental update is difficult to implement and access is inherently serial. Fault tolerance can also be low since a broken length field can render the rest of the file unusable.

Hierarchical object stores

Traditionally, most applications have had their own file format. Nowadays, when software bloat has exploded user expectations way beyond what a single coder can ever hope to achieve, more than one program must be used to achieve a single task. For instance, to create a convincing electronic document, one may need to use a word processor, a database, a spreadsheet, a Web browser, a formula editor and copious numbers of utilities. Often big projects have lots of different kinds of data which should be stored and handled together. One soon begins to think it would be nice if all these things could be put in a single file, since that is what they constitute, logically. This is why a new breed of file formats has emerged. We’ll call them hierarchical object stores. They are based on object oriented ideas (especially data encapsulation and abstract data types) and utilize file formats which can hold a hierarchy of related objects. The objects are usually loaded and rendered on demand by their respective programs and application level integration lets people use intuitive operations to define the composite files. Windows uses this idea in the files created by the COM persistent storage module. Another example is OpenDoc. Object stores are an elegant idea, but they are often a heavy weight solution—they require whole environments and new APIs to be created in order to be useful to people. This is one of the main suspects in the current bloat of Win9x—integration is expensive.

Files of this type generally hold a complex array of data structures, starting with a header, continuing with a central directory and ending with complex linked list structures to aid in insertion. Basically, they are complete file systems constructed inside a file. Sometimes compression, links to external objects (ala OLE/DirectX) and network integration (see DCOM) are added. No lowcal stuff, then. If we seek for a less extensive solution, such file formats as IFF and RIFF can be used—they represent the early attempts at creating a common, hierarchical file organisation.

In a sense, object oriented solutions like these are good. They present a coherent framework in which to build hybrid documents, something which would be enormously difficult otherwise. They also cultivate the idea of proper data encapsulation—something which takes us closer to clean OO design. On the other hand, most object stores display low performance and hence tend not to bend to serve the scene. They also lack in portability, since the whole programming environment should be carried with them. Further, they really abolish the idea of file formats at the usual level—the programmer has very little to say how data is stored in the file‐system‐in‐a‐file and crafting our own really misses the point. One still has a word in how the particular objects serialize themselves, but little meaningful engineering can be done since the fixed wrapper around the data mostly defies space and speed optimisation. Promoting encapsulation also means that objects must be ported instead of data—good if we can use a common standard for the code (like Java), bad if not (like protected mode asm, which we would love, of course).

Recently some scientific formats have ventured in the hierarchical format scene. The most notable example would be HDF. This is good since such a marriage might spawn efficient yet general formats. (Scientists like Crays and Cray‐cycles cost.) It would be nice to see the scene adopt some of the ideas presented by such efforts.

Dimensional data and scientific formats

Files of this kind originate in the scientific computing community. They are primarily meant to carry the vast piles of data generated by supercomputers, satellites, meteorological and seismic observations and alike. As for structured data, CAD/CAM is the major source. The common factor here is that heavy knowhow has gone to optimization, normalisation, generalisation and abstraction in the design stage. Often extensive metadata, indexing, space management and update capabilities are present. Two major representatives of this genre are CDF (with netCDF and other derivatives) and HDF. Both formats admit single file or multiple file organization. They are meant to be used through standard APIs—they address not only the storage format but also related operations. The formats are efficient, if not very simple—CDFs of tens of gigabytes work quite well. In the structured data end of the spectrum, Xerox’s XDR and many of the standard object database formats reign (notably CORBA persistence derivatives). Never heard of any of these? That is because most formats of the seventh kind are of little interest to the casual coder. However, if one wishes to store tremendous amounts of data, they are worth a look. They also showcase a number of clever ideas, both in the format and in implementations. (The libraries are mostly available in source form, for free.) Sparse arrays of high dimension are something which only sophisticated formats such as CDF can handle. The same goes for multiresolution data. (Imagine building a game with a 3D city model filling an entire CD, an image editor for real color wall posters or an audio editor for serious studio use (64+ channels at 24 bits and 48+ kHz) without such constructs. Sparse arrays, dimensionality, incremental updates and multiresolution encodings are useful when we are coding truly hardcore apps.) Again, such heavy weight formats are not meant for casual use, but some of the ideas could, in my mind, be assimilated.