Text, technology, i18n and art

A while ago, in the course of both work and site building, I bumped into something I never ever thought could be so complicated: text. Like most nerdy individuals I’ve pretty much equated text with what I’m able to type at my keyboard, with a little additional, Scandinavian color for having to deal with ä’s and ö’s. That is, nothing severe. Of course you have a faint idea of not everybody writing like you do, but hey, really, if people don’t get by with simple, Western characters, they’re not probably worth the attention anyway. Right?

Dead wrong. For one, type is by no means straight forward. It seems so quite simply because someone else did the hard work. Secondly, even with all their complications and the immense weight they have in the world of computing, Western writing systems are actually pretty unimaginative and boring. Once one gets even a tiny glimpse of how such beauties as Japanese Kanji and Indic cursive scripts work, there’s no return—one starts to think of written text as art and what shows on screen rather pales in comparison.

This article is about getting the art of writing online. It mostly concerns itself with the technology but the motivation is first and foremost æsthetic. There is also a strong additional flavor of internationalization (i18n) because the up and coming multilingual Web well demonstrates the true depth of typography and the value of attention given to pure text.

Foundations

After we learn to read and write, we pretty soon develop a ton of automatics which eventually shield us from the complications of the written word. Rapid and efficient use of text depends on us being able to process text atomically, without regard for its low level structure. From the functional standpoint, this is a necessity. From the æsthetic one it’s a disaster. Especially now, in the age of computers and desktop publishing, one rarely stops to wonder about such novelties as typography, the fine scale structure of what one writes (think about the leeway we give ourselves with email) or the possibilities given to us by text itself. We also rarely think about what could be different about text. This chapter tries to highlight the latter through a brief tour of some of the world’s writing systems.

But it’s just text!

 ‐illustrate by Unicode/Devanagari
 ‐unidimensionality as a parallel to spoken language—constrast with formal mathematical musical and artistic notations

The first thing to consider about text, that is, well before we start delve into how it is currently handled by computers, is what text actually is. The naïve answer is: graphically encoded speech. In most respects this is quite accurate. Text, especially in Western countries, displays many of the characteristics of spoken language. Text written in a phonetically derived alphabet is composed of a linear, unidimensional stream of characters most of which have a more or less consistent relationship with the sound emitted when reading the text aloud.

Now although basically correct, the accuracy of this characterisation varies with language and orthography even within the family of languages encoded in the Latin alphabet. For instance, the English language has quite complicated rules of pronounciation while the Finnish one, with only two or three exceptions, is in one‐to‐one correspondence with the written version. Some polymorphism can also be observed in that some information (such as French accents, common ligatures et cetera) encoded in text isn’t actually phonetically significant. Nevertheless the above description will suffice for the languages most Westeners are familiar with.

The picture gets far more complicated, however, when we start to consider languages outside our own cultural sphere. For example, in Hebrew one of the features most important to the Finnish reader, vowels, have a greatly diminished role in written text—they may actually be dropped altogether from the written representation. The lesson of this first example is that not everything that is spoken is necessarily written on paper.

It is also not self‐evident that all writing systems are based on alphabets. Indic scripts and the Japanese kana substitute syllabaries for alphabets: entire syllables (typically consonant–vowel pairs) are encoded where the Latin alphabet uses approximately one letter per phoneme.

An even more disturbing example is East Asian orthography which is heavily based on ideographic forms. What is encoded is more in line with entire words/ideas than phonetic information. In fact, a Japanese person might well understand many of the ideas presented in Chinese text, while not being able to pronounce the text in its original language. The resulting wealth of characters, of which there are maybe tens of thousands, is a primary trouble‐maker in East Asian information processing tasks. Not surprisingly then we have a term for the text processing peculiarities of the area: CJKV (Chinese–Japanese–Korean–Vietnamese).

Till now we’ve operated under the assumption that we can easily identify from text the basic, discrete units constituting it: characters. But this need not be the case. Indic scripts are a prime example. In those, multiple adjoining syllabic units are ligated together to form compounds with very little visual similarity to the original ones. Hence it is extremely difficult for an illiterate to even begin to guess where syllable, character or word boundaries lay. The same difficulty applies to a computer engaged in a rendering task. These ligatures are a special headache of font makers and coders of font rendering software, since there are lesser used scripts in the Indic family whose ligatures have never even been exhaustively catalogued. Furthermore, the rules governing the visual appearance of the language may very well cause the display of text to actually advance in a somewhat nonlinear order (e.g. truncated syllables may cause modification of pretty much arbitrary parts of the base glyph to which they belong, even when the base glyph may be anywhere from one to six syllables downstream from where the eventually ligature forming one is encountered).

As the final example, we have a language of sorts which does not even follow the rule of unidimensionality established above: formal mathematical notation. It is well established that the scientific notation actually works in more or less two dimensions. It is no surprise then that math was pretty much unprintable before the advent of computers, and relied heavily on handwriting.

Characters: alphabets, syllabaries and ideographs

Because of the immense number of different characters in their native orthography, the East Asian computing folks have a very straight forward and useful view of characters. To them, the idea of character exists to limit the number of different conceptual units one has to learn in order to read and write. After one has a certain familiarity with computerized data representation in its general form, one is in a position to imagine notations without such well defined elementary units. But for human consumption such a system would probably be too unwieldy. Hence the concept of characters as a quantum level of text.

There are many classifications of characters. Perhaps the most important classes of characters in general use are alphabetic, syllabic, ideographic, pictographic, diacritical and punctuation characters.

Alphabetic characters are phonetically motivated ones and are usually organized in sets of relatively limited variability. For instance, few Western languages need more than 60 or 70 letters, case variations included.

Syllables, such as the ones used to encode (selected parts of) Japanese and Hindi, usually comprise a less limited variety and are used primarily with languages with a limited phonetic palette or those which place timbral qualities semantically on par with phonetic ones. Chinese could be an excellent example, except that it for some historical reason unknown to the writer just doesn’t use a syllabary.

Ideographic characters encode entire ideas, something a Western person would likely use multicharacter words for. Unlike with pictograms, the outward appearance of ideographs has little to do with their meaning. These are what most people call Chinese characters, although they are used all over East Asia and in many aboriginal and tribal languages elsewhere. Since entire ideas are encoded, there are usually quite a number of ideographs needed. The most extensive Japanese character dictionary (Dai Kanwa Jiten, 大漢和字典) comprises a staggering 65000+ Kanji characters.

Pictograms, then, would include characters which are simplified pictures of their denotation. Egyptian hieroglyphs are a prime example. Some experts argue these are the most primitive form of written language, though by no means less expressive than the current ones. Most written languages of today do not use pictograms, but based on what we know about ancient Egyptian writing, pictograms are used in numbers comparable to those of ideographs.

Diacritics are a class of characters used to decorate other ones. In some languages (like the Viet Nōm form of Vietnamese) use them for extensive phonetic annotation, some (like French) for considerably milder forms of same, in some languages they are there simply for decorative reasons and some languages have actually adopted forms of letters with diacritic marks as atomic parts of their ortography (the Scandinavian ones are the most important example for the writer).

Punctuation is differentiated from diacritics as being mainly a phenomenon of the written language, where these characters variably encode phonetic features (like periods of silence) and actual semantics which are not always expressible in spoken language (like the difference between full stops and semi‐colons).

Writing direction, cursive writing and complex scripts

Writing systems differ not only by the character repertoire they utilize. There are also differences based on what the mode of composition of characters is, the level of polymorphism witnessed in the actual graphic form of the text and the direction in which the linear flow of text advances.

Perhaps the easiest of these to understand is the difference in orientation. This concept has its roots in the fact that once we have a text longer than a single character, we need to somehow order it to lay it out without confusing its order. Usually this means that we start to compose the text linearly in some fixed direction. After the length grows to some tens of characters, the traditional two dimensional substrate (paper) and its higher tech derivatives (scrollable windows or console displays) no longer stretch to fit and we need to decide how to continue. This is where the concepts of major and minor writing direction come from, as do all the problems of line layout, hyphenation and pagination.

Most Western languages, and through Western colonial influence many others, are or can be written left to right, top to bottom. This is the order you are seeing before you right now. Other examples include top to bottom, left to right (traditional Japanese), right to left, top to bottom (Hebrew), boustrophedon (ox‐turning, alternating left to right, right to left and top to bottom, and even bottom to top in some ancient scripts; this happens in some forms of ancient Greek) and others. Writing direction is the first stepping stone of any i18n savvy textual environment, since it is relatively easy to implement, crucial to languages such as Hebrew and summons nasty interactions in environments where texts of varying directionality have to be mixed. Writing direction is also something that demonstrates well some of the deeply rooted expectations of Western computer science professionals—you might want to consider that HTML and CSS, which are what I’m writing this text in, simply cannot handle traditional Japanese text.

Above I already mentioned that many different modes of composition of text exist. Indeed, the Western one of invariant alphabetic units combined to words, separated by spaces, structured into sentences with punctuation, laid out on lines and paginated is certainly not the only way to go. The first minor variation of this model is witnessed in Chinese text: no word boundaries are marked and punctuation is, if not nonexistent, minimalistic at best. Unlike in Western text, no hyphenation is used either. The flow of text is broken into lines almost without regard for the structure of language, with punctuation being just about the only element which affects line formatting.

Similarly, in East Asian text space has a very different role than in Western. As in Chinese, in Japanese space does not separate words but instead demarkates the division between adjacent characters. It is extremely important that this spacing is equal and well chosen. Variable pitch fonts are not often used, at least with ideographs. Kana syllables may sometimes use variable spacing but this is more of an effect and a way to condense the presentation than a consistent typographical rule, as it is with modern Western type.

Finally, perhaps the most radical differentiation that can be made is between cursive and discrete writing. By this I mean the difference in placing space separated characters side by side as contrasted by the Indic scripts’ concept of text as a flow of ligated letter forms. Western people know the concept in the form of cursive calligraphy, which sadly has lost some of its status as a typographic discipline. Cursive scripts tend to cause great trouble to computers because unlike with the discrete characters of typical type, cursive characters join on a pair‐wise basis. This means that optimally no two distinct pairs of adjacent characters join equally, and perhaps that we might even want to join two identical characters in similar context very differently. Handling this sort of thing with the typical graphics programming construct of lookup tables of bitmaps simply does not happen.

When we deal with script fonts, we also bump into the trouble of defining what exactly the characters of a given writing system are— since characters join into each other in a seamless flow, the boundaries between them can become pretty hazy indeed. This is most actively seen in ligatures, of which the western fi‐ligature is a representative instance. Here, we cannot plausibly image the two letters as independent, but must rather form a combined glyph where the top hook of the f extends to supplant the dot of the i, with the horizontal stroke joining the body of the i. It was already pointed out in the beginning that in Indic scripts this principle of joining extends to multiple adjacent characters, mutating them into something not directly recognizable as the original glyphs at all. Such ligation can also alter the perceived order of the glyphs’ features, in fact so much so that sometimes linear editing of the characters is pretty much impossible to implement. This is why characters (the abstract notation in the text, pre‐ligation) and graphemes (what the user sees as characters, pro‐ligation) are two completely separate things. More on this sort of detail later on…

Computable representations of text

 ‐the stream/flow aspect
 ‐the origins of nerdy byte‐centric thinking
 ‐the history of codes
 ‐the concept of code extension and escaping
 ‐the result of byte‐centricism in thinking of multibyte charsets as multidimensional
 ‐the tragedy of table form representation of 94/96x94/96 charsets which are then still unidimensional in text

Text coding concepts

 ‐graphic characters
 ‐control characters
 ‐code points
 ‐bit combinations
 ‐combining characters
 ‐functions
 ‐protocol issues
 ‐escape sequences
 ‐multibyte encodings
 ‐multibyte character sets?!
 ‐switching/shifting/escaping
 ‐escape sequences
 ‐autodetection
 ‐designation/invocation
 ‐spacing/nonspacing characters
 ‐punctuation
 ‐diacritics
 ‐accents
 ‐invisible characters
 ‐code pages

Plain text versus text based protocols

 ‐not all plain text is ASCII
 ‐not all ASCII text is plain text
 ‐ASCII has been used to build many protocols (IETF ones as examples)
  and further standards (SGML/XML/TEI)
 ‐there are multiple extensions of ASCII (ISCII…)
 ‐there are multiple variants of pure ASCII (ASCII does not specify use of control codes)
 ‐control codes are used in a variety of applications (MODEMs, tape, synch/asynch line syntax…)

Encoding and storage

Coded character sets

ASCII/ISO 646 IRV

ISO‐8859 series standards

 ‐amazingly, the š‐ and Š‐characters, needed for formally correct Finnish, are missing in ISO‐8859‐1!

Japanese Industrial Standards (JIS)

Other Asian national standards

Proprietary character sets and code pages

Encoding methods

ISO 2022 aka ECMA‐35—the mother of all encodings

EUC is a locale‐fixed variant of 2022‼

Shift‐JIS: can you say M$?

Unicode/ISO 10646

The Unicode character model

 ‐combining characters
 ‐Hangul jamo
 ‐separation of style
 ‐Han unification

Logical ordering and composition

 ‐extreme intelligence required of the reader
 ‐what’s the deal with bidirectionality attributes? they are badly misplaced: bidirectional rendering is not the problem
 ‐but the fact that once we write in some specific direction
 ‐embedding anything else is difficult. in the common case that the embedding should be opposite
 ‐it is possible
 ‐in the one dimensional abstraction
 ‐to embed the other string by inverting its logical direction. but this really isn’t the problem of a character encoding. the character encoding should perhaps aid (maybe even give hints as to which direction some characters are supposed to written in)
 ‐but the selection of writing direction is clearly a part of an upper layer protocol. what the bidirectional algorithm does is one specific instance of the general problem of what should be done when we have to embed a string of characters with whatever native direction into a stream of characters written in another. possible answers: we can rotate the line to fit
 ‐we can print the characters in the current direction just pretending direction isn’t important (it is)
 ‐and if the direction of the embedding is coincident
 ‐then we can embed it one dimensionally and distribute into lines afterwards

UCS‐4, UCS‐2, UTF‐16, UTF‐8 and UTF‐7

Typography

Fonts, glyphs and elementary text composition

Legibility vs. beauty

 ‐legibility aids: serifs (this site is a horrible example!)
 ‐spacing
 ‐x‐height
 ‐ascenders/descenders
 ‐variable pitch characters and word formation (the effects of kerning—is there a physiological reason?)
 ‐topological variation in alphabets
 ‐some analogies between typographical and phonetical forms (the width of consonants and vowels and the correlation to acoustical length)

Types of type

Ligatures and script fonts

Font formats and support for i18n

Internationalisation in the Web

Basics: MIME, HTTP, XML and XHTML

Data input and form fill‐out

Searching and indexing

Web typography, or lack thereof

Epilogue: i18n—is that edible?