decoy’s HTML tutorial

An overview of HTML

HTML is the language we use to create Web pages. It’s an application of a far older and more comprehensive framework, called SGML. This way, HTML inherits many features from SGML, like the hideous entity references any valid, non‐English pages are bound to contain, the angle bracket notation for marking up text and the hierarchical, nested markup structure of documents. This little tutorial talks about HTML version 4.0 which was, at the time of the writing, the newest. Most of what is said applies to XHTML version 1.0 as well, which is what this document now contains. However, the broad set of dynamic features and server side processing related things (like forms) are left untouched. This text is about static HTML. The text also references older versions of the language, but only for completeness—the only difference between older versions of the language and 4.0 is presentational markup which shouldn’t be used anyway, if possible. Sometimes we need such markup for compatibility, so some of the more important features are described. (As a matter of fact, the features are available in a looser 4.0 specification, as well.)

HTML, as an acronym, comes from the words HyperText Markup Language—it’s a language for marking up text documents. Hypertext, in case, is text where we use links to break up the traditional linear way of reading and also include all sorts of multimedia stuff. The idea is, the reader chooses what to read and follows links accordingly. Multimedia enters the picture because with a non‐paper bound architecture it is possible to represent data much more efficiently by abandoning pure text. In the context of markup languages, we also talk about functional markup, which means that the markup we use reflects the structure of the document. This is in stark contrast with presentational markup which simply tells how the document looks. For instance, tables are good for tabular data but should never be used for layout purposes. HTML provides ample opportunity for functional markup since most of the really important structures like lists, text paragraphs, images, objects external to the document, scripts, tables, addresses, emphasized text and so on are present. In the following, a short tour of the more important HTML elements is taken. It might be useful to see how this document has been built up, as well—it has been built in a fairly puritanian manner.

The structure of HTML

An HTML document is simply a text editable file. Text is encoded as‐is and is intermixed with markup telling what the text is. Markup is composed of easily identifiable tags which HTML applications use to do things. On the other hand, HTML is a declarative language in the sense that none of the structures in it do anything but simply stand for specific data types, like lists or paragraphs. Also unlike older publishing languages, the specific tags do not cause a fixed line or page break but rather delimit logical units of the text. HTML is not a WYSIWYG publishing tool either.

Another thing to note about HTML that, as an SGML application, it consists of structure markup. This means that the language has lots of structure (like marked text inside differently marked text) and it also places limitations on how the different structures can be used (plain text cannot occur directly inside the document body, for instance). There are a bunch of rules on how the different elements of the document may be nested, whether certain series of successive elements are legitimate and also semantic conventions which tell how common structures should be implemented in the HTML environment. The basic principle is that we always concentrate on the structure of the document instead of its layout. For instance, a list is not a pile of indented, bulleted lines of text but rather a logical unit consisting of a sequence of elements which may in case of their own inner structure and semantics. Thinking about the structure of the document like this quickly leads us to use nested structures and content models which tell which elements and in which order a given element may contain. For instance, we might have a book comprised of parts which are composed of sections in which we have paragraphs which in case contains lists of mixed text and figures. All this is quite possible in HTML, although the representation is not quite so easy to comprehend before the language is rendered in a web browser which knows how the logical structure should look like on screen. Specifically, HTML (like all SGML) abstracts a single logical part of a document, like the ones listed above, as elements.

In actual documents the manifestation of elements is a specific tagging syntax. We place angle brackets around parts of text we want the computer to interpret. Each such bracketed expression is a tag. A single element is represented by a start tag (which also includes textual attributes) the element content (just text or more elements) and the end tag. That is, <something size="123">456</something> is an element named something with one attribute (size) of value "123" and the pure text content "456". Element and attribute names are case insensitive (but use small letters since HTML is going there). Whitespace inside the start tag separate successive attributes. The end tag has nothing but the element name. Note that if we have attributes with whitespace, they must be quoted like above and for the most part, only numbers, letters and a few special characters can be used. (We should always quote because in a couple of years, that will be mandatory.) Since both tags and attributes are markup, they will not be directly shown on‐screen but are used to derive formatting instructions for the browser.

Basic constructs

The document

The top level structure of a document is the following: first we start an HTML document with <html>, then comes the document header after <head> (inside which we must have at least the document <title>) after which we start the actual document body with <body> and finally end the HTML document with </html>. It is already seen that not all elements need to be ended explicitly. For instance, the canonical syntax for the above would have included ending the head before the body and the body before the whole document ends. But since the structural definition of an HTML document says that body can only occur after the head, not inside it, any HTML application can automatically close the head in the right place when it sees the body start tag. There are some elements for which including the end tag is actually an error—they do not have content at all, only attributes, and hence the end of content markup would be stupid. Images are an example. Most tags can be ended, but nevertheless good style requires us to leave the end tags out. List and table entries are good examples.

Often we would like to have a nice background image in the document and perhaps some unusual colors. The right and trendy way to accomplish this is to use CSS style definitions. But if your browser does not support these or you are just lazy, there is an alternative way. In older HTML versions the same effect is achieved by using the color, bgcolor and background attributes of the <body> element. The first sets the text color, the second the color of the background and the third gives the URL of an image to be tiled in the background. These are especially useful with older Netscapes—the CSS support is even now inferior to Micro$oft’s corresponding.

Headers and rules

The next important element to know is the heading. It is declared by the using the elements from <h1> to <h6>. A larger number means a smaller heading. More control is achieved through style sheets. The headings are situated outside the actual text paragraphs, tables and whatever, inside either a <body> or a <div> element. The content of a heading is the text to be rendered. Version 2 of the CSS style language also gives a possibility of autonumbering headings, but the support is nonexistent. It is better to just number any headings by hand.

Another useful way to separate parts of the document from each other is to use a rule. It is often used when there is a clear division but a separate section (with a title) is not an option. Notes, sidebars and end of chapter lists/summaries are excellent examples. A horizontal rule is created with the <hr> element. This is an empty element with no end tag and it is permitted in precisely the same parts of the document as are the headings. The element is most commonly rendered as a wide, narrow horizontal line, perhaps with some shading.

Text and sections

When we want to include pure text paragraphs, we use the <p> element. Paragraphs cannot nest so the browser knows to end the previous one if a new one is started. No end tags are needed. Another common grouping construct is the division, <div>. Divisions can be nested and permit all kinds of other content (like tables and lists) as well. Consequently divisions need to be ended—the browser cannot know whether the next element is a nested element or simply follows the division if no end tag is present. Divisions are mainly used to group other elements, but can be used as a text container as well.

As for attributes, <div>s, <p>s and almost all other elements as well can have the lang one. This marks a specific element as having content in a specific language. Nowadays it’s practically useless since no support is present. But in the future it allows multilingual document construction and automatic language selection by the browser. Some further generic attributes exist: class is a whitespace separated list of class names into which the element belongs and id is a unique textual name for the element. Both are use in stylesheets to attach style information to specific elements. Classes are used when multiple elements utilize the same style information (like the foreground color). The id attribute must be unique within the document so it cannot be used similarly. Instead it is very useful when we must be sure that some style information is attached to just one specific element (like the size of a picture). ID’s are also used within HTML to refer to a given element, like when a link map needs to be attached to multiple images or when we need to jump to a specific part of the document with a link. The problem is, Netscape does not support the id attribute and instead utilizes the legacy name attribute of some elements (like anchors and imagemaps) for the same purpose. In the newest HTML work name is deprecated and it’s also of very limited applicability. Use the id one when possible. In addition to lang, id and class HTML 4.0 also permits a title for an element. It is used to give a short human readable title and is most commonly rendered as a popup/tooltip.

In the age before CSS paragraphs, divisions and images also permitted two formatting attributes. float had the values right and left and if used, floated the object into the side in question, letting other objects (like text) flow around it. To control the flow, clear was used to tell that the object marked with it wants a clear right, left or all (left and right) margin.

Pure textual information is often longwinded and boring. It is also quite difficult to find the salient points in the flow without reading the whole thing really carefully. This (and machine processing) is why we need markup to differentiate between different types of text inside a paragraph. The currently legal elements for this include <em> for emphasis, <strong> for another kind of emphasis, <dfn> for the defining instance of a term, <abbr> for an abbreviation, <acronym> for acronyms and so on. The exact list can be found in the HTML specification, maintained by W3C. It must be noticed that the legal elements, here, do not tell how the text should be rendered but are semantic in the sense that they tell something about the meaning of the text inside. The older versions of HTML also had presentational tags like <u> for underline, <b> for bold, <i> for italics and <strike> for overstrike. Even worse, a <font> element with face (font name) and size attributes was present—documents formatted with elements such as <font> are practically incomprehensible to human beings without being rendered to HTML.

Images

Images are a very important part of the Web environment. They are declared with an element called <img>. As attributes we give a source (attribute name src) which give the URL which can be used to fetch the image and alt (alternative) which gives a short textual description of the picture to be used when retrieving the image. Even more importantly alt is used in nongraphical browsers to represent the image as the latter cannot be displayed. Choose alts with care. In principle, from HTML 4.0 upwards alt attributes are mandatory in images. If you need to tell more about the image, use the longdesc attribute. Images cannot be used outside text content, so they must always appear inside divisions or paragraphs. The older HTML versions defined width and height attributes as well, so that the picture could be stretched and its dimensions calculated without fetching the image data itself. Nowadays we should use CSS for this as well. Both methods admit pixel sizes (the default in HTML) and percentages of the available width (mark with a trailing % sign). It’s quite clear that percentages are more robust—after all there is no way of knowing what the screen size of a particular viewer’s computer is. It is not in the spirit of the Web or HTML to build sites which are dependent on screen sizes or the browser which is used to view the pictures.

Links

It was already mentioned that HTML is supposed to represent hypertext. So we need links. In HTML all links are embedded into the document, they have but one target and can do nothing but go to the document which the links points to. We define a link with the anchor (<a>) element. The content of the element is what is shown in the user interface as the link (can be text and/or images) and the end point of the link is given with the href attribute. The target is a URI, of which URLs are a special case. For instance, http://www.iki.fi/~decoy/front is a URI pointing to my current homepage while mailto:decoy@iki.fi is a link which lets the viewer send mail to my e‐mail address. The part before the colon gives the protocol/access method to the resource, the rest tells where the resource is. Other examples are links to FTP servers, like ftp://ftp.funet.fi/ or even the such more esoteric resources as WAIS and GOPHER servers. The slash after the server name/last directory name is often forgotten but should definitely be included—some sites permit files and directories of same name and those are differentiated between by using the trailing slash. And even if a given server is configured to redirect malformed queries to a specific directory (URL ends with the name) to the right one (slash ended), there will be extra load on the Net. That is bad Netiquette. Also remember that the full name of the server should be used. There are lots of servers out there whose addresses do not start with www.. For instance: http://www‐library.itsi.disa.mil/. Again we see how valid syntax really matters.

The above examples use strict, absolute uris. But there are also relative forms, just like in operating system file names. In this case, the URI starts with ../ which tells to look one step higher in the user visible directory hierarchy of the server. ../../dsound.html would mean two steps up and a specific file and so on. It is also possible to reference particular ID’s within a given document. This is done by appending a hash sign (#) and the name of the id to the URL. In fact, the URL part can be dropped: a link which simply consists of a hash mark and an id will target a given id in the same document. HTML 4.0 lets us define ids in almost any element. Parts of a text element can be made link targets by using the id attribute of a <spot> element (which is empty) or a <span> one (which does have content). Netscape does not support such constructs but instead prefers the older form in which we define an anchor (<a>) element with the link target id given in the name attribute. This is too bad since named anchors will probably go away in time and they make no distinction between links to a point in the file and links with reference a whole part.

Lists

In addition to vanilla text, HTML gives us markup for some structured forms of text as well. Lists and tables are the major examples. There are three kinds of lists: ordered, unordered and term definition ones. An ordered list is created with an <ol> element, an unordered with <ul> and a definition list with a <dl>. The ordered and unordered lists have the same structure with multiple embedded list items (<li>), the definition list differs somewhat. Lists can be recursive, that is, there can be new lists inside list items. Usually we do not end list items since they cannot nest (a list item inside a list item is not legal through a list item inside a list inside a list item is). Ordered lists are usually rendered with numbering, unordered with bullets. Ordered lists permit control over the numbering with the start (first number) and type (1: arabic; a: lower alpha; A: upper alpha; i: lower roman and I: upper roman numbers) attributes.

Definition lists do not use list items but rather a list of term‐definition pairs. Terms are marked with <dt> (definition term) and definitions with <dd> (definition data). On the definition part can have complex substructure—the term is similar to paragraph content.

Lists and tables are a very important example of the importance of correct syntax. For instance, there cannot be text inside a list. Only list items can. In newer M$ browsers incorrect markup of this kind will produce some pretty wild errors (like text moving outside of the list etc.).

Tables

Tables are surely one of the most important features of HTML. They are created with the <table> element and they must be ended. Table content is grouped into rows (<tr>: table row) of equal length; each row contains the same number of table cells. Rows are rarely terminated. Two cell types exist: <td> is an ordinary table data cell, <th> is a table header cell. Cells contain content similar to the paragraphs and are usually not ended either. The most important of table attributes are rowspan and colspan which can be given in data and header cells to join multiple cells into one complex one. The values are numerical and give the size of the cell, horizontally or vertically. The space taken by the extension (the default values are 1, of course) protrude into following rows and columns. Cell created this way must not overlap and the cells covered spans must not be present in the following data. Tables can also have <caption>s. This element is placed inside the table but before the actual row data. Tables are recursive structures as well—table cells can have embedded tables.

Frames

Frames are a technique used to create multiple independent partitions into a web browser window. Graphical browsers display the result as multiple separately scrollable subwindows (frames) of the main browser one—each behaves like a browser in itself. To control what is shown in the frames, a target attribute is used in links to tell which frame should show the link endpoint. As special cases, the top level browser window or a newly created copy can be used as the link target. Frames are a recursive construct as well, so it is possible to show in a frame a document which creates further frames. On the plus side, frames help create structured views to pages very quickly and also permit a characteristic, tidy look. The downsides are that only leading graphical browsers support frames completely, that documents become difficult to manage, that the methodology is not completely device independent (frames, like tables and columns, are basically a graphical construct which means they are not good functional markup) and the fact that some people are deeply annoyed by frames. (For instance, some heavily subsidised sites use the window creation facility to display adds and banners in a completely indiscriminate manner.)

Thanks to Netscape, the frame facility was added to HTML 4.0 as well. Earlier specifications were a Netscape novelty and they were only partially documented. The fourth version of HTML clarifies the situation by strictly separating frame definitions from the actual document data. This is done by declaring a separate frameset document which can only create frames and only giving links to the original content of the frames in this document. After this, the frames are completely independent of the frameset definition.

The doctype declaration for a frameset document is <!DOCTYPE html PUBLIC "‐//W3C//DTD HTML 4.0 Frameset//EN" "http://www.w3.org/TR/REC‐html40/frameset.dtd">. In a frameset, no <body> is allowed—it is substituted with an element called <frameset>. Horizontal and vertical divisions are given as attributes. Scrolling and margin information can be specified as well. The frameset element contains <frame>s which give the actual geometry of the frames and use the href attribute to link to the content of a given frame. In addition to these, there is a <noframes> element which is used to specify an alternative text in case the frame facility is not present. This should always be used, and not just to tell the user to get a graphical browser. End tags can only be omitted from the frame elements. In framed documents, link syntax is almost identical to the usual one. The only difference is that we have an attribute called target which specifies the name of the frame receiving the link target. The name is specified with the name attribute of a <frame> element in the frameset declaration. Alternatively a special name can be used. These start with an underscore and can be used to open the link in the current frame, the parent frame, the top level browser window or in a separate newly created pane. It must be noted that since frames are not pure HTML, the target attribute is not available in the strict form of HTML 4.0—you will need to specify the document as being HTML 4.0 Transitional instead of HTML 4.0 Strict. This happens by changing the above document type declaration to use the word Transitional instead of Frameset and system identifier transitional.dtd instead of frameset.dtd.

My opinion is that frames are not good unless they’re targeted strictly for a known browser and the graphical structure of the page absolutely requires frames. This is rare, especially for content centric (useful) pages. Most people use frames to keep hotlinks readily available, to get nice presentation or to reproduce polymorphic directory structures. The first case can be handled with vanilla links, the second with proper usage of styles and the third one by generating plain old HTML in a server side script. So if frames are only used to brag, why use them and make the page incompatible with simpler, older and special purpose browsers? Of course, some things are so much easier to implement with frames that it wouldn’t be sensible to try to get around using them. But you should always serve frameless versions of your site as well, even if it does make maintenance a pain in the ass.

Why?

Up to this point, all the advice given has been rather puritanian in nature. My advice diverges considerably from the mainstream graphical web design paradigms. Layout issues have mostly been neglected and the structural details of an HTML document discussed in more detail. It is therefore necessary to elaborate a bit on the theoretical basics of HTML and to tell what HTML really is, what it is meant to do and where its roots are.

HTML and SGML

It was already mentioned above that HTML is an application of SGML. The latter is, as a language, somewhat older and more comprehensive than HTML. Its intended purpose and prime field of application differ considerably from the contemporary, entertainment oriented ones of HTML. SGML was originally meant for structured documentation and similar useful purposes. SGML is rooted in the 70’s and in IBM. It was developed as a cure for the structureless, manually operated and hence quite difficult to manage publishing languages of the time. The idea was to produce a language which could encapsulate maintainable and above all machine processable documents in a unified format. It was primarily meant for applications dealing with highly regular text data, such as thesauri, term banks and technical manuals—anything with a precise template and/or strict structural constraints.

The first important idea was to let semantic markup supplant the explicit, procedural layout instructions of the type used then. That is, instead of telling the layout program to insert a page break here and a margin there, we simply tell where the text paragraphs are and let the layout system autogenerate the layout stuff. Only in the layout stage do we deal with layout data, before that we have meaningful (semantic) blocks of data. The advantages come in the form of uniform documentation, easily machine processable documents (the programs now know what the text blocks are) and the possiblity of automated structural validation of the documents. In addition to this, we gain a degree of modularity because the blocks marked as meaning something can be reused wherever the same type of block is permitted.

It is easy to see that a simple markup language does not suffice, here—after all, every application needs its own block types and document structures. For instance, thesauri might mark text up as terms, definitions and cross references. At the same time, a technical manual might require something else. This is why SGML ended up being defined as a metalanguage, or a toolset. This means that instead of defining a strict set of permissible types of text the language can utilize, we only set up the basic mechanisms which many applications are likely to use (such as a basic linking facility, an example markup encoding, a mechanism for references external to the document and so on) and a framework which lets the specific application to define the datatypes and conventions it uses. The latter part is split in two: there is the SGML declaration which defines document size related constraints, the possible presence of syntactical options and shorthands and things pertaining to the actual character encoding of SGML (the language is designed to be strictly independent of character sets and encodings chosen) and the document type definition (DTD) which tells block types and permisible document structures for the application. As an example, the SGML declaration for HTML says that the ISO 646 character set (7‐bit ISO standardized ASCII) is used, that element start tag omission is not used and that element names, the structural depth of the document and all the other capacities of the document assume the maximum values permitted. I.e. the SGML declaration specifies the encoding we are about to use. Similarly, the HTML DTD tells what we are about to encode: all the elements and attributes and the rules constraining their use are listed, here. We seldom change the declaration part, but the DTD may well change (for instance, we have multiple versions of HTML). The static declaration is what makes us use angle brackets and forbids non‐ASCII characters in HTML.

Now, the step to HTML is very simple: HTML is just an application of SGML. It uses the simplest example SGML declaration included in the standard and each version of HTML is simply one more (albeit very similar to the others) DTD. The DTDs list the elements permissible in a given version of HTML, the attributes they may have, the types of the attributes and how the elements can be nested. As a practical complication, many vendors have extended the structures permissible in pure HTML by their own little gimmicks. This has lead to a situation in which people casually use/abuse elements and attributes which really do not belong in any version of the HTML DTD. In fact, few people validate their HTML or care about the validity of the structures they put on their pages. The bad habits spread rapidly, as <multicol> formatted text and table based layout tell us.

Valid syntax

When they hear for the first time that proprietary extension elements and features—which of course work quite well with mainstream browsers and also look OK—should not be used, most people do not relate. Even though my opinion is very clear (just say no), it is extremely difficult to communicate why it is of pivotal importance to adhere to the correct syntax of HTML. The way to illustrate the point is to see what happens when something other than a mainstream browser is used. Actually most current nice looking pages on the Web are completely useless as far as alternative browsers are considered. Examples of such browsers include Lynx (which is text based and cannot utilize tables, images or fonts at all and often even color is disabled), render‐to‐speech approaches (with same limitations) and Braille (same limitations plus the low speed of reading causes people to skip any text which does not get to the point in the first one or two sentences). In addition to these, older and/or smaller budget browsers cannot utilize the high end graphical extensions nowadays in favor and might well skip any content expressed thru them. Overall, writing invalid HTML makes any page incompatible with anything but the specific browser which happens by chance render the page correctly and makes the pages a lot more difficult to maintain. The latter problem arises because machine validation is no longer possible—my own pages would long ago have fallen apart if it weren’t for continuous validation and syntactical purity. The page may well become error prone as relying on some feature of HTML based on testing in a graphical browser does not future proof the design like validation does. Last but not least, using HTML extensions makes the vendors think such proprietary stuff is permissible and contributes to a very counter productive, anti‐Netiquette extension war like the one which rages between Microsoft and Netscape.

Separating style from content

In the current age of WYSIWYG editors, easy to use home publishing applications and hyper graphical web sites the distinction between document content and presentational details has been blurred. This distinction is, however, very important—if the content is to be automatically processed or we want to be able to present it in multiple formats for different users, it is clear the graphical outlook and the content itself should be neatly separated. The difference is most pronounced if we think about such specialized needs as non‐graphical browsing (Braille, pure text and speech renderings) or the highly simplified picture a search engine gets of a Web page. If the content of a page has very close ties with the layout, we easily end up with a data representation which does not differ considerably from presenting all the data as a single picture. And we all know what considerable trouble machine understanding/translation of pictures entails…

When we deal with larger sites, there are further reasons to separate style from content. The most important is maintenance: if presentational data is completely separate from page content, it can be maintained separately from the actual pages and using the same, always up to date style data for all pages, old or new, is quite straight forward. Updates to either content or style are possible without touching other and the documents themselves are greatly simplified. This adds up to great ease of maintenance.

The recent progress in browser technology has brought the possibility of separating style information from document content to the casual user of the Web. This is achieved by coding the presentational aspects of a page into style sheets in a style sheet language. The most well known of these is Cascading Style Sheets (CSS). Within the XML activity, a new language called XSL has been developed as well, but the support for the latter is quite sparse at the moment. CSS is here now, however, and it is quite a blessing even for the more graphical designers—CSS allows control over layout which is far extended from the level provided by presentational HTML.

Advanced topics

About the more exotic SGML constructs

As an application of SGML, HTML inherits most of the features of the SGML architecture. Eventhough HTML only employs a very limited subset of SGML’s power, this set is still surprisingly broad. In particular, it includes many constructs which naïve implementations of HTML do not take into consideration.

It was already mentioned that SGML documents aim for automatic validation. What was not said was that within the SGML community, documents are validated whenever possible. This means that many SGML editors check the correctness of documents even when the user is editing and prohibit the creation of invalid structures in the first place. Web browsers, however, do not care for such stuff. They simply follow the lax principles of the Net and accept just about anything, whether valid HTML or not. As an example, documents are accepted which contain severe incorrect markup (most typically markup of the form <em><strong></em></strong>), characters from outside the permissible ISO 646 set (for instance, the Windows native character set when used with an HTTP header signifying the incorrect usage), omitted elements marked as compulsory in the DTD (say, <html>) and especially omission of all of the SGML header data needed for document type identification and automatic processing (usually the first line of a valid document). It is a real shame that such structures are sanctioned because there is no guarantee that any of them will work as intended in an architecture which does support standard SGML. It is no use trying to edit the documents with a compliant SGML editor if the headers are not present, for instance.

In the SGML community, quite different tactics have been taken to solving the many problems than in the HTML one. It is then quite natural that some of these innovations have not been implemented in current Net browsers. As typical examples, SGML marked sections (which appear in the form <![tyyppi[dataa]]> in SGML source) can be given. But such sections could be quite useful if only one had the courage to use them—they make it possible to mark portions of a document to be ignored even though parsed, the marking of long blocks of text as (sort of) conditional and preventing an application from parsing a text block so that we can give examples of SGML within SGML without the parser going haywire.

Another typical difference between the typical HTML application and a complete SGML implementation (the support for this is even rarer) is the support for the internal DTD subset. This is a part of the DTD which is included in the document itself and can make slight modifications to the DTD. Any parts of the DTD already declared cannot be overridden, but we really can declare new elements and entities. In SGML, this makes renders us able to include the content of the newly declared entities in our documents which would be very nice in HTML as well. Now we have to use server side includes or some similar mechanism to accomplish the same goal, and this can greatly add to the burden of the HTTP server. On the other hand, XLink should give us embedding links in the future. But still…

But the most frightening aspect is that when the browsers attempt to handle just about anything that they may come across, some of the underlying principles of SGML have been forgotten along the way. One of the worst problems is the handling of style related information inside HTML (that is, presentational markup): most browsers keep up any style definitions until they are explicitly ended. This happens even though the element which contained the original style definition (which itself is an element in HTML) has already ended a while ago. From the standpoint of SGML this is senseless and indicates that the browser does not handle the useful hierarchical element structure of HTML documents correctly. This is a telltale sign that browser designers have left all of the beautiful implications of the SGML architecture behind and instead went with extensions and nonstandard processing. It may well be that features like this won’t work after a couple of years (assuming the browser manufactures regain their senses) whereas all the valid HTML pages should work just fine. In fact, the valid pages should work with even more browsers as time goes by. This is a very weighty reason to go with valid markup—it is future proof.

XML vs. SGML vs. HTML

A few years ago when the Web blasted into the public eye, the new popularity of HTML also seethed onto the underlying SGML architecture. It was rapidly determined that at least some of the problems and omissions listed above were there mostly because SGML was so difficult to implement, not because browser manufacturers were lazy. This was no news to the SGML expert. SGML has been known to be quite difficult to implement and even understand. Then what happened was what always happens when nerds see a problem—a cooperative effort to fix it quickly and simply. The result is called XML, Extensible Markup Language, and it’s basically highly simplified SGML.

More concretely, XML fixes a given SGML declaration so that only one character mapping of XML, only a standardized character set (ISO 10646/Unicode), practically no SGML extensions and the capacities (basically no limit) have to be implemented. The possibility for inclusion and exclusion elements in DTDs (these allow one to permit or diallow element occurrence within any level of the document tree from a given element downwards) is also gone, as are some of the other possibilities in creating mixed content styles (both text and elements). Parsing is far simpler in XML than in SGML and the aim has been at making it easy, fast and cheap to implement so that even light weight applications can benefit from structured markup. XML extends the SGML architecture by introducing the concept of well‐formed documents, which are XML which does not necessarily validate but is still sensible in its own right. The facility lets us embed chunks of XML into contexts where it is not explicitly allowed by the DTD.

The most significant part of XML is not by all means the core standard, but the tens of applications being built on it right now. Core XML is almost amusingly simple while some of the actual standard encodings built on it (e.g. SVG for vector images and MathML for the mathematical notation) and the other architectural features (like namespaces, querying and fragment interchange) are an order of magnitude more complex than anything ever seen in the SGML circles. This tells a lot about XML: a primary guideline of XML’s architects was to build a modular and easily extensible language. Instead of huge, monolithic DTDs, we use the secondary XML standards (namespaces, XLink/XPointer, fragment interchange, XSL and so on) to glue small standard XML structures together into a coherent document.

It is thus clear that XML is more an architecture than the core standard would itself suggest. In the near future, XML with its helper standards will considerably expand the possibilities of the Web to represent structured information. It will also represent a serious learning quest to anybody who uses HTML now. The easiest way to go is probably XHTML, which is simply HTML ported from SGML to XML. This may sound like a trivial change but it’s really anything but: after going to XHTML, the modular aspects of XML are at the writer’s disposal and when the support comes, such things as the mathematical notation, three dimensional graphics and two dimensional uniform vector graphics can be utilized where we today take graphical shortcuts with GIFs and so on. XML is a nice thing indeed and it’s coming to us head on…