HTML is the language we use to create Web pages. It’s an application of a far older and more comprehensive framework, called SGML. This way, HTML inherits many features from SGML, like the hideous entity references any valid, non‐English pages are bound to contain, the angle bracket notation for marking up text and the hierarchical, nested markup structure of documents. This little tutorial talks about HTML version 4.0 which was, at the time of the writing, the newest. Most of what is said applies to XHTML version 1.0 as well, which is what this document now contains. However, the broad set of dynamic features and server side processing related things (like forms) are left untouched. This text is about static HTML. The text also references older versions of the language, but only for completeness—the only difference between older versions of the language and 4.0 is presentational markup which shouldn’t be used anyway, if possible. Sometimes we need such markup for compatibility, so some of the more important features are described. (As a matter of fact, the features are available in a looser 4.0 specification, as well.)
HTML, as an acronym, comes from the words HyperText Markup
Language—it’s a language for marking up text documents.
Hypertext, in case, is text where we use links to break up the
traditional linear way of reading and also include all sorts of
multimedia stuff
. The idea is, the reader chooses what to read
and follows links accordingly. Multimedia enters the picture because
with a non‐paper bound architecture it is possible to represent data
much more efficiently by abandoning pure text. In the context of
markup languages, we also talk about functional markup,
which means that the markup we use reflects the structure of the
document. This is in stark contrast with presentational markup which
simply tells how the document looks. For instance, tables are good for
tabular data but should never be used for layout purposes. HTML
provides ample opportunity for functional markup since most of the
really important structures like lists, text paragraphs, images,
objects external to the document, scripts, tables, addresses,
emphasized text and so on are present. In the following, a short tour
of the more important HTML elements is taken. It might be useful to
see how this document has been built up, as well—it has
been built in a fairly puritanian manner.
An HTML document is simply a text editable file. Text is encoded as‐is
and is intermixed with markup telling what the text is.
Markup is composed of easily identifiable tags which HTML
applications use to do
things. On the other hand, HTML is a
declarative language in the sense that none of the structures
in it do anything but simply stand for specific data types,
like lists or paragraphs. Also unlike older publishing languages, the
specific tags do not cause a fixed line or page break but rather delimit
logical units of the text. HTML is not a
WYSIWYG publishing
tool either.
Another thing to note about HTML that, as an SGML application, it consists of structure markup. This means that the language has lots of structure (like marked text inside differently marked text) and it also places limitations on how the different structures can be used (plain text cannot occur directly inside the document body, for instance). There are a bunch of rules on how the different elements of the document may be nested, whether certain series of successive elements are legitimate and also semantic conventions which tell how common structures should be implemented in the HTML environment. The basic principle is that we always concentrate on the structure of the document instead of its layout. For instance, a list is not a pile of indented, bulleted lines of text but rather a logical unit consisting of a sequence of elements which may in case of their own inner structure and semantics. Thinking about the structure of the document like this quickly leads us to use nested structures and content models which tell which elements and in which order a given element may contain. For instance, we might have a book comprised of parts which are composed of sections in which we have paragraphs which in case contains lists of mixed text and figures. All this is quite possible in HTML, although the representation is not quite so easy to comprehend before the language is rendered in a web browser which knows how the logical structure should look like on screen. Specifically, HTML (like all SGML) abstracts a single logical part of a document, like the ones listed above, as elements.
In actual documents the manifestation of elements is a specific
tagging syntax. We place angle brackets around parts of text
we want the computer to interpret. Each such bracketed expression is a
tag. A single element is represented by a start tag (which also includes
textual attributes) the element content (just text or more
elements) and the end tag. That is, <something size="123">456</something>
is an element named something
with one attribute (size
) of value
"123" and the pure text content "456". Element and attribute names are
case insensitive (but use small letters since HTML is going there).
Whitespace inside the start tag separate successive attributes. The end
tag has nothing but the element name. Note that if we have attributes
with whitespace, they must be quoted like above and for the most part,
only numbers, letters and a few special characters can be used. (We
should always quote because in a couple of years, that will be
mandatory.) Since both tags and attributes are markup, they will not be
directly shown on‐screen but are used to derive formatting instructions
for the browser.
The top level structure of a document is the following: first we start
an HTML document with <html>
, then comes the document header after
<head>
(inside which we must have at least the document
<title>
) after which we start the actual document body with
<body>
and finally end the HTML document with </html>
. It is
already seen that not all elements need to be ended explicitly. For
instance, the canonical syntax for the above would have included ending
the head before the body and the body before the whole document ends.
But since the structural definition of an HTML document says that body
can only occur after the head, not inside it, any HTML application can
automatically close the head in the right place when it sees the body
start tag. There are some elements for which including the end tag is
actually an error—they do not have content at all, only attributes,
and hence the end of content markup would be stupid. Images are an
example. Most tags can be ended, but nevertheless good style requires us
to leave the end tags out. List and table entries are good examples.
Often we would like to have a nice background image in the document and
perhaps some unusual colors. The right and trendy way to accomplish this
is to use CSS style definitions. But if your browser does not support
these or you are just lazy, there is an alternative way. In older
HTML versions the same effect is achieved by using the color, bgcolor and
background attributes of the <body>
element. The first sets the
text color, the second the color of the background and the third gives
the URL of an image
to be tiled in the background. These are especially useful with older
Netscapes—the CSS support is even now inferior to
Micro$oft’s corresponding.
The next important element to know is the heading. It is declared by the
using the elements from <h1>
to <h6>
. A larger number means
a smaller heading. More control is achieved through style sheets. The
headings are situated outside the actual text paragraphs, tables and
whatever, inside either a <body>
or a <div>
element. The
content of a heading is the text to be rendered. Version 2 of the CSS
style language also gives a possibility of autonumbering headings, but
the support is nonexistent. It is better to just number any headings by
hand.
Another useful way to separate parts of the document from each other is
to use a rule. It is often used when there is a clear division but a
separate section (with a title) is not an option. Notes, sidebars and
end of chapter lists/summaries are excellent examples. A horizontal rule
is created with the <hr>
element. This is an empty element with no
end tag and it is permitted in precisely the same parts of the document
as are the headings. The element is most commonly rendered as a wide,
narrow horizontal line, perhaps with some shading.
When we want to include pure text paragraphs, we use the <p>
element. Paragraphs cannot nest so the browser knows to end the previous
one if a new one is started. No end tags are needed. Another common
grouping construct is the division, <div>
. Divisions can be nested
and permit all kinds of other content (like tables and lists) as well.
Consequently divisions need to be ended—the browser cannot know
whether the next element is a nested element or simply follows the
division if no end tag is present. Divisions are mainly used to group
other elements, but can be used as a text container as well.
As for attributes, <div>
s, <p>
s and almost all other
elements as well can have the lang one. This marks a specific element as
having content in a specific language. Nowadays it’s practically useless
since no support is present. But in the future it allows multilingual
document construction and automatic language selection by the browser.
Some further generic attributes exist: class
is a whitespace separated
list of class names into which the element belongs and id
is a unique
textual name for the element. Both are use in stylesheets to attach
style information to specific elements. Classes are used when multiple
elements utilize the same style information (like the foreground color).
The id
attribute must be unique within the document so it cannot be used
similarly. Instead it is very useful when we must be sure that some
style information is attached to just one specific element (like the
size of a picture). ID’s are also used within HTML to refer to a given
element, like when a link map needs to be attached to multiple images or
when we need to jump to a specific part of the document with a link. The
problem is, Netscape does not support the id
attribute and instead
utilizes the legacy name
attribute of some elements (like anchors and
imagemaps) for the same purpose. In the newest HTML work name is
deprecated and it’s also of very limited applicability. Use the id
one
when possible. In addition to lang, id
and class
HTML 4.0 also permits a
title for an element. It is used to give a short human readable title
and is most commonly rendered as a popup/tooltip.
In the age before CSS paragraphs, divisions and images also permitted
two formatting attributes. float
had the values right
and left
and if
used, floated the object into the side in question, letting other
objects (like text) flow around it. To control the flow, clear
was used
to tell that the object marked with it wants a clear right
, left
or
all
(left and right) margin.
Pure textual information is often longwinded and boring. It is also
quite difficult to find the salient points in the flow without reading
the whole thing really carefully. This (and machine processing) is why
we need markup to differentiate between different types of text inside a
paragraph. The currently legal elements for this include <em>
for
emphasis, <strong>
for another kind of emphasis, <dfn>
for
the defining instance of a term, <abbr>
for an abbreviation,
<acronym>
for acronyms and so on. The exact list can be found in
the HTML specification, maintained by
W3C. It must be noticed that the legal
elements, here, do not tell how the text should be rendered but are
semantic in the sense that they tell something about the
meaning of the text inside. The older versions of HTML also had
presentational tags like <u>
for underline, <b>
for bold,
<i>
for italics and <strike>
for overstrike. Even worse, a
<font>
element with face
(font name) and size
attributes was
present—documents formatted with elements such as <font>
are
practically incomprehensible to human beings without being rendered to
HTML.
Images are a very important part of the Web environment. They are
declared with an element called <img>
. As attributes we give a
source (attribute name src
) which give the
URL which can be
used to fetch the image and alt (alternative) which gives a short
textual description of the picture to be used when retrieving the image.
Even more importantly alt is used in nongraphical browsers to represent
the image as the latter cannot be displayed. Choose alts with care. In
principle, from HTML
4.0 upwards alt
attributes are mandatory in images. If you need to tell
more about the image, use the longdesc
attribute. Images cannot be used
outside text content, so they must always appear inside divisions or
paragraphs. The older HTML
versions defined width
and height
attributes as well, so that the
picture could be stretched and its dimensions calculated without
fetching the image data itself. Nowadays we should use
CSS for this as well.
Both methods admit pixel sizes (the default in
HTML) and
percentages of the available width (mark with a trailing % sign). It’s
quite clear that percentages are more robust—after
all there is no way of knowing what the screen size of a particular
viewer’s computer is. It is not in the spirit of the Web or
HTML to build sites
which are dependent on screen sizes or the browser which is used to view
the pictures.
It was already mentioned that HTML is supposed to represent hypertext.
So we need links. In HTML all links are embedded into the document, they
have but one target and can do nothing but go to the document which the
links points to. We define a link with the anchor (<a>
) element.
The content of the element is what is shown in the user interface as the
link (can be text and/or images) and the end point of the link is given
with the href
attribute. The target is a URI, of which URLs are a
special case. For instance, http://www.iki.fi/~decoy/front
is a URI pointing
to my current homepage while mailto:decoy@iki.fi
is a link which lets the viewer send mail to my e‐mail address. The
part before the colon gives the protocol/access method to the resource,
the rest tells where the resource is. Other examples are links to
FTP
servers, like ftp://ftp.funet.fi/ or
even the such more esoteric resources as
WAIS and
GOPHER servers. The
slash after the server name/last directory name is often forgotten but
should definitely be included—some sites permit files and directories of
same name and those are differentiated between by using the trailing slash.
And even if a given server is configured to redirect malformed queries to a
specific directory (URL ends with the name) to the right one (slash
ended), there will be extra load on the Net. That is bad Netiquette.
Also remember that the full name of the server should be used. There are
lots of servers out there whose addresses do not start with
www.. For instance:
http://www‐library.itsi.disa.mil/. Again we see how valid syntax
really matters.
The above examples use strict, absolute uris. But there are also
relative forms, just like in operating system file names. In this case,
the URI starts with ../ which tells to look one step higher in
the user visible directory hierarchy of the server.
../../dsound.html would mean two steps up and a specific file
and so on. It is also possible to reference particular ID’s within a
given document. This is done by appending a hash sign (#) and the name
of the id
to the URL. In fact, the URL part can be dropped: a link which
simply consists of a hash mark and an id
will target a given id
in the
same document. HTML 4.0 lets us define id
s in almost any element.
Parts of a text element can be made link targets by using the
id
attribute of a <spot>
element (which is empty) or a <span>
one (which does have content). Netscape does not support such constructs
but instead prefers the older form in which we define an anchor
(<a>
) element with the link target id
given in the name
attribute.
This is too bad since named anchors will probably go away in time and
they make no distinction between links to a point in the file and links
with reference a whole part.
In addition to vanilla text, HTML gives us markup for some structured
forms of text as well. Lists and tables are the major examples. There
are three kinds of lists: ordered, unordered and term definition ones.
An ordered list is created with an <ol>
element, an unordered with
<ul>
and a definition list with a <dl>
. The ordered and
unordered lists have the same structure with multiple embedded list
items (<li>
), the definition list differs somewhat. Lists can be
recursive, that is, there can be new lists inside list items.
Usually we do not end list items since they cannot nest (a list item
inside a list item is not legal through a list item inside a list inside
a list item is). Ordered lists are usually rendered with numbering,
unordered with bullets. Ordered lists permit control over the numbering
with the start
(first number) and type
(1: arabic; a: lower alpha; A:
upper alpha; i: lower roman and I: upper roman numbers) attributes.
Definition lists do not use list items but rather a list of
term‐definition pairs. Terms are marked with <dt>
(definition
term) and definitions with <dd>
(definition data). On the
definition part can have complex substructure—the term is similar to
paragraph content.
Lists and tables are a very important example of the importance of correct syntax. For instance, there cannot be text inside a list. Only list items can. In newer M$ browsers incorrect markup of this kind will produce some pretty wild errors (like text moving outside of the list etc.).
Tables are surely one of the most important features of HTML.
They are created with the <table>
element and they must be ended.
Table content is grouped into rows (<tr>
: table row) of equal
length; each row contains the same number of table cells. Rows are
rarely terminated. Two cell types exist: <td>
is an ordinary table
data cell, <th>
is a table header cell. Cells contain content
similar to the paragraphs and are usually not ended either. The most
important of table attributes are rowspan
and colspan
which can be given
in data and header cells to join multiple cells into one complex one.
The values are numerical and give the size of the cell, horizontally or
vertically. The space taken by the extension (the default values are 1,
of course) protrude into following rows and columns. Cell created this
way must not overlap and the cells covered spans must not be present in
the following data. Tables can also have <caption>
s. This element
is placed inside the table but before the actual row data. Tables are
recursive structures as well—table cells can have embedded tables.
Frames are a technique used to create multiple independent partitions
into a web browser window. Graphical browsers display the result as
multiple separately scrollable subwindows (frames) of the
main browser one—each behaves like a browser in itself. To control
what is shown in the frames, a target
attribute is used in links to tell
which frame should show the link endpoint. As special cases, the top
level browser window or a newly created copy can be used as the link
target. Frames are a recursive construct as well, so it is possible to
show in a frame a document which creates further frames. On the plus
side, frames help create structured views to pages very quickly and also
permit a characteristic, tidy look
. The downsides are that
only leading graphical browsers support frames completely, that
documents become difficult to manage, that the methodology is not
completely device independent (frames, like tables and columns, are
basically a graphical construct which means they are not good functional
markup) and the fact that some people are deeply annoyed by frames.
(For instance, some heavily subsidised sites use the window creation
facility to display adds and banners in a completely indiscriminate
manner.)
Thanks to Netscape, the frame facility was added to HTML 4.0 as well. Earlier specifications were a Netscape novelty and they were only partially documented. The fourth version of HTML clarifies the situation by strictly separating frame definitions from the actual document data. This is done by declaring a separate frameset document which can only create frames and only giving links to the original content of the frames in this document. After this, the frames are completely independent of the frameset definition.
The doctype declaration for a frameset document is <!DOCTYPE html
PUBLIC "‐//W3C//DTD HTML 4.0 Frameset//EN"
"http://www.w3.org/TR/REC‐html40/frameset.dtd">
. In a frameset,
no <body>
is allowed—it is substituted with an element called
<frameset>
. Horizontal and vertical divisions are given as
attributes. Scrolling and margin information can be specified as well.
The frameset element contains <frame>
s which give the actual
geometry of the frames and use the href
attribute to link to the content
of a given frame. In addition to these, there is a <noframes>
element which is used to specify an alternative text in case the frame
facility is not present. This should always be used, and not just to
tell the user to get a graphical browser. End tags can only be omitted
from the frame elements. In framed documents, link syntax is almost
identical to the usual one. The only difference is that we have an
attribute called target
which specifies the name of the frame receiving
the link target. The name is specified with the name
attribute of a
<frame>
element in the frameset declaration. Alternatively a
special name can be used. These start with an underscore and can be used
to open the link in the current frame, the parent frame, the top level
browser window or in a separate newly created pane. It must be noted
that since frames are not pure HTML, the target
attribute is not
available in the strict form of HTML 4.0—you will need to specify the
document as being HTML 4.0 Transitional instead of HTML 4.0 Strict. This
happens by changing the above document type declaration to use the word
Transitional
instead of Frameset
and system identifier
transitional.dtd
instead of frameset.dtd
.
My opinion is that frames are not good unless they’re targeted strictly for
a known browser and the graphical structure of the page absolutely
requires frames. This is rare, especially for content centric
(useful
) pages. Most people use frames to keep hotlinks readily
available, to get nice presentation or to reproduce polymorphic
directory structures. The first case can be handled with vanilla links,
the second with proper usage of styles and the third one by generating
plain old HTML in a server side script. So if frames are only used to
brag, why use them and make the page incompatible with simpler, older
and special purpose browsers? Of course, some things are so much easier
to implement with frames that it wouldn’t be sensible to try to get
around using them. But you should always serve frameless versions of
your site as well, even if it does make maintenance a pain in the ass.
Up to this point, all the advice given has been rather puritanian in nature. My advice diverges considerably from the mainstream graphical web design paradigms. Layout issues have mostly been neglected and the structural details of an HTML document discussed in more detail. It is therefore necessary to elaborate a bit on the theoretical basics of HTML and to tell what HTML really is, what it is meant to do and where its roots are.
It was already mentioned above that HTML is an application of SGML. The
latter is, as a language, somewhat older and more comprehensive than
HTML. Its intended purpose and prime field of application differ
considerably from the contemporary, entertainment oriented ones of HTML.
SGML was originally meant for structured documentation and similar
useful
purposes. SGML is rooted in the 70’s and in IBM. It was
developed as a cure for the structureless, manually operated and hence
quite difficult to manage publishing languages of the time. The idea was
to produce a language which could encapsulate maintainable and above all
machine processable documents in a unified format. It was primarily
meant for applications dealing with highly regular text data, such as
thesauri, term banks and technical manuals—anything with a precise
template and/or strict structural constraints.
The first important idea was to let semantic markup supplant the explicit, procedural layout instructions of the type used then. That is, instead of telling the layout program to insert a page break here and a margin there, we simply tell where the text paragraphs are and let the layout system autogenerate the layout stuff. Only in the layout stage do we deal with layout data, before that we have meaningful (semantic) blocks of data. The advantages come in the form of uniform documentation, easily machine processable documents (the programs now know what the text blocks are) and the possiblity of automated structural validation of the documents. In addition to this, we gain a degree of modularity because the blocks marked as meaning something can be reused wherever the same type of block is permitted.
It is easy to see that a simple markup language does not suffice, here—after all, every application needs its own block types and document structures. For instance, thesauri might mark text up as terms, definitions and cross references. At the same time, a technical manual might require something else. This is why SGML ended up being defined as a metalanguage, or a toolset. This means that instead of defining a strict set of permissible types of text the language can utilize, we only set up the basic mechanisms which many applications are likely to use (such as a basic linking facility, an example markup encoding, a mechanism for references external to the document and so on) and a framework which lets the specific application to define the datatypes and conventions it uses. The latter part is split in two: there is the SGML declaration which defines document size related constraints, the possible presence of syntactical options and shorthands and things pertaining to the actual character encoding of SGML (the language is designed to be strictly independent of character sets and encodings chosen) and the document type definition (DTD) which tells block types and permisible document structures for the application. As an example, the SGML declaration for HTML says that the ISO 646 character set (7‐bit ISO standardized ASCII) is used, that element start tag omission is not used and that element names, the structural depth of the document and all the other capacities of the document assume the maximum values permitted. I.e. the SGML declaration specifies the encoding we are about to use. Similarly, the HTML DTD tells what we are about to encode: all the elements and attributes and the rules constraining their use are listed, here. We seldom change the declaration part, but the DTD may well change (for instance, we have multiple versions of HTML). The static declaration is what makes us use angle brackets and forbids non‐ASCII characters in HTML.
Now, the step to HTML is very simple: HTML is just an application of
SGML. It uses the simplest example SGML declaration included in the
standard and each version of HTML is simply one more (albeit very
similar to the others) DTD. The DTDs list the elements permissible in a
given version of HTML, the attributes they may have, the types of the
attributes and how the elements can be nested. As a practical
complication, many vendors have extended the structures permissible in
pure HTML by their own little gimmicks. This has lead to a situation in
which people casually use/abuse elements and attributes which really do
not belong in any version of the HTML DTD.
In fact, few people validate their HTML
or care about the validity of the structures they put on
their pages. The bad habits spread rapidly, as <multicol>
formatted text and table based layout tell us.
When they hear for the first time that proprietary extension elements and features—which of course work quite well with mainstream browsers and also look OK—should not be used, most people do not relate. Even though my opinion is very clear (just say no), it is extremely difficult to communicate why it is of pivotal importance to adhere to the correct syntax of HTML. The way to illustrate the point is to see what happens when something other than a mainstream browser is used. Actually most current nice looking pages on the Web are completely useless as far as alternative browsers are considered. Examples of such browsers include Lynx (which is text based and cannot utilize tables, images or fonts at all and often even color is disabled), render‐to‐speech approaches (with same limitations) and Braille (same limitations plus the low speed of reading causes people to skip any text which does not get to the point in the first one or two sentences). In addition to these, older and/or smaller budget browsers cannot utilize the high end graphical extensions nowadays in favor and might well skip any content expressed thru them. Overall, writing invalid HTML makes any page incompatible with anything but the specific browser which happens by chance render the page correctly and makes the pages a lot more difficult to maintain. The latter problem arises because machine validation is no longer possible—my own pages would long ago have fallen apart if it weren’t for continuous validation and syntactical purity. The page may well become error prone as relying on some feature of HTML based on testing in a graphical browser does not future proof the design like validation does. Last but not least, using HTML extensions makes the vendors think such proprietary stuff is permissible and contributes to a very counter productive, anti‐Netiquette extension war like the one which rages between Microsoft and Netscape.
In the current age of WYSIWYG editors, easy to use home publishing applications and hyper graphical web sites the distinction between document content and presentational details has been blurred. This distinction is, however, very important—if the content is to be automatically processed or we want to be able to present it in multiple formats for different users, it is clear the graphical outlook and the content itself should be neatly separated. The difference is most pronounced if we think about such specialized needs as non‐graphical browsing (Braille, pure text and speech renderings) or the highly simplified picture a search engine gets of a Web page. If the content of a page has very close ties with the layout, we easily end up with a data representation which does not differ considerably from presenting all the data as a single picture. And we all know what considerable trouble machine understanding/translation of pictures entails…
When we deal with larger sites, there are further reasons to separate style from content. The most important is maintenance: if presentational data is completely separate from page content, it can be maintained separately from the actual pages and using the same, always up to date style data for all pages, old or new, is quite straight forward. Updates to either content or style are possible without touching other and the documents themselves are greatly simplified. This adds up to great ease of maintenance.
The recent progress in browser technology has brought the possibility of separating style information from document content to the casual user of the Web. This is achieved by coding the presentational aspects of a page into style sheets in a style sheet language. The most well known of these is Cascading Style Sheets (CSS). Within the XML activity, a new language called XSL has been developed as well, but the support for the latter is quite sparse at the moment. CSS is here now, however, and it is quite a blessing even for the more graphical designers—CSS allows control over layout which is far extended from the level provided by presentational HTML.
As an application of SGML, HTML inherits most of the features of the SGML architecture. Eventhough HTML only employs a very limited subset of SGML’s power, this set is still surprisingly broad. In particular, it includes many constructs which naïve implementations of HTML do not take into consideration.
It was already mentioned that SGML documents aim for automatic
validation. What was not said was that within the SGML community,
documents are validated whenever possible. This means that many SGML
editors check the correctness of documents even when the user is editing
and prohibit the creation of invalid structures in the first place. Web
browsers, however, do not care for such stuff. They simply follow the
lax principles of the Net and accept just about anything, whether valid
HTML or not. As an example, documents are accepted which contain
severe incorrect markup (most typically markup of the form
<em>
<strong>
</em>
</strong>
), characters from
outside the permissible ISO 646 set (for instance, the Windows native
character set when used with an
HTTP header signifying the incorrect
usage), omitted elements marked as compulsory in the DTD (say,
<html>
) and especially omission of all of the SGML header data
needed for document type identification and automatic processing
(usually the first line of a valid document). It is a real shame that
such structures are sanctioned because there is no guarantee that any of
them will work as intended in an architecture which does support
standard SGML. It is no use trying to edit the documents with a
compliant SGML editor if the headers are not present, for instance.
In the SGML community, quite different tactics have been taken to
solving the many problems than in the HTML one. It is then quite natural
that some of these innovations have not been implemented in current Net
browsers. As typical examples, SGML marked sections (which
appear in the form <![tyyppi[dataa]]>
in SGML source) can be
given. But such sections could be quite useful if only one had the
courage to use them—they make it possible to mark portions of a
document to be ignored even though parsed, the marking of long blocks of
text as (sort of) conditional and preventing an application from parsing
a text block so that we can give examples of SGML within
SGML without the parser going haywire.
Another typical difference between the typical HTML application and a complete SGML implementation (the support for this is even rarer) is the support for the internal DTD subset. This is a part of the DTD which is included in the document itself and can make slight modifications to the DTD. Any parts of the DTD already declared cannot be overridden, but we really can declare new elements and entities. In SGML, this makes renders us able to include the content of the newly declared entities in our documents which would be very nice in HTML as well. Now we have to use server side includes or some similar mechanism to accomplish the same goal, and this can greatly add to the burden of the HTTP server. On the other hand, XLink should give us embedding links in the future. But still…
But the most frightening aspect is that when the browsers attempt to
handle just about anything that they may come across, some of the
underlying principles of SGML have been forgotten along the way. One of
the worst problems is the handling of style related information inside
HTML (that is, presentational markup): most browsers keep up any style
definitions until they are explicitly ended. This happens even though
the element which contained the original style definition (which itself
is an element in HTML) has already ended a while ago. From the
standpoint of SGML this is senseless and indicates that the browser does
not handle the useful hierarchical element structure of HTML documents
correctly. This is a telltale sign that browser designers have left all
of the beautiful implications of the SGML architecture behind and
instead went with extensions and nonstandard processing. It may well be
that features
like this won’t work after a couple of years
(assuming the browser manufactures regain their senses) whereas all the
valid HTML pages should work just fine. In fact, the valid pages should
work with even more browsers as time goes by. This is a very weighty
reason to go with valid markup—it is future proof.
A few years ago when the Web blasted into the public eye, the new popularity of HTML also seethed onto the underlying SGML architecture. It was rapidly determined that at least some of the problems and omissions listed above were there mostly because SGML was so difficult to implement, not because browser manufacturers were lazy. This was no news to the SGML expert. SGML has been known to be quite difficult to implement and even understand. Then what happened was what always happens when nerds see a problem—a cooperative effort to fix it quickly and simply. The result is called XML, Extensible Markup Language, and it’s basically highly simplified SGML.
More concretely, XML fixes a given SGML declaration so that only one
character mapping of XML, only a standardized character set (ISO
10646/Unicode), practically no SGML extensions and the capacities
(basically no limit) have to be implemented. The possibility for
inclusion and exclusion elements in DTDs (these allow one to permit or
diallow element occurrence within any level of the document tree from a
given element downwards) is also gone, as are some of the other
possibilities in creating mixed content styles (both text and
elements). Parsing is far simpler in XML than in SGML and the aim has
been at making it easy, fast and cheap to implement so that even light
weight applications can benefit from structured markup. XML extends the
SGML architecture by introducing the concept of well‐formed
documents, which are XML which does not necessarily validate but is
still sensible
in its own right. The facility lets us embed
chunks of XML into contexts where it is not explicitly allowed by the
DTD.
The most significant part of XML is not by all means the core standard,
but the tens of applications being built on it right now. Core XML
is almost amusingly simple while some of the actual standard
encodings
built on it (e.g.
SVG
for vector images and MathML
for the mathematical notation) and the other architectural features (like namespaces,
querying and fragment interchange) are an order of magnitude more
complex than anything ever seen in the SGML circles. This tells a lot
about XML: a primary guideline of XML’s architects was to build a
modular and easily extensible language. Instead of huge, monolithic
DTDs, we use the secondary XML standards (namespaces,
XLink/XPointer,
fragment interchange, XSL and so on) to glue small standard XML
structures together into a coherent document.
It is thus clear that XML is more an architecture than the core standard would itself suggest. In the near future, XML with its helper standards will considerably expand the possibilities of the Web to represent structured information. It will also represent a serious learning quest to anybody who uses HTML now. The easiest way to go is probably XHTML, which is simply HTML ported from SGML to XML. This may sound like a trivial change but it’s really anything but: after going to XHTML, the modular aspects of XML are at the writer’s disposal and when the support comes, such things as the mathematical notation, three dimensional graphics and two dimensional uniform vector graphics can be utilized where we today take graphical shortcuts with GIFs and so on. XML is a nice thing indeed and it’s coming to us head on…