squeaky clean markup

Alter width of article: Default / Full width

May, 2000.
By Molly E. Holzschlag. (Link to original article.)

HTML as we know it has undergone specific and eye-opening changes to prepare for tomorrow's more demanding technological needs. It has been reformulated into Extensible Hypertext Markup Language, or XHTML. The language's maturation was an eventuality, but that hasn't made it any less of a headache for coders. Despite the annoyance of breaking old HTML habits, XHTML is a useful tool once you've surmounted the learning curve.

Why extensibility?

Today's data management requirements have become extremely complex. Looking at the number of browsers and browser versions in the current market is enough to make even the most experienced developer dizzy. What's more, developers will need to have methods for managing special user agents that will become more commonplace in coming years: browsers for hand-held computers and other devices like wireless phones and pagers.

Then there's the demand for flexible HTML to accommodate industry specific needs. Public sites and private intranets are becoming highly functional by necessity. For example, sites developed for online banking have different data display and management concerns from those developed for health care management.

A language that can extend to meet these custom demands on both the front and back ends is very desirable. What's more, allowing media management to occur via the language, with the browser acting as an interpreter rather than as a complete engine, is an intelligent approach to reaching beyond HTML's limitations.

But before extending a language, authors should ensure the best possible stability by addressing any weaknesses. To do this, the World Wide Web Consortium (W3C) took a visit back to the grass roots intelligence of the Web markup origin, the Standard Generalized Markup Language (SGML). SGML rules are quite strict, and as its personality clearly demonstrates, HTML is an unruly child in need of some discipline to make it behave more consistently and appropriately. Both XML and the resulting XHTML are truer, more proper applications of their mature SGML parent.

Knocking the dust out of html

XHTML exists to bridge the gap between HTML and the more flexible, powerful, and syntactically accurate XML. Specific goals of XHTML are to:

So will XHTML actually replace HTML 4.0 as the next client-side markup language? The beauty of XHTML is that it's ready to go right away. Coders who want to comply with current standards should be writing their documents in XHTML (version 1.0), rather than HTML, right now. Browsers can understand XHTML because it relies heavily on HTML, but contains some specific deviations that are fairly easy for the browser to accommodate. This means that once you learn XHTML's basic rules, you can begin writing to the XHTML standard.

Of course, writing good XHTML depends on a sophisticated understanding of HTML as it exists as a formal standard. So if your hand-coded HTML is more intuitive, or if you rely on editors and WYSIWYG applications, it is in your best interests to brush up on HTML 4.0 before attempting to write XHTML documents. You can do this by visiting the W3C's Web site and studying the HTML 4.0 standard. You can purchase any number of books on HTML 4.0, or attend classes, seminars, and conferences that address HTML 4.0.

Different is better

Because of the more refined syntactical rules, as well as the concurrent flexibility found within XHTML, authors working with HTML today should take a look at the differences in XHTML and begin incorporating the stricter coding practices into their HTML and XHTML documents.

To begin with, it's important to look at the format of XHTML documents. All XHTML documents must:

For those unfamiliar with DTDs, a DTD is a set of rules that define the logical structure for a legal, conforming document. XHTML has three DTDs: Strict, Transitional, and Frameset. If you know your HTML 4.0 well, you'll understand this terminology. For those of you who don't, let me simply say that each DTD provides a guideline for legitimate syntax. The DTD you choose will depend on your choice of document type.

DTDs are described within the DOCTYPE definition. I'll show you how to declare them shortly. But before that, I want to turn to the issue of root element and namespace. The root element for either a standard HTML page or an XHTML page is <html>. In XHTML, you'll need to add the XML namespace attribute and corresponding values to further define the page as an XHTML (not HTML) document (see Listing 1).

Writing code that shines

After ensuring that the document itself conforms to the standard, turn to the syntactical issues of XHTML. Typically, any differences between HTML and XHTML syntax are due to the influence of the rigorous XML on the less well-behaved HTML. Concerns for HTML coders include the following core issues:

Let's take a detailed look at what each of these considerations entail.

Neatness counts

Precise coders will really enjoy the clean aspect of XHTML. However, many HTML coders have learned HTML in boot-strap fashion by necessity, or simply don't know or care much about meticulous code. Those people face the hard reality of having to clean up their proverbial acts.

A well-formed document contains clean, consistent markup written to the rules and regulations of the language. HTML browsers are very forgiving, but XHTML—in order to be flexible—cannot afford instability. As a result, no sloppy code is allowed.

Specifically, all documents must be written symmetrically. This means that no element can be illegally nested in another element. Here's a bit of symmetrical HTML code:

<b>This code is <i>symmetrical</i></b>

A poorly formed example would be:

<b>This code is <i>not</b></i>

Despite the fact that many browsers read the code in the second example properly, it is still poorly formed because of the interpolated <i> and </b> tags. You can easily test symmetry by drawing a line from each element to its companion closing element (Figure 1). If there is no intersection, the code is symmetrical. If the lines cross, the code is asymmetrical and therefore considered poorly formed.

Figure 1.This exercise will help you determine whether your code is symmetrical.

Getting in touch with your case-sensitive side

I have been advocating lowercase tags and attributes for years, despite guidelines imposed by various companies or publishers for whom I have taught or developed HTML. There is elegance and precision in lower case tagging, and XHTML ensures the achievement of this precision. Since XML is case sensitive, all XHTML elements must be written in lower case in order to achieve effective reformulation. This way, supporting user agents can differentiate between tags and attributes appropriately.

So, if you've been coding in all caps:

<A HREF="http://www.molly.com">Go to Molly's Site</A>

or using combination case:

<A href="http://www.molly.com">Go to Molly's Site</a>

get ready to change your style. You are required under XHTML to place all tags and attributes in lower case:

<a href="http://www.molly.com">Go to Molly's Site</a>

And I quote

Another forgiven method in HTML that is tightened under XHTML is the quotation of attribute values. This means that if you've got a value attached to an attribute, it must appear in quotes. Current practices—those of hand coders and those using HTML editors or WYSIWYG applications—are very inconsistent. Here's an unquoted example:

<div align=center>

Notice that the center value doesn't have quotes around it. In XHTML, this is forbidden! XHTML insists that you write your code as follows:

<div align="center">

Writing XHTML requires that you be more attentive to your code.

Managing nonempty elements

A nonempty element is one that does not require a closing tag, yet contains data that is influenced by that tag. The paragraph element is an excellent example of this. In legal HTML 4.0 you can code a paragraph as follows:

<p>Welcome to the Coffee Crazy Web site. We offer great coffees, teas, and specialty gifts. Since the holidays are right around the corner, be sure to stop by our special holiday catalog and see all we have to offer.

This is a nonempty element. The <p> tag opens the paragraph and legally defines its attributes. However, it contains the paragraph data and does not close itself off before beginning another paragraph or element.

In XHTML, end tags are required for nonempty elements. The <p> element has another legal option in HTML 4.0, and that's the open/close tag option. Using the open/close option, the code sample would then read:

<p>Welcome to the Coffee Crazy Web site. We offer great coffees, teas, and specialty gifts. Since the holidays are right around the corner, be sure to stop by our special holiday catalog and see all we have to offer.</p>

This paragraph is an appropriately managed nonempty element under XHTML 1.0.

Tying up loose ends

HTML contains numerous empty elements. These are elements that have no closing tags. XHTML 1.0 requires that any empty element either be closed with a closing element, or that the tag must contain a terminator.

Examples of empty elements include <br>, <hr>, and <img>. If you would like to add a line break under XHTML rules, you'll add a terminator to the opening tag. The terminator style dictates that the tag markup is followed with a space then a forward slash: <br />, <hr />, or <img />.

So, legal XHTML would be written as follows:

<img src="http://www.foo.com/images/coffee.jpg" width="50" height="50" /> Please feel free to get in touch. We are located at:<br />
10,000 Coffee Talk Way<br />
Anywhere, USA<br />
<hr width="75%" />

The white-glove test

XHTML obviously places demands on the coder, but the demands on user agents and applications will cause equal and possibly greater concern. Developers are faced with creating updated Web browsers that not only support and differentiate between extensible languages, but also possess backward compatibility. New user agents that are being developed for handheld devices and cell phones need to incorporate extensible technology.

Application developers must pay attention to these changes. Even leading HTML editors like Allaire's HomeSite do not currently conform to XHTML's strict coding practices, such as ensuring that all attributes are quoted. WYSIWYG software vendors are especially challenged to revise their products so that they output well-formed code. While this is undoubtedly a serious task, the end result is certain to be a much more sophisticated set of development tools that grow and evolve with developer and market demands.

Copyright Dunstan Orchard