An introduction to XML

By: Lars Marius Garshol

This document | The weaknesses of HTML | What's XML? | The potential of XML | References

Norsk versjon | På svenska | Other articles

About this document

This document is something my thesis advisor asked me to write in connection with my thesis, to clarify what XML is and I saw in it. It ocurred to me that this might be useful for a lot of people so I put it out on the web. After the Norwegian version recieved quite a few hits I asked on comp.text.sgml whether there was any interest in an English version. There was, so I translated it.

What's wrong with HTML?

(If you don't know the difference between a tag and an element in HTML/SGML you should read the glossary at the end of this document.)

Originally, the intention with HTML was that the elements should be used to mark up information according to its meaning, without regard to how this would actually be rendered in a browser. In other words: title, main header, emphasized text and the contact information of the author should be placed inside the elements TITLE, H1, EM (or possibly STRONG) and ADDRESS. To use FONT or I and similar elements to get a nice layout makes it a lot more difficult to present the information to the best possible effect regardless of the users environment. Processing the information automatically also becomes difficult (or even impossible). (See reference 1.)

The reason why the browser should decide for itself how to display title and headers etc is that it knows a lot more about the users preferences and environment and so can make decisions based on that. The author, not knowing his reader, cannot do this as well, of course. This is especially useful for people who are blind, run non-graphical browsers or who have weak eyesight, and therefore need larger font sizes. This means that an author who doesn't follow the rules will cause problems for those of the readers who read in a non-standard environment.

Unfortunately, browser vendors have either not understood this or decided to ignore it, as they have ignored standards that tried to place information about layout outside the HTML documents themselves, like CSS. (See reference 2.) Instead, they've introduced their own elements and attributes whose only purpose is to specify the layout, like FONT, CENTER, BGCOLOR etc. They've also made HTML editors (like Netscape Gold) which produce HTML where the markup is presentational rather than semantic. (For instance, Netscape Gold uses UL to produce indentation, and not just for lists.)

The result is that a lot of pages on the web now contain tagging that's written for a specific version of a specific browser (with default preferences) and a specific screen resolution. These pages are often more or less unreadable to those who use something else. Thus, HTML has gradually been turned into a presentational language for Netscape and MSIE by the vendors and their users.

This, however, is not the only problem. If you want to mark up your information really precisely according to its meaning you'll want lots of elements that just aren't present in HTML. If you are, say, a chemist, you'll probably want special elements for chemical formulas, for measurement data and so on. If you are an airplane manufacturer you'll want to be able to talk about engines, parts and models. Catering to the needs of all trades and people will obviously mean having an enormous amount of elements, which is quite simply a Bad Thing for both developers and users.

Another problem is that HTML has very little internal structure, which means that you can easily write valid HTML that does not make sense at all when you consider the semantics of the elements. This is because (among other things) the contents of BODY have been defined so that you can place the elements allowed therein in any order you please. This means that you don't need a H1 with the H2s inside it and H3s inside the H2s. (Think of H1 as a book title, H2 as part title and H3 as chapter title.) HTML should ideally be written this way, but the HTML standard does not require it. (Se references 1 and 3.)

People have been aware of these problems for quite some time, and in the summer of '96 the W3C (which defines the web standards) started work on a new standard to deal with these problems. The W3C has set up a working group that is now creating this new standard called XML, for eXtensible Markup Language. The working group (from now on called XWG, for XML working group) has split their work into three phases.)

Phase 1
Define a standard for the creation of markup languages.
Phase 2
Develop a common standard for linking in these markup languages.
Phase 3
Develop a common standard for specifying the layout of documents encoded in these languages.

Phase 1 is now completed, since the XML 1.0 specification is now finished. Phase 2 is still under way, although there is a working draft. Phase 3 has not yet reached that stage, as there only exists a suggestion at this stage.

XML

Please note that the descriptions given below are simplified and only meant to give an impression of XML. They leave out a lot of the standards and are (for reasons of readability) a little inaccurate. If you want more detailed and accurate information you should go on to read the appendices below. Also note that these standards are not finalized yet, so that they may change before they're officially accepted. As a first introduction, however, this document should be useful.

XML itself

There already exists a standard for defining markup languages like HTML, which is called SGML. HTML is actually defined in SGML. SGML could have been used as this new standard, and browsers could have been extended with SGML parsers. However, SGML is quite complex to implement and contains a lot of features that are very rarely used. Its support for different character sets is also a bit weak, which is something that can cause problems on the web where people use many different kinds of computers and languages. It's also difficult to interpret an SGML document without having the definition of the markup language (the DTD) available. Because of this, the XWG decided to develop a simplified version of SGML, which they called XML. (As they like to say, XML is more like SGML light, than HTML++.)

The main point of XML is that you, by defining your own markup language, can encode the information of your documents much more precisely than is possible with HTML. This means that programs processing these documents can "understand" them much better and therefore process the information in ways that are impossible with HTML (or ordinary text processor documents). Imagine that you marked up recipes (for, say, soups and sea food dishes etc) according to a DTD tailored for recipes where you entered the amounts of each ingredient and alternatives for some ingredients. You could then easily make a program that, given a list of the contents of your fridge, would go through the entire list of recipes and make a list of the dishes you could make with them. Given nutritional information about the ingredients (x calories per ounce of this, y calories per once of that etc) the program could sort the suggestions by the amount of calories in each dish. Or by how long they'd take to prepare, or the price (given price information for the ingredients), or... The possibilites are almost endless, because the information is encoded in a way that the computer can "understand".

Defining your own markup language with XML is actually surprisingly simple. If you wanted to make a markup language for FAQs you might want it to be used like this: (note that this example is really too simple to be very useful)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE FAQ SYSTEM "FAQ.DTD">
<FAQ>
  <INFO>
    <SUBJECT>   XML                </SUBJECT>
    <AUTHOR>    Lars Marius Garshol</AUTHOR>
    <EMAIL>     larsga@ifi.uio.no  </EMAIL>
    <VERSION>   1.0                </VERSION>
    <DATE>      20.jun.97          </DATE>
  </INFO>

  <PART NO="1">
  <Q NO="1">
    <QTEXT>What is XML?</QTEXT>
    <A>SGML light.</A>
  </Q>

  <Q NO="2">
    <QTEXT>What can I use it for?</QTEXT>
    <A>Anything.</A>
  </Q>

  </PART>
</FAQ>

In XML, the markup language shown above (let's call it FAQML) had a DTD like this:

<!ELEMENT FAQ     (INFO, PART+)>

<!ELEMENT INFO    (SUBJECT, AUTHOR, EMAIL?, VERSION?, DATE?)>
<!ELEMENT SUBJECT (#PCDATA)>
<!ELEMENT AUTHOR  (#PCDATA)>
<!ELEMENT EMAIL   (#PCDATA)>
<!ELEMENT VERSION (#PCDATA)>
<!ELEMENT DATE    (#PCDATA)>

<!ELEMENT PART    (Q+)>
<!ELEMENT Q       (QTEXT, A)>

<!ELEMENT QTEXT   (#PCDATA)>
<!ELEMENT A       (#PCDATA)>

<!ATTLIST PART    NO    CDATA #IMPLIED
                  TITLE CDATA #IMPLIED>
<!ATTLIST Q       NO	CDATA #IMPLIED>

<!ELEMENT> is used to define elements like this: <!ELEMENT NAME CONTENTS>. NAME gives the name of the element, and CONTENTS describes which elements that are allowed where inside the element we've defined. A,B means that you must have an A first, followed by a B. ? after an element means that it can be skipped, + means that it must be included one or more times and * means that it can be skipped or included one or more times. #PCDATA means ordinary text without markup (more or less).

An important difference between XML and SMGL is that elements in XML which do not have any contents (like IMG and BR of HTML) are written like this in XML: <IMG SRC="pamela.gif"/>. Note the slash before the final >. This means that a program can read the document without knowing the DTD (which is where it says that IMG does not have any contents) and still know that IMG does not have an end tag and that what comes after IMG is not inside the element.

<!ATTLIST> defines the attributes of an element. In the DTD given above it's used to give PART and Q an attribute called NO, which contains ordinary text and which can be skipped. As you can see, PART has two attributes, and the last one is called TITLE, contains text and can be skipped.

Linking in XML

HyTime is a standard for adding linking attributes and elements to SGML DTDs. It is far more advanced than what's possible with HTML and contains a lot of stuff not useful on the web. The XWG is therefore currently making a similar standard for XML which borrows a lot from HyTime (and similar standards) and simplifies it.

To make it possible to use this linking standard in any DTD (regardless of which elements the DTD has) there aren't defined any particular elements for linking. Instead, linking elements use special attributes that identify them as linking elements. All elements that have an attribute called XML-LINK will be considered linking. The value of XML-LINK specifies what kind of link the element specifies.

XML links can be between two or more resources, and resources can be either files (and not necessarily XML or HTML files) or elements in files. Links can be specified with the ACTUATE-attribute to be followed either (if the value is USER) when the user explicitly requests this (for instance by clicking) or (value AUTO) automatically (ie: when the system reads the linking element). What happens when you follow the link is specified with SHOW, which can take the following values:

EMBED
This means that the resource the link points to is to be inserted into the document the link comes from. This will happen either during the displaying of the document or during processing of the document. This can be useful for including text from other files (with ACTUATE=AUTO) or to include a picture in a page. It can also be used to insert footnotes into the text and ACTUATE will then specify if the user has to click on the footnotes to include them or whether all footnotes will be inserted automatically.
REPLACE
This means that the resource the link points to is to replace the linking element. If you have two different versions of a paragraph you can link them in such a way that one can see the other version in the same context by following the link.
NEW
In this case, following the link will not affect the resource the link came from. Instead, the linked resource will be processed/displayed in a new context. Ordinary HTML links are of type NEW, as the new page is displayed in place of the previous one.

XML is even more advanced than this. Links can be between more than one resource, they can be specified outside the actual documents themselves and the linked-to element inside a resource can be specified in very powerful ways. The element can be identified with an ID-attribute, position in the element structure and one can even specify that the link goes to things like "fourth LI inside the first UL inside BODY".

In FAQML this could have been used both for specifying links to relevant information outside the FAQ as well as specifying internal relationships between different answers. It could also have been used for footnotes etc.

XML and layout

There is actually already an SGML standard for this as well, and it's called DSSSL, and isn't very simple, either. The XWG has therefore decided to make a simplified version of DSSSL as well and call it XSL. So far, not much has been done about this. One proposal (see references) has been submitted, but it hasn't been accepted yet, and it's uncertain if it will be. So, I'm going to describe DSSSL instead of XSL, at least until the future shape of XSL becomes clearer.

DSSSL is actually a full programming language, based on Scheme (a LISP dialect), and is very powerful. It can be used both as a stylesheet specifying fonts and positioning for the different elements and as a transformational language that can be used to transform documents from one DTD to another.

The most common use of DSSSL is to convert SGML documents to other formats better suited for presentation, like PDF (also known as Acrobat), PostScript, LaTeX, HTML or RTF. What the XWG is planning is to use XSL to specify how XML documents are to be displayed on screen.

Below I try to show how we could make a stylesheet for FAQML, but without explaining very much of what really happens. I've split the DSSSL file into several parts in order to be able to comment it as it's written, but it is meant to be a single file.

DSSSL consists of several different parts, and the most basic one is the expression language which is quite simply a subset of Scheme. This means that DSSSL-stylesheets are really one large Scheme expression that is calculated by the DSSSL engine, with a file as the result of the calculation. Another important part (which is built on the expression language) is the style language, which I've used almost exclusively in this example. A third part is the query language, which can be used to find any element you want in your document. I've used it in this example to find the number of a FAQ question from inside the QTEXT element. This was necessary because NO is an attribute of the surrounding Q element, and not QTEXT itself.

All formatting in DSSSL is done with so-called flow objects. In the code below you'll se a lot of (element X (make Y-expressions which indicate that when element X shows up the DSSSL engine is to create a flow object of type Y. Then style rules for Y and then the contents of Y are specified. There's much more to DSSSL than this, but the rest is considered to be outside the scope of this document.

<!doctype style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">

;--- DSSSL stylesheet for FAQML

;---Constants

(define *font-size* 	12pt)
(define *font* 		"Times New Roman")

The first line tells the SGML parser that this document follows the DTD for DSSSL. (Yes, DSSSL is an SGML application.) The next two lines are comments (after ; the rest of the line is ignored). Then I define two constants that I use below in the styles themselves. This is done to make it easy to change the font size of the entire document without having to adjust sizes for all kinds of headers etc. Instead, I just change the value of *font-size*.

;---Element styles

(element FAQ
  (make simple-page-sequence
	font-family-name:	*font*
	font-size: 		*font-size*
	input-whitespace-treatment: 'collapse
	line-spacing:		(* *font-size* 1.2)

	(process-children)))

This part creates a flow object for the FAQ element, ie: the whole document. The flow object is "simple-page-sequence", which I assume is meant for small articles. I then specify what font to use, font size, that whitespace is to be considered insignificant (like in HTML) and then I give the line height. The line height is set to be 1.2 times the font size.

(element INFO
  (make paragraph
	quadding:		'center
	space-after:		(* *font-size* 1.5)

	(process-children)))

This indicates that the element INFO (from start-tag to end-tag) is to be laid out as a paragraph that is centered and has a blank space as high as 1.5 lines after it. After creating the paragraph flow object the DSSSL engine is to go on to process the child elements of INFO.

(element SUBJECT
  (make paragraph
	font-size:		(* *font-size* 2)
	line-spacing:		(* *font-size* 2)
	space-after:		(* *font-size* 2)
	
	(process-children)))

The subject element gets its own paragraph and is displayed in double font size. AUTHOR and EMAIL are simpler versions of this, so I skip them. (You can find them in the complete DSSSL file linked to below.)

(element VERSION
  (make paragraph
	
	(make sequence
	  (literal "Version: "))
	(process-children)))

The VERSION element is given its own paragraph, which contains sequence flow objects. I insert one containing the text "Version: " before the actual contents of VERSION are processed. This means that the text "Version: " will be inserted in front of the actual version number. DATE is similar, so I skip that.

(element PART
  (make paragraph
	font-size:    		(* *font-size* 1.5)
	line-spacing:		(* *font-size* 2)

	(make sequence
	  (literal (attribute-string "NO" (current-node)))
	  (literal ". ")
	  (literal (attribute-string "TITLE" (current-node)))
	  )

	(process-children)))

I wanted PART to have a large font size and contain both number and title. We've already seen how to do this with sequence, but the problem of getting hold of the number and title is new. They are only given as attributes, and thus will be ignored by (process-children). The function attribute-string gives us what we want. (attribute-string "NO" (current-node)) returns the value of the attribute NO in the current element. The rest of this style sheet is so simple that I'll just skip it without comments.

In case anyone's interested, they can find the entire DSSSL file here, together with the results in RTF and PostScript formats. The RTF file is produced by Jade (see reference 12) and the Postscript file is produced from this. Note that the RTF and Postscript files are from the Norwegian version. This should make no difference, though.

The difference between XSL and DSSSL

At this point it doesn't seem like XSL will be based on Scheme, since Microsoft and Netscape already have implemented JavaScript in their browsers. So XSL will probably be defined as an XML DTD that uses JavaScript for programming. That's a pity, since DSSSL has such a nice syntax and Scheme is such a great programming language, but Netscape Navigator and MSIE are of course large enough as it is.

What will XML be used for?

Please note that what follows is only my personal views on the future of the web and should as such be regarded with a pinch of salt. ( Reference 4 is an excellent article by XWG Chairman Jon Bosak on XML and the future of the web.)

The layout problem

The first thing I hope XML can put right is the problem of making web pages with decent layout that are still accessible to anyone, regardless of browser. Considering that XSL will be a complete standard to be supported one should, after a while, expect a stable standard to write against. XSL also lets you check whether optional features are present or not and if not you can supply alternative code to take care of those cases.

A FAQ-maintainer will also be rid of the problems with maintaining the FAQ in HTML, .txt and PDF versions (or whatever). Instead s/he can make one (or more) DSSSL stylesheets to be run each time the original has been updated to create new versions of the distribution files. (Just like I produced .RTF and .PS files for my FAQ above.)

Considering that neither Microsoft nor Netscape have been able to implement CSS (or even HTML) properly one can wonder what will happen when they try to implement XML and XSL. My hope is that they'll decide they have to make a real effort and do it properly and that if they don't somebody who does will take over the market. They've now promised to support XML, so there's room for hope, but no more... (See references 5 and 6.)

More versatile ways of displaying data

An API to be supported by all XML and HTML processors (that is browsers and other tools) is under development under the name Document Object Model (or DOM). This happens a little on the side of the XWGs work, but is still well under way. (See reference 7.) This API will make it possible to make Java applets (or JavaScript snippets) that can be used to change the display of XML-encoded information in web browsers. (The members of XWG like to call this "giving Java something to work with.")

This can be used in a nearly infinite number of ways, but examples of what the developers have in mind are footnotes that are invisible until you click the footnote number in the text, that you can start from the table of contents in a document and descend through the levels by clicking (like in Windows Explorer). You can also make things like tables that can be sorted by any column by clicking on it. The possibilities are nearly unlimited, and this is only the tip of the iceberg.

This can be made significantly much more advanced. One could imagine that VRML (a language for coding 3D worlds) was redefined in XML and VRML viewers were written as Java applets using DOM. (If you think this is science fiction take a look at reference 8.) This would mean that VRML could be used together with HTML with no need for extra software on the client side. (Well, there would be the applets, but they both install and remove themselves.)

Jon Bosak describes an even more advanced possibility in reference 4. The major vendors of electronic components (so-called chips) have joined forces to make a DTD that can be used to describe components. Together with the right Java applets this could be used to download any descriptions of chips and then model how these work together.

Searching and agents

The applications described here are currently not feasible, but I hope that in time they may be.

That the information in XML documents is so precisely described by the markup means that one can search them in much better ways than the primitive text searches currently available from search engines like Excite and Altavista today. There are already SGML query languages that are similar to SQL in power and this field is still under research. (See references 9 and 10.)

With standardized DTDs for different applications one could retrieve information much more accurately than today. One could envision things like a central search engine for chip vendors where you could do very precise searches for components by specification, almost as if they were in an ordinary relational database. Similar services would be possible for all documents with a common DTD.

Exploiting this for global search engines like Excite and Altavista is going to be a lot more difficult becaouse of the number of different DTDs. With an overview of the most important ones and a little artificial intelligence in the search engines this could perhaps be handled, but for now this is pure science fiction.

Jon Bosak writes about using this sort of technique with intelligent agents, which are personal robots that search the web (and possibly other services as well) for information for you based on your preferences. This might be easier, as you could list the DTDs and your preferences privately, but it's still science fiction.

Exchanging information between different systems

Because a DTD gives a standard format for information related to a specific subject it can be used to simplify the exchange of information between different sources. Many kinds of applications have or will have standard DTDs. I've already mentioned chip manufacturers, and many other industries already have standard DTDs and more will follow. This means that systems can use these common DTDs to exchange information with each other, regardless of their internal format. The main applications of this will probably be the exchange of data between companies in the same industry or researchers within an academic field, although many other applications for ordinary users are imaginable.

There is already an XML DTD for chemists, called CML. (See reference reference 11.) CML will be very useful for exchanging research results and other data between chemists and companies working with chemistry in any way. It can also be used with Java-applets in education. The list of possibilities just goes on and on.

Appendices

Glossary

DTD
In both XML and SGML this is a definition of a markup language. The best-known example is HTML, which is defined by a DTD describing the structure of HTML documents.
Element
SGML/XML elements and tags are often confused. An SGML element is the area from and including the start tag (<TAG>) up to and including the corresponding end tag (</TAG>). In HTML the HTML element is the whole document (apart from the !DOCTYPE declaration). Another example is links, where the element is the start and end tags as well as the underlined text between them.

Additional information

  1. The official XML-FAQ, by Peter Flynn.
  2. The XML 1.0 specification.
  3. XML-LINK. The official draft of the XML linking standard.
  4. XSL, the current proposal.
  5. Links to more information about SGML and DSSSL.
  6. The XML files, a series of articles in Web Review.
  7. The W3C XML pages.
  8. Free XML software, an exhaustive list.
  9. James Taubers XML links. Well structured, easy to find what you need.

References

  1. How to write HTML, by Tim Berners-Lee.
  2. An early CSS proposal, by Håkon Wium Lie. Note the date!
  3. HTML 3.2, the BODY content model, from the W3C.
  4. XML, Java and the Future of the Web, article by Jon Bosak.
  5. Microsoft's page about their XML support.
  6. An overview document by the developers of XML support in Mozilla.
  7. The Document Object Model.
  8. Frag Island, a Quake clone as a Java applet.
  9. SgmlQL, a query language for SGML.
  10. The home page of Arjit Sengupta, lots of information about research on SGML query languages.
  11. CML, Chemical Markup Language.
  12. Jade, James' DSSSL Engine.

Last update 12.Oct.99 22:56, by Lars M. Garshol.