Introduction to XML

What is eXtensible Markup Language (XML)?

XML is similar to HTML in that is a markup language that organizes information. Both HTML and XML are World Wide Web (W3) standards. The difference is that while HTML elements determine how the browser displays the information, XML describes the information itself.

For example, review the very simple HTML and XML page in the following table. The HTML row shows a simple HTML document that describes a restaurant menu. Notice that the elements both structure the HTML file and control how it is displayed in the browser. The <h1> and <h4> elements do not identify the contents of the elements (the menu item and price); they simply tell the browser how to format that content.

Conversely, in the XML example, each element identifies what is contained within that element. However, when you view raw XML in a browser, it shows the hierarchy of the XML file as the default behavior.

HTML XML
<!doctype html>
<html>
<head>
<title>Menu</title>
</head>
<body>
<h1>Cheeseburger</h1>
<h4>$3.99</h4>
<h1>Hamburger</h1>
<h4>$3.49</h4>
</body>
</html>
<?xml version="1.0" encoding="utf-8"?>
<menu>
<item>
<name>Cheeseburger</name>
<price>3.99</price>
</item>
<item>
<name>Hamburger</name>
<price>3.49</price>
</item>
</menu>

One of the main differences between XML and HTML is that with HTML, you generally use predefined elements. XML allows you to create elements to describe the information you are creating.

Also, note the <?xml version="1.0" encoding="UTF-8"?> statement  in the XML example. All XML files should include that line of code as the first line. It identifies the version of xml and the encoding method used for the file.

Why use XML?

XML provides several benefits:

  • XML is text-based – You can edit the same file with any number of tools (including Notepad). If you try to open a file like a Word document or a FrameMaker file in text editor, you’ll usually see a lot of strange characters or garbled text. An XML file is just plain text and can be opened in almost any program that can read text.
  • XML is an open standard – You aren’t locked into a particular product or version to author documents. Vendors that provide XML editors follows the W3C XML standard. Many products, like unstructured FrameMaker or PowerPoint, stores the file in a proprietary format and cannot be opened easily in other tools. An XML file can be opened and edited in any XML editing tool. You could be using one XML editor and your coworkers could be using different XML editing tools to work on the same set of XML files.
  • XML is extensible – You can define your set of elements to define your data or use one that has been predefined for your content type. For example, DITA (Darwin Information Typing Architecture) and Docbook are both predefined XML standards with a predefined set of elements that you can use to author your documentation. Or you have the choice to create your own set of elements and structures.

XML’s role in structured authoring

XML and XML tools enforce the rules of structured authoring. Because elements in XML describe the XML is often referred to in the same breath as structured authoring. Structured authoring:

  • Provides an authoring methodology in which you author and organize information based on the type of information
  • Identifies different pieces of information based on what that information contains
  • Separates content from format or appearance of a document.
  • Enforces a set of rules when authoring content.
  • Breaks information into topics and smaller components instead of long narratives.

The benefits of structured authoring include:

  • Consistent content and organization
  • Ability to programmatically convert content into multiple output formats or deliverables
  • Can improve writer productivity
  • Can improve the quality of the information

You can use XML and XML tools to enforce the rules of structured authoring. Because elements in XML describe the structure of the information, you can enforce rules on what elements can be included and the order in which they are included within the documentation that you create. For example, you can set up rules for the XML documents to ensure:

  • Tasks must include at least one step
  • All figures must have a title
  • All names must include a last name and a first name

This structured approach allows you to ensure that documentation is consistent even when written by multiple authors, enables to you transform the documents into a variety of outputs, frees writers from having to worry about style choices, and improves the quality of information.

XML does not include any formatting information within it. The documents solely focus on the content.

How does XML get formatted?

So, you see raw XML in the browser, but how does it get formatted into something that readers can easily read?

XML has an associated standard called XSLT (or Extensible Stylesheet Language Transformation). XSLT language (with the aid of an XML processor) transforms the raw XML into an output for the reader to consume. That output can include HTML, HTML Help, Eclipse Help, PDF, or even other XML files.

The following diagram illustrates this process. A future lesson covers XSLT in more detail.

Anatomy of an XML Element

An XML file contains a series of building block elements that contain the content of your document.

Elements have a starting and ending tag around some set of content:

Note the following:

  • The starting tag name begins with a less than sign (<), includes a name, and ends with the greater than sign (>). The name cannot contain any spaces.

  • The ending tag begins with less than sign followed by a forward slash (</), has a name that matches the starting tag, and ends with the greater than symbol (>). The name must exactly match the starting tag name, including case.

Naming elements

Element names must conform to certain rules. They:

  • Can contain any alphanumeric character.
  • Can contain hyphens, periods, or underscores.
  • Cannot begin with a number or a punctuation character.
  • Cannot contain spaces.

Samples of valid names include <firstname>, <item1>, and <memo_document>.

Samples of invalid names include <1item>, <first name>, and <_dog>.

Defining the content of elements

The contents of an element can be:

  • Text:

    <firstname>Tom</firstname>

  • Other elements:

    <name><lastname>Smith</lastname><firstname>Tom</firstname></name>

    When an element contains additional elements, those elements are called nested elements or child elements. For example, <lastname> and <firstname> are nested within the <name> element. They are considered children of the <name> element.

  • A combination of elements and text:

    <step>Press <key>Enter</key> to continue</step>

    In this example, the <step> element contains both regular text, as well as the element <key>.

  • An empty element that contains no content in a few cases:

    <img src="images/imagename.jpg"/>

    In XML, an empty element must end with /> or have a closing tag.

Well-formed documents

A well-formed XML document is a document that corresponds to the basic rules of XML that were described in the W3C standard.

The basic rules include:

  • A single root element must include all of the other elements within the XML document. For example, consider the following example of a restaurant menu defined within XML:
    <?xml version="1.0" encoding="utf-8"?>
    <menu>
    <item>
    <name>Cheeseburger</name>
    <price>3.99</price>
    </item>
    <item>
    <name>Hamburger</name>
    <price>3.49</price>
    </item>
    </menu>

    In this example, all of the other elements are included within the <menu> element. The <menu> element is the root element for that document.

  • Each element starting tag must have a corresponding ending tag. If you are used to working with HTML, some elements like <br> or <hr> do not have an ending tag. With XML, an ending tag is required.

    If you define an element that does not include content, you can close the starting tag by inserting / before the ending >. For example, <img src="file.jpg"/> doesn’t contain any content and uses the /> syntax to close the element.

  • You must close the elements in the right sequence. Inner elements must be closed before outer elements.

    For example, in the menu, you must close <name> and <price> before closing <item>.

    <item><name>hamburger<item></name> would be incorrect. Because <name> is nested within <item>, you must close <name> before <item>.

  • Case matters in element names. <name> is not the same as <Name>. Any XML processor will consider these things to be completely different elements. Always ensure that the case of your elements are the same.

You can check whether your XML is well-formed by using the XML validation service provide at W3Schools at http://www.w3schools.com/xml/xml_validator.asp.

Creating a well-formed document

To create a well-formed document, consider a contact list. The following contact list contains phone numbers for a set of people and an optional note.

Jones, Fred
    home: (512) 555-3301
    work: (512) 555-2212
Reynolds, Biff
    home: (512) 555-2222
    Birthday: July 31st
Smith, Bill
    home: (512) 555-2323
    cell: (512) 555-2231
    Contractor

Think about the elements that you might define to contain this contact list:

  • A contact list root-level element to contain the list of contents.
  • A contact element to contain all of the information for a single person.
  • A name element to contain the name of the person. You could further break this down into a first name and a last name.
  • Different types of phone number elements to include the different phone numbers associated with a contact.
  • A note element to include any notes associated the contact.

The following example shows these components:

Contact list 
       Contact 
          Name 
            Last Name 
            First Name 
          Phone number (different types)
          Note 

After you determine how you want to structure the well-formed document, you can create the document. Note that it has the root-level element <contact_list> and all other elements are contained within that root-level element.

Ensure when naming elements and structuring the files that you conform to all rules of a well-formed document.

<?xml-stylesheet type="text/xsl" href="contact.xsl"?>
<contact_list>
   <contact>
      <name>
        <lastname>Smith</lastname>
        <firstname>Bill</firstname>
     </name>
     <phonenumber_home>(512) 555-2323</phonenumber_home>
     <phonenumber_cell>(512) 555-5111</phonenumber_cell>
     <note>Contractor</note>
   </contact>
   <contact>
      <name>
         <lastname>Jones</lastname>
         <firstname>Fred</firstname>
      </name>
      <phonenumber_home>(512) 555-3301</phonenumber_home>
      <phonenumber_work>(512) 555-2212</phonenumber_work>
      <note/>
   </contact>
…</contact_list>

Note that:

  • The element names reflect their content.
  • No formatting is defined within the elements.
  • Elements are defined and used consistently.

Also note that you nest all of the information for a particular person within a single contact element. By nesting this way, you’ll be able to do such things as sort the list of contacts alphabetically later when performing XSLT processing.

If you were to include this information within HTML, you would not have the elements that describe the content. Instead you might have something like the following:

<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <title> Contact List </title>  
   </head>
   <body>
     <p><b>Jones, Fred</b></p>
     <ul>
       <li><p><b>home: </b>(512) 555-3301</p></li>
       <li><p><b>work: </b>(512) 555-2212</p></li>
     </ul>
    <p><b>Reynolds, Biff</b></p>
    <ul>
      <li> <p><b>home: </b>(512) 555-2222</p></li>
    </ul>
    <p><b>Smith, Bill</b></p>
    <ul>
      <li><p><b>home: </b>(512) 555-2323</p></li>
      <li><p><b>cell: </b>(512) 555-5111 </p></li>
    </ul>
    </body>
 </html>

In this example, note the following:

  • The element names do not reflect the content.
  • Element names are generic.
  • Formatting elements like the <b> elements specify appearance.

Summary

This topic provided a high-level overview of XML and structured authoring. It also broke down the basic structure of an element and provided rules for well-formed documents.