What is eXtensible Markup Language (XML)?
XML is similar to HTML in that is a markup language that organizes information. Both HTML and XML are World Wide Web (W3) standards. The difference is that while HTML elements determine how the browser displays the information, XML describes the information itself.
For example, review the very simple HTML and XML page in the following table. The HTML row shows a simple HTML document that describes a restaurant menu. Notice that the elements both structure the HTML file and control how it is displayed in the browser. The <h1> and <h4> elements do not identify the contents of the elements (the menu item and price); they simply tell the browser how to format that content.
Conversely, in the XML example, each element identifies what is contained within that element. However, when you view raw XML in a browser, it shows the hierarchy of the XML file as the default behavior.
HTML |
XML |
<!doctype html> <html> <head> <title>Menu</title> </head> <body> <h1>Cheeseburger</h1> <h4>$3.99</h4> <h1>Hamburger</h1> <h4>$3.49</h4> </body> </html> |
<?xml version="1.0" encoding="utf-8"?> <menu> <item> <name>Cheeseburger</name> <price>3.99</price> </item> <item> <name>Hamburger</name> <price>3.49</price> </item> </menu>
|
|
|
One of the main differences between XML and HTML is that with HTML, you generally use predefined elements. XML allows you to create elements to describe the information you are creating.
Also, note the <?xml version="1.0" encoding="UTF-8"?> statement in the XML example. All XML files should include that line of code as the first line. It identifies the version of xml and the encoding method used for the file.
Why use XML?
XML provides several benefits:
- XML is text-based – You can edit the same file with any number of tools (including Notepad). If you try to open a file like a Word document or a FrameMaker file in text editor, you’ll usually see a lot of strange characters or garbled text. An XML file is just plain text and can be opened in almost any program that can read text.
- XML is an open standard – You aren’t locked into a particular product or version to author documents. Vendors that provide XML editors follows the W3C XML standard. Many products, like unstructured FrameMaker or PowerPoint, stores the file in a proprietary format and cannot be opened easily in other tools. An XML file can be opened and edited in any XML editing tool. You could be using one XML editor and your coworkers could be using different XML editing tools to work on the same set of XML files.
- XML is extensible – You can define your set of elements to define your data or use one that has been predefined for your content type. For example, DITA (Darwin Information Typing Architecture) and Docbook are both predefined XML standards with a predefined set of elements that you can use to author your documentation. Or you have the choice to create your own set of elements and structures.
XML’s role in structured authoring
XML and XML tools enforce the rules of structured authoring. Because elements in XML describe the XML is often referred to in the same breath as structured authoring. Structured authoring:
- Provides an authoring methodology in which you author and organize information based on the type of information
- Identifies different pieces of information based on what that information contains
- Separates content from format or appearance of a document.
- Enforces a set of rules when authoring content.
- Breaks information into topics and smaller components instead of long narratives.
The benefits of structured authoring include:
- Consistent content and organization
- Ability to programmatically convert content into multiple output formats or deliverables
- Can improve writer productivity
- Can improve the quality of the information
You can use XML and XML tools to enforce the rules of structured authoring. Because elements in XML describe the structure of the information, you can enforce rules on what elements can be included and the order in which they are included within the documentation that you create. For example, you can set up rules for the XML documents to ensure:
- Tasks must include at least one step
- All figures must have a title
- All names must include a last name and a first name
This structured approach allows you to ensure that documentation is consistent even when written by multiple authors, enables to you transform the documents into a variety of outputs, frees writers from having to worry about style choices, and improves the quality of information.
XML does not include any formatting information within it. The documents solely focus on the content.
How does XML get formatted?
So, you see raw XML in the browser, but how does it get formatted into something that readers can easily read?
XML has an associated standard called XSLT (or Extensible Stylesheet Language Transformation). XSLT language (with the aid of an XML processor) transforms the raw XML into an output for the reader to consume. That output can include HTML, HTML Help, Eclipse Help, PDF, or even other XML files.
The following diagram illustrates this process. A future lesson covers XSLT in more detail.
Anatomy of an XML Element
An XML file contains a series of building block elements that contain the content of your document.
Elements have a starting and ending tag around some set of content:
Note the following:
-
The starting tag name begins with a less than sign (<), includes a name, and ends with the greater than sign (>). The name cannot contain any spaces.
-
The ending tag begins with less than sign followed by a forward slash (</), has a name that matches the starting tag, and ends with the greater than symbol (>). The name must exactly match the starting tag name, including case.
Naming elements
Element names must conform to certain rules. They:
- Can contain any alphanumeric character.
- Can contain hyphens, periods, or underscores.
- Cannot begin with a number or a punctuation character.
- Cannot contain spaces.
Samples of valid names include <firstname>, <item1>, and <memo_document>.
Samples of invalid names include <1item>, <first name>, and <_dog>.
Defining the content of elements
The contents of an element can be:
- Text:
<firstname>Tom</firstname>
- Other elements:
<name><lastname>Smith</lastname><firstname>Tom</firstname></name>
When an element contains additional elements, those elements are called nested elements or child elements. For example, <lastname> and <firstname> are nested within the <name> element. They are considered children of the <name> element.
- A combination of elements and text:
<step>Press <key>Enter</key> to continue</step>
In this example, the <step> element contains both regular text, as well as the element <key>.
- An empty element that contains no content in a few cases:
<img src="images/imagename.jpg"/>
In XML, an empty element must end with /> or have a closing tag.
Creating a well-formed document
To create a well-formed document, consider a contact list. The following contact list contains phone numbers for a set of people and an optional note.
Jones, Fred
home: (512) 555-3301
work: (512) 555-2212
Reynolds, Biff
home: (512) 555-2222
Birthday: July 31st
Smith, Bill
home: (512) 555-2323
cell: (512) 555-2231
Contractor
Think about the elements that you might define to contain this contact list:
- A contact list root-level element to contain the list of contents.
- A contact element to contain all of the information for a single person.
- A name element to contain the name of the person. You could further break this down into a first name and a last name.
- Different types of phone number elements to include the different phone numbers associated with a contact.
- A note element to include any notes associated the contact.
The following example shows these components:
Contact list
Contact
Name
Last Name
First Name
Phone number (different types)
Note
After you determine how you want to structure the well-formed document, you can create the document. Note that it has the root-level element <contact_list>
and all other elements are contained within that root-level element.
Ensure when naming elements and structuring the files that you conform to all rules of a well-formed document.
<?xml-stylesheet type="text/xsl" href="contact.xsl"?>
<contact_list>
<contact>
<name>
<lastname>Smith</lastname>
<firstname>Bill</firstname>
</name>
<phonenumber_home>(512) 555-2323</phonenumber_home>
<phonenumber_cell>(512) 555-5111</phonenumber_cell>
<note>Contractor</note>
</contact>
<contact>
<name>
<lastname>Jones</lastname>
<firstname>Fred</firstname>
</name>
<phonenumber_home>(512) 555-3301</phonenumber_home>
<phonenumber_work>(512) 555-2212</phonenumber_work>
<note/>
</contact>
…</contact_list>
Note that:
- The element names reflect their content.
- No formatting is defined within the elements.
- Elements are defined and used consistently.
Also note that you nest all of the information for a particular person within a single contact element. By nesting this way, you’ll be able to do such things as sort the list of contacts alphabetically later when performing XSLT processing.
If you were to include this information within HTML, you would not have the elements that describe the content. Instead you might have something like the following:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title> Contact List </title>
</head>
<body>
<p><b>Jones, Fred</b></p>
<ul>
<li><p><b>home: </b>(512) 555-3301</p></li>
<li><p><b>work: </b>(512) 555-2212</p></li>
</ul>
<p><b>Reynolds, Biff</b></p>
<ul>
<li> <p><b>home: </b>(512) 555-2222</p></li>
</ul>
<p><b>Smith, Bill</b></p>
<ul>
<li><p><b>home: </b>(512) 555-2323</p></li>
<li><p><b>cell: </b>(512) 555-5111 </p></li>
</ul>
</body>
</html>
In this example, note the following:
- The element names do not reflect the content.
- Element names are generic.
- Formatting elements like the <b> elements specify appearance.