Introduction to DTDs
This week is going to dive more into elements and introduce attributes. It is also going to provide more information about what makes up an XML file. First up, a short introduction to DTDs.
Introducing DTDs
The Introduction to XML lesson mentioned that XML can enforce rules to determine such things as the order in which elements occur or whether an image must have a caption. The document that defines the structures and rules to which an XML file must conform is called a DTD (Document Type Definition).
There are really two levels of XML document compliance:
- An XML document is well-formed if it follows the basic rules of XML.
- An XML document is valid if it follows the structures and rules defined in the DTD.
Each element within a document instance must have a corresponding entry in the DTD. DTDs contain the definitions for both elements and attributes. The following sections describe how an XML instance file references a DTD file and how to construct a DTD.
Anatomy of an XML file
The following example shows the XML file for the contact list that was described in the Introduction to XML lesson. This XML file is also called an instance file, because it's an instance of an XML document that uses a particular DTD:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE contact_list SYSTEM "contact.dtd">
<?xml-stylesheet type="text/xsl" href="contact.xsl"?>
<contact_list>
<!-- This is a comment in the code. -->
<contact>
<name>
<lastname>Smith</lastname>
<firstname>Bill</firstname>
</name>
<phonenumber type="home">(512) 555-2323</phonenumber>
<phonenumber type="cell">(512) 555-5111</phonenumber>
<note>Contractor at Jones & Sons</note>
</contact>
<contact>
<name>
<lastname>Jones</lastname>
<firstname>Fred</firstname>
</name>
<phonenumber type="home">(512) 555-3301</phonenumber>
<phonenumber type="work">(512) 555-2212</phonenumber>
<note/>
</contact>
<contact>
<name>
<lastname>Reynolds</lastname>
<firstname>Biff</firstname>
</name>
<phonenumber type="home">(512) 555-2222</phonenumber>
<note>Birthday: July 4</note>
</contact>
</contact_list>
A complete XML instance document includes the following components:
-
DOCTYPE declaration (optional):
<!DOCTYPE contact_list SYSTEM "contact.dtd">
The DOCTYPE declaration identifies the DTD that you are using to validate the file. This example shows an external DTD. The DOCTYPE name must match the root element. You can also define the DTD within the XML instance file.
- Stylesheet declaration (optional):
<?xml-stylesheet type="text/xsl" href="contact.xsl"?>
The stylesheet declaration indicates the stylesheet that you want to use to transform the content of the file. Week 4 will cover stylesheets in detail.
- Root-level element:
<contact_list> </contact_list>
The root-level element contains the content of the document instance. You can have only one root element.
This XML file is also called an instance file. Because it's an instance of an XML document that uses a particular DTD.
Examples of DTDs
DTDs can either be external (outside of your XML instance file) or internal (defined within your XML file). Typically, for DTDs that you want to reuse, you will place the DTD outside of your XML file.
Example of an internal DTD
The following sample shows an internal DTD. The elements are defined within the same file as the contents.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE memo [
<!ELEMENT memo (to,from,subject,body)>
<!ATTLIST memo type CDATA #REQUIRED>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT body (para)+>
<!ELEMENT para (#PCDATA)>
]>
<memo type="meeting">
<to>John Doe</to>
<from>Jane Wilson</from>
<subject>Hello</subject>
<body>
<para>Meeting today at 3!</para>
</body>
</memo>
Note the following guidelines:
- The internal DTD is enclosed within the <!DOCTYPE> statement. The <!DOCTYPE> statement ends with a ]>.
- You must have an <!ELEMENT> statement for each element that you use to structure the content in your XML instance file.
- An <!ELEMENT> statement defines the structure for the named element.
- An <!ELEMENT> statement for the root element (<memo>) defines the overall structure of the document. In this case, a memo consists of four parts: the <to> section, the <from> section, the <subject> section, and the <body> section.
- Child elements are also defined within the DTD. The <para> element is a child of the <body> element.
For any element that has an attribute, you must create an <!ATTLIST> statement to define the attributes for that element. Unlike the elements that define the content of your XML instance, note that the <!ELEMENT> and <!ATTLIST> statements that define the DTD do not have end tags.
Example of an external DTD
An external DTD consists of <!ELEMENT> and <!ATTLIST> statements just like an internal DTD. The only difference is that it is contained in its own standalone file. You do not include the <!DOCTYPE statement within the DTD file. The DTD file should have the extension .dtd.
<!ELEMENT contact_list (contact)+>
<!ELEMENT contact (name, phonenumber*, note?)>
<!ELEMENT name (lastname, firstname) >
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT phonenumber (#PCDATA)>
<!ATTLIST phonenumber type CDATA #IMPLIED>
<!ELEMENT note (#PCDATA)>
To reference the DTD, you include a <!DOCTYPE> statement in your XML instance file that points to DTD file.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE contact_list SYSTEM "contact.dtd">
<contact_list>
...
</contact_list>
Defining an <!ELEMENT> statement
The syntax for an <!ELEMENT> statement for a DTD is as follows:
<!ELEMENT elementName (elementContents)>
Where:
- elementName indicates the name of the element.
- elementContents indicates the contents of the elements
The name of the element in the <!ELEMENT> statement must match the name of the element as it appears in the XML instance exactly (case matters).
Defining the contents of the element
You define the contents of the element depending on content of the XML that is contained.
- For example, the root-level element of our contact list specifies that the <contact_list> element includes a series of contacts:
<!ELEMENT contact_list (contact)+>
- The content of the <contact> element includes the information for that contact:
<!ELEMENT contact (name, phonenumber*, note?)>
- If an element contains only text, you can define it as #PCDATA:
<!ELEMENT lastname (#PCDATA)>
Note that some of the elements in this example have a +, *, or ? symbol after them. Those symbols indicate how many times an element can occur.
(contact)+
indicates that the <contact_list> contains multiple contacts, but you must have at least one. You wouldn't want an empty contact list.
phonenumber*
indicates that you can have any number of phone numbers or no phone numbers at all.
note?
indicates that you can either have 0 or 1 note.
The following table defines all of the qualifiers:
Symbol |
Meaning |
+ |
1 to many (a contact list can contain many contacts, but it must contain at least 1) |
* |
0 to many (each contact can have any number of phone numbers) |
? |
0 or 1 (a contact can contain a note or not – it is optional) |
| | The element can contain a choice of content
|
no qualifier |
The element will occur only once (the name has a single first and a single last name) |
The following table shows more examples of <!ELEMENT> statements..
Content |
structure |
a single element |
<!ELEMENT body (para)+>
|
a sequence of elements |
<!ELEMENT memo (to,from,subject,body)>
<!ELEMENT contact (name, phonenumber*, note?)>
|
a choice between multiple elements |
<!ELEMENT computer (laptop | desktop | tablet)>
|
text |
<!ELEMENT para (#PCDATA)>
|
Defining an <!ATTLIST> statement
Each element that includes an attribute in your DTD must have a corresponding <!ATTLIST> statement that defines the content of that attribute.
The syntax for an <!ATTLIST> statement is as follows:
<!ATTLIST elementName attributeName dataType defaultDeclaration >
Where:
- elementName indicates the name of the element to which this attribute list applies. The name of the element must exactly match what is your XML instance file.
- attributeName indicates the name of the attribute. The name of the attribute must match exactly the name of the attribute in the XML instance file.
- datatype indicates the type of information that the attribute will hold. Some types include:
- CDATA for character data
- ID to indicate a unique value that identifies the element
- IDREF if you are indicating an ID for a different element (for example, within a link)
- a list of choices (yes | no) or (laptop|desktop|tablet|phone)
- defaultDeclaration indicates whether an attribute must be present or if a default value is available
- REQUIRED to indicate that the attribute is required
- FIXED if you want it set to a static value
- IMPLIED to indicate that the attribute can either be there or not (optional)
- A value such as yes to indicate a default
This looks complicated, but for the purposes of this class you probably will only use a handful of combinations. The following table shows the most commonly used combinations.
Examples of <!ATTLIST> statements
Description |
Statement |
Single required attribute on an element |
<!ATTLIST memo type CDATA #REQUIRED>
The element <memo> must have a type attribute defined to be valid. |
Single optional attribute on an element |
<!ATTLIST memo type CDATA #IMPLIED>
The element <memo> can have an attribute of type, but if it doesn't have to. If you leave out the type attribute, the file will not be invalid. |
Single attribute that has a list of choices with a default valuet |
<!ATTLIST memo type (reminder|meeting) "meeting">
The element <memo> can have an attribute of type from which you can set one of two values. If no value is specified, the value defaults to meeting.
|
Element with multiple attributes |
<!ATTLIST book
publisher CDATA #IMPLIED
reseller CDATA #FIXED "MyStore"
ISBN ID #REQUIRED
inPrint (yes|no) "yes"
>
In this example, the book element has four attribute associated with it: publisher, reseller, ISBN, and inPrint. You define the contents for each attribute.
(Example from Microsoft documentation.) |
A previous lesson described well-formed documents and how they must comply with the basic XML rules. There are actually two levels of compliance for XML documents:
- the rules for well-formed XML
- rules for valid XML
Well-formed files obey all the basic rules (such as having end tags, proper nesting, and so on). Valid files conform to the rules specified in an associated DTD.
The following table summarizes the rules for a well-formed document.
Rule |
Valid Example |
Invalid Example |
a single root element is required |
<?xml version="1.0" encoding="UTF-8"?>
<contact_list></contact_list> |
<?xml version="1.0" encoding="UTF-8"?>
<contact_list></contact_list>
<contact_list></contact_list> |
closing tags are required |
<name>
<lastname>Smith</lastname>
</name> |
<name>
<lastname>Smith
</name> |
elements must be properly nested |
<name>
<lastname>Smith</lastname>
</name> |
<name>
<lastname>Smith</name>
</lastname> |
case matters |
<name></name> |
<Name></name> |
Attribute values must be included in quotes |
<phonenumber type=“home”>
555 216 3213</phonenumber> |
<phonenumber type=home>
555 216 3213
</phonenumber> |
To be valid, the document has to conform to the DTD to which it is associated. For example, if you had the following DTD:
<!ELEMENT memo (to,from,subject,body)>
<!ATTLIST memo type CDATA #REQUIRED>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT body (para)+>
<!ELEMENT para (#PCDATA)>
The following XML would be invalid:
<memo type="meeting">
<to>John Doe</to>
<subject>Hello</subject>
<body>
<para>Meeting today at 3!</para>
</body>
</memo>
It does not contain the <from> element where it should be located. A <memo> element as defined in the DTD must contain a <to> element, followed by a <from> element, followed by a <subject> element, and finally followed by the <body> element.
Simarly, the following XML would be invalid:
<memo type="meeting">
<to>John Doe</to>
<from>Mary Jones</to>
<subject>Hello</subject>
<body>
<p>Meeting today at 3!</p>
</body>
</memo>
The <p> element is not defined in the DTD -- the <para> element is The element names must match exactly. This example is not well-formed XML either -- the <from> element has an incorrect closing tag of </to>.
Examples
The following examples illustrate some types of elements that you might want to create:
List of items
To create a list of items, you would create a
- a container to hold the items in the list
- one or more items
The following example also includes an attribute value of compact. In the transformation, you could specify that lists with compact set to yes have less space between elements. If you didn’t specify a value, it would default to no.
<!-- Example of a list -->
<!ELEMENT list (item)+>
<!ELEMENT item (#PCDATA)>
<!ATTLIST list compact (yes| no) "no" >
An example of XML that conforms to this DTD follows:
<list compact="yes">
<item>first item</item>
<item>second item</item>
<item>third item</item>
</list>
If you wanted the list item to be able to contain another element (such as <keyname> or <fieldname>), you could define it as a mixed type element:
<!ELEMENT list (item)*>
<!ELEMENT item (#PCDATA | fieldName | keyName)*>
<!ATTLIST list compact (yes|no) "no" >
<!ELEMENT fieldName (#PCDATA)> <!ELEMENT keyName (#PCDATA)>
In this example, you could have the additional elements in your content of the <item> element:
<list>
<item>This element can contain: <keyName>Enter</keyName>.</item>
<item>This element can also contain: <fieldName>Name</fieldName> field.</item>
<item>Or you could just have an element that has text and nothing more.</item>
</list>
Note that the list in the previous example doesn’t have the attribute for compact. If you do not specify the attribute, the default value of no is used during transformation processing.
Choices between elements
The following example shows an <!ELEMENT> statement in which you have a choice between two elements.
<!ELEMENT pets_onetype (cat* | dog*)>
In this example, the pipe (|) between the <dog> and <cat> elements indicates that you can chose between them within the <pets_onetype> element. Because the asterisk is located on the <cat> and <dog> element within the parenthesis, you can only have 0 to many cats or you can have 0 to many dogs. You cannot mix and match them (for example, have two dogs and three cats). But once you choose, you can have as many of that type of pet as would like.But once you choose, you can have as many of that type of pet as would like.
Conversely, if you moved the asterisk outside of the parenthesis:
<!ELEMENT pets_twotypes (dog | cat)*>
You would be able to have any combination of <dog> and <cat> elements. You choose between the two of them, but then you can have 0 to many choices.
The following table illustrates the possible combinations:
Element |
Example code |
Description |
<pets_onetype>
|
<pets_onetype>
<dog>Ruff</dog>
<cat>Fido</cat>
</pets_onetype> |
Invalid – You can have either dogs or cats for the <pets_onetype> element. You cannot have both pets and dogs. |
<pets_twotypes>
|
<pets_twotypes>
<dog>Fluffy</dog>
<cat>Cujo</cat>
</pets_twotypes> |
Valid – You can have any combination of dogs and cats. |
<pets_onetype>
|
<pets_onetype>
<cat>Fluffy</cat>
<cat>Frisky</cat>
</pets_onetype> |
Valid – You can have any number of a single type of pet for the <pets_onetype> element. |
Optional element within a sequence
Assume that you are creating an <address> element. You know that not all addresses have apartment numbers, so you want to make sure that the apartment number is optional. You also know that addresses have a pretty standard order: name, street, apartment number, city, state, zip code
You could define this within your DTD as:
<!ELEMENT address (name, street, aptNumber?, city, state, zipCode) >
The commas between the elements indicate that the elements should occur exactly in the same order as defined in this sequence. You also can only have one of each item. The question mark behind the <aptNumber> indicates that is an optional element.
Example code |
Description |
<address>
<name>Joe Smith</name>
<street>123 Main Street</street>
<city>Austin</city>
<state>TX</state>
<zipCode>78759</zipCode>
</address> |
Valid – The <address> elements includes all elements except the apartment number; however, that is optional. |
<address>
<name>Joe Smith</name>
<city>Austin</city>
<state>TX</state>
<zipCode>78759</zipCode>
</address> |
Invalid – The address does not include the <street> element. The <street> element is not optional. |
<address>
<name>Joe Smith</name>
<street>123 Main Street</street>
<aptNumber>#14</aptNumber>
<city>Austin</city>
<state>TX</state>
<zipCode>78759</zipCode>
</address> |
Valid – The address includes all elements, including the optional <aptNumber> element. |
<address>
<name>Joe Smith</name>
<name>Maggie Smith</name>
<street>123 Main Street</street>
<aptNumber>#14</aptNumber>
<city>Austin</city>
<state>TX</state>
<zipCode>78759</zipCode>
</address> |
Invalid – The address include two names. Currently, the DTD is not defined for that to be permitted. If you want the address to be able to include multiple names, you would need to redefine the <!ELEMENT> statement:
<!ELEMENT address (name+, street, aptNumber?, city, state, zipCode) > |
<address>
<name>Joe Smith</name>
<street>123 Main Street</street>
<aptNumber>#14</aptNumber>
<city>Austin</city>
<zipCode>78759</zipCode>
<state>TX</state>
</address> |
Invalid – The <zipCode> occurs before the <state> element. The sequence in the DTD indicates that the <state> element should appear first. |
Reusing a child element in multiple elements
Consider the following structure for a memo. In this example, instead of only a <body> element as in the previous <memo> examples, it includes a <greeting>, <body>, and <closing> element. The <para> element can occur within any of those elements. The <from> element can be used within the <memo> element or within the <closing> element.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE memo [
<!ELEMENT memo (to,from,subject,greeting,body,closing)>
<!ATTLIST memo type CDATA #REQUIRED>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT subject (#PCDATA)>
<!ELEMENT greeting (para)+>
<!ELEMENT body (para)+>
<!ELEMENT closing (para+,from)>
<!ELEMENT para (#PCDATA)>
]>
<memo type="letter">
<to>John Doe</to>
<from>Jane Wilson</from>
<subject>Hello</subject>
<greeting>
<para>It has been a long time since we talked. How have you been?</para>
</greeting>
<body>
<para>We should get to discuss the project really soon!<para>
</body>
<closing>
<para>Sincerely,</para>
<from>Joe</from>
</closing>
</memo>
When you define the <para> and <from> element within the DTD, you only define it one time. Each element that includes the <para> element will use the same definition. If you defined <para> more than one time within a single DTD, the DTD would be invalid. An element can be defined only a single time within a DTD.