XML (Extensible Markup Language) has become one of the most widely used methods to represent information in a self-describing format across computing devices running all manner of operating systems and programming languages. XML is designed to transport and store information independent of how it may be viewed, displayed, or otherwise processed in a simple, formatted manner. Similar to HTML in concept, the qualities of XML allow it to be used as a universal data format that web services and other applications can leverage to build complex web services or even stand-alone applications.

How Does XML Differ from HTLML?

When first taking a look at XML, it appears to be almost the same format as HTML in format. This primarily resides in the fact that both file formats are related to the markup definition language, SGML which was first released as an international standard (ISO) in 1986. As a result, students and developers who kXMLnow HTML syntax (i.e. those who don’t solely rely on WYSIWYG (What You See Is What You Get) tools), are able to quickly pickup basic XML fundaments. There are two primary differences; however, between XML and HTML:

#1 – XML separates form and content. In HTML, the language primarily contains tags that define how text and other elements such as images should be displayed in a web browser. Although XML can be viewed in most modern browsers, the tags are used to express the structure and data content. The data can be manipulated by a web service, application, or XML style sheet to be viewed or otherwise used in calculations to support other programs.

#2 – It is extensible. In XML, a tag can be defined by developers, organizations, or even individuals for use in open source or proprietary applications. HTML on the other hand, uses a standardized tag set published by the W3C (World Wide Web Consortium).

History of XML

Prior to the Internet becoming a part of every-day life in the late 1980’s, digital media publishers saw the benefits of using SGML for leveraging dynamic displays of information. As the World Wide Web started to grow, experienced SGML and web researchers and developers realized that there would be a number of issues related to information sharing and display that could mitigated through the use of SGML or a follow-on derivative of the markup language. Dan Connolly added SGML to the W3C activity list in 1995 as a result, and work began on a way forward in the middle of 1996 by Jon Bosak.

An 11 person working group was formed supported by a 150 member interest group in XLM. The membership conducted technical discussions and debates over a shared email list in this time-frame. Decisions on the standard were made by either working group majority or consensus vote during this timeframe. The initial goals of the working group for XML were: Internet usability, compatibility with SGML, formality, stability, conciseness, authoring ease, legibility, and a small number of optional features.

The resulting work would be compiled by Michael Sperberg-McQueen on December 4th, 1997. James Clark served as the Technical Lead of the XML working group and is attributed with naming the new markup language, XML as well as creating the empty element syntax. The three co-editors of the XML specification were Tim Brag, Michael Sperberg-McQueen, and Jean Paoli. Other suggested names for the language at the time were: MAGMA-Minimal Architecture for Generalized Markup Application, MGML-Minimal Generalized Markup Language, and SLIM-Structured Language for Internet Markup. Design work on the standard continued throughout 1997 and the W3C make the recommendation on February 10th, 1998 of XML 1.0. All of the primary goals of XML 1.0 were achieved; however, since brevity was not considered essential, the first version of the markup language also permitted redundant syntactic constructs and repetition of element identifiers.

Basic XML Constructs

The basic constructs of the XML 1.0 markup language include making use of Unicode characters, markup and content, tags, elements, attributes, and the XML declaration.

Unicode

The XML specification defines an XML document to be a string of characters. Just about all of the legally defined Unicode characters are able to be used in an XML document.

XML Markup and Associated Content

The XML document specification divides a legal document into two categories: markup and content. The difference between the two categories is determined by using basic syntactical rules. At the high level, a string of characters that begin with “<” (without the quotes) and end with “<” or “&” and “;” are defined as markup. All other strings of characters that do not meet the definition for markup are considered content. Of course, we can’t have a language without special rules, so the delimiters <![CDATA[ and ]]> are classified as markup with the text between being considered content. Finally, all whitespace that is included in the document prior to and after the final element is considered markup.

XML Tag

XML tags are one of the most basic elements of the markup language. Unlike HTML, XML tags must have a beginning and an ending. For example:

Start Tag: <start>

End Tag: </start>

Or a combined, empty element tag: <empty />

XML Element

All of the characters that fall between a starting and ending tag are considered the elements content. They may also contain other elements (child elements), or other market.

XML Attribute

In XML, a markup will have a name/value pair that is resident within a starting or empty element tag. This markup is referred to as an XML attribute. The attributes are the text that comes in between two quotation marks as shown in this example:

<myTag attribute1=”some information” arttribute2=”http://www.byteguide.com”/>

A second example demonstrates how an attribute can be defined with an associated value with opening and closing tags:

<info attributeInfo=”2”>A B C</attributeInfo>

XML Declaration

XML documents should normally begin by declaring information about the document to declare basic information about the following markup. For example, a declaration stating that XML specification 1.0 and using UTF-8 encoding would look like this:

<?xml version=”1.0″ encoding=”UTF-8″ ?>

Does XML Allow Comments?

Similar to HTML, XML provides supports for comments within the document anywhere after the XML declaration. XML comments are typically used if the documents are edited by a human during development or updates; however, are not as commonly found as in normal programming practice. An XML comment will start with: “<!–” and end with “–>”. Within the comment, the “–“ string combination is not permitted since this signals the end of a comment.

How Does XML Handle Errors?

In the XML standard, XML documents are defined to be those which meet all syntax rules laid out in the specification and are considered well-formed. The requirements for a document to be well-formed are:

–          XML documents may only contain legal Unicode characters.

–          Special syntax characters can only be used when used in markup-defining roles.

–          Beginning, ending, and empty element tags must be correctly nested, not overlap, and have none missing.

–          Beginning and ending tags have to match exactly.

–          Tag names cannot start with numbers, -, or . .

–          Tag names can’t contain any of the following characters: !”#$%&'()*+,/;<=>?@[\]^`{|}~.

–          There can only be one root element in an XML document that is the parent of all other document elements.

When an XML processor finds a violation in the rules of a well-formed document, it is required to report the error and cease processing of the document. This is distinct form how HTML more gracefully handles errors found in web pages. By the specification, a well-formed XML document only has to follow the XML syntax rules; however, the error validation/checking term has also come to be used to refer to documents which meet the rules for being well-formed based off of the XML Schema (XSD) or Document Type Definitions (DTD).

How Are XML Documents Validated?

The Document Type Definition (DTD) is the original schema language for XML that originated with SGML.

DTD Advantages

–          Since DTDs were included in the XML 1.0 standard, they are supported just about anywhere XML is supported.

–          Present more information in a single screen than other element-based schema languages.

–          Allow the declaration of public entity sets for publishing characters.

–          Define a document type grouping all constraints for the information in a single document or collection of rules.

DTD Limitations

–          There is no explicit support for emerging features in XML to include namespaces.

–          They are more simplistic than SGML DTDs and only support basic or rudimentary data types.

–          Lack readability without tools.

–          Syntax is based on regular expressions to describe the schema.

What is the Newest Schema Language for XML?

The newest schema language for XML is XML Schema, or XSD. Unlike XML DTDs, XML Schema uses a rich data typing system. They also permit authors to construct more detailed constraints on document structure than are permitted in DTDs. The XSD format also uses an XML-like or based format that makes it easier for processing tools to support the schema. Disadvantages of XSD are primarily in it being fairly new with regards to the original specification implementation (they have been around for more than a decade). As a result, not all legacy applications which are based on XML will support schema.

What is XSLT?

XSLT is an XML-based language that can be used to create other documents from existing XML documents. These include but are not limited to HTML, XHTML, plain text; 3D graphics formats such as X3D or VRML, and more. XSLT makes use of XPath that addresses the elements and attributes of the source XML document and makes use of the XSLT template-processing engine, or processor, to create the output document. The XSLT document is also referred to as the “stylesheet” which includes one or many template rules. These rules tell the processor how to create the various components of the output document.

How is XML Used on the Internet?

Over the past decade, XML has become one of the primary data interchange formats over the Internet and local networks. IETF RFC 3023, XML Media Types, provides a governing set of rules to create valid media types when sending XML across the Internet. It also includes the definitions for the text/xml and application/xml types but do not define requirements for the semantics of the data types. The text/xml type may be deprecated in the future since it can result in encoding issues depending on the application. Another big change in the RFC include appending “+xml” to XML-based formats such as “XML-ized” 2D SVG, image/svg+xml. A related RFC to 3023 is RFC 3470 which provides guidelines for the use of XML within existing IETF protocols. Also known as IETF BCP 70, it covers a number of the requirements for both the design and deployment of an XML language.

What is the Future of XML?

The future of XML depends on who you talk to. Some developers/researchers think that it’s time to hold development of “new features” or “rules” with XML and let industry fully adopt the existing standards. Others believe that there are significant gains to be made by continuing to develop iterations of the standard as it makes sense to continue improving application performance, capabilities, and end-user experience. In recent history, leveraging XML has made it possible for developers to take large data sets which don’t logically fit or work in a relational database and share this information without a dedicated application program interface (API). In essence, simplifying information exchange across operating systems, programming languages, and even spoken languages.

Unfortunately, the more capabilities and specifications developed that rely on XML have resulted in a large set of specifications that have to be learned, adhered to, and are not necessarily in synch with each other. This results in a decrease in usability for new adopters and potentially a longer lead-time to create XLM tools. Some of the recommendations for inclusion in an XML 2.0 standard include: Full integration of namespaces, getting rid of DTDs, and including XML Base and XML Information Set into the base XML standard. There has also been significant work towards including binary encoding of XML into the base standard vice being an extension to the standard.