View Only

Managing XML Data

By Ashley Bassman posted Sat October 01, 2022 04:09 PM

Developed by: 

Matthias Nicola

IBM Silicon Valley Lab

Susanne Englert

IBM Silicon Valley Lab

Download the full report to get started!

Executive Summary: 

This paper provides principles and guidelines for using DB2® pureXML® to solve business problems effectively and to achieve high performance when managing XML data in enterprise applications. The examples illustratingthe best practices are based on a real-world financial application scenario and demonstrate how to implement the guidelines. The examples can be easily adapted to other types of XML applications. The paper covers the following areas:

  • Storage options for XML data to improve performance and storage efficiency
  • Techniques for adding XML data into a DB2 database
  • Techniques for querying and updating XML documents efficiently
  • Techniques for using indexes over XML data with queries effectively
  • Techniques for efficiently maintaining and monitoring an XML database
  • Techniques for developing efficient pureXML applications


XML provides a neutral, flexible way of exchanging data among different systems, applications, and organizations. With XML, data is maintained in an extensible self- describing format to accommodate ever-evolving business needs. XML documents use tags to describe the data values they contain, and the nesting of tags to express hierarchical relationships between the data items. XML can describe very structured data and enforce the structure through XML schemas, but can also describe semi-structured data which is prevalent in content-oriented applications.

Service oriented architectures (SOA), enterprise application integration (EAI), enterprise information integration (EII), Web services, the enterprise message bus (ESB), and standardization efforts in many industries all rely on XML as an underlying technology for data exchange.

Organizations as well as entire industries have standardized XML schemas to promote exchange of data and areevolving those schemas to meet changing business needs. These efforts include ACORD in the insurance industry, FpML® and FIXML in the financial industry, RosettaNet in supply chain management, ARTS in the retail business, HL7 in healthcare, XBRL for business reporting, and DITA for authoring, managing, and publishing documentation in print and on the Web.

Such industry-specific initiatives, as well as regulatory requirements, are driving the deployment of XML. Asmore business transactions are conducted through Web-based interfaces and electronic forms, government agencies and commercial enterprises bear greater responsibility for preserving the original order, request, claim, trade, or submission. XML provides a straightforward means of capturing and maintaining the data associated with these electronic transactions. Indeed, XML documents frequently represent transaction records in message-based transaction processing systems.

Pros and cons of XML and relational data

As a self-describing data format, XML allows diverse data (with or without an XML schema) to be stored in asingle document or row without sacrificing the ability to search or aggregate portions of that data. Applications can evolve their XML schemas without causing any changes to the underlying database schema. While the flexibility of XML means that examining and interpreting XML data can consume more CPU and I/O resources than if the same data was stored in relational form, factors such as schema complexity might make storing the data in relational form impractical.

With more rigid schema definitions, the relational model requires significantly less interpretation and allows for more optimized data operations. As such, it can provide very high performance but might fail to meet application requirements for schema flexibility. The relational data model is very suitable for applications with stable data structures and predictable access patterns. XML is often more suitable for applications with complex and variable data structures, and for combining structured and unstructured information.

In some cases, XML offers performance benefits over relational models precisely because of its flexibility.Relational databases often require normalization to fit business data into flat, tabular structures. This normalization of complex business data requires transformation when data is stored and retrieved, and often leads to multi-way join queries in relational databases. XML can be a more natural representation of complex business objects with all relevant relationships represented in a single document. The hierarchies within anXML document are essentially pre-computed joins between related data items.

Another consideration in choosing a data model is the application using the data. Even if data originates in XML, if subsequent processing of that data depends on the data being stored in a tabular format—for example, when applying relational online analytical processing (OLAP) to data in a data warehouse—then storing the data in relational format instead of XML might be the correct choice.

XML solutions to relational data model problems

The storage data model should match, to the greatest extent possible, the highest value and most critical usage model for your data. If the data being modeled is naturally tabular, it is typically better to represent it in relational format that as XML. However, there are cases where the relational model is not necessarily the best choice and sometimes even poor choice to hold your data. The following are some situations where an XML representation tends to be more beneficial than the relational format:

When the schema is volatile

Problem with relational data: If the schema of the data changes often, then representing the data in relational form results incurs cost and overhead of changing the relational schema. While some forms of schema modification are relatively painless in relational databases, such as adding a new column to a table, other forms are more involved, such as dropping a column or changing the type of a column. Still other forms of schema modification are downright difficult, such as normalizing one table into multiple tables. Changing the tablesthen means that applications need to change the SQL statements that access them.

Solution with XML data: Portions of the schema that are volatile can be expressed as a single XML column. The self-describing and extensible nature of XML allows seamless handling of schema variability andevolution. Changes in the XML document format are accommodated without changing tables or columns in the database and typically without breaking existing XML queries.

When data is inherently hierarchical in nature

Problem with relational data: Data that is inherently hierarchical or recursive is often difficult to represent in relational schemas. Examples include bill of materials, engineering objects, or biological data. A bill of materials explosion can be stored in a relational database but reconstructing it in parts or in full might requirerecursive SQL.

Solution with XML data: Since XML is a hierarchical data model, it is a much more natural representation for inherently hierarchical business data. Using XML allows simple, navigational data access to replace complex set operations if the same was represented in tabular format.

When data represents business objects

Problem with relational data: If application data represents business objects, such as insurance claim forms, then it is often beneficial to keep the data items that comprise a particular claim together, instead of spreading them over a set of tables. This is particularly true when the individual data items of a claim form have no valid business meaning by themselves and can only be interpreted in the context of the complete form. Normalizing the claims across dozens of relational tables means that applications deal with a complex and unnatural fragmentation of their business data. This increases complexity and the chance for errors.

Solution with XML data: XML allows you to represent even complex business objects as cohesive and distinctdocuments while still capturing all of the relationships between the data items that comprise the business object. Representing each claim form (business object) as a single XML document in a single row of a table provides a very intuitive storage model for the application developer and allows rapid application development.

When objects have sparse attributes

Problem with relational data: Some applications have a large number of possible attributes, most of which are sparse, that is attributes applicable to very few objects. A classic example is a product catalog where the number of different product attributes is huge, including: size, color, weight, length, height, material, style, weave, voltage, resolution, water resistance, and a nearly endless list of other properties. For any given product, only a subset of these attributes is relevant. One possible relational approach is to store this data is to have one column per attribute, which means a very large percentage of the cells in the table contains NULLvalues. This is undesirable and can be inefficient. A different relational approach for such sparse data is a 3-column table that stores several name/value pairs for each product ID. This means the attribute names aren’t column names but values in a VARCHAR column. This prevents relational database systems from accurately estimating constraint selectivity and generating efficient query plans. Also, defining and enforcing constraints, such as uniqueness for a certain attribute, is extremely difficult.

Solution with XML data: The beauty of XML is that elements and attributes can be optional, so they are simply omitted if they don’t apply for a specific product. Neither NULL values nor name/value pairs are needed. The XML schema can define a very large number of optional elements, but only few of them are used for any given object. While every row in a relational table has to have the exact same columns, XML documentsin an XML columns can have different elements from one row to the next. Also, an XML index for an optional element will be very small if this element appears only in a small percentage of the documents (rows). This is a clear advantage over relational indexes which have exactly one entry per row. 

Problem with relational data: If you export a set of rows from a relational table and send them to another application or organization, the recipient cannot interpret the data without additional metadata that describes the columns. This is particularly true if your relational schema has changed since the last time you sent data.

Solution with XML data: XML data is self-describing. The XML tags are metadata which describe the values that they enclose.

Benefits of DB2 pureXML over alternative storage options

Since XML has become increasingly critical to the operations of an enterprise, XML documents are assets thatneed to be shared, persisted, searched, secured, and updated with full transactional consistency. Depending on its use, XML data might also need to be transformed, audited, and integrated with other data. To meet these requirements, storing XML data in its native hierarchical format in a DB2 database has several advantages, including:

  • Retaining awareness of the internal structure of the XML data. This has advantages over storing theXML documents as character or binary large objects (CLOBs or BLOBs) in the Inparticular, you can easily query XML data using XQuery, XPath, and SQL/XML to take advantage of the XML structure, and you can enhance query performance by creating indexes over XML data. Additionally, you can easily update, transform, and publish XML data using SQL, XQuery and XSLT.
  • Maintaining the hierarchical and flexible nature of XML data. This has advantages over decomposing (shredding) the XML documents into relational tables where an administrator maps XML elements and attributes to relational columns. After shredding, XML document values are stored in these tables without their original Shredding often requires a large number of tables and often it is toocomplex to be practical. Queries over decomposed XML documents can require complex SQL joins that tend to be difficult to develop and tune. Changes to the XML schema often break the mapping to the relational database schema. This is incurs costly and time-consuming maintenance which defeatsthe flexibility for which XML is typically chosen. This is why DB2 pureXML allows you to use a single XML column to store and query XML documents that are based on different XML schemas, or different versions of an evolving XML schema.
  • Integration of XML documents with relational data in a single database. This has advantages overstoring relational data in one database and XML documents in a separate, XML-only database. Such an approach requires skills and labor to operate and maintain two database systems instead of just one. Also, combining data from the two databases usually requires extra logic in the application that is often difficult and inefficient. When you store both XML and relational data in a single DB2 database, you can combine both types of data in queries, perform joins between them, and even convert one to other as needed. This can potentially be more cost-efficient and provides better performance than using two separate databases.

Best practices for DB2 pureXML: Overview

The DB2 pureXML feature offers sophisticated capabilities for storing, indexing, validating and querying XML data – fully integrated with the DB2 relational data management features. This document describes principles for using pureXML in an effective and efficient manner. The goal of this document is not to be an introduction to the pureXML features or how they work. Instead, this document provides guidelines to help achieve higher pureXML performance, as well as examples of how to deploy the pureXML functionality to solve specific business problems effectively.

We are using a real-world application scenario as the setting for the best practices and examples in this paper. It’s a scenario from the financial industry and deals with the management of “derivative trades” based on an XML format called FpML (Financial Products Markup Language). You don’t need to be a financial expert tounderstand this scenario. Although we use this specific scenario, the best practices also apply to other XML applications, such as XML forms processing, order management systems, XML in health care and electronic patient records management.

The topics in this paper are roughly organized according to the typical life cycle of a database project. We start in this section by reviewing the data and tables that the application requires. Then we discuss the DB2 storage options for XML data in “Choosing the right storage options for XML data.” After that, ”Guidelines for addingXML data to a DB2 database” provides tips and techniques for adding XML data into a DB2 database. In ”How to query XML data efficiently and effectively,” we present guidelines and examples for querying XML data more efficiently. To improve query performance, best practices for defining and using XML indexes are given in ”Usage guidelines for XML indexes.” Guidelines for XML namespaces and XML updates are discussed in ”Dealing with XML namespaces” and ”Effectively updating XML data,” respectively. The sections ”Maintaining and monitoring an XML database” and ”Developing pureXML applications” coveradditional topics for database administrators (DBAs) and application developers. Finally, ”Summary” concludes with a summary of the most important guidelines.

 Sample scenario: derivative trades in FpML format

A “derivative trade” is a financial trade which is based upon (derived from) some other financial asset, such as a stock, an index, an interest rate, a currency, or other. In a derivative trade, two parties agree to exchange cash, depending on market conditions that affect the underlying asset. Typically, one party uses the trade to mitigate risk; the other party uses the trade to gain immediate income (through fees or premiums) or to speculate that future market conditions will provide profits. Consider the following example.

YourWorld Investments and MyGlobal Bank agree on a currency exchange derivative trade. They negotiate thaton October 25, YourWorld will pay 71,900,000 Chinese Yuan to MyGlobal, and MyGlobal will pay 10,000,000 US Dollars to YourWorld. MyGlobal will profit from this trade if the value of the US Dollar declines below 7.19 CYN per USD between now and Oct 25. MyGlobal might be using this trade to hedge against the risk of a falling US Dollar. YourWorld might speculate that the US Dollar will gain in value, or collect up-front fees from MyGlobal for entering the trade.

What’s interesting about derivatives is that (a) there are many different types and variations, (b) the conditions of a particular trade are often individually negotiated and complex, and (c) the life span of a derivative can range from days to years and their conditions might change over time. The financial industry found that the flexibility and extensibility of XML was required to define a standard data format that could capture the high variability of derivatives. As a result, they developed FpML. FpML is essentially an XML Schema that defines how XML elements and attributes are used to describe derivative trades. The International Swaps and Derivatives Association, Inc (ISDA) manages the FpML standard on behalf of a community of investment banks that make a market in OTC derivatives. See [11] and [12] for more information on derivatives and FpML.

Download the full report to get started!