Db2

Db2

Where DBAs and data experts come together to stop operating and start innovating. Connect, share, and shape the AI era with us.


#Data


#Data
 View Only

A Deep-Dive in Open‑Data Formats: Parquet, ORC, and Avro

By Ghareeb Falazi posted 5 days ago

  

Modern data platforms rely on efficient, interoperable, and open file formats to store and process large‑scale analytics and streaming data. Open‑Data Formats (ODFs) such as Apache Parquet, Apache ORC, and Apache Avro form the backbone of lakehouse architectures and understanding how these formats differ — in design, purpose, and capabilities — is essential for selecting the right tool for your workload.

What Are Open‑Data Formats?

Open Data Formats (ODFs) are publicly documented and vendor neutral data serialization formats. They define how data is structured and encoded so that multiple systems can read and process it consistently. Because they are open, these formats allow interoperability across different data warehousing and distributed processing systems such as Spark, Hive, Flink, Trino, and Db2, avoiding vendor lock-in and ensuring long term data accessibility.

Major Factors to Consider When Choosing an ODF

When comparing ODFs, we need to focus on key characteristics and design choices that differentiate between them. Here, you can learn about these concepts that distinguish ODFs and later, we will apply them as criteria to compare the most prominent ODFs.

Column‑ vs Row‑Orientation

ODFs focus on storing tabular data and they use either a row oriented layout or column oriented layout for this purpose. In a row-oriented layout, entire records are stored contiguously, which is ideal when whole rows are frequently written or read. In contrast, in a column-oriented layout, values from the same column are stored contiguously. This layout enables better storage efficiency because each column typically contains homogeneous data and therefore benefits from encoding and compression tailored to its characteristics. Colum-oriented layouts also reduce I/O for analytic workloads by allowing engines to read only the specific columns needed for a query, rather than scanning entire rows and discarding unneeded columns.

Support for Distributed Processing

Distributed engines rely on splitting files into units to process data in parallel. ODFs may split the data in two ways. First, if the ODF supports column-oriented layout, data is split vertically, meaning that different columns can be processed independently. Additionally, the ODF may chunk rows to into groups, which splits the data horizontally. Each of the resulting row groups can be processed independently. Having both horizontal and vertical data splitting enables the highest degree of distributed processing.

Predicate Pushdown Support

Predicate pushdown is an optimization technique that allows a query engine to apply filtering conditions as early as possible in the data access path. In the context of ODF engines, this means that we apply filtering conditions while reading the data files themselves, rather than after loading the data into memory.

When a query includes a filter (for example, WHERE date >= '2025-01-01' AND country = 'DE'), a naive execution model would read all rows from the data file and then discard rows that do not satisfy the condition. Predicate pushdown avoids this by using metadata stored inside the data file to determine which parts of the file can be safely skipped without reading their actual values.

To enable predicate pushdown, an ODF stores statistics such as minimum and maximum values, null counts, or bloom filters for logical subsets of the data (for example, row groups, stripes, or pages). During query planning or execution, the engine compares the query’s predicates with these statistics: If the statistics prove that no values in a data chunk can satisfy the predicate, that chunk is skipped entirely, whereas, if the statistics indicate a possible match, only then is the corresponding data read and evaluated.

ODFs differ in the scopes and types of statistics they require to be stored in data files. Storing more detailed statistics, potentially at multiple levels, increases the chance that queries that involve filters run faster. However, this increases complexity and storage requirements.

Compression and Encoding

Compression and encoding reduce storage and I/O requirements by storing data efficiently. ODFs vary in the compression codecs and encoding techniques they support and the level at which they are applied (file, row-group, page, etc.)

Complex Types Support

Modern datasets include a variety of data types. In addition to primitive types, such as integers, Booleans and byte arrays, datasets may include data with complex types such as temporal, geospatial, and nested types. ODFs differ in how naturally they represent complex types. Strong support for complex types ensures that standard readers and writers can handle them natively, minimizes the need to preprocess data before storing it, and ensures better query performance for nested types.

Schema Evolution Support

Because real‑world schemas change over time (for example, by adding, removing, or changing the data types of columns), ODFs with strong schema evolution support allow changing the schema over time without having to re-write existing data files. It also helps producers and consumers to interoperate despite evolving schemas, which reduces friction even in long‑running data pipelines. In summary, ODFs with capable schema evolution support ensure that changes in data structures do not disrupt data processing workflows.

Prominent ODFs

In the following, we introduce the most prominent ODFs and highlight their limitations and conceptual parallels.

Apache Parquet

Apache Parquet is an ODF designed to deliver efficient data storage and high‑performance retrieval across large and complex datasets. Parquet’s formal specification defines how data is structured and formatted, while multiple implementations (in Java, C++, Go, Rust, and others) ensure broad ecosystem compatibility. Supported across numerous analytics frameworks, Parquet provides a robust, interoperable foundation for modern data processing pipelines

Parquet has a column oriented layout organized as row groups (horizontal partitions of consecutive rows), each containing one column chunk per column, which is further divided into pages—the indivisible units for encoding and compression; the pages of a column chunk are guaranteed to be contiguous in the row group. This hierarchy enables vertical pruning (read only needed columns). Furthermore, it enables horizontal pruning via predicate-pushdown at two levels. First, whole column chunks can be skipped over during scans with the help of statistics stored in the file’s footer for each column chunk in each row group. Second, pages can be skipped by examining either the statistics stored in each page’s header or the statistics stored per-page inside the column index, which is stored separately from row group data enabling access to page statistics without having to load the page itself. Parquet’s metadata driven design (with statistics at multiple levels) powers strong predicate pushdown and efficient distributed processing.

Another strong characteristic of Parquet is its supported compression and encoding mechanisms that are built around its columnar structure. Because each column within a row group is stored as a contiguous column chunk and further subdivided into pages, Parquet can apply compression at the page level and choose encodings that fit the statistical properties of each column. This design, supported by compression formats such as Snappy, Gzip, LZO, Brotli, and Zstd, allows Parquet to optimize both space and scan efficiency by tailoring compression and encoding per column chunk and applying them per data page.

Parquet's native support for complex and nested data types stands as one of its defining strengths. Built on Dremel's record splitting and assembly algorithm, it achieves logical nesting of types — including lists, structs, and maps — while physically storing records in a columnar format, preserving both efficient access and compact storage. Complementing this, Parquet deliberately limits its set of primitive types, relying instead on logical type definitions to express richer structures. This design keeps reader and writer implementations straightforward without sacrificing the breadth of complex types we find in real-world analytical workloads.

We mentioned earlier that providing a strong support for schema evolution is an important characteristic to facilitate typical long-running data pipelines that might go through schema changes. However, Parquet itself does not enforce schema evolution rules at all — that responsibility falls on the query engine or table format layered on top (e.g., Apache Hive, Apache Spark, or Apache Iceberg). These systems provide varying degrees of schema evolution support on top of Parquet files. Typically, adding a new nullable column or dropping an existing column can be handled gracefully by Parquet readers.

Apache ORC

Apache ORC (Optimized Row Columnar) is a high performance, type-rich columnar storage format introduced in Hive 0.11 to overcome the limitations of earlier formats like RCFile, particularly around compression efficiency, type semantics, and query pruning. It achieves smaller file sizes through type specific encodings and supports fast analytical workloads through built-in indexing and selective reading.‑‑in indexing and selective reading.

Like Parquet, ORC has a column oriented layout. The body of the file is organized as stripes (horizontal partitions of consecutive rows), each consisting of one or more adjacent streams for each column. Streams are further divided into row groups of 10K rows by default (not to be confused with Parquet’s row groups, which are similar to ORC’s stripes). This file structure enables vertical pruning (read needed columns only). Furthermore, a rich hierarchy of metadata enables horizontal pruning via predicate-pushdown at three levels. First, per-column statistics stored at the file’s footer enable skipping the whole file during a scan if its columns do not fulfil the applied predicate. Second, the File Metadata section, also located at the file’s footer, stores column statistics at stripe-level granularity, enabling predicate pushdown to eliminate irrelevant input splits on a per-stripe basis. Finally, row groups irrelevant for a scan can be identified and skipped by examining their statistics, which are stored in dedicated streams, known as row indexes, at the beginning of each stripe in the file’s body. Similar to Parquet, the hierarchical and metadata driven design facilitates strong predicate pushdown and streamlined distributed processing.

ORC supports efficient encoding by allowing to choose a different encoding mechanism, e.g., dictionary based or Run Length Encoding (RLE), for each column in each stripe. The choice is based on column-wide characteristics, e.g., data type, and local column characteristics within the domain of the stripe, e.g., value cardinality. Furthermore, ORC supports a variety of compression algorithms, such as zlib, Snappy, and LZO, and although the compression algorithm must be the same for the whole file, it is applied independently to each compression chunk allowing it to be decompressed without accessing the previous bytes. The compression chunk size is configurable and defaults to 256KB. Larger chunks lead to better compression but require more memory.

Similar to Parquet, ORC files are self-describing in terms of the types of the contained objects since the schema information is stored directly in the file’s footer. Additionally, ORC supports a rich set of scalar (primitive) and compound (complex) types derived from Hive column types, providing maximal compatibility with the Hive ecosystem. Compound types can be nested and when the file is serialized, the resulting type tree is flattened using a specific algorithm that allows readers to reconstruct it during scans.

Like Parquet, the ORC specification does not define schema evolution semantics, leaving this responsibility to the query engine or table format built on top. Consequently, schema evolution support varies across different engines and implementations.

Apache Avro

Apache Avro is a compact, row oriented serialization format and RPC framework designed for efficient data exchange and long term interoperability.

In streaming or messaging scenarios Avro encodes records—referred to as objects—in a binary or JSON format. These objects are transmitted without carrying schema information, requiring the receiver to have out‑of‑band access to it. This makes Avro more suitable for high‑throughput pipelines and distributed systems where schemas are centrally managed or shared across producers and consumers.

For persistent storage, Avro provides Object Container Files, which embed the full schema in the file header along with user-defined metadata. The body of the file consists of data blocks, each containing a set of objects, which are optionally compressed using codecs such as Deflate, Snappy, or Bzip2. This design allows readers to interpret the file without depending on any external schema source. However, Avro provides limited support for distributed processing and because it stores complete objects contiguously and does not maintain per column statistics, it provides no predicate pushdown capabilities.

A key strength of Avro is its schema centric architecture and robust schema evolution rules. Each Object Container File includes its own schema, and the specification defines how reader and writer schemas are reconciled. Hence, long running pipelines can evolve independently with minimal friction. Furthermore, Avro supports a wide range of primitive and complex types—records, arrays, maps, unions, and more—allowing nested structures to be modelled naturally.

In summary, Avro emphasizes interoperability, efficient row‑level serialization, and strong schema evolution, making it ideal for streaming, messaging, and RPC workloads, while complementing—rather than replacing—the analytical strengths of columnar formats like Parquet and ORC.

Comparison of Parquet, ORC, and Avro

Based on the evaluation criteria introduced earlier, the following comparison highlights how Parquet, ORC, and Avro differ in design goals and technical capabilities.

Criterion

Parquet

ORC

Avro

Layout Orientation

Column-oriented

Column-oriented

Row-oriented

Distributed Processing

Strong horizontal and vertical splitting via a hierarchy structured as row groups > columns chunks > pages

Strong horizontal and vertical splitting via a hierarchy structured as stripes > streams > row groups

Limited to block-level horizontal splitting

Predicate Pushdown

Multi-level statistics at row-group and page levels enable strong predicate pushdown capabilities

Multi-level statistics at file, stripe, and row-group levels enable strong predicate pushdown capabilities

No support

Encoding

Supports multiple encodings. Chosen per column based on its characteristics

Supports multiple encodings. Chosen per column based on its characteristics.

Only supports JSON and binary encodings.

Compression

Supports multiple codecs. Chosen per column chunk and applied per page.

Supports multiple codecs. Chosen per column and applied per compression chunk (whose size is configurable and independent from other file structures).

Supports multiple codecs. Block-level compression without column optimizations

Complex Types Support

Native and efficient support for nested types

Rich primitive and compound types aligned with Hive

Broad support for complex and nested record types

Schema Evolution

Handled by engines or table formats layered on top

Handled by engines or table formats layered on top

Built-in, well-defined reader–writer schema resolution

Parquet and ORC are purpose-built for large-scale analytical workloads, emphasizing efficient columnar storage, rich metadata, and aggressive query pruning. Avro, by contrast, focuses on row-level serialization, interoperability, and robust schema evolution, making it a natural fit for streaming, messaging, and RPC use cases. The choice between Parquet and ORC mainly revolves around the ecosystem you have at hand. For Hive-based ecosystems, ORC is generally a better choice since it is designed and tailored for this purpose. In contrast, when interoperability among engines and ecosystems is a requirement, Parquet is generally a better option due to its strong support across exiting systems. To summarize, together, the three formats address complementary needs within modern data architectures rather than serving as direct substitutes.

Defining ODF-based Tables in Db2

With the Datalake table feature, multiple flavours of Db2, such as Db2 on IBM Software Hub, support defining and using ODF-based tables whose data resides in ODF files outside of Db2 itself, such as in an object storage. Db2’s role is to manage table definition and metadata, and to provide non-exclusive access to it in a way that avoids vendor lock-in and maintains data autonomy by allowing other engines and frameworks to access the data files too. A detailed description of this feature is provided in our previous blog.
In this blog, we focus on how we can select the ODFs we use in our Db2 Datalake tables and how we tweak them to suit our needs.

Choosing a Specific ODF for the Datalake Table

Db2 supports two types of Datalake tables, which are Hive tables and Iceberg tables. In both cases, we can select the ODF we want to be used as the underlying data file format directly in the CREATE DATALAKE TABLE statement. As an example, let’s assume that we want to create a simple table called mytable with two columns and that we have already created a storage alias called datalakealias for the object storage where we want the table data to be stored.

If we want to create mytable as a Hive table that uses Parquet as an underlying data type, we do that as follows:

CREATE DATALAKE TABLE mytable (id INT, name VARCHAR(100)) STORED AS PARQUET location 'DB2REMOTE://datalakealias//mypath/mytable';

Instead, if we want to use ORC as our underlying ODF, we replace the keyword PARQUET in the previous statement with the keyword. For Iceberg tables, the CREATE DATALAKE TABLE statement differs slightly. If we want to create mytable as an Iceberg table that uses Parquet as the underlying ODF, the statement would be:

CREATE DATALAKE TABLE mytable (id INT, name VARCHAR(100)) STORED BY ICEBERG STORED AS PARQUET location 'DB2REMOTE://datalakealias//mypath/mytable';

For ORC, we simply replace the keyword PARQUET in the previous statement with the keyword ORC.

For a list of all file formats supported by Db2 Datalake tables, please refer to Supported file formats for Datalake tables - IBM Documentation

Fine-Tuning the ODF File Layout

Some of the benefits we gain from using an ODF as an underlying file format for a Datalake table, such as efficient compression, distributed processing, and predicate pushdown capabilities, depend on the correct configuration of the storage layout of the data files that underly the table. For example, Parquet’s compression benefits can diminish when a suboptimal page sizes or encodings are chosen, as overly small pages increase overhead and inconsistent dictionary use across pages can lead to uneven compression and unpredictable performance. Furthermore, ORC stripes that are too large can diminish the usefulness of the associated stripe-level statistics when handling pushed-down predicates and thus degrade the performance of table scans. Therefore, the storage layout of the ODF files that underly Datalake tables must be carefully configured to handle the workload at hand. We can configure different ODF aspects of a Db2 Datalake table in two ways:

  1. When creating the Datalake table. These configurations are applied to all the data files that are created as the table evolves over time.
  2. Using session properties that are applied to the data files that are created during the current session the client application has with Db2.

For example, let us assume that we want to set the Parquet row group size to 33554432 bytes (32 MiB) in a Db2 Datalake table. Our first option is to apply this setting during the CREATE DATALAKE TABLE statement using the TBLPROPERTIES clause:

  • Hive table:

CREATE DATALAKE TABLE mytable (id INT, name VARCHAR(100)) STORED AS PARQUET location 'DB2REMOTE://datalakealias//mypath/mytable' TBLPROPERTIES('parquet.block.size'='33554432');

  • Iceberg table:

CREATE DATALAKE TABLE mytable (id INT, name VARCHAR(100)) STORED BY ICEBERG STORED AS PARQUET location 'DB2REMOTE://datalakealias//mypath/mytable' TBLPROPERTIES('write.parquet.row-group-size-bytes'='33554432');

Note that the property that is configured is different for Hive and Iceberg. Down below, you can find references to all properties that can be configured in Db2 for both cases.

The second option we have is to configure the ODF’s storage layout using session properties:

  • Hive table: 

SET HADOOP PROPERTY 'parquet.block.size'='33554432'

  • Iceberg table: 

SET HADOOP PROPERTY 'write.parquet.row-group-size-bytes'='33554432'

Further Reading

About the Author

Ghareeb Falazi is a Software Engineer at the IBM R&D lab in Ehningen, Germany and leads the Db2 Datalake table performance squad responsible for measuring and improving the performance of Hive and Iceberg tables defined using Db2. His work involves introducing embedded performance metrics that can help fine-tune Db2 Datalake tables and measuring the performance of new Db2 Datalake Table features. Ghareeb holds a doctoral degree in Computer Science from the University of Stuttgart, Germany. Ghareeb can be reached at ghareeb.falazi@ibm.com .

0 comments
16 views

Permalink