Why compress?
Compression is ubiquitous in modern computing, it has become "part of the furniture". The chances are whilst you are reading this blog post, the page will also contain multiple images that use the compressed JPEG or GIF file formats. Compression can be either lossy or lossless - both with the same goal of reducing the size of data.
Multimedia data (images, audio and video) are often compressed using lossy algorithms, exchanging some integrity/clarity for a greater compression ratio. As it happens the two images formats mentioned use different approaches, JPEG uses lossy whilst GIF uses lossless compression. Lossy algorithms cannot decode the encoded form back again to the original input bytes whilst lossless algorithms always will do.
In IBM MQ, the compression algorithms used on message data are always lossless. You wouldn't ever want to decompress a message and have it give you an account number, an order quantity or a holiday destination that has lost any detail from the original message. Integrity of messages is always critically important.
IBM MQ 9.4 adds the ability to use the LZ4 compression algorithm for channel traffic, client connections and also Native HA replication links. It has always been possible to compress and decompress MQ data as it is transferred over a channel using custom written exits. Compression has also been available as a built-in capability of IBM MQ for almost 20 years when COMPHDR and COMPMSG attributes were first introduced to channels in IBM WebSphere MQ 6.0 (2005), thus avoiding the need for such exits.
The aim of compression of IBM MQ data over a communications link is a very simple one, try and encode the data into a smaller number of bytes before sending it over the network and then decode back to the original set of bytes. But why? What are the benefits of doing this? Well, let's look at two - efficiency and economy...
Efficiency
Using compression can be more efficient to transmit data, if that data is readily compressible, if you have spare CPU cycles to perform the encoding and decoding of the data and if network bandwidth and latency is a limiting factor. If not all of those conditions are true, then it is likely to be more efficient to continue to transfer data uncompressed.
Data varies greatly in terms of how readily compressible it is. A message that uses a format that contains a large amount of repeating byte strings and/or whitespace is an example of data that would be readily compressible, whilst a message formed with a block of entirely random bytes would typically not compress well, if at all. Small blocks of data also don't tend to compress well and may also cause the encoded representation to be larger than the uncompressed data. In circumstances where the encoded representation is larger, IBM MQ sends the original uncompressed data instead, however it has already gone to the expense of encoding.
Compression algorithms vary greatly in terms of compression ratio, CPU cost and encoding and decoding times. Many compression algorithms will offer variants that attempt to increase the compression ratio in exchange for slower encoding and/or decoding times, in MQ these variants are identified with 'FAST' and 'HIGH' suffixes. As an example, ZLIBFAST weights the ZLib algorithm towards speed, whilst ZLIBHIGH exchanges speed for a higher compression ratio.
If you have a mix of different messaging applications where message sizes and formats will benefit from applying different approaches to compression (e.g. JSON, XML, a file transfer of a zip file) then separating traffic into different transmission queues and channels with different COMPMSG settings, is one approach to optimize efficiency to match the application data.
Native HA log replication links need to transfer the content of the log extents which is an eclectic mix of application data, headers and object media images. The round-trip time in sending log data to a quorum of replicas has a direct correlation in how quickly new messaging workload is acknowledged, for that reason if compression is enabled it is important that a 'FAST' algorithm is used.
Economy
There is inevitably a financial cost associated with using computing resources, whether that be in storage, processing or communications. Some of those costs may be billed whether you are fully utilizing those resources to their capacity (e.g. CPU) and some may be billed per unit of usage (e.g. disk or network). Compressing data can utilize unused CPU cycles to encode and decode data and to save the amount of bytes sent over a network link.
Whilst it may be less efficient to compress and decompress data in terms of elapsed time result in lower throughput, you may still wish to perform compression because it is more economic to do so.
Algorithms in MQ
When IBM MQ added compression support to channels, the original algorithms available for message compression at that time were RLE, ZLIBFAST or ZLIBHIGH.
RLE (Run-length encoding) compression is one of the simplest lossless algorithms, exchanging a sequence of repeating bytes with that byte followed by a count of how many times it was repeated. This type of algorithm is clearly very efficient where messages might contain a lot of adjacent repeating characters (e.g. "AAAA"), blank padded fields, but drastically less so when handling repeating multi-byte patterns.
ZLIBFAST and ZLIBHIGH implement the ZLib algorithm which is a combination of the LZ77 compression algorithm and a huffman encoding scheme. The LZ77 algorithm is an example of sliding-window compression where the encoder maintains a fixed amount of context to look for matching patterns. Back references to repeating bytes within the window are used to encode the output. This dictionary matching approach is better at handling data that contains repeating sequences that span multiple bytes (e.g. "ABCDABCD").
The LZ4 algorithm introduced in IBM MQ 9.4 uses a similar dictionary matching approach but it does not combine this with an entropy encoding phase. LZ4 encoding and decoding are faster than ZLib and less demanding in terms of CPU, however there is a small trade-off in terms of a lower compression ratio. The decoder for LZ4 is extremely fast when compared to ZLib and hence this is a good fit for use by IBM MQ channels and Native HA log replication where transmission time is critically important.
Conclusion
In determining whether compression would be beneficial, remember that the type of data being compressed can be just as, if not more significant than the network bandwidth and latency. In IBM MQ 9.4 the LZ4 algorithm offers faster compression, however if your driving force for compressing data is solely to reduce the number of bytes sent over a network then ZLib continues to offer superior compression ratios.