Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.

 View Only

Flexibility through Kafka and Data Replication

By Jonathan Rosen posted Fri March 29, 2024 04:32 PM

  

Traveling can be fun and exciting. Realizing that you booked the wrong return flight after you already left?  Maybe not as much fun.  If you are fortunate enough to have a flexible budget, then maybe it’s not as bad as a situation as it could have been.  However, you decide to change your flight: calling a ticketing agent, logging into your airline’s app on your phone, or using a travel agency (yes, some still exist) a table in a database with that airline is going to be updated.

Among fields that the database will look at is and possibly update in the table that holds your ticket are your:

  • Name
  • Record locator
  • Existing flight
  • Departing airport
  • Flight number
  • Departure time
  • Arrival airport
  • Arrival time

When you make the change to your flight, the database will most likely UPDATE, INSERT, OR DELETE information into the row where your information lives.  These changes are simple SQL commands that will change the database row (or column) your record lies on.  Transmitting these changes to other databases is called change data capture or CDC.

What else happens behind the scenes and how might an airline handle these transactions at scale?  They are most likely using an Online Transaction Processing (OLTP) database (DB) which is designed to handle a high volume of transactions.  The OLTP database might be PostgreSQL, MariaDB, or even IBM’s DB2 on z/OS which excels at heavy transactional volumes.  Once that DB processes the information, the job is not done.  The airline probably wants flexibility to do something with this data in near real-time or keep it around to be used later.  The IT department probably wants something that is fault tolerant and can handle transactions at scale… I’m talking thousands, tens of thousands or over 100,000+ transactions per second.

Kafka, the streaming service developed by LinkedIn in 2011 and then donated to the Apache’s opensource community is one of the most popular destinations that we see where a company send their data.  A Kafka stream has built in fault tolerance, is scalable, keeps data in memory for the data engineer to access in real time or when it’s needed in the future.  An application might pull this data from the Kafka stream and notify you of your flight change via an SMS text message.  An ETL tool such as DataStage could be used to enrich data and combine it with another set of data to update your frequent flyer mile.  From my personal experience I’ve noticed that this happens during off peak hours or several days later since it is not as critical a function as updating your ticket, boarding pass and making getting your bag rerouted to the new flight.  As you can see, Kafka gives a huge amount of flexibility since it can handle critical and non-time sensitive events.

When it works correctly, one of the great things about Kafka is that it records onto a topic in the order in which it occurs in a fault tolerant and durable way.  For anyone without an IT background reading this, fault tolerance helps prevent a scenario that you probably experienced—working on a project and your computer crashes before you saved.  In a way, fault tolerance is kind similar to an auto save feature that prevents data loss.  However, Kafka is fault tolerant and not fault proof.  When an issue occurs, say a consumer API that is subscribed to a topic, fails, it’s possible that you could have data quality issues… duplicates of data or data could arrive out of chronological order. 

At IBM, our software engineers developed a way to combat this data quality issue with by creating technology called a Transactionally Consistent Consumer (TCC).  When using our IBM Data Replication to transmit to a Kafka stream this comes standard and can be further enhanced something we call a KCOP (Kafka Custom Operating Procedure).  Adding a simple time stamp to a Kafka record makes the data engineer’s life easier in the event of an issue that would require investigation and intervention.  As of writing this there are a handful prebuilt KCOPs which can do a variety of data transformations and enrichments.  All KCOPs can header to the data on Kafka so a data engineer can easily understand what they are looking at when viewing data in Kafka.  Others KCOPs can transform the data format Arvo to CSV or JSON formats for downstream applications to consume.  There is also the ability to create your own KCOP to deliver exactly what you want for your use case.  This provides the ultimate flexibility for the user which in turn increases business value for the client.

If you want to learn more about how IBM’s technology can assist with your replication needs check out our website and contact an IBM solution engineer today.  In the meantime, I’m going to update my flight.

0 comments
18 views

Permalink