Apache Arrow Flight is a general-purpose client-server framework that simplifies high performance transport of large datasets over network interfaces.
In this article, we will see how to extend the Arrow Flight OSS module in watsonx.data Presto (Java) to connect to an Arrow Flight based data sources. We will also see an example of Presto IBM Arrow Flight connector added to watsonx.data
Introduction
IBM watsonx.data is a hybrid, open data lakehouse to power AI and analytics with all your data, anywhere. It gives the users a single point of entry for accessing data from multiple data sources. Presto (Java) is one of the query engines available in watsonx.data, which uses a plugin architecture to provide connectors for the different types of data sources it can access.
Support for Arrow Flight based connector in watsonx.data enhances its capabilities to connect to additional data sources provided they can be accessed through the Arrow Flight protocol.
In watsonx.data, every Arrow Flight server implementation differs in aspects such as, the format of the request accepted by the server or the way in which the metadata is returned from the data source. Therefore, each implementation of Arrow Flight server requires a different Arrow Flight connector in Presto (Java). However, the code can be shared across the different Arrow Flight connectors. To take advantage of this, we designed a presto-base-arrow-flight module as a template which can be extended to create different implementations of Arrow Flight connector.
How the connector works
The above diagram is a representation of how the Arrow Flight connector in watsonx.data will work in a Cloud Pak for Data (CPD) environment.
NOTE: To use the Arrow Flight connector in CPD, the CPD Flight server must be installed in the CPD installation.
As the CPD Flight server can connect to different data sources, adding an Arrow Flight connector in watsonx.data for the CPD Flight server enables watsonx.data to connect to different data sources using the same connector.
The connector template is designed in such a way that any logic specific to the Arrow Flight server implementation is delegated to the extending module.
For example, calls to get metadata are delegated to the extending module, which is presto-ibm-arrow-flight in the case of watsonx.data. (The connector template has an abstract class BaseArrowFlightClientHandler, which declares the methods to get the metadata, and the extending module extends this abstract class and provides implementations for listSchemaNames and listTables.)
The following list provides details of the sequence of events when an SQL query, targeting an Arrow Flight catalog, is run:
- The connector in the Presto co-ordinator constructs a FlightDescriptor and gets FlightInfo from the Flight server. The FlightInfo provides a list of endpoints to fetch data from the Flight server.
- The Presto co-ordinator then constructs instances of ConnectorSplit containing the serialized Flight endpoint.
- The connector splits are then sent to the different Presto workers.
- The Presto workers then process each connector split in parallel by getting the flight ticket from the flight endpoint set in the connector split. This enables parallel data fetches from the Flight server.
- The Presto worker gets the Flight ticket from the split and gets the FlightStream for the ticket from the Flight server.
- The ConnectorPageSource implementation in the connector template constructs Presto pages from the Flight stream.
- The co-ordinator fetches the results from the workers and sends the result to the client.
Conclusion
As demonstrated in the above flow, to add an Arrow Flight connector in watsonx.data, we created two modules.
presto-base-arrow-flight module is included in Presto OSS version. So, anyone interested in creating an Arrow Flight connector for a different implementation of Flight server can extend this base module and create their own connector implementation.
To give an idea of how extending this connector template can simplify connector development, we can compare the number of classes needed to create a Presto connector. A connector like presto-bigquery which directly implements Plugin has 27 classes in it. But we can create an extending module of presto-base-arrow-flight with just 4 or 5 classes and implement only few methods in those classes.
Further reading
#watsonx.data