Are you annoyed by writing a custom code for every database you need to connect to, to get all the data you need for your machine learning (ML) models?
Here comes a handy solution: use Trino - now fully integrated in our Kubeflow on Power distribution for machine learning operations (MLOps)!
Trino is a distributed query engine that allows you to query all kinds of databases, be it relational or non-relational, using SQL. That already greatly harmonizes data access. But it gets even better: Within the same SQL query you can join data from multiple database systems on the fly. Yes, you heard right. Just a single SQL query to query all of your databases.
Overall, Trino can help you to save lots of time as well as storage space and avoid maintaining duplicate code. For example, you’ll now only need one Kubeflow component inside your ML pipeline to query all your data sources via Trino.
Here comes an example that shows you these helpful features in action.
To follow along, you need:
- Access to an OpenShift on IBM Power cluster via CLI
- A PostgreSQL database and a MongoDB instance (ideally also running in your OpenShift cluster)
The basic setup of the example developed hereinafter looks as follows: