IBM Fusion

Ask questions, exchange ideas, and learn about IBM Fusion

View Only

Back to Blog List

Accelerating metadata cataloging through distributed harvesters

By PAUL LLAMAS VIRGEN posted Wed June 18, 2025 01:43 PM

One of the common challenges in today's companies is the existence of large amount of unstructured data. As data generation continues to surge, there is a need to have data-intensive technologies at exabyte-scale.

It is a common scenario for companies to store large amounts of data across different locations or storage tiers. This distributed storage approach is necessary because unstructured data is generated by numerous entities in various locations. The co-located storage tiers must be available to accommodate and manage this data as it is created and used in daily business operations, see the following picture.

IBM Fusion Data Cataloging service includes a feature called Data Harvesters, designed for data ingestion at scale across distributed NFS data sources. It accelerates the transformation of unstructured data into structured formats, streamlining the data journey.

The data administrators, scientists, and data engineers can use this capability to scan and ingest metadata at the edge. It allows a data administrator to deploy a compute node close to the data source, install a data harvester, and immediately scan and ingest data.

This article explores some of the industrial challenges and demonstrates how Data Harvesters can effectively address these scenarios.

1 Low speed network communication from the cluster to the data source.

Challenge:

When dealing with distributed entities that consume data from various sources, there is a possibility that these entities may be located in different regions. Therefore, the actual data storage must be as close as possible to the consumer or producer entity. An entity that consumes and produces data is typically associated with the highest network speed data source to ensure seamless data transfer.

Scaling this scenario to an enterprise level, where multiple entities are spread across the globe and data sources are more centralized, the focus should be on high data transmission speed to the data store. In large enterprises with data spread across different regions, network connectivity between regions isn’t always optimal.

Solution:

Based on the previous, it can be best to process the data where it is stored to maximize efficiency.

Harvesters help with this by scanning data near to the data source through co-located compute nodes where harvesters run, mitigating latency-related issues.

Benefits:

- Improves efficiency and ingestion rates on low speed network links

- Benefit of distributing heavy-lifting on external compute nodes instead of the central entity

2 Distributed Data Sources and network security policies

Challenge:

Taking into consideration previous network speed issues, the amount of distributed data sources is another factor to be considered. While deployment is not always feasible, the challenge of managing distributed data sources poses a greater risk, as it significantly affects the performance of scanning and metadata enrichment.

The addition of network security policies further complicates the equation. In such scenarios, different network security policies, firewall rules, country-to-country regulations, or other similar restrictions are often in place, that is, there is a large volume of data distributed in different regions, and each region may have rules governing the movement of data out of the region. It creates difficulty for regular procedures to perform scanning and data inspection on every single data source.

Solution:

Fusion Data Cataloging Harvesters provide a way to circumvent these challenges by processing and gathering data where it resides. On the compute node, which is co-located with the data source. After the certification entity verifies and trusts the compute node and the code, the data harvester can operate successfully with high performance.

Benefits:

This approach has the following benefits:

- Alleviates the need to move large amounts of data.

- Addresses the need to comply with regional data movement policies.

3 Higher performant data ingestion for massive scans

Challenge:

In today's business landscape, the generation of data has increased exponentially, presenting both opportunities and challenges for organizations. To support companies on their journey from unstructured to structured data, it is essential to scale data discovery and ingestion processes effectively at massive scale. This need becomes even more crucial when data is not co-located, requiring a robust solution for large-scale distributed data ingestion.

Solution:

By deploying Fusion Data Cataloging Harvesters, the distributed parallel ingestion is enhanced, in which organizations can process and gather data at multiple sources simultaneously, significantly reducing bottlenecks and latency.

This approach leverages distributed computing to process data while following regional regulations efficiently. It improves performance and offers a scalable solution for future data growth and complexity.

The Data Cataloging Harvesters enhances distributed parallel ingestion, which organizations can process and gather data at multiple sources simultaneously, significantly reducing bottlenecks and latency.

Driven by the need to solve data discovery and ingestion challenges, Fusion Data Cataloging Harvesters were created as a complementary solution with a vision for future industrial needs.

Benefits:

Data harvesters deployment increases significantly on the ability of distributing the load on near-to-data sources compute nodes, thus augmenting the data ingestion performance. At the end, data harvesters are components to perform heavy-lifting at the edge establishing cluster communication for data ingestion purposes.

This Fusion Data Cataloging capability runs outside of the cluster instance on a separate compute node which could be a Virtual Machine. The mentioned compute node may not be part of the cluster but is allocated sufficient resources to run Data harvester tasks.

The following diagram describes a common use case where different data sources are distributed along different countries.

As shown in the diagram, the data harvester entity requires a compute node to run, allocated resources for it are recommended to be near to the data source for high performant data ingestion purposes.

The diagram shows a hybrid scanning approach where the cluster that is located near to one of the data sources (in France, for this case) starts a scan through regular connection manager, since FDC is co-located along with the data source. On the other hand, a harvester entity is installed on the rest of the data sources that are located far from cluster to augment the scanning performance.

An entity situated near two locations (Spain and Portugal) can harvest data from both. If two data sources are close to each other, a single harvester can efficiently scan them both.

At the end, the two different scanning paths shown connect on the same cluster. The key difference lies in where the heavy lifting is done during scanning. The green path involves harvesting data at the edge, while the blue path depends on the connection manager entity to handle the heavy lifting within the cluster.

How to leverage this capability

Getting started

1. Setting up the environment.

Before installing the DCS Harvester, it is required to install a few dependencies to be able to run it. Regular package manager can be used to do so:

yum install -y python3.11 python3.11-pip tar xz curl

2. Installing the DCS Harvester CLI.

A detailed installation guide is available on the Setting up external host documentation topic but in a few words it is all about leveraging the OpenShift CLI with valid login credentials to do most of the setup including things like retrieving certificates, defining host names and credentials, and installing the Python dependencies.

oc -n openshift-ingress get secret router-certs-default -o json | jq -r '.data."tls.crt"' | base64 -d > $(pwd)/router_tls.crt

export IMPORT_SERVICE_CERT=$(pwd)/router_tls.crt

export DCS_CERT=$(pwd)/router_tls.crt

3. Generating a SQLite file with metadata.

Data harvesters are designed to import metadata from any SQLite file that follows the required schema. This schema contains two tables:

- One table named metaocean to store the actual metadata associated to the scanned records of a data source.

- A second table named connections with just one record providing details of the data source of origin.

This is a good example of these important tables:

Example tables

METAOCEAN TABLE

inode

owner

group

uid

gid

permissions

path

filename

mtime

atime

ctime

size

fkey

root

wheel

rwxrwxrwx

/exports/demo/

file1.txt

2022-11-15 22:26:38.000000

100

1_my_nfs.ibm.com_nfs_0

root

wheel

r--r--r--

/exports/demo/

file2.txt

2022-11-15 22:26:38.000000

1450

2_my_nfs.ibm.com_nfs_0

…

CONNECTIONS TABLE
name	platform	cluster	datasource	site	host	mount_point	le_enabled
nfs_server_0	NFS	my_nfs.ibm.com	nfs_0	ibm.com	my_nfs.ibm.com	/exports/demo	0

4. Importing metadata to DCS using the Harvester CLI.

Once the SQLite file is created use a variable to define the path. Then all is set to run the harvester to send all those records to the main catalog.

Note: It can take a while to complete as it goes over every row of the provided SQLite file to send the metadata to the instance of DCS running on the OpenShift cluster so it can take several minutes to complete.

SQLITE_FILE_PATH=/tmp/example.sqlite
CONFIG_FILE=$(pwd)/harvester/config.ini
python3.11 harvester metaocean $SQLITE_FILE_PATH -c $CONFIG_FILE -p nfs

5. Using the REST API to retrieve imported metadata.

# Define token based on x-auth-token header

curl -k -u <user>:<password> https://<dcs_console_route_hostname>/auth/v1/token -i

# Query the recently indexed records

curl -k -H "Authorization: Bearer <token>" https://<dcs_console_route_hostname>/db2whrest/v1/sql_query\?select%20\*%20from%20metaocean%20limit%2010 -X GET -H "Accept: application/json"

In conclusion, the Fusion Data Cataloging Data harvesters provide a capability to perform base metadata ingestion and metadata enrichment at the edge on scenarios when distributed data sources are located in different locations.

This flexible method to import metadata for cases where the origin data source cannot be scanned due to different technical problems like high latency or network security policies, or there’s a need to scale the scans to thousands of parallel connections.

One of the advantages of using FDC Harvesters is that the OpenShift cluster footprint gets impacted positively, thereby distributing processing time on different hosts that are not necessarily attached to the main cluster.

The industry needs a complementary capability to the current Fusion Data Cataloging for base metadata and metadata enrichment ingestion. Fusion Data Cataloging Harvester is the solution as it ensures large-scale performance on highly distributed data source scenarios.

Acknowledgements: @SantiagoValle @BEN RANDALL

0 comments

5 views

Permalink

https://community.ibm.com/community/user/blogs/paul-llamas-virgen/2025/06/18/accelerating-metadata-cataloging-through-harvester

IBM Fusion

IBM Fusion

Accelerating metadata cataloging through distributed harvesters

By PAUL LLAMAS VIRGEN posted Wed June 18, 2025 01:43 PM

Permalink

Additional
Resources

Office

Quick Links

IBM Fusion

IBM Fusion

Accelerating metadata cataloging through distributed harvesters

By PAUL LLAMAS VIRGEN posted Wed June 18, 2025 01:43 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources