Data Integration

Data Integration

Connect with experts and peers to elevate technical expertise, solve problems and share insights.

 View Only

Streaming on watsonx.data integration - Custom Stages In Action

By Deepak Ranganathan posted 2 days ago

  

Once the custom stages are deployed to the StreamSets Data Collector engine as outlined in the Streaming on watsonx.data integration - Custom Stages you should be able to use them within the flows as below

Steps to create a new flow

To create new flow one has to follow the below steps

  1. Select the project you have deployed the StreamSets engine to
  2. Navigate to the assets tab and create new asset by selecting New Asset
  3. Search for StreamSets in the "What do you want to do?" page
  4. Select “Create a real-time streaming data flow” to start building the flow
  5. Enter a new name for the flow in the "Create StreamSets flow" page, select the environment you deployed the custom to and select create at the bottom right corner of the page
  6. The following is the landing page or the canvas to build a flow
  7. You can select the sources, processors, targets and executors from the left pane to build a flow

Below are some sample flows with custom stages

Content extractor processor and copy file executor use case

In this use case we will give an overview on custom stages to extract text and metadata from PDF documents and back up the documents in a new location. PDF document processing was implemented using Apache Tika which provides a unified interface for processing various file formats, including PDFs, making it useful for tasks like search engine indexing, content analysis, and translation.

The PDF files are read from a local directory in whole file data format

The Content Extractor Processor extracts the content and the metadata, in the next stage we retain only the file content using the Field Remover stage. Finally the content of the PDF is written to the Local filesystem.

Below are some screenshots of the flow preview and output

Content Extract Processor preview

PDF content sent as input to the Local FS stage to create the files

Details on the file that was created by the flow

For this exercise Copy File executor was implemented to copy the file from the origin to the /tmp folder.

Custom Expression Language example

We have an extensive list of expression language that ship with the product but in some situations you might need to extend/implement some to meet your needs. 

This is the third blog post in the series "Streaming on watsonx.data integration - Custom Stages".

The link to the first and second blog posts are Streaming on watsonx.data integration and Streaming on watsonx.data integration - Custom Stages

0 comments
11 views

Permalink