Overview
StreamSets provides a streaming data platform for building and managing data movements. StreamSets Data Collector offers a wide array of pre-built stages for various data Sources and Targets. It also ships with some powerful stages that support data processing and event-based task handling. However, there are instances where specific business logic or unique integration requirements demand functionality beyond the standard offerings. This is why StreamSets support custom stages which allow you to extend the platform's core capabilities, integrating with niche systems, and implementing your own specialized logic right inside your data flow. By building custom components, you are no longer limited by the out-of-the-box connectors and processors. Custom stages have been widely used by our customers and have become an invaluable asset, allowing them to extend StreamSets Data Collector's capabilities and tailor it to their precise needs.
Why Custom Stages?
Custom stages empower users to:
- Implement specialized data processing logic: Develop custom transformations, aggregations, or data quality checks that are unique to a particular use case.
- Integrate with proprietary systems: Connect to custom APIs, databases, or legacy systems not supported by standard stages.
- Optimize performance for specific scenarios: Craft highly optimized stages for critical data flows, potentially using specialized libraries or algorithms.
Types of custom stages
There are three main types of custom stages you can develop, depending on where they fit in your pipeline:
- Custom Origin: A custom origin is designed to consume data from a specific system that is currently not supported by StreamSets Data Collector and produce a stream of records for the rest of the pipeline to process.
- Custom Processor: A custom processor can be designed to transform or enrich data in a way that is not supported by existing processors.
- Custom Destination: A custom destination is designed to write data to a target that is currently not supported by StreamSets Data Collector.
- Custom Executor: A custom executor is designed to trigger a task when it receives an event that is currently not supported by StreamSets Data Collector.
- Custom Expression Language (EL): A custom EL enables you to create expressions that evaluate or modify data that is currently not supported by StreamSets Data Collector.
Creating a Custom Stage: A High-Level Overview
Developing a custom stage in StreamSets Data Collector typically involves:
- Project Setup: Utilizing the StreamSets Maven archetype to generate a custom project for a custom origin, processor, or destination. This generates a template project from the archetype in a directory with the necessary structure and dependencies.
- Implementing Logic: Writing Java code within the generated project to define the stage's behavior. This includes initializing resources, handling data records, configuring stage properties, and implementing the core processing logic and destroying to ensure there are no memory leaks.
- Building and Packaging: Compiling the custom stage project into a JAR file, which will be deployed to the Data Collector.
- Deployment: Making the custom stage library available to Data Collector instances, by uploading the JAR file via the external resource archives.
- Pipeline Integration: Incorporating the newly created custom stage into Data Collector pipelines, configuring its properties, and connecting it to other stages to build a complete data flow.
Steps to create and deploy custom stage to the StreamSets environment
In this section, we will outline the steps to create a custom stage, packaging it and adding it as an asset to the project such that it can be deployed as an externalResource to a StreamSets environment.
Creating a custom stage
Create a new custom stage project using the Maven archetype:
Step 1: Install maven
Install maven on your workstation. Follow the steps from Apache Maven installation.
Step 2: Build
Once maven is installed, run the following command to build
mvn archetype:generate -DarchetypeGroupId=com.streamsets \
-DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial \
-DarchetypeVersion=6.4.0 -DinteractiveMode=true
Build is successful when you see the following output
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------< org.apache.maven:standalone-pom >-------------------
[INFO] Building Maven Stub Project (No POM) 1
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] >>> archetype:3.4.1:generate (default-cli) > generate-sources @ standalone-pom >>>
[INFO]
[INFO] <<< archetype:3.4.1:generate (default-cli) < generate-sources @ standalone-pom <<<
[INFO]
[INFO]
[INFO] --- archetype:3.4.1:generate (default-cli) @ standalone-pom ---
[INFO] Generating project in Interactive mode
[INFO] Archetype repository not defined. Using the one from [com.streamsets:streamsets-datacollector-stage-lib-tutorial:6.4.0] found in catalog remote
Define value for property 'groupId': com.example
Define value for property 'artifactId': samplestage
Define value for property 'version' 1.0-SNAPSHOT:
Define value for property 'package' com.example:
Confirm properties configuration:
groupId: com.example
artifactId: samplestage
version: 1.0-SNAPSHOT
package: com.example
Y:
[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Archetype: streamsets-datacollector-stage-lib-tutorial:6.4.0
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: groupId, Value: com.example
[INFO] Parameter: artifactId, Value: samplestage
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Parameter: package, Value: com.example
[INFO] Parameter: packageInPathFormat, Value: com/example
[INFO] Parameter: package, Value: com.example
[INFO] Parameter: groupId, Value: com.example
[INFO] Parameter: artifactId, Value: samplestage
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] Project created from Archetype in dir: /Users/deepak/custom_stages/custom_origin/samplestage
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:51 min
[INFO] Finished at: 2025-11-04T19:47:42-05:00
[INFO] ------------------------------------------------------------------------
Maven generates a template project from the archetype in a directory with the artifactId you provided as its name.

Step 3: Package the template
You can package the template by running the following mvn command from the template project directory
mvn clean package -DskipTests
The packaging is successful when you see the following output
[INFO] Scanning for projects...
[INFO]
[INFO] ----------------------< com.example:samplestage >-----------------------
[INFO] Building samplestage 1.0-SNAPSHOT
[INFO] from pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
....output omitted....
[INFO] Building tar: /Users/deepak/custom_stages/custom_origin/samplestage/target/samplestage-1.0-SNAPSHOT.tar.gz
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
Packaging the external archive as an asset
The custom stage can be deployed to the StreamSets Data Collector as part of an external archive. An external resource archive file must use the required folder names and directory structure.
The root folder must be named externalResources and include the following directories:
resources
- The resources directory must include text files created for runtime resources.
streamsets-libs-extras
- The streamsets-libs-extras directory must include a subdirectory for each set of required external libraries based on the stage library name, as follows: <stage library name>/lib/
user-libs
- The user-libs directory must include a subdirectory for each custom stage.
Navigate to the directory from the output of Step 3: Package the template and copy the tar to the externalResources archive as below

Repackage the folder with the custom stage to a new externalResources.tar.gz. Now the external archive is ready to be deployed.
Please review the product documentation for more detailed information on the archive file - Archive File as the Source.
Adding externalResourses as an Asset to the Project
To deploy the archive file you need to import the archive file as an asset to the project by selecting the Import assets option under Assets tab

You need to click on New Asset to import the archive.

Select one of the options on the left pane to import the archive.
Once you import the asset select done to proceed.

You should see the externalResources archive listed as an asset under the project

Updating the StreamSets environment to deploy the externalResources
Navigate to the Projects → Manage → StreamSets under Tools left pane menu to access the environment created. Edit the environment as by selecting clicking on the vertical ellipsis next to the environment as below

In the edit environment screen, scroll down to the External Resources section to select the externalResources.tar.gz that was added as an asset to the Project.

Review the changes and click save to continue to generate the engine run command

In order to execute the command to create the engine you need to have container management platform such as Docker or Podman on the host where the command is executed. For detailed instructions, see Prerequisites. Copy the generated command, and then run the command using the detailed steps in Running the engine command.
Once the engine run command is successfully run, you can view the the environment details by selecting the environment from StreamSets landing page.

You can manage the Data Collector engine(s) from Environment details landing page
