StreamSets Data Collector 5.7.1 - Docker Deployment and Web Socket Pipeline Guide | WhoisXML API

WhoisXML API Blog

StreamSets Data Collector 5.7.1 - Docker Deployment and Web Socket Pipeline Guide

This guide provides step-by-step instructions for deploying StreamSets Data Collector version 5.7.1 in a Docker container and creating a simple pipeline to connect to a WebSocket and store the received data locally.

Prerequisites:

  • Docker installed on your machine. (Download Docker and install it).
  • StreamSets account. 

StreamSets Data Collector Deployment:

Step 1: Setup Deployment Data collector:

  • After logging in, navigate to the sidebar and choose "Deployment," as illustrated in the screenshot below.
navigate to the sidebar and choose "Deployment"
  • Initiate a new deployment by clicking on the plus sign, as demonstrated in the screenshot below.
Initiate a new deployment by clicking on the plus sign
  • Complete the defined deployment based on your requirements, ensuring to choose the engine version 5.7.1, the current stable release. Afterward, click on the "Save & Next" button.
Complete the defined deployment based on your requirements
  • Tailor the engine settings to meet your specifications. Subsequently, select the "Save & Next" button.
Tailor the engine settings to meet your specifications
  • Choose "Docker Image" as the installation type and then proceed by selecting the "Save & Next" button.
Choose "Docker Image"
  • Customize the sharing settings for the deployment to extend access to other users and groups. Subsequently, click on the "Save & Next" button.
Customize the sharing settings for the deploymen
  • Choose the "Start & Generate Install Script" button.
Choose the "Start & Generate Install Script" button
  • Ensure that Docker is running, and then copy the command provided below. Paste this command into the windows/Ubuntu/mac terminal to initiate the engine.
Ensure that Docker is running
  • Verify that the container is active. In the screenshot below, observe the running container within Docker Desktop.
Verify that the container is active

Step 2: Advance Data collector Configuration:

To leverage the WhoisXML API WebSocket, customize your data collector settings accordingly.

  • You can download the built jar file here. Following the download, replace the existing file in the Docker container with the newly acquired one. To facilitate a seamless replacement of the file, utilize Docker Desktop for an efficient and straightforward process.
customize your data collector settings
  • Navigate to directory /opt/streamsets-datacollector-5.7.1/streamsets-libs/streamsets-datacollector-basic-lib/lib.
Navigate to directory
  • In this directory, you should find a jar file named "streamsets-datacollector-basic-lib-5.7.1.jar".
find a jar file
  • Replace the existing file with the downloaded one. Alternatively, you can drag the downloaded file to this location for a simple and direct replacement.
  • Restart the Data Collector/Container. 

Creating a pipeline:

Step 3: Setup Pipeline:

  • In the StreamSets UI, go to the sidebar, select "Build," and then choose "Pipelines." Initiate the pipeline by clicking on the plus button, as demonstrated in the screenshot below.
In the StreamSets UI, go to the sidebar, select "Build," and then choose "Pipelines."
  • Customize the new pipeline according to your specifications, and proceed by clicking on the "Next" button.
Customize the new pipeline
  • Adjust the pipeline configuration, choose the designated data collector, and click on the "Save & Open in Canvas" button.
Adjust the pipeline configuration
  • Subsequently, the following user interface (UI) will be displayed. 
user interface (UI) is displayed
  • Click on the "Add Stage" button, then search for "WebSocket" and choose the "WebSocket Client."
Click on the "Add Stage" button

Choose the WebSocket stage and configure it based on your preferences:

WebSocket Configuration: 

  • Add Resource URL 
  • Add request Data (It will contain the WHOIS API Key)
  • Add Max Message Length (bytes) = 522184 minimum 
Add Max Message Length

Data Format Configuration:

  • Add Data Format JSON.
  • Add Max Object Length (chars) = 9999999 (You can change it according to your requirements)
Data Format Configuration
  • Add another stage by selecting the "Add Stage" button on the UI, and choose the "Local FS" stage.
Add another stage

Local FS Configuration:

Choose the "Local FS" stage and configure it according to your specific requirements.

  • Add Directory Template for the desired output file location.
Add Directory Template for the desired output file location

Include the necessary configuration for the Data Format as per your requirements.

  • Add Data Format JSON.
Add Data Format JSON.

Once configured, validate the pipeline by clicking on the "Validate" button to identify and rectify any errors. The final state of the pipeline should resemble the provided example.

Final Steps:

final steps

Running a Pipeline:

Execute the pipeline by selecting "Draft Run" on the UI, then choose "Start Pipeline."

running a pipeline

Following the completion of the aforementioned steps, upon starting the pipeline, you should observe the displayed UI as illustrated below.

you should observe the displayed UI

Within the Docker container, you should be able to locate the file created in the specified directory, containing data sourced from the WebSocket, as depicted below.

locate the file created in the specified directory

Conclusion:

In conclusion, this comprehensive guide outlined the step-by-step process to set up and execute a pipeline in the StreamSets UI using Docker. From configuring the WebSocket Client to defining the Local FS stage, each step contributed to building a functional pipeline for data processing. The validation step ensures the integrity of the pipeline, and upon successful execution, the UI displays the output file within the specified directory in the Docker container. By following these instructions, users can successfully create, configure, and run a streamlined data processing pipeline, facilitating effective data integration and management.

Try our WhoisXML API for free
Get started