StreamSets Data Collector 5.7.1 - Docker Deployment and Web Socket Pipeline Guide
This guide provides step-by-step instructions for deploying StreamSets Data Collector version 5.7.1 in a Docker container and creating a simple pipeline to connect to a WebSocket and store the received data locally.
Prerequisites:
- Docker installed on your machine. (Download Docker and install it).
- StreamSets account.
StreamSets Data Collector Deployment:
Step 1: Setup Deployment Data collector:
- After logging in, navigate to the sidebar and choose "Deployment," as illustrated in the screenshot below.
- Initiate a new deployment by clicking on the plus sign, as demonstrated in the screenshot below.
- Complete the defined deployment based on your requirements, ensuring to choose the engine version 5.7.1, the current stable release. Afterward, click on the "Save & Next" button.
- Tailor the engine settings to meet your specifications. Subsequently, select the "Save & Next" button.
- Choose "Docker Image" as the installation type and then proceed by selecting the "Save & Next" button.
- Customize the sharing settings for the deployment to extend access to other users and groups. Subsequently, click on the "Save & Next" button.
- Choose the "Start & Generate Install Script" button.
- Ensure that Docker is running, and then copy the command provided below. Paste this command into the windows/Ubuntu/mac terminal to initiate the engine.
- Verify that the container is active. In the screenshot below, observe the running container within Docker Desktop.
Step 2: Advance Data collector Configuration:
To leverage the WhoisXML API WebSocket, customize your data collector settings accordingly.
- You can download the built jar file here. Following the download, replace the existing file in the Docker container with the newly acquired one. To facilitate a seamless replacement of the file, utilize Docker Desktop for an efficient and straightforward process.
- Navigate to directory /opt/streamsets-datacollector-5.7.1/streamsets-libs/streamsets-datacollector-basic-lib/lib.
- In this directory, you should find a jar file named "streamsets-datacollector-basic-lib-5.7.1.jar".
- Replace the existing file with the downloaded one. Alternatively, you can drag the downloaded file to this location for a simple and direct replacement.
- Restart the Data Collector/Container.
Creating a pipeline:
Step 3: Setup Pipeline:
- In the StreamSets UI, go to the sidebar, select "Build," and then choose "Pipelines." Initiate the pipeline by clicking on the plus button, as demonstrated in the screenshot below.
- Customize the new pipeline according to your specifications, and proceed by clicking on the "Next" button.
- Adjust the pipeline configuration, choose the designated data collector, and click on the "Save & Open in Canvas" button.
- Subsequently, the following user interface (UI) will be displayed.
- Click on the "Add Stage" button, then search for "WebSocket" and choose the "WebSocket Client."
Choose the WebSocket stage and configure it based on your preferences:
WebSocket Configuration:
- Add Resource URL
- Add request Data (It will contain the WHOIS API Key)
- Add Max Message Length (bytes) = 522184 minimum
Data Format Configuration:
- Add Data Format JSON.
- Add Max Object Length (chars) = 9999999 (You can change it according to your requirements)
- Add another stage by selecting the "Add Stage" button on the UI, and choose the "Local FS" stage.
Local FS Configuration:
Choose the "Local FS" stage and configure it according to your specific requirements.
- Add Directory Template for the desired output file location.
Include the necessary configuration for the Data Format as per your requirements.
- Add Data Format JSON.
Once configured, validate the pipeline by clicking on the "Validate" button to identify and rectify any errors. The final state of the pipeline should resemble the provided example.
Final Steps:
Running a Pipeline:
Execute the pipeline by selecting "Draft Run" on the UI, then choose "Start Pipeline."
Following the completion of the aforementioned steps, upon starting the pipeline, you should observe the displayed UI as illustrated below.
Within the Docker container, you should be able to locate the file created in the specified directory, containing data sourced from the WebSocket, as depicted below.
Conclusion:
In conclusion, this comprehensive guide outlined the step-by-step process to set up and execute a pipeline in the StreamSets UI using Docker. From configuring the WebSocket Client to defining the Local FS stage, each step contributed to building a functional pipeline for data processing. The validation step ensures the integrity of the pipeline, and upon successful execution, the UI displays the output file within the specified directory in the Docker container. By following these instructions, users can successfully create, configure, and run a streamlined data processing pipeline, facilitating effective data integration and management.