StreamSets to Snowflake Integration
This guide provides step-by-step instructions for StreamSets to Snowflake Integration.
Prerequisites
- StreamSets account.
- Data collector using Docker (as described in previous guide)
- Snowflake account
- WHOISXMLAPI.COM Real-Time Domain Registration subscription
Design Overview
Creating a Database in Snowflake
Step 1: Set Up a Warehouse:
- After logging in, navigate to the sidebar select "Admin" and choose "Warehouses," Click on Warehouse button, as illustrated in the screenshot below.
- Select the appropriate warehouse name and click on the “Create Warehouse” button.
Step 2: Setup a Database
- Navigate to the sidebar select "Data" and choose "Databases" Click on Database button, as illustrated in the screenshot below.
- Select the appropriate database name and click on “Create” button.
Step 3: Create Schema:
- Select database and click on “+ Schema” button
- Select the appropriate Schema name and click on “Create” button.
Step 4: Create Stage:
- Select Schema created in the previous step and click on “Create” button and choose “Stage” and click on “Snowflake Managed”.
- Select the appropriate Stage name and click on “Create” button.
Step 5: Create Table:
- Select Schema created in the previous step and click on “Create” button and choose “Table” and click on “Standard”.
- Create Table by writing the SQL query as depicted below and click on the “Create Table” button.
Creating a Connection for Snowflake
Step 6: Setup a Connection:
- In the StreamSets UI, go to the sidebar, select "Setup," and then choose "Connections." Create the connection by clicking on the plus button, as demonstrated in the screenshot below.
- Customize the new Connection according to your specifications, and proceed by clicking on the "Save & Next" button. Make sure to select the Data collector and type.
- Configure the new Connection according to your credentials, and proceed by clicking on the "Save" button. Make sure to Test the connection.
Creating a pipeline
Step 7: Setup Pipeline:
- In the StreamSets UI, go to the sidebar, select "Build," and then choose "Pipelines." Initiate the pipeline by clicking on the plus button, as demonstrated in the screenshot below.
- Customize the new pipeline according to your specifications, and proceed by clicking on the "Next" button.
- Adjust the pipeline configuration, choose the designated data collector, and click on the "Save & Open in Canvas" button.
- Subsequently, the following user interface (UI) will be displayed.
- Click on the "Add Stage" button, then search for "WebSocket" and choose the "WebSocket Client."
Choose the WebSocket stage and configure it based on your preferences:
WebSocket Configuration:
- Add Resource URL
- Add request Data (It will contain the API Key)
- Add Max Message Length (bytes) = 522184 minimum
Data Format Configuration:
- Add Data Format JSON.
- Add Max Object Length (chars) = 9999999 (You can change it according to your requirements)
- Add another stage by selecting the "Add Stage" button on the UI, and choose the "Snowflake (destinations)" stage.
Snowflake Configuration:
- Select the connection which we created in previous steps.
- Add the Snowflake configuration warehouse, database, schema & table. Check create new columns; it will automatically create columns in the table.
- Add the staging configuration stage location, stage name , stage database and schema.
- Add the data configuration:
Final Steps:
Once configured, validate the pipeline by clicking on the "Validate" button to identify and rectify any errors. The final state of the pipeline should resemble the provided example.
Running Pipeline:
Execute the pipeline by selecting "Draft Run" on the UI, then choose "Start Pipeline."
Following the completion of the aforementioned steps, upon starting the pipeline, you should observe the displayed UI as illustrated below.
Within the Snowflake database, you should be able to preview the data on the table, containing data sourced from the WebSocket, as depicted below.
Conclusion
In conclusion, this comprehensive guide outlined the step-by-step process to set up and execute a pipeline in the StreamSets UI using Docker. From configuring the WebSocket Client to defining the Snowflake stage, each step contributed to building a functional pipeline for data processing. The validation step ensures the integrity of the pipeline, and upon successful execution, the snowflake UI displays the output. By following these instructions, users can successfully create, configure, and run a streamlined data processing pipeline, facilitating effective data integration and management.