Data Loading

Shashi Shankar

Mar 19, 2023

Data Loading Process

Snowflake Stage Types

Internal

Snowflake Stage Tables
Snowflake User – a User Stage is allocated to each user by default to store files prior to be loaded
Named location - a database table where URL locations of data files are kept

External

Named Location – a database table where URL locations of data files are kept

Bulk Loading

Used for loading batches of data from stage to Snowflake tables
Uses COPY command
It requires an appropriately sized Compute Virtual Warehouse for executing COPY command. The required size of Virtual Warehouse depends on volume data
Data files must be uploaded first to Stage prior to bulk loading
Transformations like – a) column reordering, b) unselecting column(s), c) type conversion using CASTS and d) truncating test fields can be done during COPY

Continuous Loading - Snowpipe

Used for loading micro batches of data for incremental loads
Snowflake provides Compute Warehouse
Data is loaded as soon as it lands in pre-defined Cloud location
Snowflake tool Snowpipe is used to load data
Snowpipe uses COPY command and supports simple transformations as mentioned in Bulk Load option

Used for automating transformations for continuous data load using Snowpipe

Raw data is first fed to a staging area where it is transformed using SQL queries and database Stored Procedures for loading into one or more Snowflake tables
Multiple tasks can be chained together
Captures Change Data Capture and provides lineage
Table streams record when staging tables and any downstream tables are populated with data
Uses Snowpipe and SQL
Data is first staged and then transformed and loaded into destination table(s)
Each time a task is scheduled to run, it can verify whether a stream contains change data for a table SYSTEM$STREAM_HAS_DATA SYSTEM$STREAM_HAS_DATA

Snowpipe Sctreaming

Components of Data Streaming Using Snowpipe

Snowflake connecter for Kafka - The Kafka connector continuously loads micro batches of records from one or more Apache Kafka topics into an internal (Snowflake) stage, which fed to Snowpipe.
Data Files – Source data files or micro batches of source data
Snowflake Internal Stage – Snowflake managed temporary storage of source data
Snowpipe - Continuously loads micro-batches of data from an External Stage location (Amazon S3, Google Cloud Storage, or Microsoft Azure) or Internal Stage into a staging table.
Snowflake Managed Compute Services – Used for loading data from Snowpipeto Staging tables.
Staging Table – Snowflake tables into which data is loaded by Snowpipe.
Stream – storage for Change Data Capture (CDC). Keeps both before and after images.
Task - execute SQL statements (which could call stored procedures) to transform the change data and move the optimized data sets into destination tables for analysis. Each time this transformation process runs, it selects the change data in the stream to perform DML operations on the destination tables and then consumes the change data when the transaction is committed.
User Managed Compute – User Manages compute environment is used to transform and load data from Stage table to Target tables
Target Tables - Data loaded into Target tables can be used by Users