top of page

Data Loading

Shashi Shankar

Mar 19, 2023

Data Loading Process

Snowflake Stage Types


Internal

  • Snowflake Stage Tables

  • Snowflake User – a User Stage is allocated to each user by default to store files prior to be loaded

  • Named location - a database table where URL locations of data files are kept 


External

  • AWS S3

  • Azure

  • Google Cloud Storage


Named Location – a database table where URL locations of data files are kept


Bulk Loading



  • Used for loading batches of data from stage to Snowflake tables

  • Uses COPY command

  • It requires an appropriately sized Compute Virtual Warehouse for executing COPY command. The required size of Virtual Warehouse depends on volume data

  • Data files must be uploaded first to Stage prior to bulk loading

  •  Transformations like – a) column reordering, b) unselecting column(s), c) type conversion using CASTS and d) truncating test fields can be done during COPY


Continuous Loading - Snowpipe


  • Used for loading micro batches of data for incremental loads

  • Snowflake provides Compute Warehouse

  • Data is loaded as soon as it lands in pre-defined Cloud location

  • Snowflake tool Snowpipe is used to load data

  • Snowpipe uses COPY command and supports simple transformations as mentioned in Bulk Load option

Used for automating transformations for continuous data load using Snowpipe

  • Raw data is first fed to a staging area where it is transformed using SQL queries and database Stored Procedures for loading into one or more Snowflake tables

  • Multiple tasks can be chained together

  • Captures Change Data Capture and provides lineage

  • Table streams record when staging tables and any downstream tables are populated with data

  • Uses Snowpipe and SQL

  • Data is first staged and then transformed and loaded into destination table(s)

  • Each time a task is scheduled to run, it can verify whether a stream contains change data for a table SYSTEM$STREAM_HAS_DATA  SYSTEM$STREAM_HAS_DATA


Snowpipe Sctreaming




Components of Data Streaming Using Snowpipe


  • Snowflake connecter for Kafka - The Kafka connector continuously loads micro batches of records from one or more Apache Kafka topics into an internal (Snowflake) stage, which fed to Snowpipe.

  • Data Files – Source data files or micro batches of source data

  • Snowflake Internal Stage – Snowflake managed temporary storage of source data

  • Snowpipe - Continuously loads micro-batches of data from an External Stage location (Amazon S3, Google Cloud Storage, or Microsoft Azure) or Internal Stage into a staging table.

  • Snowflake Managed Compute Services – Used for loading data from Snowpipeto Staging tables.

  • Staging Table – Snowflake tables into which data is loaded by Snowpipe.

  • Stream  – storage for Change Data Capture (CDC). Keeps both before and after images.

  • Task - execute SQL statements (which could call stored procedures) to transform the change data and move the optimized data sets into destination tables for analysis. Each time this transformation process runs, it selects the change data in the stream to perform DML operations on the destination tables and then consumes the change data when the transaction is committed.

  • User Managed Compute – User Manages compute environment is used to transform and load data from Stage table to Target tables

  • Target Tables  - Data loaded into Target tables can be used by Users

bottom of page