In today’s data-driven world, the ability to effectively store, process, and analyze vast amounts of information isn’t just an advantage – it’s a necessity. For years, organizations grappled with legacy data warehousing solutions that were often rigid, expensive, and struggled to keep pace with modern data demands. Then came Snowflake.
This blog post is my personal endeavor to create a comprehensive, yet quick-reference guide to Snowflake. The goal is simple: to have a single place where I (and hopefully you, dear reader!) can quickly refresh my understanding of key Snowflake concepts, features, and best practices. As I continue my journey with Snowflake, I’ll be updating and expanding this post, aiming to cover everything from the fundamentals to more advanced topics like performance tuning, security, data sharing, and ecosystem integrations.
So, whether you’re new to Snowflake, considering it for your organization, or, like me, just want a handy refresher on its powerful capabilities, let’s dive in!
Table of Contents
Introduction
What is Snowflake? The Data Cloud Explained
At its core, Snowflake is a cloud-native data platform delivered as a Software-as-a-Service (SaaS). This means you don’t install software or manage hardware; you simply sign up and start using it. While it’s often referred to as a “cloud data warehouse,” that description, while accurate, doesn’t capture its full scope.
Snowflake is much more: it’s a global Data Cloud that allows organizations to:
- Store and analyze massive volumes of structured and semi-structured data (like JSON, Avro, XML, Parquet).
- Run diverse workloads: From traditional business intelligence and reporting to data engineering, data science, and even building data applications.
- Securely share live data: Without copying or moving it, both internally and externally with partners, customers, and data providers.
- Benefit from cloud elasticity: Scale resources up or down near-instantly and pay only for what you use.
Snowflake was built from the ground up for the cloud, leveraging the infrastructure of major cloud providers (AWS, Azure, GCP) but operating as a distinct service on top of them. Its unique architecture is what truly sets it apart.
How Does Snowflake Work? A Look Under the Hood
The “magic” of Snowflake lies in its unique, multi-cluster, shared-data architecture. This architecture is designed to separate storage, compute, and cloud services, allowing each to scale independently. Let’s break down the three key layers:
- Database Storage (The Foundation):
- When you load data into Snowflake, it’s reorganized into Snowflake’s internal optimized, compressed, columnar format and stored in the cloud storage service of the underlying cloud provider (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage).
- This storage layer is designed for elasticity, resilience, and cost-effectiveness. You pay for the storage you consume.
- All your data resides in this central storage repository, accessible by compute resources.
- Query Processing (The Muscle – Virtual Warehouses):
- This is where the actual data processing and query execution happen. Snowflake uses “Virtual Warehouses” for compute.
- A Virtual Warehouse is essentially one or more clusters of compute resources (CPUs, memory, temporary storage).
- Crucially, these are independent of the storage layer and other virtual warehouses.
- You can have multiple virtual warehouses of different sizes (e.g., X-Small, Small, Medium, Large, X-Large, etc.) running simultaneously, accessing the same data without contention.
- For example, your ETL/ELT jobs can run on one warehouse, your BI tools can query data using another, and data scientists can run complex analyses on a third – all without impacting each other’s performance.
- You can scale a warehouse up or down (e.g., from Small to Large) on the fly, even while queries are running, or set them to auto-suspend when idle to save costs.
- Cloud Services (The Brain):
- This is the coordination and management layer of Snowflake. It handles tasks like:
- Authentication and access control
- Query parsing, optimization, and compilation
- Infrastructure management (including virtual warehouse provisioning)
- Metadata management (schemas, tables, views, security, etc.)
- Transaction management and concurrency control
- This layer runs on compute instances provisioned by Snowflake across availability zones within a region, ensuring high availability and fault tolerance. It’s completely managed by Snowflake, so you don’t have to worry about it.
- This is the coordination and management layer of Snowflake. It handles tasks like:
This decoupling of storage and compute is a fundamental differentiator. It means you can scale storage capacity independently of your compute needs, and vice-versa, providing unparalleled flexibility and cost efficiency.
What is a Datawarehouse?
A data warehouse is a centralized repository designed to store large volumes of historical data from various disparate sources, optimized for querying, reporting, and business intelligence rather than transactional processing.
The ETL (Extract, Transform, Load) process is the traditional method used to populate a data warehouse:
- Data is extracted from operational systems (like databases, CRMs, ERPs), transformed (cleaned, standardized, aggregated, joined) into a consistent and usable format, and then loaded into the warehouse.
- Snowflake serves as a powerful, cloud-native data warehouse, acting as the destination for these ETL pipelines. Its architecture, with the separation of storage and compute, not only efficiently handles the “load” phase but also increasingly supports ELT (Extract, Load, Transform) patterns, where raw data is loaded first into Snowflake, and transformations are then performed directly within Snowflake leveraging its scalable processing capabilities.
How Snowflake Can Benefit a Finance Organization
Finance organizations are often at the heart of data-driven decision-making, dealing with sensitive information, complex reporting requirements, and the need for accuracy and timeliness. Snowflake offers significant advantages for them:
- Single Source of Truth (SSoT):
- Finance departments often pull data from various systems (ERP, CRM, HR, market data feeds). Snowflake can act as the central repository, consolidating all this data into a governed and reliable SSoT. This reduces discrepancies and ensures everyone is working from the same numbers for reporting, forecasting, and analysis.
- Enhanced Performance & Scalability for Reporting and Analytics:
- Month-end, quarter-end, and year-end closing processes can be incredibly compute-intensive. Snowflake’s ability to instantly scale virtual warehouses up ensures that complex financial reports, reconciliations, and analytical queries run quickly, even during peak demand. Warehouses can then be scaled down or suspended to save costs.
- Secure and Governed Data Sharing:
- Finance needs to share data securely with auditors, regulators, internal departments (e.g., sales for commission calculations), and sometimes external partners. Snowflake’s Secure Data Sharing allows live, read-only access to specific datasets without copying or moving data, maintaining full control and auditability.
- Cost Optimization and Predictability:
- The pay-per-use model (separate pricing for storage and compute) means finance only pays for the resources consumed. The ability to auto-suspend virtual warehouses during idle periods significantly reduces costs compared to traditional, always-on systems. This makes budgeting for data infrastructure more predictable.
- Robust Security and Compliance:
- Financial data is highly sensitive. Snowflake provides enterprise-grade security features like end-to-end encryption (in transit and at rest), role-based access control (RBAC), multi-factor authentication (MFA), network policies, and support for compliance standards relevant to finance (e.g., SOC 2 Type II, PCI DSS, HIPAA, and GDPR support).
- Advanced Analytics, AI/ML, and Forecasting:
- Beyond standard reporting, finance teams can leverage Snowflake to perform more sophisticated analyses like predictive forecasting, risk modeling, fraud detection, and customer profitability analysis. Snowflake’s ability to handle diverse data types and integrate with data science tools empowers these advanced use cases.
- Reduced IT Overhead and Faster Time to Value:
- As a SaaS platform, Snowflake eliminates the need for finance or IT teams to manage hardware, software updates, or infrastructure maintenance. This frees up valuable resources to focus on strategic data initiatives rather than operational tasks, leading to faster insights and decision-making.
- Agility for Evolving Business Needs:
- The finance landscape is constantly changing with new regulations, business models, and M&A activities. Snowflake’s flexible architecture allows finance teams to adapt quickly, integrate new data sources, and modify analytical models without lengthy IT projects.
By addressing these key areas, Snowflake empowers finance organizations to transform their data operations from a cost center into a strategic asset, driving better financial planning, risk management, and overall business performance.
Scaling Policy
Beyond manually resizing a virtual warehouse (scaling up/down), Snowflake offers powerful auto-scaling capabilities for multi-cluster warehouses. This is governed by the scaling policy, which dictates how Snowflake automatically adds or removes compute clusters within that warehouse to handle fluctuating query loads. When you configure a multi-cluster warehouse, you set a minimum and maximum number of clusters. The scaling policy then determines when to start additional clusters (scale out) as query concurrency increases or query queues build up, up to the defined maximum. Conversely, it will shut down idle clusters (scale in) to conserve credits, down to the defined minimum
There are two scaling modes:
- Standard: Favors starting additional resources
- Economy: Favors conserving credits rather than starting additional warehouses
Basic SQL Operation Examples
-- ================================== -- WAREHOUSE MANAGEMENT -- ================================== -- Create a new X-Small virtual warehouse -- AUTO_SUSPEND: Number of seconds of inactivity before suspending (e.g., 600 seconds = 10 minutes) -- AUTO_RESUME: Automatically resume when a query is submitted -- INITIALLY_SUSPENDED: Starts the warehouse in a suspended state -- MIN_CLUSTER_COUNT & MAX_CLUSTER_COUNT: For multi-cluster warehouses -- SCALING_POLICY: 'STANDARD' or 'ECONOMY' for multi-cluster warehouses CREATE WAREHOUSE IF NOT EXISTS MY_XS_WH WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 600 AUTO_RESUME = TRUE INITIALLY_SUSPENDED = TRUE COMMENT = 'My general purpose X-Small warehouse'; CREATE WAREHOUSE IF NOT EXISTS MY_SCALABLE_WH WAREHOUSE_SIZE = 'SMALL' MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3 SCALING_POLICY = 'STANDARD' AUTO_SUSPEND = 300 AUTO_RESUME = TRUE COMMENT = 'My auto-scaling warehouse for BI'; -- View all warehouses SHOW WAREHOUSES; -- Alter an existing warehouse (e.g., resize, change auto-suspend) ALTER WAREHOUSE MY_XS_WH SET WAREHOUSE_SIZE = 'SMALL'; ALTER WAREHOUSE MY_XS_WH SET AUTO_SUSPEND = 300; -- Suspend a warehouse (stops billing for compute) ALTER WAREHOUSE MY_XS_WH SUSPEND; -- Resume a warehouse ALTER WAREHOUSE MY_XS_WH RESUME; -- Drop a warehouse (be careful!) DROP WAREHOUSE IF EXISTS MY_TEMP_WH; -- Use a specific warehouse for the current session USE WAREHOUSE MY_XS_WH; -- ================================== -- DATABASE & SCHEMA MANAGEMENT -- ================================== -- Create a new database CREATE DATABASE IF NOT EXISTS MY_DATABASE COMMENT = 'Primary database for my projects'; -- View all databases SHOW DATABASES; -- Use a specific database (sets the context for subsequent commands) USE DATABASE MY_DATABASE; -- Create a new schema within the current database CREATE SCHEMA IF NOT EXISTS MY_SCHEMA COMMENT = 'Schema for raw data'; CREATE SCHEMA IF NOT EXISTS ANALYTICS_SCHEMA COMMENT = 'Schema for analytical views and tables'; -- View all schemas in the current database SHOW SCHEMAS; -- Use a specific schema (sets the context for tables, views, etc.) USE SCHEMA MY_DATABASE.MY_SCHEMA; -- Or if MY_DATABASE is already in use: -- USE SCHEMA MY_SCHEMA; -- Drop a schema (be careful! This drops all objects within it) DROP SCHEMA IF EXISTS MY_DATABASE.OLD_SCHEMA; -- Drop a database (be careful! This drops all schemas and objects within it) DROP DATABASE IF EXISTS MY_OLD_DATABASE; -- ================================== -- TABLE MANAGEMENT (DDL) -- ================================== -- Ensure you are in the correct database and schema context USE DATABASE MY_DATABASE; USE SCHEMA MY_SCHEMA; -- Create a new table CREATE TABLE IF NOT EXISTS EMPLOYEES ( ID NUMBER(10,0) NOT NULL PRIMARY KEY, -- Integer up to 10 digits FIRST_NAME VARCHAR(100), LAST_NAME VARCHAR(100), EMAIL VARCHAR(100) UNIQUE, HIRE_DATE DATE, SALARY NUMBER(12,2), -- Numeric with 2 decimal places DEPARTMENT_ID NUMBER(5,0), IS_ACTIVE BOOLEAN DEFAULT TRUE, CREATED_AT TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP() -- Timestamp without timezone ); -- Describe a table structure DESCRIBE TABLE EMPLOYEES; -- Show all tables in the current schema SHOW TABLES; -- Add a new column to an existing table ALTER TABLE EMPLOYEES ADD COLUMN PHONE_NUMBER VARCHAR(20); -- Modify an existing column (e.g., change data type - be careful with data loss) -- ALTER TABLE EMPLOYEES MODIFY COLUMN PHONE_NUMBER VARCHAR(30); -- Check Snowflake docs for exact syntax on type change -- Rename a column ALTER TABLE EMPLOYEES RENAME COLUMN PHONE_NUMBER TO MOBILE_NUMBER; -- Drop a column ALTER TABLE EMPLOYEES DROP COLUMN MOBILE_NUMBER; -- Rename a table ALTER TABLE EMPLOYEES RENAME TO STAFF_MEMBERS; -- (Rename back for subsequent examples) ALTER TABLE STAFF_MEMBERS RENAME TO EMPLOYEES; -- Create a table like another table (copies structure, not data) CREATE TABLE EMPLOYEES_BACKUP LIKE EMPLOYEES; -- Create a table from a query (CTAS - copies structure and data) CREATE TABLE HIGH_EARNERS AS SELECT ID, FIRST_NAME, LAST_NAME, SALARY FROM EMPLOYEES WHERE SALARY > 100000; -- Drop a table (be careful!) DROP TABLE IF EXISTS EMPLOYEES_BACKUP; DROP TABLE IF EXISTS HIGH_EARNERS; -- Truncate a table (removes all data, but keeps the table structure; faster than DELETE without WHERE) TRUNCATE TABLE IF EXISTS EMPLOYEES; -- ================================== -- DATA MANIPULATION (DML) -- = -- Use the correct context USE WAREHOUSE MY_XS_WH; -- Ensure a warehouse is active USE DATABASE MY_DATABASE; USE SCHEMA MY_SCHEMA; -- Insert a single row of data INSERT INTO EMPLOYEES (ID, FIRST_NAME, LAST_NAME, EMAIL, HIRE_DATE, SALARY, DEPARTMENT_ID) VALUES (1, 'Alice', 'Smith', '[email protected]', '2022-01-15', 75000.00, 101); -- Insert multiple rows of data INSERT INTO EMPLOYEES (ID, FIRST_NAME, LAST_NAME, EMAIL, HIRE_DATE, SALARY, DEPARTMENT_ID) VALUES (2, 'Bob', 'Johnson', '[email protected]', '2021-07-20', 82000.00, 102), (3, 'Carol', 'Williams', '[email protected]', '2023-03-01', 68000.00, 101); -- Select all data from a table SELECT * FROM EMPLOYEES; -- Select specific columns SELECT FIRST_NAME, LAST_NAME, EMAIL FROM EMPLOYEES; -- Select with a filter (WHERE clause) SELECT * FROM EMPLOYEES WHERE SALARY > 70000; SELECT * FROM EMPLOYEES WHERE DEPARTMENT_ID = 101 AND IS_ACTIVE = TRUE; -- Select with ordering SELECT * FROM EMPLOYEES ORDER BY LAST_NAME ASC, FIRST_NAME ASC; SELECT * FROM EMPLOYEES ORDER BY SALARY DESC; -- Update existing data UPDATE EMPLOYEES SET SALARY = SALARY * 1.05, EMAIL = '[email protected]' WHERE ID = 1; -- Delete data (be careful!) DELETE FROM EMPLOYEES WHERE ID = 3; -- If you want to delete all rows, TRUNCATE is usually faster for large tables: -- TRUNCATE TABLE EMPLOYEES; -- ================================== -- SESSION CONTEXT -- ================================== -- Show current session context SELECT CURRENT_WAREHOUSE(), CURRENT_DATABASE(), CURRENT_SCHEMA();
Loading Data
Stages
In Snowflake, a stage is a named location where data files are stored (or “staged”) before being loaded into Snowflake tables, or where data is placed after being unloaded from tables. Think of it as an intermediary storage area or a “loading dock” for your data. Stages are essential for bulk data operations, allowing Snowflake to efficiently read from or write to files in various formats (like CSV, JSON, Parquet, Avro, ORC, XML).
There are two main types of stages:
- Internal Stages: These are storage locations managed directly by Snowflake within your Snowflake account. They are convenient for temporary storage, ad-hoc loading, or when you don’t want to manage external cloud storage accounts. Snowflake encrypts data in internal stages by default. There are three kinds of internal stages:
- User stages (private to each user, @~),
- Table stages (tied to a specific table, @%table_name), and
- Named internal stages (database objects accessible by users with appropriate privileges).
- External Stages: These point to data files stored in your own cloud storage locations outside of Snowflake, such as Amazon S3 buckets, Google Cloud Storage buckets, or Microsoft Azure Blob storage containers (and S3-compatible services like MinIO). External stages are ideal when your data already resides in cloud storage, or when you need to share data with other applications or processes outside Snowflake. You define the URL and necessary credentials (or a storage integration) for Snowflake to access these external locations.
The primary use of stages is with the COPY INTO <table> command to load data from staged files into Snowflake tables, and the COPY INTO @stage command to unload data from Snowflake tables into files in a stage. They facilitate efficient, parallelized data movement between Snowflake and file-based storage.
Here are some common operations for managing stages in Snowflake:
1. Creating Stages:
- Creating a Named Internal Stage:
-- Create a basic internal stage CREATE OR REPLACE STAGE my_internal_stage; -- Create an internal stage with a specific file format and copy options CREATE OR REPLACE STAGE my_csv_internal_stage FILE_FORMAT = (TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 1) COPY_OPTIONS = (ON_ERROR = 'SKIP_FILE'); SHOW STAGES LIKE 'my_csv_internal_stage';
- Creating an External Stage (e.g., for AWS S3):
(Using a Storage Integration is the recommended and more secure method for external stages)
-- Step 1: Create a Storage Integration (if you haven't already) -- This is a one-time setup for a given S3 bucket/path and credentials approach CREATE OR REPLACE STORAGE INTEGRATION s3_integration_example TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'S3' ENABLED = TRUE STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/your_snowflake_access_role' -- Role Snowflake will assume STORAGE_ALLOWED_LOCATIONS = ('s3://your-s3-bucket-name/path1/', 's3://your-s3-bucket-name/path2/'); -- For direct key/secret (less recommended than IAM Role for AWS S3, but viable for S3-compatible like MinIO): -- STORAGE_AWS_KEY_ID = 'YOUR_AWS_ACCESS_KEY_ID' -- STORAGE_AWS_SECRET_KEY = 'YOUR_AWS_SECRET_ACCESS_KEY' -- ENDPOINT = 'your_minio_endpoint:port' -- For S3-compatible like MinIO -- Step 2: Create the External Stage using the Storage Integration CREATE OR REPLACE STAGE my_s3_external_stage STORAGE_INTEGRATION = s3_integration_example URL = 's3://your-s3-bucket-name/data_source_a/' -- Specific path within the allowed locations FILE_FORMAT = (TYPE = 'PARQUET'); SHOW STAGES LIKE 'my_s3_external_stage';
2. Listing Files in a Stage:
-- List files in an internal stage LIST @my_internal_stage; -- List files in an external stage LIST @my_s3_external_stage; -- List files matching a pattern LIST @my_s3_external_stage PATTERN='.*.parquet';
3. Putting Files into an Internal Stage (from local machine):
(This command is executed using the SnowSQL client, not directly in the Snowflake UI worksheet unless you have a client-side connection active through it.)
-- From SnowSQL CLI: -- PUT file:///path/to/your/local/file.csv @my_internal_stage; -- PUT file:///path/to/your/local/data_*.json @my_internal_stage AUTO_COMPRESS=TRUE OVERWRITE=TRUE;
4. Getting Files from an Internal Stage (to local machine):
(Also executed using the SnowSQL client.)
-- From SnowSQL CLI: -- GET @my_internal_stage/file.csv file:///path/to/your/local/destination/; -- GET @my_internal_stage file:///path/to/your/local/destination/ PATTERN='.*.json.gz';
5. Removing Files from a Stage:
-- Remove a specific file from an internal stage REMOVE @my_internal_stage/file.csv; -- Remove all files from an external stage's path (Snowflake doesn't delete from external storage, but it removes them from its listing for that stage) -- For true deletion from external S3, you'd use S3 tools. -- However, for internal stages, REMOVE actually deletes the file. REMOVE @my_internal_stage; -- Removes all files in the root of my_internal_stage REMOVE @my_internal_stage PATTERN='.*.json.gz';
6. Altering a Stage:
You can alter properties like FILE_FORMAT, COPY_OPTIONS, URL (for external), or STORAGE_INTEGRATION (for external).
-- Alter the file format of an existing stage CREATE OR REPLACE FILE_FORMAT my_new_json_format TYPE = 'JSON' STRIP_OUTER_ARRAY = TRUE; ALTER STAGE my_internal_stage SET FILE_FORMAT = my_new_json_format; -- Change the URL for an external stage (if allowed by its storage integration) ALTER STAGE my_s3_external_stage SET URL = 's3://your-s3-bucket-name/new_path/';
7. Describing a Stage:
To see the properties and definition of a stage:
DESCRIBE STAGE my_internal_stage; DESC STAGE my_s3_external_stage;
8. Dropping a Stage:
This removes the named stage object. For internal stages, it also deletes all files stored within that stage. For external stages, it only removes the Snowflake stage object; the files in your external cloud storage remain untouched.
DROP STAGE IF EXISTS my_internal_stage; DROP STAGE IF EXISTS my_s3_external_stage;