Snowflake is expensive! If you don’t...

We often hear claims that Snowflake is expensive. While there may be some truth to this, when you only pay for what you use... maybe it’s just not architected right.

We’ve seen 6x savings from inefficiencies in a Snowflake setup. Here are some tips to get it right:

Serverless Architecture

If you're already using a cloud platform (AWS, Azure, GCP), use it! Opt for a hybrid solution that connects data from your cloud platform to Snowflake. When we first started with Snowflake many years ago, the primary use case was for data warehousing. There were whispers of wider ML/AI use cases, but any development was still in the background. For a platform built entirely from the ground up, advanced data use cases were surely on the roadmap, but at that time, they weren’t yet in production.

We opted for a solution that provided flexibility in our architecture: we can handle traditional BI workloads through Snowflake while also retaining our raw data in our data lake on S3 for future ML/AI use cases. Not only did this provide a flexible architecture, but it also created cost efficiencies.

Snowpipe is a serverless ingestion tool by Snowflake used for batch data ingestion. It connects to your cloud data storage to automate ingestion of new files. Underneath the hood, Snowflake leverages compute resources (Snowflake Warehouses) to power the ingestion. However, due to its serverless nature, users don’t need to provision their own dedicated warehouse. Snowflake manages this across its users for a fully optimized allocation of resources.

In contrast, many ingestion methods can be inefficient. We’ve migrated from ELT tools that ran simultaneous batching and writing, resulting in warehouses staying on longer than necessary—leading to an increased bill. Additionally, some off-the-shelf tools have expensive merge logic, whether it’s creating a separate metadata table or not optimizing the DML to update existing records. Where possible, try an append-only integration into your raw data and build your transformations with Snowflake’s tasks and streams.

This ensures that data isn’t lost in your raw layer via transformation. You can also leverage Snowflake’s serverless features as part of your ELT logic. A Snowflake stream tracks data manipulation changes (DML) to a source table or view in Snowflake, including inserts, updates, and deletes—essentially, change data capture. The resulting metadata is available for querying and use within a Snowflake task. A task can then read data from the stream and apply the required downstream transformation logic on your transformed table.

Snowflake tasks are serverless, so you’re again sharing compute resources. Combined with a task, you're only querying against transformed data, optimizing both reads and writes.
‍

Understand Your Workloads

Understanding your workloads is key to optimizing Snowflake. The unique architecture requires rethinking conventional optimization techniques. Evenly spreading out compute, as you might with other systems, is likely not the best approach. We won’t dive too deep here, but a few tips to be aware of:

Understand your cache requirements and map them according to your workloads. For instance, with ingestion, you should provision a dedicated warehouse. The default auto-suspend time on Snowflake is 10 minutes, meaning after 10 minutes of inactivity, the warehouse will shut down. Ingestion doesn’t benefit from caching, so those extra nine minutes are likely wasted. Compounded daily, monthly, and yearly, this can drive up costs at the end of the month.
‍

Over-engineering Your Architecture

Finally, let’s talk about over-engineering. In an effort to make things appear perfect, you end up adding unnecessary complexity to your setup. We’ll focus here on multi-tenant architectures. Building out your architecture for multi-tenant use cases requires careful planning, focusing on security, authentication, governance, and cost, among other factors.

One of my favorite articles on design patterns for multi-tenant applications on Snowflake can be found here.

It was the source for building a highly scalable internal data platform. The flowchart is great and guides you to the ideal solution for your organization. However, there are cost implications depending on the setup you choose. Here are a few things to consider:

Account per tenant: Each of your tenants has their own dedicated Snowflake account (or reader account). This will be the most expensive route, though it offers complete isolation from other tenants. Warehouse resources cannot be shared.
Object per tenant: You can consolidate storage at the tenant level, meaning each tenant has their own dedicated databases. From here, you can decide whether to use shared or dedicated compute. Your decision here will significantly impact your bill.

‍

The flowchart should still be the guide.
‍

Wrapping Up

These are just a few of the many factors to consider when looking to optimize your Snowflake bill. We’ve encountered a host of other considerations, which we’ll expand on at a later date (e.g., proper scaling in Snowflake: horizontal vs. vertical). Additionally, be on the lookout for overall Total Cost of Ownership optimization articles, where we take a wider view of common scaling pain points and how to plan and optimize accordingly.

If you'd like to learn more, have any questions, or want to get in touch, feel free to reach out by filling out our contact form here.

‍

Snowflake is expensive! If you don’t...

Serverless Architecture

Understand Your Workloads

Over-engineering Your Architecture

Wrapping Up

More blog posts

Conundrum: dbt Core vs dbt Cloud

The start of a journey with Leolytix