Blog Detaisl

The Rise of AI Data Cloud

The pendulum swings relentlessly. A common theme is emerging after talking to many customers: they are once again asking for an integrated and simplified data platform, without having to stitch multiple disparate software products. Call it a hangover from the drunkenness of Modern Data Stack. The integrated data platform is also core to enterprise architecture for achieving intelligent data applications beyond traditional analytics.

However, there is a twist to this — interoperability. To better prepare for pendulum swings, users want their platform to adopt open standards so that there is some semblance of future-proofing.

Snowflake Summit 2024 reflected this sentiment as it showcased its deep commitment to data and AI. The key highlights are:

  • Snowflake is expanding its appeal beyond the core data analyst persona. In the last few years, it added app development capabilities. Now, it’s adding more data science, operational, and governance features through conversational interfaces.

  • Snowflake is not only going wider, but it is also adding capabilities vertically. This includes more foundational capabilities that are starting to encroach into its ecosystem. However, the ecosystem is so large, we can expect it to grow even further. Some vendors are delighted as their category is now getting elevated by increased focus from Snowflake, like DataOps.

  • While Snowflake is expanding the platform capabilities beyond the core data platform for analysts, there is a gap between market perception and its messaging and positioning. The market still perceives Snowflake as being behind in data science and at higher cost, although it has made positive strides in both directions.

In this document, we examine key announcements and its impact on customers and the industry. Like the 2023 Snowflake Summit blog, this document does not specify the latest availability status (private preview, public preview, or general availability) of the new announcements.

Data Management

Snowflake’s north star is that they handle both structured and unstructured data, with discovery and governance, and at the most optimal cost performance. The latter is handled through Document AI which provides a workflow to parse PDFs. This technology is based on the August 2022 acquisition of Poland-based Applica and has been enhanced by Snowflake’s Arctic LLMs. In fact, it uses Arctic TILT — a mere 0.8B parameter model with very high benchmark rating. More on the Arctic LLMs later in the document.

Managed Iceberg Tables went GA at the Summit. They are optimized to have comparable performance as native tables. However, this year Snowflake also announced the ability to use any compute analytical engine that supports the Iceberg format.

Two open table formats predominantly in use are Iceberg and Delta. A bevy of database management systems supports Iceberg, like Snowflake, Cloudera, IBM watsonx.data, ClickHouse, and object-store-based fully managed data services, like Salesforce Data Cloud, Fivetran, and Confluent. Some products support multiple formats like Google’s BigQuery and Amazon Redshift. While the Delta format is mainly used by Databricks, it has created a translation layer called UniForm to interoperate between the two formats. Microsoft Fabric is based on OneLake that uses Delta as its native table format. Under the hood, Microsoft Fabric uses Onehouse’s Apache XTable translation layer to support both Delta and Iceberg. To complete the story, Onehouse also supports the Hudi table format.

Before we go any further, let’s address a huge spanner that Databricks threw into the party when they announced their acquisition of Tabular, founded by the original creators of the Iceberg table format, while they were at Netflix. Databricks paid an eye-watering sum of between $1B and $2B for a company with 40 employees and a funding of $37M. The announcement was timed to the minute as the Snowflake Summit keynote got started. We will leave any commentary on this acquisition out of this blog and instead focus on Snowflake’s Polaris and Horizon catalogs that are needed to achieve Iceberg interoperability.

Polaris Catalog

Polaris catalog is built on Iceberg REST API and will be open-source in 90 days. It tracks technical metadata, such as table names, columns, partitions, and bucket paths on object stores like Amazon S3 or Google Cloud Storage. With this metadata, any supported commute engine can act on the underlying data in a read/write manner. Some of these engines include Snowflake, Spark, Flink, Trino/Starburst, Presto, and Dremio, etc.

Polaris Catalog deployment options range from Snowflake-hosted (to be in public preview soon) to self hosted as Docker containers managed by Kubernetes.

Polaris catalog supports coarse-grained role based access control. Fine-grained access control comprising row-level and column-level security is done in Horizon Catalog which is covered next. One way of thinking of Polaris Catalog is as a mechanism to ensure multi-engine concurrency control and query planning.

Horizon

Horizon is an overarching solution with built-in governance, discovery, and access for content internal to an organization, as well as sourced from third parties. It has a unified set of compliance, security, privacy, interoperability, and access capabilities. Horizon includes Snowflake’s internal technical catalog, business catalog, and Trust Center (see below). Horizon was released in November 2023, for all data and application assets. It will extend Polaris Catalog by adding column masking policies, row access policies, object tagging and data sharing capabilities.

Horizon has all the features of a full-blown data catalog like business glossary and lineage, and is being expanded to become a registry for AI models. Its object descriptions can then be used to develop a semantic layer, which is available as a YAML file and is used by the Cortex Analyst (see below).

Cost Management

Snowflake’s new slick cost management user interface has enhanced capabilities such as budgets, which allows users to set spending limits and notifications. It has three goals:

Cost is such a key topic in this ecosystem that several players offer solutions to help reduce costs, like Capital One Software’s Slingshot.

  • Cost transparency shows spend overview at account level and across teams.

  • Cost control through allocation at account level and in the future, at query level.

  • Cost optimization through rule-based heuristics leading to recommendations.

Comments

Leave a message here

We welcome your messages and feedback. Whether you have questions about our event schedule, need additional information, or simply want to share your thoughts, we're here to listen. Your input is valuable to us and helps us improve our services and offerings. Feel free to reach out with any comments, suggestions, or inquiries you may have. We're committed to providing you with the best possible experience and look forward to hearing from you.