Lakehouse Modeling Playbook: When to Use Star Schemas, OBTs, or Data Vault on Databricks
How to strategically choose and optimize between Star Schemas, One Big Table (OBT), and Data Vault models on Databricks for different layers and workloads of the lakehouse architecture, focusing on balancing performance, governance, and cost
Executive Summary
The Databricks Lakehouse Platform robustly supports multiple data modeling techniques, but strategic selection and optimization are critical for maximizing performance, controlling costs, and ensuring governance. The choice is not between a single “best” model, but about applying the right model to the right layer for the right workload. The most effective strategy is often a hybrid approach, leveraging different models across the Medallion Architecture’s Bronze, Silver, and Gold layers.
Optimization Trumps Model Choice: Liquid Clustering Delivers >20x Speed-Up
How a model is optimized is more important than which model is chosen. In a documented experiment, applying Liquid Clustering (CLUSTER BY
) and OPTIMIZE
to a One Big Table (OBT) model resulted in a greater than 20x task speed-up and a 3x reduction in wall-clock duration. Query time for the optimized OBT dropped to 1.13 seconds, outperforming a standard relational model which took 2.6 seconds. This was achieved by reducing the number of files scanned from 7 to 2, demonstrating that physical data layout is the primary driver of performance.
The Hybrid Model Prevails: Use OBT for Agility, Star Schemas for Governance
The most successful and common pattern on Databricks is a hybrid architecture. This involves using agile models like OBT or Data Vault in the Silver Layer for rapid integration and cleansing. From there, curated, business-centric Dimensional Models (Star Schemas) are built in the Gold Layer to provide a governed, performant, and reusable semantic layer for enterprise BI and reporting. This approach balances speed of development with the need for a stable, single source of truth.
OBT Slashes Build Time but Magnifies Governance Risk
The One Big Table (OBT) model offers maximum simplicity and accelerates development by eliminating complex joins. However, it concentrates all data, including sensitive PII, into a single wide table, inherently increasing the risk and “blast radius” of a data breach. In contrast, a Dimensional Model naturally segregates sensitive data into specific dimension tables, reducing risk. Therefore, deploying Unity Catalog’s row-level filters and column-level masks is a non-negotiable prerequisite before granting any access to an OBT.
DLT Automation Reduces Maintenance Overhead for Dimensional Models
For domains requiring historical tracking, the engineering effort over time favors Dimensional Models. Delta Live Tables (DLT) provides an AUTO CDC
feature that handles complex Slowly Changing Dimension (SCD) Type 2 logic out-of-the-box, automatically managing history with __START_AT
and __END_AT
columns. Maintaining history in an OBT requires complex, manual MERGE
statements that become operationally expensive as the table grows.
3NF is an Explicit Anti-Pattern for Lakehouse Analytics
Legacy, highly normalized models like Third Normal Form (3NF) are considered an anti-pattern for analytical workloads on Databricks. The excessive number of joins required for queries negates the benefits of the platform’s distributed engine and data skipping optimizations, leading to extremely poor performance. Such models should be confined to the Bronze ingestion layer and redesigned into denormalized structures for the Silver and Gold layers.
1. Why Modeling Strategy Determines Lakehouse ROI
Choosing a data modeling strategy on the Databricks Lakehouse is not merely a technical exercise; it is a critical business decision that directly determines the return on investment (ROI) of your data platform. The cost of a suboptimal choice compounds across performance, security, and maintenance. An inefficient model burns excess compute (DBUs), leading to higher operational costs. A poorly governed model increases the risk of data breaches and compliance failures, while a complex model inflates engineering maintenance hours. This report provides a playbook for selecting, optimizing, and governing the right modeling approach—linking each technical choice to its business, security, and cost consequences.
2. Modeling Options Deep Dive
Databricks supports a variety of modeling techniques, each with distinct strengths, limitations, and ideal placement within the Medallion Architecture. Understanding these options is the first step toward building an efficient and scalable lakehouse.
Dimensional Modeling—Star Schemas Super-charged by Photon & DLT
Dimensional Modeling, introduced by Ralph Kimball, organizes data into “facts” (measurable business events) and “dimensions” (descriptive context) to optimize for analytics.
- Definition: The most common implementation is the Star Schema, which features a central fact table linked to multiple denormalized dimension tables. This design minimizes complex joins and is intuitive for business users. A more normalized variant, the Snowflake Schema, is less common on Databricks as its additional joins can degrade performance.
- Placement: Dimensional models are the best practice for the Gold Layer, serving as the curated, business-ready presentation layer for BI tools and reporting.
- Strengths: The structure is optimized for analytical (OLAP) queries, providing fast slicing, dicing, and aggregation. It creates a consistent semantic layer for governed reporting. Databricks enhances performance with Delta Live Tables (DLT) for simplified SCD management, informational PK/FK constraints in Unity Catalog, and Liquid Clustering.
- Limitations: It requires significant upfront design effort and ongoing maintenance of ETL/ELT pipelines. The rigid structure can be less agile for exploratory analysis compared to OBT.
One Big Table—Rapid Prototyping & ML Feature Stores
The One Big Table (OBT) model prioritizes simplicity and speed by consolidating all data into a single, wide table.
- Definition: An OBT is a highly denormalized table that pre-joins all relevant fact and dimension attributes for a specific use case, effectively flattening the data structure.
- Placement: OBTs are flexible. In the Silver Layer, they can serve as an agile integration table. In the Gold Layer, they provide a high-performance asset for specific use cases like ML feature engineering or single-purpose dashboards.
- Strengths: OBTs offer maximum simplicity and fast development cycles. By eliminating joins at query time, they can deliver excellent performance, especially for filtered queries on clustered columns. Governance is simplified due to fewer tables to manage.
- Limitations: The model leads to significant data redundancy and can increase storage costs. Performance degrades if queries filter on non-clustered columns, leading to large, inefficient scans. The concentration of sensitive data requires meticulous implementation of row-level security and column masking.
Data Vault—Audit-Grade Foundation for Regulated Industries
Data Vault is a modeling methodology designed for agility, scalability, and auditability, making it ideal for building an integrated enterprise data foundation.
- Definition: It separates business keys (Hubs), relationships (Links), and descriptive attributes (Satellites) into distinct components. This “write-optimized” model tracks data history and source, providing a complete audit trail.
- Placement: The raw Data Vault is typically built in the Silver Layer, serving as an integrated and auditable foundation. From this layer, downstream data marts, often in a Star Schema format, are created in the Gold Layer for business consumption.
- Strengths: The model is highly agile and extensible; new data sources can be added with minimal impact on the existing structure. It provides a complete, auditable history, which is ideal for regulatory and compliance requirements.
- Limitations: Querying the raw Data Vault directly for analytics is extremely complex due to the high number of joins required. It is not suitable for direct BI or ad-hoc analysis and almost always requires a denormalized presentation layer (like a Star Schema) built on top.
3NF & Excessive Snowflaking—Why Normalization Fails in MPP Lakes
Highly normalized models, such as Third Normal Form (3NF) or deep Snowflake Schemas, are a foundational part of traditional relational databases (OLTP systems) but are considered an anti-pattern for modern analytics on Databricks.
- Definition: 3NF focuses on eliminating data redundancy by ensuring all table attributes depend only on the primary key. Snowflaking extends a Star Schema by further normalizing dimension tables into sub-dimensions.
- Placement: Data often arrives from source systems in 3NF and may be stored as-is in the Bronze Layer. It is strongly discouraged for use in the Gold Layer.
- Reason for Failure: On a distributed MPP platform like Databricks, the large number of joins required by normalized models incurs significant compute overhead from data shuffling. This negates the benefits of data skipping optimizations and leads to extremely poor query performance for analytical workloads.
3. Performance Benchmarks & Tuning Levers
While model choice matters, performance on Databricks is more heavily influenced by data layout optimization and caching. An optimized OBT can outperform a non-optimized Star Schema, and vice-versa, proving that tuning is non-negotiable.
Head-to-Head Metrics: Join vs. Scan Cost
The primary performance trade-off between a Dimensional Model and an OBT is join cost versus scan cost.
Factor | Dimensional Model (Star Schema) | One Big Table (OBT) |
---|---|---|
Primary Cost Driver | Join Cost: Compute resources are consumed joining the fact table with dimension tables. This is heavily optimized by the Photon engine and Adaptive Query Execution (AQE). | Scan Cost: Queries must scan a single, very wide table. Performance hinges on minimizing the amount of data scanned through effective file pruning. |
Data Skipping | Highly Effective: Filters on dimension tables can prune files in both the dimensions and the central fact table, significantly reducing data reads. | Conditionally Effective: Data skipping works well only when queries filter on the columns used for clustering. Its effectiveness diminishes for queries on non-clustered columns, leading to large scans. |
Key Takeaway: A Dimensional Model’s join cost is often less than an OBT’s scan cost if the OBT is not properly clustered for common query patterns.
Liquid Clustering Case Study—20× Task Speed-Up
Liquid Clustering is Databricks’ most advanced data layout strategy and is essential for OBT performance. Refer to article One Big Table vs. Dimensional Modeling on Databricks SQL for case study details.
A benchmark test illustrates its profound impact:
- Before Optimization: A query on a standard OBT took 3.5 seconds and scanned 7 files.
- After Optimization: After applying
CLUSTER BY
on the filter column (c_mktsegment
) and runningOPTIMIZE
, the same query saw a >20x task speed-up and a >3x reduction in wall-clock duration. - Result: The query time dropped to 1.13 seconds, and the number of files scanned was reduced from 7 to just 2.
This demonstrates that applying Liquid Clustering is a critical step that can make an OBT significantly faster than even a standard relational model (which took 2.6 seconds in the same test).
Photon Caching: Achieving < 500 ms Queries at Scale
For high-concurrency BI workloads with repetitive query patterns, Databricks SQL’s automatic caching provides a substantial performance boost for both Dimensional Models and OBTs. When data is cached, subsequent queries can see execution times drop significantly, often to under 500 milliseconds, as both I/O and computation are minimized. This makes both models viable for interactive dashboards, provided the underlying data is accessed frequently.
4. Cost & Governance Trade-Offs
The choice between a Dimensional Model and an OBT involves a direct trade-off between upfront engineering costs and long-term governance overhead.
Storage vs. Compute TCO Across Models
Total Cost of Ownership (TCO) is driven by a combination of storage, compute, maintenance, and engineering time. While storage is relatively inexpensive, compute costs can vary dramatically based on model efficiency.
Cost Driver | Dimensional Model (Star Schema) | One Big Table (OBT) |
---|---|---|
Storage Cost | Lower: Less data redundancy as contextual attributes are stored once in dimension tables. | Higher: Significant data redundancy from flattening all attributes increases the storage footprint. |
Compute Cost | Join Complexity: Queries require joins, which consume compute, though this is highly optimized on Databricks. | Large Scans: Inefficient queries that cannot be pruned by clustering keys lead to massive, expensive table scans. |
Maintenance Cost | Moderate: Requires running OPTIMIZE to compact files and apply clustering. Can be automated with Predictive Optimization. |
Moderate: Same maintenance requirements as a Dimensional Model to ensure clustering remains effective. |
Initial TCO | Higher: Requires substantial upfront engineering time for schema design and building complex ETL pipelines. | Lower: Simpler and faster to set up, reducing initial development costs. |
Long-Term TCO | Lower: Easier to govern and maintain once built. | Higher: Can increase due to complex governance for fine-grained access control and data quality management on a single, wide table. |
Key Takeaway: OBTs offer a lower barrier to entry but can accrue higher long-term costs related to governance and inefficient queries if not managed properly. Dimensional Models require more upfront investment but provide a more stable and cost-effective foundation for governed analytics over time.
Security Blast Radius—Dimensional Isolation vs. OBT Concentration
The structure of a data model has direct implications for its risk profile and data privacy.
- Dimensional Model: This model inherently reduces risk by segregating data. Sensitive PII can be isolated in specific dimension tables (e.g., a ‘Customer’ dimension). This separation reduces the “blast radius” of a breach, as access to a fact table alone may not expose sensitive context, aligning with the principle of data minimization.
- One Big Table: An OBT concentrates all data, including sensitive fields, into a single asset. This inherently increases risk, as a single misconfigured permission could expose a vast amount of information. This model demands a flawless implementation of fine-grained access controls to manage its higher intrinsic risk.
Unity Catalog is the critical enabler for managing this risk, especially for OBTs. Its features for row-level filters and column-level masks are essential for implementing the ‘need-to-know’ principle securely. For example, a column mask can be created as a SQL UDF to redact data for unauthorized users and applied directly to the table:
-- Create a masking function to redact price for non-privileged users
CREATE FUNCTION price_mask(o_totalprice DECIMAL)
RETURN CASE WHEN is_authorised('admin') THEN o_totalprice ELSE '***-**-****' END;
-- Apply the mask to the OBT column
ALTER TABLE tpch_obt ALTER COLUMN o_totalprice SET MASK price_mask;
5. Operational Complexity & Automation
Databricks tooling can significantly reduce the operational burden of data modeling, but the benefits are not distributed equally across all approaches.
SCD & CDC: DLT AUTO-CDC vs. Manual MERGE
Handling Slowly Changing Dimensions (SCDs) and Change Data Capture (CDC) is a common requirement where the choice of model has a major impact on engineering effort.
- Dimensional Model Approach: Delta Live Tables (DLT) provides a powerful, declarative framework that simplifies implementing dimensional models. Its
AUTO CDC
APIs automate the complex logic for both SCD Type 1 (overwrite) and Type 2 (history tracking). For SCD Type 2, DLT automatically manages history with__START_AT
and__END_AT
columns and uses aSEQUENCE BY
column to handle out-of-order data robustly. - OBT Approach: An OBT-centric pipeline faces significant challenges with historical tracking. Implementing the equivalent of SCDs requires complex custom logic, often involving manual
MERGE
statements to handle updates. This results in higher long-term operational toil and can be computationally expensive, as changes to an underlying dimension may require regenerating large portions of the OBT.
Predictive Optimization & Lineage via Unity Catalog
Unity Catalog provides two key automation capabilities that level the playing field:
- Predictive Optimization: For tables in Unity Catalog, this feature can automate maintenance tasks like running
OPTIMIZE
and applying Liquid Clustering. It intelligently selects clustering keys and runs the necessary jobs, ensuring tables remain performant without manual intervention. - Automated Data Lineage: Unity Catalog automatically captures and visualizes column-level lineage for all workloads (SQL, Python, DLT, dbt). This provides end-to-end visibility into data flows, which is crucial for debugging, impact analysis, and auditing any data model.
6. Workload-Driven Prescriptions
The optimal modeling strategy is dictated by the specific workload. Rather than adhering to a single dogma, teams should let the use case guide the design.
Governed BI Dashboards → Star Schemas
For governed dashboarding and enterprise BI, the Dimensional Model (Star Schema) is the recommended approach. Its structure is inherently optimized for the OLAP queries (slicing, dicing, aggregation) that power these tools. The separation of facts and dimensions provides an intuitive semantic layer for BI tools like Power BI and Tableau, ensuring a consistent single source of truth for reporting. On Databricks, this model is highly performant, leveraging the Photon engine, Liquid Clustering on keys, and informational constraints in Unity Catalog to accelerate queries.
ML Feature Engineering → OBT as Feature Table
For machine learning workloads, the One Big Table (OBT) is highly suitable for creating feature tables. The denormalized, flattened structure provides a simple, wide table where each row represents an observation and each column a feature. This eliminates the need for complex joins during model training and inference, simplifying the feature engineering pipeline and improving performance.
Hybrid Medallion Pattern—OBT Silver → Star Gold
For a general-purpose enterprise lakehouse, the most effective pattern is a hybrid Medallion Architecture.
- Silver Layer: Data is cleansed and conformed into an OBT. This provides an agile, integrated table for data preparation and allows data scientists and analysts to quickly explore and prototype with a comprehensive view of the data.
- Gold Layer: Once analytical requirements stabilize, the Silver OBT is transformed into curated Star Schemas. This layer provides highly performant, reusable, and governed data products for downstream BI, reporting, and analytics, ensuring consistency and a single source of truth.
This pattern combines the agility of OBTs for development and exploration with the performance and governance of Star Schemas for production analytics.
7. Common Antipatterns & Failure Modes
Most modeling failures on Databricks trace back to ignoring platform-native optimizations or applying patterns from legacy systems without adaptation.
Antipattern | Reason for Failure | Remediation |
---|---|---|
Excessive Snowflaking / Over-normalization | Highly normalized models (3NF, deep snowflakes) create excessive joins, which degrade performance on distributed engines like Databricks by increasing data shuffling and reducing the effectiveness of data skipping. | Prefer denormalized structures. Redesign legacy 3NF models into a Star Schema for the Gold layer. Keep dimension tables wide by flattening many-to-one relationships to avoid snowflake joins. |
Over-partitioning | Partitioning tables < 1 TB or on low-cardinality columns leads to a “small file problem.” The query engine spends more time reading metadata from thousands of tiny files than reading data, which increases overhead and hurts performance. | Minimize partitioning. For most use cases, use Liquid Clustering (CLUSTER BY ). It is more flexible, automatically manages file sizes, and incrementally optimizes data layout, providing superior performance without the rigidity of partitioning. |
Naive OBT Implementation | Creating a wide OBT without optimization shifts the performance bottleneck from joins to scans. Queries filtering on non-clustered columns will force inefficient and expensive full table scans, negating the model’s benefits. | Apply Liquid Clustering on the OBT using the 1-3 most frequently filtered columns. This co-locates related data, enabling efficient file pruning. For Unity Catalog tables, enable Predictive Optimization to automate this process. |
8. Tooling Ecosystem
A scalable modeling strategy relies on a robust tooling ecosystem for governance, transformation, and automation. Metadata and lineage are prerequisites, not afterthoughts.
Unity Catalog—Single Source of Governance Truth
Unity Catalog is the central, unified governance solution for the Databricks Lakehouse. It provides the foundational layer for the entire modeling ecosystem with key capabilities:
- Centralized Metadata: A 3-level namespace (
catalog.schema.table
) organizes all data assets. It can store informationalPRIMARY KEY
andFOREIGN KEY
constraints, which provide semantic context for modelers and are used by visual modeling tools likeerwin
andsqlDBM
. - Automated Data Lineage: It automatically captures and visualizes column-level lineage for all workloads, providing end-to-end visibility.
- Fine-Grained Access Control: It enables robust security via a hierarchical privilege model, RBAC, row-level filters, and column-level masking.
- Data Discovery and Sharing: It provides a searchable catalog for all assets and facilitates secure data sharing via the open Delta Sharing protocol.
dbt + DLT + Delta Sharing—Composable Lakehouse Stack
Databricks native tools and third-party integrations form a powerful, composable stack for building and managing data models:
- Delta Live Tables (DLT): A declarative framework for building reliable, maintainable, and testable data processing pipelines, with built-in support for CDC and data quality expectations.
- dbt (data build tool): A popular open-source tool for data transformation that integrates seamlessly with Databricks and Unity Catalog’s 3-level namespace, enabling engineers to build, test, and document models using SQL.
- Delta Sharing: An open protocol for securely sharing live data from your lakehouse to other organizations, regardless of their computing platform.
9. Migration Playbook
Migrating from a legacy data warehouse to the Databricks Lakehouse involves more than a simple “lift-and-shift”; it requires a thoughtful redesign of data models to align with the architecture’s strengths.
Schema Redesign Quick Wins
The core activity in a migration is moving away from legacy, highly normalized 3NF models, which perform poorly on Databricks. The goal is to redesign schemas into denormalized structures that optimize query performance.
A key recommendation is to adopt a hybrid strategy that minimizes risk and downtime. Start by creating OBTs in the Silver Layer for rapid data integration and to provide immediate value to data science teams. This allows you to decommission legacy pipelines quickly. Once business requirements for reporting and BI have stabilized, create curated Dimensional Models (Star Schemas) in the Gold Layer from the Silver OBTs. This staged conversion provides a pragmatic path to a fully modernized, performant, and governed lakehouse architecture.
Roll-Out Checklist & Governance Guards
- Inventory and Prioritize: Identify source systems and prioritize migration based on business impact and technical complexity.
- Design Bronze Layer: Set up raw ingestion pipelines using
COPY INTO
or Auto Loader to land data in its native format into the Bronze layer. - Implement Silver Layer: Build cleansing and transformation logic to create conformed OBTs or a Data Vault in the Silver layer.
- Apply Governance: Before exposing any data, configure Unity Catalog with appropriate access controls, including row filters and column masks for all sensitive data in OBTs.
- Build Gold Layer: Develop and deploy optimized Star Schemas for key business domains in the Gold layer.
- Optimize and Automate: Implement Liquid Clustering on all large tables and enable Predictive Optimization in Unity Catalog to automate maintenance.
- Migrate BI Tools: Repoint BI dashboards and reports from the legacy warehouse to the new Gold layer tables and views in Databricks SQL.
Ready to begin your migration journey? Our team at Dhristhi specializes in helping organizations navigate complex data platform transformations. Contact us to discuss your specific migration challenges and develop a tailored strategy for your organization.
This post is part of our ongoing series on modern data engineering practices. Stay tuned for more insights on data platform transformations and best practices.