datapro.news
Posts
Medallion Architecture: Five need to knows

Medallion Architecture: Five need to knows

This Week: Databricks Lakehouses & New AI Engineering Resources

Samuel Williams
October 23, 2024

In partnership with

Dear Reader…

As a data engineer if you are new to Databricks, understanding Medallion Architecture is essential for design and build of robust data pipelines. Simply put it is a data design pattern used to logically organise data in a lakehouse when using Databricks. The approach has been getting some serious traction in recent years, and so this week we will take a a look at the essentials of Medallion Architecture, as well as look at some of the pro’s and con’s. Here are the 5 “need to knows” on Medallion Architecture:

#1 “And the🏅Medalists are…”

The Medallion architecture logically organises data in a lakehouse into three distinct layers: Bronze, Silver, and Gold. Each “medallion” represents progressively higher levels of data quality. The Bronze layer is where raw data is ingested and stored in its original form, with minimal transformations. The Silver layer is where data is cleaned, deduplicated, and transformed to ensure accuracy and consistency. Finally, the Gold layer is where data is aggregated, optimised, and presented for consumption by stakeholders.

This approach enables you to provide a scalable and flexible framework for managing data in a large data lakehouses, creating a logical progression of data quality, from raw data ingestion to high-performance analytics and reporting. The architecture means that pipelines can be more robust, efficient, and scalable. Each layer has specific roles and responsibilities in the data pipeline:

🥉Bronze Layer: This is the raw data ingestion layer where data is stored in its original form, with minimal transformations. It includes metadata for data lineage and auditing. The Bronze layer acts as a landing zone for all incoming data, ensuring that raw data is preserved and traceable.

🥈Silver Layer: This layer is where data is cleaned, deduplicated, and transformed to ensure accuracy and consistency. It prepares data for downstream consumption. The Silver layer is critical for ensuring data quality and making data usable for analytics and reporting.

🏅Gold Layer: This is the final layer where data is aggregated, optimized, and presented for consumption by stakeholders. It is designed for high-performance analytics and reporting. The Gold layer is optimised for query performance, making it suitable for real-time analytics and production applications.

#2. 🔑 Guiding Principles

The approach is guided by these key principles to ensure data quality and reliability:

Data Maturity: Data quality improves as it progresses through the layers. The Bronze layer focuses on raw data ingestion, the Silver layer on data cleaning and transformation, and the Gold layer on data aggregation and optimisation.
Agility: The architecture adapts to changing business needs and new data sources this means it is flexible and can accommodate various data types and sources, making it suitable for dynamic data environments.
Atomicity, Consistency, Isolation, and Durability (ACID): Ensures data integrity and reliability as it passes through multiple layers of validations and transformations. The Medallion architecture adheres to ACID principles to ensure that data is accurate, consistent, and reliable.

#3. 🧑🏽‍🍳 “Mise en place”

In French cuisine the concept of “mise en place”, literally translates to “put in place”. It is the practice of organising all your ingredients and tools before you start to cook. In the same way here the best practices to ensure you have everything in place:

Separate Staging Area: Create a separate staging area in the Bronze layer for raw data to maintain data integrity. This ensures that raw data is preserved and traceable.
Data Quality Checks: Enforce data quality checks in the Silver layer to ensure data accuracy and consistency. This includes data cleaning, deduplication, and transformation.
Optimise Query Performance: Optimise query performance in the Gold layer for efficient analytics and reporting. This includes data aggregation, indexing, and caching.

#4. 🙌🏽 Databricks Success Functionality

Two features of Databricks that work seamlessly with this approach are:

Delta Live Tables (DLT): Use DLT for building and managing data pipelines, ensuring data quality and consistency. DLT provides a managed service for creating and maintaining data pipelines, making it easier to implement the Medallion architecture.

Structured Streaming: Leverage Structured Streaming for real-time data processing and integration with various data sources. Structured Streaming provides a scalable and fault-tolerant framework for processing real-time data, making it suitable for dynamic data environments.

#5. Compatibility and Scalability

The Medallion architecture is designed to be compatible and scalable, making it suitable for various data types and environments:

Data Mesh Compatibility: The Medallion architecture is compatible with the concept of a data mesh, allowing for flexible data integration and reuse. A data mesh is a decentralised data architecture that enables data sharing and collaboration across different teams and departments.
Scalability: The architecture supports both structured and unstructured data, making it suitable for various data types and scalable for cloud-based data warehouses. The Medallion architecture can handle large volumes of data and scale horizontally to meet the needs of growing data environments.

Writer RAG tool: build production-ready RAG apps in minutes

Writer RAG Tool: build production-ready RAG apps in minutes with simple API calls.
Knowledge Graph integration for intelligent data retrieval and AI-powered interactions.
Streamlined full-stack platform eliminates complex setups for scalable, accurate AI workflows.

Learn more about our production ready RAG tooling here.

Critical Considerations to Medallion Architecture

In a recent Databricks Data & AI Summit, CTO of Advancing Analytics UK David Whiteley shared his thoughts as to why a one-size-fits all use of Medallion Architecture may be detrimental.

Here are some of the top criticisms of the approach to consider:

Rigidity: The architecture can be too rigid, forcing data into three distinct layers (Bronze, Silver, and Gold) without flexibility for additional steps or nuances. This can limit the ability to adapt to changing business needs or incorporate new data sources. The strict layering can make it inflexible in handling exceptions or you may find you have data that does not fit neatly into one of the predefined layers.
Storage Requirements: This approach requires significant storage, effectively tripling the amount of storage used in a data lake, which can be impractical for teams with intensive storage demands. Equally these increased storage requirements can lead to higher costs, which can be a roadblock if you are working within budget constraints.
Additional Downstream Processing: It often requires additional downstream processing, which can lead to complexity and the need for separate processes for business-focused transformations. These additional processing steps can be resource-intensive, requiring more computational power and potentially leading to longer processing times.
Limited Flexibility: The architecture may not suit every company's unique needs, particularly those with diverse source systems or very specific industry requirements. Adapting the Medallion architecture to fit specific business needs can be challenging, requiring significant customisation and potentially leading to additional complexity.
Complexity: Managing multiple tiers can be complex and expensive, and the architecture requires ongoing maintenance and management, which has the potential to increase operational overhead and reduce overall efficiency.
Not a One-Size-Fits-All Solution: Using a Medallion architecture is not universally applicable. It should be adapted to fit specific business needs and data processing pipelines.The architecture should be considered in the context of your specific data management needs, rather than being adopted as a generic solution.

Last up today, is a highlight from the Data Innovators Exchange, launched recently is the Enterprise AI Engineering Classroom and Resource Hub. You will find materials to get you started using IBM Watsonx in an enterprise environment.

Like this content? Join the conversation at the Data Innovators Exchange.