The Hitchhiker’s Guide to the Modern Data Lakehouse
If you're reading this, you're probably somewhere between mildly curious and mildly panicked about modern data architecture.
Good news: You’re not alone. Even better news: You don’t need to panic. You just need a guide.
Much like hitchhiking across the galaxy, building a modern data lakehouse can feel a little overwhelming at first. This guide is here to help you navigate the strange, wonderful universe of data lakehouses and how to build a platform that not only works but wows.
Grab your towel. Let’s go!
What constitutes a “modern” data architecture? The answer to this will vary depending on who you talk to, and the information you get will likely get outdated quickly as this is a rapidly evolving space. Irrespective of all of the noise, there are certain key components that you should always consider to ensure that you have all your bases covered. You can find these listed in the reference architecture below .
This reference architecture should serve as a guide as you start building a modern data lakehouse. It should prompt you to ask deeper questions about the business problems you are trying to address and how you will solve them. It should also help orient your thinking towards the long-term while designing for scale and security.
It is important to view this architecture as a collection of tools and processes. You do not have to look for a single product that addresses all of this and not all components may be applicable for every situation. But, at a minimum, think deeply about these areas with respect to your priorities. This should help you identify the components you need to implement a successful solution.
You will find a lot of these capabilities are available natively within the major cloud computing platforms like Amazon Web Services (AWS), Microsoft Azure and Google Cloud. But, there is also a thriving ecosystem of SaaS products that have built deep, specialized tools that serve specific needs.
If you are building a data platform for scale, you should use this reference architecture to build a data lakehouse. What that essentially means is that you have a centrally managed library of data assets and users, and these users can analyze your data using a variety of tools like SQL, Apache Spark, Trino, Pytorch, TensorFlow etc. based on their preference and/or use case. Storage and compute are truly separated in such an architecture and the focus shifts to robust governance and security capabilities to bring this to life.
“The most serious mistakes are not being made as a result of wrong answers. The truly dangerous thing is asking the wrong questions.”
Key Component Areas
Data Security
Data Governance
Data Ingestion
Data Access
Data & Performance Management
Data Storage
Deployment
Data Models
REFERENCE ARCHITECTURE
© 2025 Insane Insights. All rights reserved.
Data Storage
Your data will have to be stored in a format that is efficient for analytics. These days, columnar formats are preferred over row based formats for analytics because of performance. Ideally, the format in which the data is stored will be readable by multiple compute engines without the need to move the data around. With AI use cases gaining prominence, your platform will need to support unstructured data as well.
Structured Data: In most cases, the majority of your data will be structured. This means that you are aware of the structure of your data and it can be easily represented in a tabular format using rows and columns.
Semi-structured Data: Semi structured data does not have a predefined structure but the structure is encoded within the data itself. JSON and XML files are common examples of semi structured data that need to be stored. It is also possible to convert semi-structured data into a structured format using pre-processing steps.
Unstructured Data: Data like images, audio, video and free-form text files cannot be easily analyzed by traditional analytics tools. This has changed to some extent with the use of AI frameworks. Now, you can store these types of files by converting them into vectors and perform sentiment analysis and semantic search on them.
Ask yourself:
Is the data stored in a format that will allow different compute engines to access it? Or is it locked to a single compute engine?
Is there support for both analytics and AI use cases (structured, semi-structured and unstructured)?
Data Ingestion
The amount of data being generated has exploded over the last 10 years. You need tools and processes to make sure the data generated from disparate sources like your OLTP systems, IoT devices and clickstream data (generated from your website traffic) are loaded, transformed and made available for analysis in a timely manner.
Streaming Ingestion: You may need to “stream” data into your data lakehouse for certain types of data sources. These types of data sources include IoT devices and clickstream data. Depending on the use case, you may require low latency loads.
Batch Ingestion: The majority of your data will be loaded in bulk on a pre-determined frequency - hourly, daily etc. The processes and tools that you use for this will have to efficiently handle the extraction and loading of large volumes of data.
Data Transformation: This is where the bulk of your business logic will be embedded in the data processing pipeline. You may need to clean, enhance, aggregate and/or apply some business logic to calculate relevant metrics.
Orchestration: As data pipelines become more complex, having the ability to string together multiple processes becomes critical. You should be able to create and monitor your workflows. You also need to factor in error handling and the ability to recover from failures.
Ask yourself:
What are the tools that you will use for the above capabilities? Will it integrate well with the rest of the architecture?
Will your architecture scale to meet future growth in data volumes?
How will you monitor workflows?
How will you recover from failures?
Are there out-of-the-box features that you can leverage to minimize code?
Data Access
Data platforms only add value if users are able to derive insights from it. Being able to access the data using a variety of tools to serve analytics and AI use cases is what will define the value of your data platform.
SQL Engines: You need a SQL engine to be able to access the structured and semi-structured data. These are the workloads that would be traditionally served by a data warehouse.
Visualization: You will need to create reports and visualizations that make the insights that you generate easy to consume and understand.
Big Data Engines: In the modern data stack, SQL is complemented by open-source big data engines like Apache Spark and Trino. Look at the your use cases and determine what role big data engines may play in your solution.
Drivers: Most visualization and reporting tools (like Tableau, PowerBI etc) connect to data using drivers. Compatibility, performance, and security features of the available drivers will shape your user experience.
API: You may also need programmatic access to your data from code. API based connectivity to your data will help simplify your code.
ML/AI Engines: Your data platform will have a large role to play in your AI ecosystem. Data is the foundation over which AI will be built and operated. Your AI models will be trained on your data and you may need to build Retrieval Augmented Generation (RAG) workflows or Agents that require access to your data.
Data Marketplace: If you need to create data products that need to be shared with other users and teams, think about how that data will be shared securely with your data consumers and if there is a need to monetize your data product.
Ask yourself:
What are the tools that you will use to access your data? How will they integrate with the rest of the components?
Do you need to monetize your data product(s)?
Will your data access choices allow you to meet your performance goals for dashboards?
What are your user concurrency requirements?
What AI use cases are on the horizon and how would they fit in?
Data & Performance Management
Once the domain of database administrators, an increasing share of these tasks are being automated using AI within the product itself. However, it is important to understand which tasks are automated (and to what extent), and where you may have to step in to fill the gaps.
Resource Optimization: How the system manages resources like CPU and memory and shares them with competing workloads will have a direct impact on user experience. Earlier, this used to managed through workload management and prioritization. But, these days, automatic scaling allows you to scale your infrastructure based on the workload in real-time.
Data Organization: Over the years, data platforms have become more intelligent and are increasingly capable of optimizing how the data is stored based on your workloads. Automations may be available to ensure that tasks like sorting, partitioning. distribution and clustering are managed by the platform itself without human intervention.
Data Materialization: Analytics workloads are resource intensive. If there are certain frequently run processes that are time consuming. it may make sense to pre-compute these results before hand. Look for features where the system automatically does this analysis and materializes data where applicable.
Code Optimization: Performance management and monitoring is an integral component of any data platform. There needs to be a mechanism to identify potentially problematic workloads. Again, modern lakehouses are capable of surfacing performance optimization suggestions and even making slight code modifications under-the-hood.
Data Backup: Your data needs to be backed up to recover from unexpected data corruption or data loss. Keep in mind that there may be both business and regulatory needs that may need to be met here.
Data Recovery: Look at how quickly you can recover from failure. This could be hardware failures or just data corruption that needs to fixed by restoring backups. Consider disaster recovery and business continuity needs as you build a robust strategy for this.
Ask yourself:
What are the administrative tasks that need to be handled and how much of these are automated?
Do you need to profile the data to understand usage patterns to make optimal choices for distribution, sorting, indices etc.? Are there automations that can be leveraged for this?
How steep is the learning curve for tasks that need to be self managed?
What are your recovery point objectives (RPO) and recovery time objectives (RTO)?
Data Models
The data you generate and manage can come from different sources but can be related to each other in a variety of ways. Your data models should capture that information. Data models can be defined at different levels (described below) and serve different purposes
Logical Data Model: This is a platform-agnostic and business-oriented view of your data. These models can be highly normalized since they are not performance oriented.
Physical Data Model: This is how you will store the data in your data platform and your approach will vary depending on whether the data is stored as files in your data lake or in tables on a data warehouse. Choices that are made here will have a consequential impact on your performance and user experience. The right physical model for you is the one that works best for the platform and tools that you plan to use.
Semantic Data Model: Think of the Semantic Data Model as a layer above the physical layer. This is geared towards the reporting and access needs of your users and adds optimizations on to the physical model to serve specific access needs.
Ask yourself:
What are your data sources and how are they related?
How should your data be modelled? Should the data be separated by the data source or should they be separated by their business meaning?
How will this data be consumed by users and applications
Think about dimensional modeling techniques, normalization and denormalization and platform specific best practices
Deployment
How you deploy, provision, test and manage your infrastructure and code are critical yet often overlooked. A holistic approach will help speed up releases and will facilitate error-free deployments.
Development: You need to consider how teams will collaborate on the same pieces of code or database objects, and the checks and balances that will need to be put in place. You will need tools to manage changes to your code, database objects and data models.
Operations: This will need a robust set of tools and processes to ensure that your release cycles are error-free and relatively low touch. Leverage infrastructure-as-code principles where applicable to maintain a higher degree of control over your configurations.
Ask yourself:
What are the test cases to include in your deployment process?
Do you have the right tools to manage your source code and database objects?
Data Security
Keeping data secure is a critical aspect of any data platform. This means that access is restricted to the right users (authentication) with the right permission levels (authorization). You must also have visibility to track and audit activity as well as the ability to encrypt your data at-rest and in-motion
Authentication: To authenticate, you are looking for robust integrations with Identity Providers that your organization may use like Microsoft Azure Entra ID or Okta. You also want to have a single sign-on experience, where you do not have to authenticate multiple times based on what you are doing.
Authorization: You should aim to manage access to your data assets centrally to the extent possible. Having to create and manage user permissions per compute instance in a world were data assets are shared is cumbersome and error prone. You also need to be able to grant granular permissions at the column, row, table, schema or database level.
Encryption: Depending on your business and/ or compliance needs, you may have to encrypt your data at-rest and in-motion. You may need to have the ability to manage your encryption keys and algorithms as well.
Auditing: Finally, you will need to have robust auditing mechanisms to check and report on activity on your data assets to ensure that your security policies are strong and functioning as desired.
Ask yourself:
What regulatory or compliance requirements need to be met?
Are you able to manage security permissions centrally?
What automated audit processes can you implement to ensure your security policies are effective?
Data Governance
While data governance has been an aspirational goal for many years now, organizations have had varying degrees of success in this space. To fuel the next wave of AI innovations in data, metadata (business and technical) and data lineage will be critical components.
Logging: Information logged regarding usage and performance can be valuable. Automate reports that will help surface helpful insights regularly.
Monitoring: The ability to see what is happening on your system be it user activity, system activity or resource consumption metrics is essential and will enable you to react in a timely manner to any issues that you may experience.
Data Lineage: Data undergoes a lot of transformation from the time it is generated till it is available to an end user for analysis. Keeping track of the processes that may have modified the data improves trust and transparency. Historically, this has been a challenge since there are a variety of tools and processes that modify the data and it becomes extremely difficult to track how the data has been transformed across these multiple layers.
Technical Metadata: This is the metadata that captures technical information about the data- data types, size, data layouts etc. Automated schema detection and storage is a key component of managing structured and semi structured data in your lakehouse.
Business Metadata: Every data element has a business meaning. Capturing this information will allow your user community to easily find the information they are looking for.
Data Quality: As companies adopt agentic architectures to automate decision making based on their data, it becomes more important that there is a high degree of trust on the data quality. Automated checks should be integrated into your data pipeline to proactively identify potential data quality issues.
Ask yourself:
Will you be able to centrally govern the tools in your stack?
Is there value to be had from investing in data lineage and business metadata management?
Who is the true owner of the data? Can they provide insights into how to determine data quality checks?
Final Thought: Don’t Panic (and Build Smart)
Building a modern data lakehouse might seem daunting at first, but with the right map, you'll be cruising through your data galaxy in no time.
The secret?
Focus not just on individual tools or layers, but on how everything integrates - creating a seamless, delightful experience for your users.
Now, go forth and build boldly. The universe (and your data) awaits.