Analytics has changed the business landscape in many ways, providing organizations with actionable insights that yield positive results. But still in the current industry landscape “Limited access to data” and “No centralized tool for capturing and analyzing data” are still barriers to the effective usage of data analytics.
Barriers to Effective Use of Analytics
What Is a Data Lake?
Data-Lake as the name suggests a body or a space that is filled with data in structured, unstructured, or semi-structured form, data lake is also independent of the data format. Just like a lake that is connected to a source, the data lake is connected to the data sources( internal or external) and data gushes into the lake with full flow and brings all the dirt and garbage along with it.
When we are talking about unclean data we’re talking about messy data like audio files, emails, photos, or satellite imagery to more neat and clean data like phone numbers, customer names, addresses, and zip codes.
Here’s how James Dixon, the person who created the term “data lake”, describes it:
“If you think of a data mart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users of the lake can come to examine, dive in, or take samples.”
Here’s another analogy from Alex Gorelik, author of the book titled “The enterprise data lake”.
“A data lake is sort of like a piggy bank. You often don’t know what you are saving the data for, but you want it in case you need it one day.”
Who Uses Data Lake?
Anyone who knows how to swim or pedal a boat can use data-lake basically who has some familiarity with data processing and analysis techniques. That’s why it’s usually data scientists and data engineers who work with data lakes. But with some additional layers on the data lake and a good user interface for data governance and adding a few self-serving analytics tools (like Tableau, PowerBI, Metabase) to the mix, anyone can use the data as per the business requirement.
Why we are using Data Lake-house instead of Data Lake?
We were executing several enterprise programs heavily reliant on data from multiple systems including SaaS Products, ERP, cloud CRM, Collaboration tools, etc. This voluminous data present in different formats across different platforms created critical challenges.
- No singular, consolidated view of enterprise data owing to the absence of a centralized model and siloed operations across multiple geographical regions.
- The lack of effective enterprise-wide data governance leads to an inability to define a streamlined decision-making process.
- No provision for self-serving Business Intelligence and Analytics in place.
How We Built a Data Lake House?
Now let’s get straight to the point, of how we’ve built a data lakehouse. Yes, you read it correctly, we’ve built a Data Lake-house instead of Data Lake.
Data Lakehouse is an emerging system design that combines the data structures and management features from a data warehouse with the low-cost storage of a data lake.
How our data lakehouse architecture looks like
Data lakehouse architecture is made up of roughly 5 layers
- Data Source Layer: Here we defined the data sources whether they are internal data sources or external data sources and how they are ingested into the system. We mostly used REST APIs and data connectors for this
- Ingestion layer: Data is pulled from different sources and delivered to the storage layer. This layer is further divided into two sub-layers
- Transformation and Storage layer: We used Apache Airflow for scheduling and orchestration of data pipelines or workflows. Orchestration of data pipelines refers to the sequencing, coordination, scheduling, and managing of complex data pipelines from diverse sources. Various types of data (structured, semi-structured, and unstructured) are kept in a cost-effective object store, such as Amazon S3. We’ve used the parquet file format to save the data in AWS S3. Both CSV and Parquet formats are used to store data, but they can’t be any more different internally. CSVs are what you call row storage, while Parquet files organize the data in columns.
- Data Curation Layer: It requires two components to build a processing layer, we’ve used an amalgamation of two cloud services AWS Glue and AWS Athena. Glue is to build glue crawlers that act as a data integration service for S3 and Athena to query the data which is coming from S3 after passing through the glue ETL pipeline.
- Cataloging Layer: We have multiple solutions for this layer, we have a t3-medium instance to run Metabase an open-source intelligence tool to simplify business intelligence and
What is the business use case of Data Lakehouse?
By building a data lakehouse we combined data lakes and data warehouse which allows the data teams and all the data-savvy folks in the organization to operate swiftly because they no longer need to access multiple systems to use the data.
Better and improved reliability: Now the data engineers don’t have to worry about the breakage of the data pipeline due to data quality and weak integrations which eventually disrupt the whole incoming of data into any data silo.
Less redundant data: This is the main factor that diverted us from building a data lake to building a data lake house. In the data lake, we just dump all the data which we are having, and eventually, that data lake will be converted into a big stinking data swamp, in other words, GIGO (Garbage In Garbage Out). With data lake-house, we have full control over what’s getting inside the lakehouse and we eliminated all the redundancies to support efficient data movement.
No stale data anymore: We tried to maintain the freshness of data to an extent where the data which was supposed to refresh every month is now getting a daily/ hourly refresh, which produces more real-time insights and helps in better decision-making.
Efficient and affordable: Since we have all the control over the data which is going inside the lakehouse and we monitor every ETL pipeline by moving it into a single architecture we saved a lot of money, and we almost cut down the monthly cloud expense to almost 🔻65% down.
We also applied a few other techniques too to make it cheaper but that’s a topic for another day
As George Fraser (CEO and Co-Founder) of Fivetran expressed a belief that data lakehouses will become increasingly popular because having data stored in an open-source format that query engines can access allows businesses to extract maximum value from the data they already have. Cost-effectiveness is another area where the data lakehouse usually outperforms the data warehouse.
But still, you can’t negate the fact that data lake-house as a concept is in a very nascent stage as compared to data warehouse yet after all these considerations data lakehouse performs far more efficient and effective