- The client wanted to create a centralized repository for all organizational data.
- The goal was to make the data accessible to everyone and create an organized data source.
- The client wanted continuous data to refresh the repository.
- The main objective was to convert raw data into actionable insights for better decision-making.
- The client wanted to build a data lake-house architecture on the cloud to achieve this.
The main challenges of the project were:
- Collecting and processing data from various sources in a raw format.
- Building a scalable infrastructure to handle large amounts of data.
- Integrating different technologies to build a unified system.
- Providing easy access to data for end-users while ensuring data security and governance.
The following items were provided as deliverables:
- Data Collection Layer (Data Source+ Transfer Layer)
- Data Ingestion Layer
- Storage and Metadata processing layer
- Metadata Processing Layer
- Cataloging layer
- BI/ Reporting Layer
I followed a phased approach to building an enterprise data lake on the cloud to improve data governance for my client. The five stages included data collection, ingestion, storage and metadata processing, cataloging, and BI/Reporting. Using AWS DataSync Agent, Airflow, and Hive Metastore, I built fully functional data lakehouses for two sub-verticals and provided a data cataloging tool and BI endpoint to consume the data. This architecture provided a centralized repository for all organization data, making it accessible to everyone and helping me convert raw data into actionable insights for better decision-making.