Data Lake 101: The Basics

What is a Data Lake?

A data lake is a centralized repository or system that allows you to store all your structured, semi-structured or unstructured data at any scale. You can run different types of analytics on the data such as batch reporting, real-time analytics or machine learning to understand customer behavior and to provide the best service for them.

21/08/2019

Reading Time: 5 minutes

Don’t miss out the latestCommencis Thoughts and News.

    21/08/2019

    Reading Time: 5 minutes

    What is a Data Lake?

    A data lake is a centralized repository or system that allows you to store all your structured, semi-structured or unstructured data at any scale. You can run different types of analytics on the data such as batch reporting, real-time analytics or machine learning to understand customer behavior and to provide the best service for them.

    Don’t miss out the latestCommencis Thoughts and News.

    Why do you need a data lake?

    Most companies have a significant amount of data generated from several channels such as click-streams, internet connected devices, social media or log files, however, they are unsure of how to leverage this data. According to an Aberdeen survey, organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. This is because data lakes helped them to generate business value from their data to identify and act upon opportunities for faster business growth.

    Deploying Data Lakes in the cloud

    There are many reasons why a data lake should be deployed in the cloud. Some of them are scalability, security, high availability, faster deployment, frequent feature/functionality updates, and reduced costs. ESG research found 39% of respondents considering cloud as their primary deployment for analytics, 41% for data warehouses, and 43% for Spark. These numbers will continue to increase day by day.

    Build a Data Lake on AWS

    As an Advanced AWS Partner, we offer you to build a cost-optimized, comprehensive, scalable and secure data lake for your own business needs with minimum effort. With several years of know-how in this domain, we can help you to either migrate your existing big data cluster to the cloud or to build a data lake from scratch. We use open-source components within each layer of a data lake, so you don’t have to pay anything to license agreements. There is no minimum fee or setup cost for any of the services, you only pay for what you use. By the help of the tools and the services we use, Commencis guarantees to reduce your existing on-premise big data cluster costs and the report execution durations substantially. (up to ten times)

    Layers of a Data Lake on AWS

    Data Lakes consist of several layers. Below you can see the main AWS services that we use inside a data lake;

    • Data Ingestion (Kinesis Firehose)
    • ETL (Glue)
    • Storage (S3)
    • Analytics (EMR)
    • Security (IAM, STS, KMS)
    • Orchestration (Data Pipeline, Step Functions, Airflow)
    • Monitoring (CloudWatch, CloudTrail)
    • Visualization (Athena, QuickSight)

    Data Ingestion

    Data ingestion is the first component of our architecture. In data lakes, data is gathered from several channels such as internet connected devices, click events, social media and so on. All the data coming from multiple channels may be ingested into the cloud in different ways. We use AWS Kinesis Firehose for this purpose. AWS Kinesis Firehose is Amazon’s fully managed streaming service that can load the data into data lakes, data stores and analytic tools. It can also batch, compress, transform and encrypt the data before loading it. AWS Firehose is also scalable, meaning that you don’t have to worry about the ingestion capacity during peaks in your system.

    ETL

    After the ingestion, we transform our data to prepare it for analytics. We use AWS Glue service for ETL operations. It is Amazon’s fully managed, serverless ETL service that creates ETL jobs with minimum effort. Glue discovers our data and stores the associated metadata (e.g. table definition and schema) in the Glue Data Catalog. Once cataloged, our data is immediately searchable and available for ETL. It is also possible (and recommended) to convert the data into columnar formats such as Parquet, ORC to optimize cost and improve performance.

    Storage

    Storage lies at the heart of a data lake. Due to the significant amount of data (petabyte) being stored, the underlying data storage must be scalable, reliable and cost-effective. AWS Simple Storage Service (S3) kicks in to achieve these. AWS S3 is an object storage service that offers scalability, data availability, security, and performance. We use AWS S3 to store our data instead of HDFS and by that, we don’t have to store data in the Hadoop clusters all the time and this brings us scalability, reliability and cost optimization.

    Analytics

    Depending on our clients’ know-how and the tools they wish to use, there are many analytics’ solutions in the industry. Even so, Amazon’s big data platform, EMR, has the answers for every situation. AWS EMR contains the most popular open-source big data tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, and Presto, coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3. EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premise clusters. Developers and analysts can use Jupyter-based EMR Notebooks for iterative development and testing purposes.

    Security

    If you are planning to keep customer data on the cloud, security becomes the most important topic. All the AWS services are GDPR-ready, meaning that every service that we consider using has encryption, access management and monitoring capabilities. AWS Identity and Access Management (IAM) service enables you to manage access to every AWS services and resources. And with AWS Security Token Service (STS), you can request temporary, limited-privilege credentials (tokens) for IAM. Encryption is a crucial aspect of keeping your data secure. By using AWS Key Management Service (KMS), you can manage your keys that are being used for encryption. For encryption, you can also use AWS managed keys or client-side encryption

    Monitoring

    For monitoring purposes, we use Amazon’s monitoring services, AWS CloudWatch and AWS CloudTrail. CloudWatch enables you to monitor your applications, understand and respond to system-wide performance changes, optimize resource utilization and get a unified view of operational health. With the help of CloudTrail, we can track and record each API call that made to the AWS environment. Thus, we can log every user activity that is made on AWS services and resources.

    Orchestration

    Orchestration is another mandatory requirement for data lakes. It is very important for an enterprise to be able to schedule, trigger/retrigger and monitor big data workflows. We use Apache Airflow and popular open-source orchestration tool to achieve this. Airflow gives us the ability to run transient EMR clusters, which means the clusters only work when they should. Ultimately, this reduces the cluster cost significantly. Of course, AWS offers their own same-purpose services such as AWS Step Functions or AWS Data Pipeline, but Airflow’s GUI and simplicity of its usage make it one step ahead of its equivalents.

    Visualization

    Last but not least visualization is the topic for a data lake. Business analysts or data scientists require a visual dashboard to see and analyze the output of the workflows. Most enterprises pay a lot of money to popular BI tools, but thanks to AWS services, you don’t have to pay any license fees. Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that include ML Insights. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites. You can also control or limit the user access to the dashboards to provide the appropriate level of data security. Another great visualization tool that we use in AWS is Amazon Athena, an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is fully managed and serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. At the same time, Athena is integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas, populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.


    Related Articles