What is Azure Data Lake Storage?
Azure Data Lake Storage (ADLS) is a secured and scalable Data Lake that helps to achieve high-performance analytics workloads. It is also known as Azure Data Lake Store. It offers a single storage platform to integrate a large volume of organizational data. It is very cost-effective and provides tiered storage and policy management. ADLS also offers single sign-on capabilities and access controls using Hadoop Distributed File System. Azure Data Lake Storage enables us to use all those tools which support HDFS.
Benefits of Azure Data Lake?
The Data Lake in Azure solution is designed for organizations that want to take advantage of Big Data. It provides a data platform that can help Developers, Data Scientists, and Analysts store data of any size and format and perform all types of processing and analytics across multiple platforms using various programming languages. It can also work with existing solutions, such as identity management and security solutions. Moreover, it integrates with other data warehouses and cloud environments. It can be useful for organizations that need the following:
- Azure Active Directory:
Azure Active Directory or AAD allows you to provide Role-Based Access Control (RBAC) or identity within the solutions. These identities have several applications that can be managed by the service principal. The service principal stores the principal’s credentials if a service wants to connect to it, whereas, managed identities are directly connected to the service, so there is no need to manage credential storage.
- Multi-protocol SDK:
It’s a new version of the Blob Storage SDK used with Azure Data Lake to handle reading and writing of the data from ADLS and retry if a transient failure occurs. However, there are some limitations as it cannot perform atomic manipulation or control the access.
- Low-cost Storage:
Azure storage emerged as a cost-effective solution for data storage with various functionalities, such as data migrations from hot storage to colder storage, life-cycle management system, high power, archive storage, and more.
- Reliability:
Azure Storage allows users to make copies of their data to prepare for data center failure or a natural disaster. Also, the advanced threat detection system integrates with the data storage and detects malicious programs or software that might damage the data or compromise your privacy.
- Scalability:
Azure is massively scalable with a current limit of up to 500 petabytes in various regions around the world, except the USA and Europe where the limit is 2 petabytes. It offers both linear and vertical scaling
Working of Azure Data Lake
Azure Data Lake is built on Azure Blob storage, the Microsoft object storage solution for the cloud. The solution mat features low-cost, tiered storage and high-availability/disaster recovery capabilities. It integrates with other Azure services, including Azure Data Factory, a tool used for creating and running extract, transform, and load (ETL) and extract, load, and transform (ELT) processes.
The solution is based on the Apache Hadoop YARN (Yet Another Resource Negotiator) cluster management platform. It can scale dynamically across SQL servers within the data lake, as well as servers in the Azure SQL Database and the Azure SQL Data Warehouse.
To start using Azure Data Lake, you need to create a free account on the Microsoft Azure portal. From the portal, you will be able to access all of the Azure services.
Azure Data Lake Store Security
When implementing a Big Data solution, security shouldn’t be optional. To conform to security standards and limit sensitive information visibility, data must be secured in transit and at rest. ADLS provides rich security capabilities so users can have peace of mind when storing their assets in the ADLS infrastructure. Users can monitor performance, audit usage, and access control through the integrated Azure Active Directory.
Auditing
ADLS create audit logs for all operations performed in it. These logs can be analyzed with U-SQL scripts.
Access Control
ADLS provides access control through the support of POSIX-compliant access control lists (ACL) on files and folders stored in its infrastructure. It also manages authentication through the integration of AAD based on OAuth tokens from supported identity providers. Tokens will carry the user’s security group data, and this information will be passed through all the ADLS microservices.
Data Encryption
ADLS encrypts data in transit and at rest, providing server-side encryption of data with the help of keys, including customer-managed keys in the Azure Key Vault.
Data Encryption Key Types
ADLS uses a Master Encryption Key (MEK) stored in Azure’s key vault to encrypt and decrypt data. Users have the option to manage this key themselves but there is always a risk of not being able to decrypt the data if the key is lost. ADLS also includes the following keys:
- Block Encryption Key (BEK): These are keys generated for each block of data
- Data Encryption Key (DEK): These keys are encrypted by the MEK and are responsible for generating BEKs to encrypt data blocks
Azure Data Lake Store Pricing
Data Lake Store is currently available in the US-2 region and offers preview pricing rates (excluding Outbound Data transfer):
Usage | Cost |
---|---|
Data Stored | US$0.04 per GB per month |
Data Lake Transactions | US$0.07 per million transactions |
In the next section of this Azure Data Lake Tutorial, you will learn to get started with Analytics.
How do I get started?
Getting started with Azure Data Lake Analytics is extremely easy. Here’s what you’ll need:
- An Azure subscription — grab a free trial if you don’t have one.
- An Azure Data Lake Analytics account — create one in your Azure subscription
- You’ll also have to create a Store account during this step.
- Some data to play with — start with text or images.
You don’t need to install anything on your personal computer to use it. You can write and submit the necessary jobs in your browser.
Components of Azure Data Lake
The full solution consists of three components that provide storage, analytics service, and cluster capabilities.
Azure Data Lake Storage is a massively scalable and secure Data Lake for high-performance analytics workloads. Azure Lake Data Storage was formerly known and is sometimes still referred to as the Azure Data Lake Store. Designed to eliminate data silos, it provides a single storage platform that organizations can use to integrate their data.
The storage can help optimize costs with tiered storage and policy management. It also provides role-based access controls and single sign-on capabilities through Azure Active Directory. Users can manage and access data within the storage using the Hadoop Distributed File System (HDFS). Therefore, any HDFS-based tool that you are using will work with ADLS.
Azure Data Lake Analytics is an on-demand analytics platform for Big Data. Users can develop and run parallel data transformation and processing programs in U-SQL, R, Python, and .NET over petabytes of data. U-SQL is a Big Data query language created by Microsoft for the Azure Data Lake Analytics service. With Azure Data Lake Analytics, users pay for each job to process data on-demand in analytics as a service environment. It is a cost-effective analytics solution because you pay only for the processing power that you use.
Azure HDInsight is a cluster management solution that offers easy, fast, and cost-effective ways to process massive amounts of data. It’s a cloud deployment infrastructure of Apache Hadoop that enables users to take advantage of optimized open-source analytic clusters for Apache Spark, Hive, Map Reduce, HBase, Storm, Kafka, and R-Server. With these frameworks, you can support a broad range of functions, such as ETL, data warehousing, machine learning, and IoT. Azure HDInsight also integrates with Azure Active Directory for role-based access controls and single sign-on capabilities.
Need of Azure Data Lake
The Azure Data Lake offers the following benefits and facilities:
- Data warehousing: Since the solution supports any type of data, you can use it to integrate all of your enterprise data into a single data warehouse.
- Internet of Things (IoT) capabilities: The Azure platform provides tools for processing streaming data in real-time from multiple types of devices.
- Support for hybrid cloud environments: You can use the Azure HDInsight component to extend an existing on-premises Big Data infrastructure to the Azure cloud.
- Enterprise features: The environment is managed and supported by Microsoft and includes enterprise features for security, encryption, and governance. You can also extend your on-premises security solutions and controls to the Azure cloud environment.
- Speed to deployment: It’s easy to get up and running with the Azure Data Lake solution. All of the components are available through the portal and there are no servers to install or infrastructure to manage.
No comments:
Post a Comment