Linux Training in Coimbatore & Best Linux Server Administration Training Institute

Monday, 21 March 2022

AWS Glue

A fully managed service to extract, transform, and load (ETL) your data for analytics.
Discover and search across different AWS data sets without moving your data.
AWS Glue consists of:
- Central metadata repository
- ETL engine
- Flexible scheduler

Use Cases

Run queries against an Amazon S3 data lake
- You can use AWS Glue to make your data available for analytics without moving your data.
Analyze the log data in your data warehouse
- Create ETL scripts to transform, flatten, and enrich the data from source to target.
Create event-driven ETL pipelines
- As soon as new data becomes available in Amazon S3, you can run an ETL job by invoking AWS Glue ETL jobs using an AWS Lambda function.
A unified view of your data across multiple data stores
- With AWS Glue Data Catalog, you can quickly search and discover all your datasets and maintain the relevant metadata in one central repository.

Concepts

AWS Glue Data Catalog
- A persistent metadata store.
- The data that is used as sources and targets of your ETL jobs are stored in the data catalog.
- You can only use one data catalog per region.
- AWS Glue Data catalog can be used as the Hive metastore.
- It can contain database and table resource links.

Database
- A set of associated table definitions, organized into a logical group.
- A container for tables that define data from different data stores.
- If the database is deleted from the Data Catalog, all the tables in the database are also deleted.
- A link to a local or shared database is called a database resource link.
Data store, data source, and data target
- To persistently store your data in a repository, you can use a data store.
  - Data stores: Amazon S3, Amazon RDS, Amazon Redshift, Amazon DynamoDB, JDBC
- The data source is used as input to a process or transform.
- A location where the data store process or transform writes to is called a data target.
Table
- The metadata definition that represents your data.
- You can define tables using JSON, CSV, Parquet, Avro, and XML.
- You can use the table as the source or target in a job definition.
- A link to a local or shared table is called a table resource link.
- To add a table definition:
  - Run a crawler.
  - Create a table manually using the AWS Glue console.
  - Use AWS Glue API CreateTable operation.
  - Use AWS CloudFormation templates.
  - Migrate the Apache Hive metastore
- A partitioned table describes an AWS Glue table definition of an Amazon S3 folder.
- Reduce the overall data transfers, processing, and query processing time with PartitionIndexes.
Connection
- It contains the properties that you need to connect to your data.
- To store connection information for a data store, you can add a connection using:
  - JDBC
  - Amazon RDS
  - Amazon Redshift
  - Amazon DocumentDB
  - MongoDB
  - Kafka
  - Network
- You can enable SSL connection for JDBC, Amazon RDS, Amazon Redshift, and MongoDB.
Crawler
- You can use crawlers to populate the AWS Glue Data Catalog with tables.
- Crawlers can crawl file-based and table-based data stores.
  - Data stores: S3, JDBC, DynamoDB, Amazon DocumentDB, and MongoDB
- It can crawl multiple data stores in a single run.
- How Crawlers work
  - Determine the format, schema, and associated properties of the raw data by classifying the data – create a custom classifier to configure the results of the classification.
  - Group the data into tables or partitions – you can group the data based on the crawler heuristics.
  - Writes metadata to the AWS Glue Data Catalog – set up how the crawler adds, updates, and deletes tables and partitions.
- For incremental datasets with a stable table schema, you can use incremental crawls. It only crawls the folders that were added since the last crawler run.
- You can run a crawler on-demand or based on a schedule.
- Select the Logs link to view the results of the crawler. The link redirects you to CloudWatch Logs.
Classifier
- It reads the data in the data store.
- You can use a set of built-in classifiers or create a custom classifier.
- By adding a classifier, you can determine the schema of your data.
- Custom classifier types: Grok, XML, JSON and CSV
AWS Glue Studio
- Visually author, run, view, and edit your ETL jobs.
- Diagnose, debug, and check the status of your ETL jobs.
Job
- To perform ETL works, you need to create a job.
- When creating a job, you need to provide data sources, targets, and other information. The result will be generated in a PySpark script and store the job definition in the AWS Glue Data Catalog.
- Job types: Spark, Streaming ETL, and Python shell
- Job properties:
  - Job bookmarks maintain the state information and prevent the reprocessing of old data.
  - Job metrics allows you to enable or disable the creation of CloudWatch metrics when the job runs.
  - Security configuration helps you define the encryption options of the ETL job.
  - Worker type is the predefined worker that is allocated when a job runs.
    - Standard
    - G.1X (memory-intensive jobs)
    - G.2X (jobs with ML transforms)
  - Max concurrency is the maximum number of concurrent runs that are allowed for the created job. If the threshold is reached, an error will be returned.
  - Job timeout (minutes) is the execution time limit.
  - Delay notification threshold (minutes) is set if a job runs longer than the specified time. AWS Glue will send a delay notification via Amazon CloudWatch.
  - Number of retries allows you to specify the number of times AWS Glue would automatically restart the job if it fails.
  - Job parameters and Non-overrideable Job parameters are a set of key-value pairs.
Script
- A script allows you to extract the data from sources, transform it, and load the data into the targets.
- You can generate ETL scripts using Scala or PySpark.
- AWS Glue has a script editor that displays both the script and diagram to help you visualize the flow of your data.
Development endpoint
- An environment that allows you to develop and test your ETL scripts.
- To create and test AWS Glue scripts, you can connect the development endpoint using:
  - Apache Zeppelin notebook on your local machine
  - Zeppelin notebook server in Amazon EC2 instance
  - SageMaker notebook
  - Terminal window
  - PyCharm Python IDE
- With SageMaker notebooks, you can share development endpoints among single or multiple users.
  - Single-tenancy Configuration
  - Multi-tenancy Configuration
Notebook server
- A web-based environment to run PySpark statements.
- You can use a notebook server for interactive development and testing of your ETL scripts on a development endpoint.
  - SageMaker notebooks server
  - Apache Zeppelin notebook server
Trigger
- It allows you to manually or automatically start one or more crawlers or ETL jobs.
- You can define triggers based on schedule, job events, and on-demand.
- You can also use triggers to pass job parameters. If a trigger starts multiple jobs, the parameters are passed on each job.
Workflows
- It helps you orchestrate ETL jobs, triggers, and crawlers.
- Workflows can be created using the AWS Management Console or AWS Glue API.
- You can visualize the components and the flow of work with a graph using the AWS Management Console.
- Jobs and crawlers can fire an event trigger within a workflow.
- By defining the default workflow run properties, you can share and manage state throughout a workflow run.
- With AWS Glue API, you can retrieve the static and dynamic view of a running workflow.
- The static view shows the design of the workflow. While the dynamic view includes the latest run information for the jobs and crawlers. Run information shows the success status and error details.
- You can stop, repair, and resume a workflow run.
- Workflow restrictions:
  - You can only associate a trigger in one workflow.
  - When setting up a trigger, you can only have one starting trigger (on-demand or schedule).
  - If a crawler or job in a workflow is started by a trigger that is outside the workflow, any triggers within a workflow will not fire if it depends on the crawler or job completion.
  - If a crawler or job in a workflow is started within the workflow, only the triggers within a workflow will fire upon the crawler or job completion.
Transform
- To process your data, you can use AWS Glue built-in transforms. These transforms can be called from your ETL script.
- Enables you to manipulate your data into different formats.
- Clean your data using machine learning (ML) transforms.
  - Tune transform:
    - Recall vs. Precision
    - Lower Cost vs. Accuracy
  - With match enforcement, you can force the output to match the labels used in teaching the ML transform.
Dynamic Frame
- A distributed table that supports nested data.
- A record for self-describing is designed for schema flexibility with semi-structured data.
- Each record consists of data and schema.
- You can use dynamic frames to provide a set of advanced transformations for data cleaning and ETL.

Populating the AWS Glue Data Catalog

Random Notes > AWS Glue > AWSGlueDataCatalog.png

Select any custom classifiers that will run with a crawler to infer the format and schema of the data. You must provide a code for your custom classifiers and run them in the order that you specify.
To create a schema, a custom classifier must successfully recognize the structure of your data.
The built-in classifiers will try to recognize the data’s schema if no custom classifier matches the data’s schema.
For a crawler to access the data stores, you need to configure the connection properties. This will allow the crawler to connect to a data store, and the inferred schema will be created for your data.
The crawler will write metadata to the AWS Glue Data Catalog. The metadata is stored in a table definition, and the table will be written to a database.

Authoring Jobs

Random Notes > AWS Glue > AWSGlueJobs.png

You need to select a data source for your job. Define the table that represents your data source in the AWS Glue Data Catalog. If the source requires a connection, you can reference the connection in your job. You can add multiple data sources by editing the script.
Select the data target of your job or allow the job to create the target tables when it runs.
By providing arguments for your job and generated script, you can customize the job-processing environment.
AWS Glue can generate an initial script, but you can also edit the script if you need to add sources, targets, and transforms.
Configure how your job is invoked. You can select on-demand, time-based schedule, or by an event.
Based on the input, AWS Glue generates a Scala or PySpark script. You can edit the script based on your needs.

Glue DataBrew

A visual data preparation tool for cleaning and normalizing data to prepare it for analytics and machine learning.
You can choose from over 250 pre-built transformations to automate data preparation tasks. You can automate filtering anomalies, converting data to standard formats, and correcting invalid values, and other tasks. After your data is ready, you can immediately use it for analytics and machine learning projects.
When running profile jobs in DataBrew to auto-generate 40+ data quality statistics like column-level cardinality, numerical correlations, unique values, standard deviation, and other statistics, you can configure the size of the dataset you want analyzed.

Monitoring

Record the actions taken by the user, role, and AWS service using AWS CloudTrail.
You can use Amazon CloudWatch Events with AWS Glue to automate the actions when an event matches a rule.
With Amazon CloudWatch Logs, you can monitor, store, and access log files from different sources.
You can assign a tag to crawler, job, trigger, development endpoint, and machine learning transform.
Monitor and debug ETL jobs and Spark applications using Apache Spark web UI.
You could view the real-time logs on the Amazon CloudWatch dashboard if you enabled continuous logging.

Security

Security configuration allows you to encrypt your data at rest using SSE-S3 and SSE-KMS.
- S3 encryption mode
- CloudWatch logs encryption mode
- Job bookmark encryption mode
With AWS KMS keys, you can encrypt the job bookmarks and the logs generated by crawlers and ETL jobs.
AWS Glue only supports symmetric customer master keys (CMKs).
For data in transit, AWS provides SSL encryption.
Managing access to resources using:
- Identity-Based Policies
- Resource-Based Policies
You can grant cross-account access in AWS Glue using the Data Catalog resource policy and IAM role.
Data Catalog Encryption:
- Metadata encryption
- Encrypt connection passwords
You can create a policy for the data catalog to define fine-grained access control.

Pricing

You are charged at an hourly rate based on the number of DPUs used to run your ETL job.
You are charged at an hourly rate based on the number of DPUs used to run your crawler.
Data Catalog storage and requests:
- You will be charged per month if you store more than a million objects.
- You will be charged per month if you exceed a million requests in a month.

AWS Data Pipeline

A web service for scheduling regular data movement and data processing activities in the AWS cloud. Data Pipeline integrates with on-premise and cloud-based storage systems.
A managed ETL (Extract-Transform-Load) service.
Native integration with S3, DynamoDB, RDS, EMR, EC2 and Redshift.

Features

You can quickly and easily provision pipelines that remove the development and maintenance effort required to manage your daily data operations, letting you focus on generating insights from that data.
Data Pipeline provides built-in activities for common actions such as copying data between Amazon Amazon S3 and Amazon RDS, or running a query against Amazon S3 log data.
Data Pipeline supports JDBC, RDS and Redshift databases.

Components

A pipeline definition specifies the business logic of your data management.
A pipeline schedules and runs tasks by creating EC2 instances to perform the defined work activities.
Task Runner polls for tasks and then performs those tasks. For example, Task Runner could copy log files to S3 and launch EMR clusters. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by Data Pipeline.

Pipeline Definition

From your pipeline definition, Data Pipeline determines the tasks, schedules them, and assigns them to task runners.
If a task is not completed successfully, Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you.
A pipeline definition can contain the following types of components
- Data Nodes – The location of input data for a task or the location where output data is to be stored.
- Activities – A definition of work to perform on a schedule using a computational resource and typically input and output data nodes.
- Preconditions – A conditional statement that must be true before an action can run. There are two types of preconditions:
  - System-managed preconditions are run by the Data Pipeline web service on your behalf and do not require a computational resource.
  - User-managed preconditions only run on the computational resource that you specify using the runsOn or workerGroup fields. The workerGroup resource is derived from the activity that uses the precondition.
- Scheduling Pipelines – Defines the timing of a scheduled event, such as when an activity runs. There are three types of items associated with a scheduled pipeline:
  - Pipeline Components – Specify the data sources, activities, schedule, and preconditions of the workflow.
  - Instances – Data Pipeline compiles the running pipeline components to create a set of actionable instances. Each instance contains all the information for performing a specific task.
  - Attempts – To provide robust data management, Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts.
- Resources – The computational resource that performs the work that a pipeline defines.
- Actions – An action that is triggered when specified conditions are met, such as the failure of an activity.
- Schedules – Define when your pipeline activities run and the frequency with which the service expects your data to be available. All schedules must have a start date and a frequency.

Task Runners

When Task Runner is installed and configured, it polls Data Pipeline for tasks associated with pipelines that you have activated.
When a task is assigned to Task Runner, it performs that task and reports its status back to Data Pipeline.

AWS Training AWS Data Pipeline 2

AWS Data Pipeline vs Amazon Simple WorkFlow

Both services provide execution tracking, handling retries and exceptions, and running arbitrary actions.
AWS Data Pipeline is specifically designed to facilitate the specific steps that are common across a majority of data-driven workflows.

Pricing

You are billed based on how often your activities and preconditions are scheduled to run and where they run (AWS or on-premises).

Amazon QuickSight is a cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device.

Amazon QuickSight

Features

- Provides ML Insights for discovering hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting.
- Has a wide library of visualizations, charts, and tables; You can add interactive features like drill-downs and filters, and perform automatic data refreshes to build interactive dashboards.
- Allows you to schedule automatic email-based reports, so you can get key insights delivered to your inbox.
- QuickSight allows users to connect to data sources, create/edit datasets, create visual analyses, invite co-workers to collaborate on analyses, and publish dashboards and reports.
- Has a super-fast, parallel, in-memory, calculation engine (SPICE), allowing you to achieve blazing fast performance at scale.
- Allows you to directly connect to and import data from a wide variety of cloud and on-premises data sources.

SPICE

- Uses a combination of columnar storage, in-memory technologies.
- Data in SPICE is persisted until it is explicitly deleted by the user.
- SPICE also automatically replicates data for high availability and enables QuickSight to scale easily.
- The SPICE engine supports data sets up to 250M rows and 500GB.

Concepts

A QuickSight Author is a user who can connect to data sources (within AWS or outside), create interactive dashboards using advanced QuickSight capabilities, and publish dashboards with other users in the account.
A QuickSight Reader is a user who uses interactive dashboards. Readers can log in via QuickSight username/password, SAML portal or AD auth, view shared dashboards, filter data, drill down to details or export data as a CSV file.
- Readers can be easily upgraded to authors via the QuickSight user management options.
- Readers with pay-per-session pricing only exist in Enterprise Edition. Standard Edition accounts can be easily upgraded to Enterprise.
A QuickSight Admin is a user who can manage QuickSight users and account-level preferences, as well as purchase SPICE capacity and annual subscriptions for the account.
- Admins have all QuickSight authoring capabilities.
- Admins can also upgrade Standard Edition accounts to Enterprise Edition.
- QuickSight Authors and Readers can be upgraded to Admins at any time.
A QuickSight Reader session has a 30-minute duration and is renewed at 30-minute intervals. The session starts with a user-initiated action (login, dashboard load, page refresh, drill-down or filtering).
Dashboards are a collection of visualizations, tables, and other visual displays arranged and visible together.
Stories are guided tours through specific views of an analysis. They are used to convey key points, a thought process, or the evolution of an analysis for collaboration.

- Data Management
  - Data preparation is the process of transforming raw data for use in an analysis.
  - You can upload XLSX, CSV, TSV, CLF, XLF data files directly from Amazon QuickSight website, or to an Amazon S3 bucket and point Quicksight to the bucket.
  - You can also connect Amazon QuickSight to an Amazon EC2 or on-premises database.
- Data Visualization and Analysis
  - A visual, also known as a data visualization, is a graphical representation of a data set using a type of diagram, chart, graph, or table. All visuals begin in AutoGraph mode, which automatically selects a visualization based on the fields you select.
  - A data analysis is the basic workspace for creating and interacting with visuals, which are graphical representations of your data. Each analysis contains a collection of visuals that you assemble and arrange for your purposes.
  - To create a visualization, start by selecting the data fields you want to analyze, or drag the fields directly on to the visual canvas, or a combination of both actions. Amazon QuickSight will automatically select the appropriate visualization to display based on the data you’ve selected.
  - Amazon QuickSight has a feature called AutoGraph that allows it to select the most appropriate visualizations based on the properties of the data, such as cardinality and data type.
  - You can perform typical arithmetic and comparison functions; conditional functions such as if,then; and date, numeric, and string calculations.
- Machine Learning Insights
  - Using machine learning and natural language capabilities, Amazon QuickSight Enterprise Edition launches you into forecasting and decision-making.
  - You can select from a list of customized context-sensitive narratives, called auto-narratives, and add them to your analysis. In addition to choosing auto-narratives, you can choose to view forecasts, anomalies, and factors contributing to these.
  - Major features
    - ML-powered anomaly detection – continuously analyze all your data to detect anomalies.
    - ML-powered forecasting – forecast key business metrics.
    - Auto-narratives – build rich dashboards with embedded narratives to tell the story of your data in plain language.
- Security
  - Offers role-based access control, Active Directory integration, CloudTrail auditing, single sign-on, private VPC subnets, and data backup.
  - FedRamp, HIPAA, PCI PSS, ISO, and SOC compliant.
  - Row-level security enables QuickSight dataset owners to control access to data at row granularity based on permissions associated with the user interacting with the data.
- Pricing
  - Quicksight has a pay-per-session model for dashboard readers, users who consume dashboards others have created.

Amazon Kinesis

Makes it easy to collect, process, and analyze real-time, streaming data.
Kinesis can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications.

Kinesis Video Streams

A fully managed AWS service that you can use to stream live video from devices to the AWS Cloud, or build applications for real-time video processing or batch-oriented video analytics.
How it works

AWS Training Amazon Kinesis 2

Benefits
- You can connect and stream from millions of devices.
- You can configure your Kinesis video stream to durably store media data for custom retention periods. Kinesis Video Streams also generates an index over the stored data based on producer-generated or service-side timestamps.
- Kinesis Video Streams is serverless, so there is no infrastructure to set up or manage.
- You can build real-time and batch applications on data streams.
- Kinesis Video Streams enforces Transport Layer Security (TLS)-based encryption on data streaming from devices, and encrypts all data at rest using AWS KMS.
Components
- Producer – Any source that puts data into a Kinesis video stream.
- Kinesis video stream – A resource that enables you to transport live video data, optionally store it, and make the data available for consumption both in real time and on a batch or ad hoc basis.
  - Time-encoded data is any data in which the records are in a time series, and each record is related to its previous and next records.
  - A fragment is a self-contained sequence of frames. The frames belonging to a fragment should have no dependency on any frames from other fragments.
  - Upon receiving the data from a producer, Kinesis Video Streams stores incoming media data as chunks. Each chunk consists of the actual media fragment, a copy of media metadata sent by the producer, and the Kinesis Video Streams-specific metadata such as the fragment number, and server-side and producer-side timestamps.
- Consumer – Gets data, such as fragments and frames, from a Kinesis video stream to view, process, or analyze it. Generally these consumers are called Kinesis Video Streams applications.
Kinesis Video Streams provides
- APIs for you to create and manage streams and read or write media data to and from a stream.
- A console that supports live and video-on-demand playback.
- A set of producer libraries that you can use in your application code to extract data from your media sources and upload to your Kinesis video stream.
Video Playbacks
- You can view a Kinesis video stream using either
  - HTTP Live Streaming (HLS) – You can use HLS for live playback.
  - GetMedia API – You use the GetMedia API to build your own applications to process Kinesis video streams. GetMedia is a real-time API with low latency.
Metadata
- Metadata is a mutable key-value pair. You can use it to describe the content of the fragment, embed associated sensor readings that need to be transferred along with the actual fragment, or meet other custom needs.
- There are two modes in which the metadata can be embedded with fragments in a stream:
  - Nonpersistent: You can affix metadata on an ad hoc basis to fragments in a stream, based on business-specific criteria that have occurred.
  - Persistent: You can affix metadata to successive, consecutive fragments in a stream based on a continuing need.
Pricing
- You pay only for the volume of data you ingest, store, and consume through the service.

Kinesis Data Stream

A massively scalable, highly durable data ingestion and processing service optimized for streaming data. You can configure hundreds of thousands of data producers to continuously put data into a Kinesis data stream.
How it works

AWS Training Amazon Kinesis 3

Concepts

- Data Producer – An application that typically emits data records as they are generated to a Kinesis data stream. Data producers assign partition keys to records. Partition keys ultimately determine which shard ingests the data record for a data stream.
- Data Consumer – A distributed Kinesis application or AWS service retrieving data from all shards in a stream as it is generated. Most data consumers are retrieving the most recent data in a shard, enabling real-time analytics or handling of data.
- Data Stream – A logical grouping of shards. There are no bounds on the number of shards within a data stream. A data stream will retain data for 24 hours, or up to 7 days when extended retention is enabled.
- Shard – The base throughput unit of a Kinesis data stream.
  - A shard is an append-only log and a unit of streaming capability. A shard contains an ordered sequence of records ordered by arrival time.
  - Add or remove shards from your stream dynamically as your data throughput changes.
  - One shard can ingest up to 1000 data records per second, or 1MB/sec. Add more shards to increase your ingestion capability.
  - When consumers use enhanced fan-out, one shard provides 1MB/sec data input and 2MB/sec data output for each data consumer registered to use enhanced fan-out.
  - When consumers do not use enhanced fan-out, a shard provides 1MB/sec of input and 2MB/sec of data output, and this output is shared with any consumer not using enhanced fan-out.
  - You will specify the number of shards needed when you create a stream and can change the quantity at any time.
- Data Record
  - A record is the unit of data stored in a Kinesis stream. A record is composed of a sequence number, partition key, and data blob.
  - A data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob is 1 MB.
- Partition Key
  - A partition key is typically a meaningful identifier, such as a user ID or timestamp. It is specified by your data producer while putting data into a Kinesis data stream, and useful for consumers as they can use the partition key to replay or build a history associated with the partition key.
  - The partition key is also used to segregate and route data records to different shards of a stream.
- Sequence Number
  - A sequence number is a unique identifier for each data record. Sequence number is assigned by Kinesis Data Streams when a data producer calls PutRecord or PutRecords API to add data to a Kinesis data stream.

Amazon Kinesis Agent is a pre-built Java application that offers an easy way to collect and send data to your Amazon Kinesis data stream.
Monitoring
- You can monitor shard-level metrics in Kinesis Data Streams.
- You can monitor your data streams in Amazon Kinesis Data Streams using CloudWatch, Kinesis Agent, Kinesis libraries.
- Log API calls with CloudTrail.
Security
- Kinesis Data Streams can automatically encrypt sensitive data as a producer enters it into a stream. Kinesis Data Streams uses AWS KMS master keys for encryption.
- Use IAM for managing access controls.
- You can use an interface VPC endpoint to keep traffic between your Amazon VPC and Kinesis Data Streams from leaving the Amazon network.
Pricing
- You are charged for each shard at an hourly rate.
- PUT Payload Unit is charged with a per million PUT Payload Units rate.
- When consumers use enhanced fan-out, they incur hourly charges per consumer-shard hour and per GB of data retrieved.
- You are charged for an additional rate on each shard hour incurred by your data stream once you enable extended data retention.
Limits
- - There is no upper limit on the number of shards you can have in a stream or account.
  - There is no upper limit on the number of streams you can have in an account.
  - A single shard can ingest up to 1 MiB of data per second (including partition keys) or 1,000 records per second for writes.
  - The default shard limit is 500 shards for the following AWS Regions: US East (N. Virginia), US West (Oregon), and EU (Ireland). For all other Regions, the default shard limit is 200 shards.
  - Each shard can support up to five read transactions per second.

Kinesis Data Firehose

The easiest way to load streaming data into data stores and analytics tools.
It is a fully managed service that automatically scales to match the throughput of your data.
It can also batch, compress, and encrypt the data before loading it.
How it works

AWS Training Amazon Kinesis 4

Features
- It can capture, transform, and load streaming data into S3, Redshift, Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards being used today.
- You can specify a batch size or batch interval to control how quickly data is uploaded to destinations. Additionally, you can specify if data should be compressed.
- Once launched, your delivery streams automatically scale up and down to handle gigabytes per second or more of input data rate, and maintain data latency at levels you specify for the stream.
- Kinesis Data Firehose can convert the format of incoming data from JSON to Parquet or ORC formats before storing the data in S3.
- You can configure Kinesis Data Firehose to prepare your streaming data before it is loaded to data stores. Kinesis Data Firehose provides pre-built Lambda blueprints for converting common data sources such as Apache logs and system logs to JSON and CSV formats. You can use these pre-built blueprints without any change, or customize them further, or write your own custom functions.
Concepts
- Kinesis Data Firehose Delivery Stream – The underlying entity of Kinesis Data Firehose. You use Kinesis Data Firehose by creating a Kinesis Data Firehose delivery stream and then sending data to it.
- Record – The data of interest that your data producer sends to a Kinesis Data Firehose delivery stream. A record can be as large as 1,000 KB.
- Data Producer – Producers send records to Kinesis Data Firehose delivery streams.
- Buffer Size and Buffer Interval – Kinesis Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations. Buffer Size is in MBs and Buffer Interval is in seconds.
Stream Sources
- You can send data to your Kinesis Data Firehose Delivery stream using different types of sources:
  - a Kinesis data stream,
  - the Kinesis Agent,
  - or the Kinesis Data Firehose API using the AWS SDK.
- You can also use CloudWatch Logs, CloudWatch Events, AWS IoT, or Amazon SNS as your data source.
- Some AWS services can only send messages and events to a Kinesis Data Firehose delivery stream that is in the same Region.
Data Delivery and Transformation
- Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations.
- Kinesis Data Firehose buffers incoming data up to 3 MB by default.
- If your Lambda function invocation fails because of a network timeout or because you’ve reached the Lambda invocation limit, Kinesis Data Firehose retries the invocation three times by default.
- Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in S3. Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON.
- Data delivery format:
  - For data delivery to S3, Kinesis Data Firehose concatenates multiple incoming records based on buffering configuration of your delivery stream. It then delivers the records to S3 as an S3 object.
  - For data delivery to Redshift, Kinesis Data Firehose first delivers incoming data to your S3 bucket in the format described earlier. Kinesis Data Firehose then issues an Redshift COPY command to load the data from your S3 bucket to your Redshift cluster.
  - For data delivery to ElasticSearch, Kinesis Data Firehose buffers incoming records based on buffering configuration of your delivery stream. It then generates an Elasticsearch bulk request to index multiple records to your Elasticsearch cluster.
  - For data delivery to Splunk, Kinesis Data Firehose concatenates the bytes that you send.
- Data delivery frequency
  - The frequency of data delivery to S3 is determined by the S3 Buffer size and Buffer interval value that you configured for your delivery stream.
  - The frequency of data COPY operations from S3 to Redshift is determined by how fast your Redshift cluster can finish the COPY command.
  - The frequency of data delivery to ElasticSearch is determined by the Elasticsearch Buffer size and Buffer interval values that you configured for your delivery stream.
  - Kinesis Data Firehose buffers incoming data before delivering it to Splunk. The buffer size is 5 MB, and the buffer interval is 60 seconds.
Monitoring
- Kinesis Data Firehose exposes several metrics through the console, as well as CloudWatch for monitoring.
- Kinesis Agent publishes custom CloudWatch metrics, and helps assess whether the agent is healthy, submitting data into Kinesis Data Firehose as specified, and consuming the appropriate amount of CPU and memory resources on the data producer.
- Log API calls with CloudTrail.
Security
- Kinesis Data Firehose provides you the option to have your data automatically encrypted after it is uploaded to the destination.
- Manage resource access with IAM.
Pricing
- You pay only for the volume of data you transmit through the service. You are billed for the volume of data ingested into Kinesis Data Firehose, and if applicable, for data format conversion to Apache Parquet or ORC.
Limits
- By default, each account can have up to 50 Kinesis Data Firehose delivery streams per Region.
- The maximum size of a record sent to Kinesis Data Firehose, before base64-encoding, is 1,000 KiB.

Kinesis Data Analytics

Analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. You can quickly build SQL queries and Java applications using built-in templates and operators for common processing functions to organize, transform, aggregate, and analyze data at any scale.
How it works

AWS Training Amazon Kinesis 5

General Features
- Kinesis Data Analytics is serverless and takes care of everything required to continuously run your application.
- Kinesis Data Analytics elastically scales applications to keep up with any volume of data in the incoming data stream.
- Kinesis Data Analytics delivers sub-second processing latencies so you can generate real-time alerts, dashboards, and actionable insights.
SQL Features
- Kinesis Data Analytics supports standard ANSI SQL.
- Kinesis Data Analytics integrates with Kinesis Data Streams and Kinesis Data Firehose so that you can readily ingest streaming data.
- SQL applications in Kinesis Data Analytics support two types of inputs:
  - A streaming data source is continuously generated data that is read into your application for processing.
  - A reference data source is static data that your application uses to enrich data coming in from streaming sources.
- Kinesis Data Analytics provides an easy-to-use schema editor to discover and edit the structure of the input data. The wizard automatically recognizes standard data formats such as JSON and CSV.
- Kinesis Data Analytics offers functions optimized for stream processing so that you can easily perform advanced analytics such as anomaly detection and top-K analysis on your streaming data.
Java Features
- Kinesis Data Analytics includes open source libraries based on Apache Flink. Apache Flink is an open source framework and engine for building highly available and accurate streaming applications.
- Kinesis Data Analytics for Apache Flink supports streaming applications built using Apache Beam Java SDK.
- You can use the Kinesis Data Analytics Java libraries to integrate with multiple AWS services.
- You can create and delete durable application backups through a simple API call. You can immediately restore your applications from the latest backup after a disruption, or you can restore your application to an earlier version.
- Java applications in Kinesis Data Analytics enable you to build applications whose processed records affect the results exactly once, referred to as exactly once processing.
- The service stores previous and in-progress computations, or state, in running application storage. State is always encrypted and incrementally saved in running application storage.
An application is the primary resource in Kinesis Data Analytics. Kinesis data analytics applications continuously read and process streaming data in real time.
- You write application code using SQL to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination.
- You can also process and analyze streaming data using Java.
Components
- Input is the streaming source for your application. In the input configuration, you map the streaming source to an in-application data stream(s).
- Application code is a series of SQL statements that process input and produce output.
- You can create one or more in-application streams to store the output. You can then optionally configure an application output to persist data from specific in-application streams to an external destination.
An in-application data stream is an entity that continuously stores data in your application for you to perform processing.
Kinesis Data Analytics provisions capacity in the form of Kinesis Processing Units (KPU). A single KPU provides you with the memory (4 GB) and corresponding computing and networking.
For Java applications using Apache Flink, you build your application locally, and then make your application code available to the service by uploading it to an S3 bucket.
Kinesis Data Analytics for Java applications provides your application 50 GB of running application storage per Kinesis Processing Unit. Kinesis Data Analytics scales storage with your application.
Running application storage is used for saving application state using checkpoints. It is also accessible to your application code to use as temporary disk for caching data or any other purpose.
Pricing
- You are charged an hourly rate based on the average number of Kinesis Processing Units (or KPUs) used to run your stream processing application.
- For Java applications, you are charged a single additional KPU per application for application orchestration. Java applications are also charged for running application storage and durable application backups.

Monday, 21 March 2022

AWS Glue