Thursday, 7 July 2022

AWS Redshift : Theory

What is Redshift?

  • Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud.
  • Customers can use the Redshift for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year.
  • Sum of Radios sold in EMEA.
  • Sum of Radios sold in Pacific.
  • Unit cost of radio in each region.
  • Sales price of each radio
  • Sales price - unit cost
Redshift ConfigurationRedshift
  • Single node
  • Multi-node
  • Leader Node
  • It manages the client connections and receives queries. A leader node receives the queries from the client applications, parses the queries, and develops the execution plans. It coordinates with the parallel execution of these plans with the compute node and combines the intermediate results of all the nodes, and then return the final result to the client application.
  • Compute Node
  • A compute node executes the execution plans, and then intermediate results are sent to the leader node for aggregation before sending back to the client application. It can have up to 128 compute nodes.
Redshift

OLAP

OLAP is an Online Analytics Processing System used by the Redshift.

OLAP transaction Example:

Suppose we want to calculate the Net profit for EMEA and Pacific for the Digital Radio Product. This requires to pull a large number of records. Following are the records required to calculate a Net Profit:

The complex queries are required to fetch the records given above. Data Warehousing databases use different type architecture both from a database perspective and infrastructure layer.

Redshift consists of two types of nodes:

Single node: A single node stores up to 160 GB.

Multi-node: Multi-node is a node that consists of more than one node. It is of two types:

Let's understand the concept of leader node and compute nodes through an example.

Redshift warehouse is a collection of computing resources known as nodes, and these nodes are organized in a group known as a cluster. Each cluster runs in a Redshift Engine which contains one or more databases.

When you launch a Redshift instance, it starts with a single node of size 160 GB. When you want to grow, you can add additional nodes to take advantage of parallel processing. You have a leader node that manages the multiple nodes. Leader node handles the client connection as well as compute nodes. It stores the data in compute nodes and performs the query.

Why Redshift is 10 times faster

Redshift is 10 times faster because of the following reasons:

  • Columnar Data Storage
    Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Row-based systems are ideal for transaction processing while column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored in a storage media sequentially, column-based systems require fewer I/Os, thus, improving query performance.
  • Advanced Compression
    Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relation data stores.
    Amazon Redshift does not require indexes or materialized views so, it requires less space than traditional relational database systems. When loading a data into an empty table, Amazon Redshift samples your data automatically and selects the most appropriate compression technique.
  • Massively Parallel Processing
    Amazon Redshift automatically distributes the data and loads the query across various nodes. An Amazon Redshift makes it easy to add new nodes to your data warehouse, and this allows us to achieve faster query performance as your data warehouse grows.

Redshift features

Features of Redshift are given below:

Redshift
  • Easy to setup, deploy and manage
    • Automated Provisioning
      Redshift is simple to set up and operate. You can deploy a new data warehouse with just a few clicks in the AWS Console, and Redshift automatically provisions the infrastructure for you. In AWS, all the administrative tasks are automated, such as backups and replication, you need to focus on your data, not on the administration.
    • Automated backups
      Redshift automatically backups your data to S3. You can also replicate the snapshots in S3 in another region for any disaster recovery.
  • Cost-effective
    • No upfront costs, pay as you go
      Amazon Redshift is the most cost-effective data warehouse service as you need to pay only for what you use.
      Its costs start with $0.25 per hour with no commitment and no upfront costs and can scale out to $250 per terabyte per year.
      Amazon Redshift is the only data warehouse service that offers On Demand pricing with no up-front costs, and it also offers Reserved instance pricing that saves up to 75% by providing 1-3 year term.
    • Choose your node type.
      You can choose either of the two nodes to optimize the Redshift.
      • Dense compute node
        Dense compute node can create a high-performance data warehouses by using fast CPUs, a large amount of RAM, and solid-state disks.
      • Dense storage node
        If you want to reduce the cost, then you can use Dense storage node. It creates a cost-effective data warehouse by using a larger hard disk drive.
  • Scale quickly to meet your needs.
    • Petabyte-scale data warehousing
      Amazon Redshift automatically scales up or down the nodes according to the need changes. With just a few clicks in the AWS Console or a single API call can easily change the number of nodes in a data warehouse.
    • Exabyte-scale data lake analytics
      It is a feature of Redshift that allows you to run the queries against exabytes of data in Amazon S3. Amazon S3 is a secure and cost-effective data to store unlimited data in an open format.
    • Limitless concurrency
      It is a feature of Redshift means that the multiple queries can access the same data in Amazon S3. It allows you to run the queries across the multiple nodes regardless of the complexity of a query or the amount of data.
  • Query your data lake
    Amazon Redshift is the only data warehouse which is used to query the Amazon S3 data lake without loading data. This provides flexibility by storing the frequently accessed data in Redshift and unstructured or infrequently accessed data in Amazon S3.
  • Secure
    With a couple of parameter settings, you can set the Redshift to use SSL to secure your data. You can also enable encryption, all the data written to disk will be encrypted.
  • Faster performance
    Amazon Redshift provides columnar data storage, compression, and parallel processing to reduce the amount of I/O needed to perform queries. This improves query performance.

AWS QLDB : Theory

Amazon QLDB :

Amazon QLDB offers a fully managed ledger database. It offers all the key features of a blockchain ledger database including immutability, transparency and cryptographically verifiable transaction log. However, QLDB cis owned by a central trusted authority. So, in a sense, it has almost all the features of a distributed ledger technology with a centralized approach.

Also, you can’t compare Amazon QLDB vs blockchain as the two have some fundamental differences. QLDB is launched along with Amazon Managed Blockchain.

Amazon QLDB Use-Cases :

In this section, we will take a look at the Amazon QLDB use-cases. The Amazon QLDB use-cases are important to get a complete glimpse of what QLDB has to offer.

Manufacturing :

The manufacturing companies can take full advantage of what QLDB Amazon has to offer. In manufacturing, it is important for companies to make sure that their supply chain data matches that of the supply chain. With QLDB, they can record every transaction and its history. As we’re already seeing blockchain in manufacturing, QLDB will only make things more efficient.

This means that each of their individual batches will be properly documented. In the end, they will be equipped with the knowledge of tracing the parts if something goes wrong during the distribution life cycle of a product.

QLDB Customers and Partners :

At the time of writing, QLDB has made strong partners and has also acquired customers. Some of them include the following.

  • Digital Asset
  • Accenture
  • Asano
  • Realm
  • Wipro
  • Zillant
  • Splunk
  • Klarna.
How it Works :

amazon qldb

Common Use Cases

  • Finance
    • Banks can use Amazon QLDB to easily store an accurate and complete record of all financial transactions, instead of building a custom ledger with complex auditing functionality.
  • Insurance
    • Insurance companies can use Amazon QLDB to track the entire history of claim transactions. Whenever a conflict arises, Amazon QLDB can cryptographically verify the integrity of the claims data.

Components Of QLDB :

  • Ledger :
    • Consists of tables and journals that keep all of the immutable histories of changes in the table.
  • Tables :
    • Contains a collection of document revisions.
  • Journal :
    • An immutable transactions log where transactions are appended as a sequence of blocks that are cryptographically chained together to provide a secure verification and immutability of the history of changes to your ledger data.
    • Only the data’s history of change cannot be altered and not the data itself.
  • Current State
    • The current state is similar to a traditional database where you can view and query the latest data.
  • History :
    • The history is a table where you can view and query the history of all the data and every change ever made to the data.

Performance :

  • Amazon QLDB can execute 2 – 3X as many transactions than ledgers in common blockchain frameworks.

Scalability :

  • Amazon QLDB automatically scales based on the workloads of your application.

Reliability :

  • Multiple copies of QLDB ledger are replicated across availability zones in a region. You can still continue to operate QLDB even in the case of zone failure.
  • Ensures redundancy within a region.
  • Also ensures full recovery when an availability zone goes down.

Backup and Restore :

  • You can export the contents of your QLDB journals to S3 as a backup plan.

Security :

  • Amazon QLDB uses SHA-256 hash function to make a secure file representation of your data’s change history called digest. The digest serves as a proof of your data’s change history, enabling you to go back at a point in time to verify the validity and integrity of your data changes.
  • All data in transit and at rest are encrypted by default.
  • Uses AWS-owned keys for encryption of data.
  • The authentication is done by attaching a signature to the HTTP requests. The signature is then verified using the AWS credentials.
  • Integrated with AWS Private Link.

Pricing :

  • You are billed based on five categories
    • Write I/Os
      • Pricing per 1 million requests
    • Read I/Os
      • Pricing per 1 million requests
    • Journal Storage Rate
      • Pricing per GB-month
    • Indexed Storage Rate
      • Pricing per GB-month
    • Data Transfer OUT From Amazon QLDB To Internet
      • You are charged based on the amount of data transferred per month. The rate varies for different regions.

Limitations :

  • Amazon QLDB does not support Backup and Restore. But you can export your data from QLDB to S3.
  • Does not support Point-in-time restore feature.
  • Does  not support cross-region replication.
  • Does not support the use of customer managed CMKs (Customer Managed Keys).

AMAZON Neptune : Theory

 

  • Amazon Neptune is a fully managed graph database service used for building applications that work with highly connected datasets.
  • Optimized for storing billions of relationships between pieces of information.
  • Provide milliseconds latency when querying the graph.
  • Neptune supports graph query languages like Apache TinkerPop Gremlin and W3C’s SPARQL.
How it works

Amazon Neptune

Common Use Cases

  • Social Networking
    • Amazon Neptune can easily process user’s interactions like comments, follows, and likes in a social network application through highly interactive queries.
  • Recommendation Engines
    • You can use Amazon Neptune to build applications for suggesting personalized and relevant products based on relationships between information such as customer’s interest and purchase history.
  • Knowledge Graphs
    • With the help of Amazon Neptune, you can create a knowledge graph for search engines that will enable users to quickly discover new information. 
  • Identity Graphs
    • You can use Amazon Neptune as a graph database to easily link and update user profile data for ad-targeting, personalization, and analytics. 

Performance

  • Supports 15 read replicas and 100,000s of queries per second.
  • Amazon Neptune uses query optimization for both SPARQL queries and Gremlin traversals.

Reliability

  • Database volume is replicated six ways across three availability zones.
  • Amazon Neptune can withstand a loss of up to two copies of data and three copies of data without affecting write availability and read availability respectively.
  • Amazon Neptune’s storage is self-healing. Data blocks are continuously scanned for errors and replaced automatically.
  • Amazon Neptune uses asynchronous replication to update the changes made to the primary instance to all of Neptune’s read replicas.
  • Replicas can act as a failover target with no data loss.
  • Supports automatic failover.
  • Supports promotion priority within a cluster. Amazon Neptune will promote the replica with the highest priority tier to primary when the primary instance fails.

 

Cluster Volume

Local Storage

STORED DATA TYPE

Persistent data

Temporary data

SCALABILITY

Automatically scales out when more space is required

Limited to the DB Instance class

Backup And Restore

  • Automated backups are always enabled.
  • Supports Point-In-Time restoration, which can be up to 5 minutes in the past.
  • Supports sharing of encrypted manual snapshots.

Security

  • Amazon Neptune supports AWS Key Management Service ( KMS ) encryption at rest.
  • It also supports HTTPS connection. Neptune enforces a minimum version of TLS v1.2 and SSL client connections to Neptune in all AWS Regions where Neptune is available.
  • To encrypt an existing Neptune instance, you should create a new instance with encryption enabled and migrate your data into it.
  • You can create custom endpoints for Amazon Neptune to access your workload. Custom endpoints allow you to distribute your workload across a designated set of instances within a Neptune cluster.
  • Offers database deletion protection.

Pricing

  • You are billed based on the DB instance hours, I/O requests, storage, and Data transfer.
  • Storage rate and I/O rate is billed in per GB-month increments and per million request increments respectively.

Monitoring

  • Visualize your graph using the Neptune Workbench.
  • You can receive event notifications on your Amazon Neptune DB clusters, DB instances, DB cluster snapshots, parameter groups, or security groups through Amazon SNS.

Limitations

  • It does not support cross-region replicas.
  • Encryption of an existing Neptune instance is not supported.
  • Sharing of automatic DB snapshots to other accounts is not allowed. A workaround for this is to manually copy the snapshot from the automatic snapshot, then, copy the manual snapshot to another account.


AWS Elastic Cache : Theory

 

What is Elasticache?


Elasticache is a web service used to deploy, operate, and scale an in-memory cache in the cloud.

  • It improves the performance of web applications by allowing you to retrieve information from fast, managed in-memory cache instead of relying entirely on slower disk-based databases.
  • For example, if you are running an online business, customers continuously asking for the information of a particular product. Instead of front-end going and always asking information for a product, you can cache the data using Elasticache.
  • It is used to improve latency and throughput for many read-heavy application workloads (such as social networking, gaming, media sharing, and Q&A portals) or compute intensive workloads (such as a recommendation engine).
  • Caching improves application performance by storing critical pieces of data in memory for low latency access.
  • Cached information may include the results of I/O-intensive database queries or the results of computationally-intensive calculations.
Types of ElasticacheTypes of Elasticache
Memcached
  • Amazon Elastic cache for Memcached is a Memcached-compatible in-memory key-value store service which will be used as a cache.
  • It is an easy-to-use, high performance, in-memory data store.
  • It can be used as a cache or session store.
  • It is mainly used in real-time applications such as Web, Mobile Apps, Gaming, Ad-Tech, and E-Commerce.
Working of Memcached
  • Databases are used to store the data on disk or SSDs while Memcached keeps its data in memory by eliminating the need to access the disk.
  • Memcached uses the in-memory key-value store service that avoids the seek time delays and can access the data in microseconds.
  • It is a distributed service means that it can be scaled out by adding new nodes.
  • It is a multithreaded service means that it can be scaled up its compute capacity. As a result of this, its speed, scalability, simple design, efficient memory management and API support for most popular languages make Memcached a popular choice for caching use cases.
Benefits of Memcached
Memcached
  • Sub-millisecond response times
  • Simplicity
  • Scalability
  • Community
Following are the use cases of Memcached
  • Caching
  • Session store

Redis
  • Redis stands for Remote Dictionary Server.
  • It is a fast, open-source, and in-memory key-value data store.
  • Its response time is in a millisecond, and also serves the millions of requests per second for real-time applications such as Gaming, AdTech, Financial services, Health care, and IoT.
  • It is used for caching, session management, gaming, leaderboards, real-time analytics, geospatial, etc.
Working of Redis
  • Redis keeps its data in-memory instead of storing the data in disk or SSDs. Therefore, it eliminates the need for accessing the data from the disk.
  • It avoids seek time delays, and data can be accessed in microseconds.
  • It is an open-source in-memory key-value data store that supports data structures such as sorted sets and lists.
Benefits of Redis
  • In-memory data store
  • Redis stores the data in-memory while the databases such as PostgreSQL, MongoDB, etc store the data in the disk.
  • It does not store the data in a disk. Therefore, it has a faster response time.
  • It takes less than a millisecond for read and write operations, and supports millions of requests per second.
  • Flexible data structures
  • It supports a variety of data structures to meet your application needs. The following are the data structures supported by Redis:
Data typeDescription
StringsIt is a text with up to 512MB in size.
ListsIt is a collection of strings.
SetsIt is an unordered collection of strings with the ability to intersect, union.
Sorted setsThe sets which are ordered by value.
HashesIt is a data structure used for storing the fields and its associated values.
BitmapsIt is a data type that provides bit-level operations.
HyperLogLogsIt is a probabilistic data structure used to estimate the unique items in a data set.
  • Simplicity
  • It allows you to write fewer lines of code to store, access, and use data in your applications.
  • For example, if the data of your application is stored in a Hashmap, and you want to store in a data store, then you can use the Redis hash data structure to store the data. If you store the data without any hash data structure, then you need to write many lines of code to convert from one format to another.
  • Replication and Persistence
  • It provides a primary-replica architecture in which data is replicated to multiple servers.
  • It improves read performance and faster recovery when any server experiences failure.
  • It also supports persistence by providing point-in-time backups, i.e., copying the data set to disk.
  • High availability and scalability
  • It builds highly available solutions with consistent performance and reliability.
  • There are various options available which can adjust your cluster size such as scale in, scale out or scale up. In this way, cluster size can be changed according to the demands.
  • Extensibility
  • It is an open-source project supported by a vibrant community.

Differences between Memcached and Redis
Basis for ComparisonMemcachedRedis
Sub-millisecond latencyIts response time is in sub-millisecond as it stores the data in memory which reads the data more quickly than disk.Its response time is in sub-millisecond as it stores the data in memory which read the data more quickly than disk.
Developer ease of useIts syntax is simple to understand and use.Its syntax is simple to understand and use.
Distributed architectureIts distributed architecture distributes the data across multiple nodes which allows to scale out more data when demand grows.Its distributed architecture distributes the data across multiple nodes which allows to scale out more data when demand grows.
Supports many different programming languagesIt supports languages such as C, C++, java, python, etc.It supports languages such as C, C++, java, python, etc.
Advanced data structureIt does not support advanced data structures.It supports various advanced data structures such as sets, sorted set, hashes, bit arrays, etc.
Multithreaded ArchitectureIt supports multithreaded architecture means that it has multiple processing cores. This allows you to handle multiple operations by scaling up the compute capacity.It does not support multithreaded architecture.
SnapshotsIt does not support the snapshots.Redis also keeps the data in a disk as a point-in-time backup to recover from the fault.
ReplicationIt does not replicate the data.It provides a primary replica architecture that replicates the data across multiple servers and scales the database reads.
TransactionsIt does not support transactions.It supports transactions that let to execute a group of commands.
Lua ScriptingIt does not support Lua Scripting.It allows you to execute Lua Scripts which boost performance and simplify the application.
Geospatial supportIt does not provide Geospatial support.It has purpose-built commands that work with geospatial data, i.e, you can find the distance between two elements or finding all the elements within a given distance.
x

There are two types of Elasticache:

Since, Memcached store the data in the server's main memory, and in-memory stores don't have to go to disk for the data. Therefore, it has a faster response time and also supports millions of operations per second.

The design of Memcached is very simple that makes it powerful and easy to use in application development. It supports many languages such as Java, Ruby, Python, C, C++, etc.

An architecture of Memcached is distributed and multithreaded that makes easy to scale. You can split the data among a number of nodes that enables you to scale out the capacity by adding new nodes. It is multithreaded means that you can scale up the compute capacity.

A Community is an open-source supported by a vibrant community. Applications such as WordPress and Django use Memcached to improve performance.

It implements the high-performance in-memory cache which decreases the data access latency, increases latency, ease the load of your back-end system. It serves the cached items in less than a millisecond and also enables you to easily and cost-effectively scale your higher loads.

It is commonly used by application developers to store and manage the session data for internet-based applications. It provides sub-millisecond latency and also scales required to manage session states such as user profiles, credentials, and session state.

Redis