Sunday, 20 March 2022

Amazon Aurora

 

  • A fully managed relational database engine that’s compatible with MySQL and PostgreSQL.
  • With some workloads, Aurora can deliver up to five times the throughput of MySQL and up to three times the throughput of PostgreSQL.
  • Aurora includes a high-performance storage subsystem. The underlying storage grows automatically as needed, up to 128 terabytes. The minimum storage is 10GB.
  • Aurora will keep your database up-to-date with the latest patches.
  • Aurora supports quick, efficient cloning operations.
    • You can share your Amazon Aurora DB clusters with other AWS accounts for quick and efficient database cloning.
  • Aurora is fault-tolerant and self-healing.

DB Clusters

    • An Aurora DB cluster consists of one or more DB instances and a cluster volume that manages the data for those DB instances.
    • An Aurora cluster volume is a virtual database storage volume that spans multiple AZs, with each AZ having a copy of the DB cluster data.
    • Cluster Types:
      • Primary DB instance – Supports read and write operations, and performs all of the data modifications to the cluster volume. Each Aurora DB cluster has one primary DB instance.
      • Aurora Replica – Connects to the same storage volume as the primary DB instance and supports only read operations. Each Aurora DB cluster can have up to 15 Aurora Replicas in addition to the primary DB instance. Aurora automatically fails over to an Aurora Replica in case the primary DB instance becomes unavailable. You can specify the failover priority for Aurora Replicas. Aurora Replicas can also offload read workloads from the primary DB instance.

Aurora Endpoints

  • Cluster endpoint – connects to the current primary DB instance for a DB cluster. This endpoint is the only one that can perform write operations. Each Aurora DB cluster has one cluster endpoint and one primary DB instance.
  • Reader endpoint – connects to one of the available Aurora Replicas for that DB cluster. Each Aurora DB cluster has one reader endpoint. The reader endpoint provides load-balancing support for read-only connections to the DB cluster. Use the reader endpoint for read operations, such as queries. You can’t use the reader endpoint for write operations.
  • Custom endpoint – represents a set of DB instances that you choose. When you connect to the endpoint, Aurora performs load balancing and chooses one of the instances in the group to handle the connection. You define which instances this endpoint refers to, and you decide what purpose the endpoint serves.
  • Instance endpoint – connects to a specific DB instance within an Aurora cluster. The instance endpoint provides direct control over connections to the DB cluster. The main way that you use instance endpoints is to diagnose capacity or performance issues that affect one specific instance in an Aurora cluster.
    • When you connect to an Aurora cluster, the host name and port that you specify point to an intermediate handler called an endpoint.
    • Types of Endpoints

Storage and Reliability

    • Aurora data is stored in the cluster volume, which is designed for reliability. A cluster volume consists of copies of the data across multiple Availability Zones in a single AWS Region.
    • Aurora automatically detects failures in the disk volumes that make up the cluster volume. When a segment of a disk volume fails, Aurora immediately repairs the segment. When Aurora repairs the disk segment, it uses the data in the other volumes that make up the cluster volume to ensure that the data in the repaired segment is current.
    • Aurora preloads the buffer pool with the pages for known common queries that are stored in an in-memory page cache when a database starts up after it has been shut down or restarted after a failure.
    • Aurora is designed to recover from a crash almost instantaneously and continue to serve your application data without the binary log. Aurora performs crash recovery asynchronously on parallel threads, so that your database is open and available immediately after a crash.
    • Amazon Aurora Auto Scaling works with Amazon CloudWatch to automatically add and remove Aurora Replicas in response to changes in performance metrics that you specify. This feature is available in the PostgreSQL-compatible edition of Aurora. There is no additional cost to use Aurora Auto Scaling beyond what you already pay for Aurora and CloudWatch alarms.
    • Dynamic resizing automatically decreases the allocated storage space from your Aurora database cluster when you delete data.

High Availability and Fault Tolerance

    • When you create Aurora Replicas across Availability Zones, RDS automatically provisions and maintains them synchronously. The primary DB instance is synchronously replicated across Availability Zones to Aurora Replicas to provide data redundancy, eliminate I/O freezes, and minimize latency spikes during system backups.
    • An Aurora DB cluster is fault tolerant by design. If the primary instance in a DB cluster fails, Aurora automatically fails over to a new primary instance in one of two ways:
      • By promoting an existing Aurora Replica to the new primary instance
      • By creating a new primary instance
    • Aurora storage is also self-healing. Data blocks and disks are continuously scanned for errors and repaired automatically.
    • Aurora backs up your cluster volume automatically and retains restore data for the length of the backup retention period, from 1 to 35 days.
    • Aurora automatically maintains 6 copies of your data across 3 Availability Zones and will automatically attempt to recover your database in a healthy AZ with no data loss.
    • Aurora has a Backtrack feature that rewinds or restores the DB cluster to the time you specify. However, take note that the Amazon Aurora Backtrack feature is not a total replacement for fully backing up your DB cluster since the limit for a backtrack window is only 72 hours.
    • With Aurora MySQL, you can set up cross-region Aurora Replicas using either logical or physical replication. Aurora PostgreSQL does not currently support cross-region replicas.

Aurora Global Database 

    • An Aurora global database spans multiple AWS Regions, enabling low latency global reads and disaster recovery from region-wide outages.
    • Consists of one primary AWS Region where your data is mastered, and one read-only, secondary AWS Region.
    • Aurora global databases use dedicated infrastructure to replicate your data.
    • Aurora global databases introduce a higher level of failover capability than a default Aurora cluster.
    • An Aurora cluster can recover in less than 1 minute even in the event of a complete regional outage. This provides your application with an effective Recovery Point Objective (RPO) of 5 seconds and a Recovery Time Objective (RTO) of less than 1 minute.
    • Has managed planned failover capability, which lets you change which AWS Region hosts the primary cluster while preserving the physical topology of your global database and avoiding unnecessary application changes.

DB Cluster Configurations

    • Aurora supports two types of instance classes
      • Memory Optimized
      • Burstable Performance
    • Aurora Serverless is an on-demand, autoscaling configuration for Amazon Aurora (supports both MySQL and PostgreSQL). An Aurora Serverless DB cluster automatically starts up, shuts down, and scales up or down capacity based on your application’s needs.
      • A non-Serverless DB cluster for Aurora is called a provisioned DB cluster.
      • Instead of provisioning and managing database servers, you specify Aurora Capacity Units (ACUs). Each ACU is a combination of processing and memory capacity.
      • You can choose to pause your Aurora Serverless DB cluster after a given amount of time with no activity. The DB cluster automatically resumes and services the connection requests after receiving requests.
      • Aurora Serverless does not support fast failover, but it supports automatic multi-AZ failover.
      • The cluster volume for an Aurora Serverless cluster is always encrypted. You can choose the encryption key, but not turn off encryption.
      • You can set the following specific values:
        • Minimum Aurora capacity unit – Aurora Serverless can reduce capacity down to this capacity unit.
        • Maximum Aurora capacity unit – Aurora Serverless can increase capacity up to this capacity unit.
        • Pause after inactivity – The amount of time with no database traffic to scale to zero processing capacity.
      • You pay by the second and only when the database is in use. 
      • You can share snapshots of Aurora Serverless DB clusters with other AWS accounts or publicly. You also have the ability to copy Aurora Serverless DB cluster snapshots across AWS regions.
    • Limitations of Aurora Serverless
      • Aurora Serverless supports specific MySQL and PostgreSQL versions only.
      • The port number for connections must be:
        • 3306 for Aurora MySQL
        • 5432 for Aurora PostgreSQL
      • You can’t give an Aurora Serverless DB cluster a public IP address. You can access an Aurora Serverless DB cluster only from within a virtual private cloud (VPC) based on the Amazon VPC service.
      • Each Aurora Serverless DB cluster requires two AWS PrivateLink endpoints. If you reach the limit for PrivateLink endpoints within your VPC, you can’t create any more Aurora Serverless clusters in that VPC.
      • A DB subnet group used by Aurora Serverless can’t have more than one subnet in the same Availability Zone.
      • Changes to a subnet group used by an Aurora Serverless DB cluster are not applied to the cluster.
      • Aurora Serverless doesn’t support the following features:
        • Loading data from an Amazon S3 bucket
        • Saving data to an Amazon S3 bucket
        • Invoking an AWS Lambda function with an Aurora MySQL native function
        • Aurora Replicas
        • Backtrack
        • Multi-master clusters
        • Database cloning
        • IAM database authentication
        • Restoring a snapshot from a MySQL DB instance
        • Amazon RDS Performance Insights
    • When you reboot the primary instance of an Aurora DB cluster, RDS also automatically restarts all of the Aurora Replicas in that DB cluster. When you reboot the primary instance of an Aurora DB cluster, no failover occurs. When you reboot an Aurora Replica, no failover occurs.
    • Deletion protection is enabled by default when you create a production DB cluster using the AWS Management Console. However, deletion protection is disabled by default if you create a cluster using the AWS CLI or API.
      • For Aurora MySQL, you can’t delete a DB instance in a DB cluster if both of the following conditions are true:
        • The DB cluster is a Read Replica of another Aurora DB cluster.
        • The DB instance is the only instance in the DB cluster.
  • Aurora Multi Master
    • The feature is available on Aurora MySQL 5.6 
    • Allows you to create multiple read-write instances of your Aurora database across multiple Availability Zones, which enables uptime-sensitive applications to achieve continuous write availability through instance failure. 
    • In the event of instance or Availability Zone failures, Aurora Multi-Master enables the Aurora database to maintain read and write availability with zero application downtime. There is no need for database failovers to resume write operations.

Tags

    • You can use Amazon RDS tags to add metadata to your RDS resources.
    • Tags can be used with IAM policies to manage access and to control what actions can be applied to the RDS resources.
    • Tags can be used to track costs by grouping expenses for similarly tagged resources.

Monitoring

    • Subscribe to Amazon RDS events to be notified when changes occur with a DB instance, DB cluster, DB cluster snapshot, DB parameter group, or DB security group.
    • Database log files
    • RDS Enhanced Monitoring — Look at metrics in real time for the operating system.
    • RDS Performance Insights monitors your Amazon RDS DB instance load so that you can analyze and troubleshoot your database performance.
    • Use CloudWatch Metrics, Alarms and Logs

Security

    • Use IAM to control access.
    • To control which devices and EC2 instances can open connections to the endpoint and port of the DB instance for Aurora DB clusters in a VPC, you use a VPC security group.
    • You can make endpoint and port connections using Transport Layer Security (TLS) / Secure Sockets Layer (SSL). In addition, firewall rules can control whether devices running at your company can open connections to a DB instance.
    • Use RDS encryption to secure your RDS instances and snapshots at rest.
    • You can authenticate to your DB cluster using AWS IAM database authentication. IAM database authentication works with Aurora MySQL and Aurora PostgreSQL. With this authentication method, you don’t need to use a password when you connect to a DB cluster. Instead, you use an authentication token, which is a unique string of characters that Amazon Aurora generates on request.
  • Aurora for MySQL
    • Performance Enhancements
      • Push-Button Compute Scaling
      • Storage Auto-Scaling
      • Low-Latency Read Replicas
      • Serverless Configuration
      • Custom Database Endpoints
      • Fast insert accelerates parallel inserts sorted by primary key.
      • Aurora MySQL parallel query is an optimization that parallelizes some of the I/O and computation involved in processing data-intensive queries.
      • You can use the high-performance Advanced Auditing feature in Aurora MySQL to audit database activity. To do so, you enable the collection of audit logs by setting several DB cluster parameters.
    • Scaling
      • Instance scaling – scale your Aurora DB cluster by modifying the DB instance class for each DB instance in the DB cluster.
      • Read scaling – as your read traffic increases, you can create additional Aurora Replicas and connect to them directly to distribute the read load for your DB cluster.

Feature

Amazon Aurora Replicas

MySQL Replicas

Number of Replicas

Up to 15 

Up to 5

Replication type

Asynchronous

(milliseconds)

Asynchronous

(seconds)

Performance impact on primary 

Low

High

Act as failover target

Yes (no data loss)

Yes

(potentially minutes of data loss) 

Automated failover

Yes

No

Support for user-defined replication delay

No

Yes

Support for different data or schema vs. primary

No

Yes

  • Aurora for PostgreSQL
    • Performance Enhancements
      • Push-button Compute Scaling
      • Storage Auto-Scaling
      • Low-Latency Read Replicas
      • Custom Database Endpoints
    • Scaling
      • Instance scaling
      • Read scaling
    • Amazon Aurora PostgreSQL now supports logical replication. With logical replication, you can replicate data changes from your Aurora PostgreSQL database to other databases using native PostgreSQL replication slots, or data replication tools such as the AWS Database Migration Service.
    • Rebooting the primary instance of an Amazon Aurora DB cluster also automatically reboots the Aurora Replicas for that DB cluster, in order to re-establish an entry point that guarantees read/write consistency across the DB cluster.
    • You can import data (supported by the PostgreSQL COPY command) stored in an Amazon S3 bucket into a PostgreSQL table.

AWS Transfer Family

 

  • AWS Transfer Family is a secure transfer service for moving files into and out of AWS storage services, such as Amazon S3 and Amazon EFS.
  • With Transfer Family, you do not need to run or maintain any server infrastructure of your own.
  • You can provision a Transfer Family server with multiple protocols (SFTP, FTPS, FTP).

Amazon Transfer Family

Benefits

Fully managed service and scales in real time.

  1. You don’t need to modify your applications or run any file transfer protocol infrastructure.
  2. Supports up to 3 Availability Zones and is backed by an auto scaling, redundant fleet for your connection and transfer requests.
  3. Integration with S3 and EFS lets you capitalize on their features and functionalities as well.
  4. Managed File Transfer Workflows (MFTW) is a fully managed, serverless File Transfer Workflow service to set up, run, automate, and monitor processing of files uploaded using Transfer Family.
  • Server endpoint types:
    1. Publicly accessible
      • Can be changed to a VPC hosted endpoint. Server must be stopped before making the change.
    2. VPC hosted
      • Can be optionally set as Internet Facing. Take note that only SFTP and FTPS are supported for the VPC hosted endpoint.
  • Custom Hostnames
    1. Your server host name is the hostname that your users enter in their clients when they connect to your server. You can use a custom domain for this. To redirect traffic from your registered custom domain to your server endpoint, you can use Amazon Route 53 or any DNS provider.

How to delegate access

  1. You first associate your hostname with the server endpoint, then add your users and provision them with the right level of access. A server hostname must be unique in the AWS Region where it’s created.
  2. Your users’ transfer requests are then serviced directly out of your Transfer Family server endpoint.
  3. If you have multiple protocols enabled for the same server endpoint and want to provide access using the same user name over multiple protocols, you can do so as long as the credentials specific to the protocol have been set up in your identity provider.

Managing Users

  • Supported identity provider types:
    • Service managed using SSH keys
    • AWS Managed Microsoft AD (does not support Simple AD)
    • A custom method via a RESTful interface. The custom identity provider method uses Amazon API Gateway and enables you to integrate your directory service to authenticate and authorize your users. The service automatically assigns an identifier that uniquely identifies your server.
  • For service managed identities, each user name must be unique on your server.
  • You also specify a user’s home directory, or landing directory, and assign an AWS IAM role to the user. 
    • Optionally, you can provide a session policy to limit user access only to the home directory of your Amazon S3 bucket.
    • The home directory is your S3 bucket or EFS filesystem. If no path is specified, your users are redirected to the root folder.
  • Amazon S3 vs Amazon EFS access management

Amazon S3

Amazon EFS

Supports session policies

Supports POSIX user, group, and secondary group IDs

Both support public/private keys, home directories and logical directories

  • Logical directories lets you construct a virtual directory structure that uses user-friendly names so that you can avoid disclosing absolute directory paths, Amazon S3 bucket names, and EFS file system names to your end users.

Pricing

  • You are billed on an hourly basis for each of the protocols enabled, from the time you create and configure your server endpoint, until the time you delete it. 
  • You are also billed based on the amount of data uploaded and downloaded over SFTP, FTPS, or FTP.
  • There is no additional charge for using managed workflows.

AWS Transfer for SFTP

AWS Transfer for FTPS

AWS Transfer for FTP

  • SFTP or Secure Shell File Transfer Protocol is a file transfer over SSH.
  • SFTP servers for Transfer Family operate over port 22.
  • SFTP is a newer protocol and uses a single channel for commands and data, requiring fewer port openings than FTPS.
  • FTPS or File Transfer Protocol Secure is a file transfer with TLS encryption.
  • The port range that AWS Transfer Family uses to establish FTPS data connections is 8192–8200. For access connections, use port 21.
  • When creating an FTPS server, you need to provide a server certificate which needs to be uploaded to AWS Certificate Manager.
  • FTP or File Transfer Protocol is an unencrypted file transfer.
  • The port range that AWS Transfer Family uses to establish FTP data connections is 8192–8200. For access connections, use port 21.
  • Only supported for access within a VPC; cannot be public facing.

AWS Storage Gateway

 

  • The service enables hybrid storage between on-premises environments and the AWS Cloud.
  • It integrates on-premises enterprise applications and workflows with Amazon’s block and object cloud storage services through industry standard storage protocols.
  • The service stores files as native S3 objects, archives virtual tapes in Amazon Glacier, and stores EBS Snapshots generated by the Volume Gateway with Amazon EBS.

Storage Solutions

File Gateway vs Volume Gateway vs Tape Gateway

    • File Gateway – supports a file interface into S3 and combines a service and a virtual software appliance.
      • The software appliance, or gateway, is deployed into your on-premises environment as a virtual machine running on VMware ESXi or Microsoft Hyper-V hypervisor.
      • File gateway supports
        • S3 Standard
        • S3 Standard – Infrequent Access
        • S3 One Zone – IA
      • With a file gateway, you can do the following:
        • You can store and retrieve files directly using the NFS version 3 or 4.1 protocol.
        • You can store and retrieve files directly using the SMB file system version, 2 and 3 protocol.
        • You can access your data directly in S3 from any AWS Cloud application or service.
        • You can manage your S3 data using lifecycle policies, cross-region replication, and versioning.
      • File Gateway now supports Amazon S3 Object Lock, enabling write-once-read-many (WORM) file-based systems to store and access objects in Amazon S3.
      • Any modifications such as file edits, deletes or renames from the gateway’s NFS or SMB clients are stored as new versions of the object, without overwriting or deleting previous versions.
      • File Gateway local cache can support up to 64TB of data.
    • Volume Gateway – provides cloud-backed storage volumes that you can mount as iSCSI devices from your on-premises application servers.
      • Cached volumes – you store your data in S3 and retain a copy of frequently accessed data subsets locally. Cached volumes can range from 1 GiB to 32 TiB in size and must be rounded to the nearest GiB. Each gateway configured for cached volumes can support up to 32 volumes.

AWS Storage Gateway Training

      • Stored volumes – if you need low-latency access to your entire dataset, first configure your on-premises gateway to store all your data locally. Then asynchronously back up point-in-time snapshots of this data to S3. Stored volumes can range from 1 GiB to 16 TiB in size and must be rounded to the nearest GiB. Each gateway configured for stored volumes can support up to 32 volumes.

AWS Storage Gateway Training

      • AWS Storage Gateway customers using the Volume Gateway configuration for block storage can detach and attach volumes, from and to a Volume Gateway. You can use this feature to migrate volumes between gateways to refresh underlying server hardware, switch between virtual machine types, and move volumes to better host platforms or newer Amazon EC2 instances.
    • Tape Gateway – archive backup data in Amazon Glacier.
      • Has a virtual tape library (VTL) interface to store data on virtual tape cartridges that you create.
      • Deploy your gateway on an EC2 instance to provision iSCSI storage volumes in AWS.
      • The AWS Storage Gateway service integrates Tape Gateway with Amazon S3 Glacier Deep Archive storage class, allowing you to store virtual tapes in the lowest-cost Amazon S3 storage class.
      • Tape Gateway also has the capability to move your virtual tapes archived in Amazon S3 Glacier to Amazon S3 Glacier Deep Archive storage class, enabling you to further reduce the monthly cost to store long-term data in the cloud by up to 75%.
      • Supports Write-Once-Read-Many and Tape Retention Lock on virtual tapes.

aws storage gateway

  • Storage Gateway Hosting Options

    • As a VM containing the Storage Gateway software, run on VMware ESXi, Microsoft Hyper-V on premises
    • As a VM in VMware Cloud on AWS
    • As a hardware appliance on premises
    • As an AMI in an EC2 instance
  • Storage Gateway stores volume, snapshot, tape, and file data in the AWS Region in which your gateway is activated. File data is stored in the AWS Region where your S3 bucket is located.
  • The local gateway appliance maintains a cache of recently written or read data so your applications can have low-latency access to data that is stored durably in AWS. The gateways use a read-through and write-back cache.
  • File Gateway File Share

    • You can create an NFS or SMB file share using the AWS Management Console or service API.
    • After your file gateway is activated and running, you can add additional file shares and grant access to S3 buckets.
    • You can use a file share to access objects in an S3 bucket that belongs to a different AWS account.
    • The AWS Storage Gateway service added support for Access Control Lists (ACLs) to Server Message Block (SMB) shares on the File Gateway, helping enforce data security standards when using the gateway for storing and accessing data in Amazon Simple Storage Service (S3).
    • After your file gateway is activated and running, you can add additional file shares and grant access to S3 buckets.

Security

    • You can use AWS KMS to encrypt data written to a virtual tape.
    • Storage Gateway uses Challenge-Handshake Authentication Protocol (CHAP) to authenticate iSCSI and initiator connections. CHAP provides protection against playback attacks by requiring authentication to access storage volume targets.
    • Authentication and access control with IAM.

Compliance

    • Storage Gateway is HIPAA eligible.
    • Storage Gateway in compliance with the Payment Card Industry Data Security Standard (PCI DSS)

Pricing

    • You are charged based on the type and amount of storage you use, the requests you make, and the amount of data transferred out of AWS.
    • You are charged only for the amount of data you write to the Tape Gateway tape, not the tape capacity.

AWS Snowmobile

 

  • An exabyte-scale data transfer service used to move extremely large amounts of data to AWS. You can transfer up to 100PB per Snowmobile.
  • Snowmobile will be returned to your designated AWS region where your data will be uploaded into the AWS storage services you have selected, such as S3 or Glacier.
  • Snowmobile uses multiple layers of security to help protect your data including dedicated security personnel:
    • GPS tracking, alarm monitoring
    • 24/7 video surveillance
    • an optional escort security vehicle while in transit
    • All data is encrypted with 256-bit encryption keys you manage through the AWS Key Management Service and designed for security and full chain-of-custody of your data.
  • Snowmobile pricing is based on the amount of data stored on the truck per month.

AWS Snowball Edge

 

  • A type of Snowball device with on-board storage and compute power for select AWS capabilities. It can undertake local processing and edge-computing workloads in addition to transferring data between your local environment and the AWS Cloud.
  • Has on-board S3-compatible storage and compute to support running Lambda functions and EC2 instances.
  • Options for device configurations
    • Storage optimized – this option has the most storage capacity at up to 80 TB of useable storage space, 24 vCPUs, and 32 GiB of memory for compute functionality. You can transfer up to 100 TB with a single Snowball Edge Storage Optimized device.
    • Compute optimized – this option has the most compute functionality with 52 vCPUs, 208 GiB of memory, and 7.68 TB of dedicated NVMe SSD storage for instances. This option also comes with 42 TB of additional storage space.
    • Compute Optimized with GPU – identical to the compute optimized option, save for an installed GPU, equivalent to the one available in the P3 Amazon EC2 instance type.

Features

  • Network adapters with transfer speeds of up to 100 GB/second.
  • Encryption is enforced, protecting your data at rest and in physical transit.
  • You can import or export data between your local environments and S3.
  • Snowball Edge devices come with an on-board LCD display that can be used to manage network connections and get service status information.
  • You can cluster Snowball Edge devices for local storage and compute jobs to achieve 99.999 percent data durability across 5–10 devices, and to locally grow and shrink storage on demand.
  • You can use the file interface to read and write data to a Snowball Edge device through a file share or Network File System (NFS) mount point.
  • You can write Python-language Lambda functions and associate them with S3 buckets when you create a Snowball Edge device job. Each function triggers whenever there’s a local S3 PUT object action executed on the associated bucket on the appliance.
  • Snowball Edge devices have S3 and EC2 compatible endpoints available, enabling programmatic use cases.
  • Customers who need to run Amazon EC2 workloads on AWS Snowball Edge devices can attach multiple persistent block storage volumes to their Amazon EC2 instances.
  • For latency-sensitive applications such as machine learning, you can deploy a performance-optimized SSD volume (sbp1). Performance optimized volumes on the Snowball Edge Compute Optimized device use NVMe SSD, and on the Snowball Edge Storage Optimized device, they use SATA SSD. Alternatively, you can use capacity-optimized HDD volumes (sbg1) on any Snowball Edge.

Snowball vs Snowball Edge

  • (View AWS Migration CheatSheet: Snowball: Snowball vs Snowball Edge section)

Job Types

  • Import To S3 – transfer of 80 TB or less of your local data copied onto a single device, and then moved into S3.
    • Snowball Edge devices and jobs have a one-to-one relationship. Each job has exactly one device associated with it. If you need to import more data, you can create new import jobs or clone existing ones.
  • Export From S3 – transfer of any amount of data located in S3, copied onto any number of Snowball Edge devices, and then move one Snowball Edge device at a time into your on-premises data destination.
    • When you create an export job, it’s split into job parts. Each job part is no more than 100 TB in size, and each job part has exactly one Snowball Edge device associated with it.
  • Local Compute and Storage Only – these jobs involve one Snowball Edge device, or multiple devices used in a cluster. This job type is only for local use.
    • cluster job is for those workloads that require increased data durability and storage capacity. Clusters have anywhere from 5 to 10 Snowball Edge devices, called nodes.
    • A cluster offers increased durability and increased storage  over a standalone Snowball Edge for local storage and compute.

Recommendations

  • Files should be in a static state while being written to the device.
  • The Job created status is the only status in which you can cancel a job. When a job changes to a different status, it can’t be canceled.
  • All files transferred to a Snowball be no smaller than 1 MB in size.
  • Perform multiple write operations at one time by running each command from multiple terminal windows on a computer with a network connection to a single Snowball Edge device.
  • Transfer small files in batches.

Security

  • All data transferred to a device is protected by SSL encryption over the network.
  • To protect data at rest, Snowball Edge uses server side-encryption.
  • Access to Snowball Edge requires credentials that AWS can use to authenticate your requests. Those credentials must have permissions to access AWS resources, such an Amazon S3 bucket or an AWS Lambda function.

Pricing

  • You are charged depending on the type of Snowball Edge machine you choose (storage optimized or compute optimized).
  • You are charged based on what you select for your term of use: On demand, a 1-year commitment, or a 3-year commitment.
  • For on-demand use, you pay a service fee per data transfer job, which includes 10 days of on-site Snowball Edge device usage. Shipping days, including the day the device is received and the day it is shipped back to AWS, are not counted toward the 10 days. If the device is kept for more than 10 days, you will incur an additional fee for each day beyond 10 days.
  • There is a one-time setup fee per job ordered through the console.

Limits

  • Each data transferred must have a maximum size of 5 terabytes.
  • Jobs must be completed within 120 days of the Snowball Edge device being prepared.
  • The default service limit for the number of AWS Snowball Edge devices you can have at one time is 1.
  • If you allocate the minimum recommendation of 128 MB of memory for each of your functions, you can have up to seven Lambda functions in a single job. (Limited because of physical limits)

Amazon S3 Glacier

 

  • Long-term archival solution optimized for infrequently used data, or “cold data.”
  • Glacier is a REST-based web service.
  • You can store an unlimited number of archives and an unlimited amount of data.
  • You cannot specify Glacier as the storage class at the time you create an object.
  • It is designed to provide an average annual durability of 99.999999999% for an archive. Glacier synchronously stores your data across multiple AZs before confirming a successful upload.
  • To prevent corruption of data packets over the wire, Glacier uploads the checksum of the data during data upload. It compares the received checksum with the checksum of the received data and validates data authenticity with checksums during data retrieval.
  • Glacier works together with Amazon S3 lifecycle rules to help you automate archiving of S3 data and reduce your overall storage costs. Requested archival data is copied to S3 One

Data Model

  • Vault
    • A container for storing archives.
    • Each vault resource has a unique address with form:
      https://region-specific endpoint/account-id/vaults/vaultname
    • You can store an unlimited number of archives in a vault.
    • Vault operations are Region specific.
  • Archive
    • Can be any data such as a photo, video, or document and is a base unit of storage in Glacier.
    • Each archive has a unique address with form:
      https://region-specific-endpoint/account-id/vaults/vault-name/archives/archive-id
  • Job
    • You can perform a select query on an archive, retrieve an archive, or get an inventory of a vault. Glacier Select runs the query in place and writes the output results to Amazon S3.
    • Select, archive retrieval, and vault inventory jobs are associated with a vault. A vault can have multiple jobs in progress at any point in time.
  • Notification Configuration
    • Because jobs take time to complete, Glacier supports a notification mechanism to notify you when a job is complete.

Glacier Operations

  • Retrieving an archive (asynchronous operation)
  • Retrieving a vault inventory (list of archives) (asynchronous operation)
  • Create and delete vaults
  • Get the vault description for a specific vault or for all vaults in a region
  • Set, retrieve, and delete a notification configuration on the vault
  • Upload and delete archives. You cannot update an existing archive.
  • Glacier jobs  select, archive-retrieval, inventory-retrieval

Vaults

  • Vault operations are region specific.
  • Vault names must be unique within an account and the region in which the vault is being created.
  • You can delete a vault only if there are no archives in the vault as of the last inventory that Glacier computed and there have been no writes to the vault since the last inventory.
  • You can retrieve vault information such as the vault creation date, number of archives in the vault, and the total size of all the archives in the vault.
  • Glacier maintains an inventory of all archives in each of your vaults for disaster recovery or occasional reconciliation. A vault inventory refers to the list of archives in a vault. Glacier updates the vault inventory approximately once a day. Downloading a vault inventory is an asynchronous operation.
  • You can assign your own metadata to Glacier vaults in the form of tags. A tag is a key-value pair that you define for a vault.
  • Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Glacier vaults with a vault lock policy. You can specify controls such as “write once read many” (WORM) in a vault lock policy and lock the policy from future edits. Once locked, the policy can no longer be changed.

Archives

  • Glacier supports the following basic archive operations: upload, download, and delete. Downloading an archive is an asynchronous operation.
  • You can upload an archive in a single operation or upload it in parts.
  • Using the multipart upload API, you can upload large archives, up to about 10,000 x 4 GB.
  • You cannot upload archives to Glacier by using the management console. Use the AWS CLI or write code to make requests, by using either the REST API directly or by using the AWS SDKs.
  • You cannot delete an archive using the Amazon S3 Glacier (Glacier) management console. Glacier provides an API call that you can use to delete one archive at a time.
  • After you upload an archive, you cannot update its content or its description. The only way you can update the archive content or its description is by deleting the archive and uploading another archive.
  • Glacier does not support any additional metadata for the archives.

Glacier Select

  • You can perform filtering operations using simple SQL statements directly on your data in Glacier.
  • You can run queries and custom analytics on your data that is stored in Glacier, without having to restore your data to a hotter tier like S3.
  • When you perform select queries, Glacier provides three data access tiers:
    • Expedited – data accessed is typically made available within 1–5 minutes.
    • Standard – data accessed is typically made available within  3–5 hours.
    • Bulk – data accessed is typically made available within 5–12 hours.

Glacier Data Retrieval Policies

  • Set data retrieval limits and manage the data retrieval activities across your AWS account in each region.
  • Three types of policies:
    • Free Tier Only – you can keep your retrievals within your daily free tier allowance and not incur any data retrieval cost.
    • Max Retrieval Rate – ensures that the peak retrieval rate from all retrieval jobs across your account in a region does not exceed the bytes-per-hour limit you set.
    • No Retrieval Limit

Security

  • Glacier encrypts your data at rest by default and supports secure data transit with SSL.
  • Data stored in Amazon Glacier is immutable, meaning that after an archive is created it cannot be updated.
  • Access to Glacier requires credentials that AWS can use to authenticate your requests. Those credentials must have permissions to access Glacier vaults or S3 buckets.
  • Glacier requires all requests to be signed for authentication protection. To sign a request, you calculate a digital signature using a cryptographic hash function that returns a hash value that you include in the request as your signature.
  • Glacier supports policies only at the vault level.
  • You can attach identity-based policies to IAM identities.
  • A Glacier vault is the primary resource and resource-based policies are referred to as vault policies.
  • When activity occurs in Glacier, that activity is recorded in a CloudTrail event along with other AWS service events in Event History.

Pricing

  • You are charged per GB per month of storage
  • You are charged for retrieval operations such as retrieve requests and amount of data retrieved depending on the data access tier – Expedited, Standard, or Bulk
  • Upload requests are charged.
  • You are charged for data transferred out of Glacier.
  • Pricing for Glacier Select is based upon the total amount of data scanned, the amount of data returned, and the number of requests initiated.
  • There is a charge if you delete data within 90 days.