AWS Databases & Analytics
Zhengliang Wang edited at Sat Jun 29 2024
Shared Responsibility on AWS
- Quick provisioning, high availability, vertical and horizontal scaling
- automated backup & restore, upgrade
- monitor and alerting
Relational
RDS
- automated provisioning, OS patching
- continuous backup and restore to specific timestamp
- monitoring dashboard
- read replica
- multi AZ for disaster recovery
- maintainance windows for upgrades
- scaling capability(vertical and horizontal)
- storage backed by EBS
Amazon Aurora
- close sourced by AWS
- PostgreSQL and MySQL are supported as Aurora DB
- Aurora is AWS cloud optimized, and claims 5x performance improvement over MySQL on RDS, 3x on PostgreSQL on RDS
- Aurora cost more than RDS (20%+) but more efficient
- Not in free tier
Amazon Aurora Serverless
- Automated database instantiation and auto-scaling based on actual usage
- PostgreSQL and MySQL are both supported as Aurora Serverless DB
- No capacity planning needed
- least management overhead
- pay per second, can be more cost-effective
- use case: good for infrequent, intermittent workload
Deployment: Read replicas, Multi-AZ
- Read replicas, Write main RDS
ElastiCache
- Redis or Memcached
- in memory database
DynamoDB
- Fully managed, highly available with replication across 3AZ
- NoSQL database
- scales to massive workloads, distributed "serverless" db.
- millions of requests per second ,trillions of row, 100s of TB of storage
- fast and consistent in performance
- single-digit millisecond latency - low latency retrieval
- integrated with IAM
- low cost and auto scaling capacity
- key/value based db with PrrimaryKey(partitionKey, sortKey)
- Global Tables
- Active-Active replication (read/write to any AZ)
DynamoDB Accelerator - DAX
- fully managed in-memory cache for DynamoDB
- 10x performance improvement
- secure, highly scalable & highly available
- DAX only used for and is integrated with DynamoDB
Redshift
- PostgreSQL-based
- OLAP - online analytical processing (analytics and data warehouse)
- load data once every hour; not every second
- 10x better performance than other data warehouses, scale to PBs of Data
- Columnar storage of data instead of row based
- massively parallel query execution, highly available
- pay as you go based on the instance provisioned
- has a SQL interface to perform queries
- BI tools such as AWS Quicksight or Tableau integrate with it
Redshift Serverless
- use case: reporting, dashboarding applications, real-time analytics
- pay only for what you use
- run analytics workloads without managing data warehouse infrastructure
Elastic MapReduce(EMR)
- help creating Hadoop clusters(Big Data) to analyze and process vast amount of data
- Support Apache Spark, HBase, Presto, Flink
- Auto-scaling
- use cases: data processing, machine learning, web indexing, big data
Amazon Athena
- Serverless query service to perform analytics against S3 objects
- uses standard SQL language to query files
- Support CSV, JSON, ORC, Avro, and Parquet (built on Presto)
- Pricing: $5 per TB of data scanned
- use compressed or columnar data for cost savings
- use case: BI analytics, reporting, analyze & query VPC Flow Logs, ELB logs
NoSQL
DocumentDB
- same for MongoDB
- similar concepts as Aurora
- storage automatically grows in increments of 10GB
Amazon Neptune
- Graph Database
- highly available across 3 AZ, up to 15 read replicas
- use case: graphs, fraud detection, recommendation engines, social networking
Amazon Timestream
- serverless time series database
- store and analyze trillions of events
- 100sX faster and 1/10 cost of RDS
Amazon QLDB(quantum ledger database)
- ledge is a book recording financial transaction
- immutable system: no entry can be removed or modified, cryptographically verifiable
- 2-3x better performance than common ledger blockchain framework, manipulate data using SQL
- no decentralization component in comparison to Amazon Managed Blockchain.
Amazon Managed Blockchain
- join public blockchain network
- create own scalable private network
- compatible with hyperedger fabric or Ethereum
AWS Glue:
- manage extract, transform, and load (ETL) service
- useful to prepare and transform data for analytics
- fully serverless service
DMS - Database Migration Service
- quick secure migrates dbs to AWS, resilient, self- healing
- supports
- homogeneous migrations: Oracle to Oracle
- heterogeneous migrations: Microsoft SQL to Aurora
Summary
- Relational databases -OLTP: RDS & Aurora
- Difference between Multi-AZ Read Replicas, Multi-Region
- In-memory Database: ElastiCache
- K/V databases: DynamoDB( for MongoDB) & DAX
- Warehouse - OLAP: Redshift(SQL)
- Hadoop Cluster: EMR
- Athena: query data on S3 with SQL capability
- QuickSight: dashboard on data (serverless)
- DocumentDB: Aurora for MongoDB
- Amazon QLDB: Financial Transaction Ledger
- Amazon Managed Blockchian: hyperledger fabric & ethereum blockchains
- Glue: ELT and data catalog service
- Database Migration: DMS