owned this note
owned this note
Published
Linked with GitHub
---
tags: AWS
---
AWS Services
===
**aws config**
* This enables you to simplify compliance auditing, security analysis, change management, and operational troubleshooting.
ECS /EKS - > proxy container managed service / kubernates
**Docker** -
breaking up monoliths
service oriented architecture
batch jobs
short lived jobs
variety and flexibility
elasticity
CI/CD pipelines
packaging in docker images
test
depoy in prod
**Fargate**
Run containers without managing servers or clusters
**Cloud Formation** Infrastructure service as code, define yml for all configs
Model and provision all your cloud infrastructure resources. Can create template
```
{
"AWSTemplateFormatVersion" : "version date",
"Description" : "JSON string",
"Metadata" : {
},
"Parameters" : {
},
"Mappings" : {
},
"Conditions" : {
},
"Transform" : {
},
"Resources" : {
},
"Outputs" : {
}
}
```
**Beanstalk** publish a webapp without caring for infrastructure
Service Role / Instance Role
**Apache Flume**
is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.Mostly for new users who are new to aws
SQS
---
* Using SQS you can decouple components of the application so that they can run independantly
* managed queue (Simple Queue Service )
* web service that gives you aceess to message queue that can be used to store the messages while waiting for computer to process them
* standard queue (not ordered and might get duplicates )/ FIFO queue
* pull based, for push based used SNS
* 256 kb in size messages
* 1 min to 14 days retention days,default retention 4 days
* visibility time out (amt of time message is visible in SQS queue after reader picks up that message) maximum is 12 hours
* SQS gurantees that your messages will be processed at least once.
* Enable Long polling by setting the ==ReceiveMessageWaitTimeSeconds== >0
* FIFO limited to 300 transactions per sec
* SQS short polling / Long polling . use long polling for saving money
* Delay seconds ??
* For a standard queue, enter a value for Receive message wait time. The range is 0 to 20 seconds. The default value is 0 seconds, which sets short polling. Any non-zero value sets long polling.
SNS
--
* Push notifications
* all messages are stored across multiple AZ
* PUB-SUB model
* sms / sqs/ android
* multiple transport protocalss
* You should use SNS instead of SES (Simple Email Service) when you want to monitor your EC2 instances\ database
Create Topic : access point(dynamicaly subscribe)
Subscribe
Delete Topic
Publish
SWF (Simple Workflow Service)
---
>Old way of orchestrating a big workflow.
* Workflows executions can last upto 1 yr
* SWF presents a task oriented api while SQS message oriented api
* SWF ensures that task is assigned only once and ==never duplicated==. with SQS you need to handle duplicated message also need to ensure that message is processed once
* SWF keeps track of all the tasks and events in application with SQS you need to implement your own application level tracking
* Actors:-
* Workflow starters : application that can start a workflow.
* Deciders : control of flow of activity tasks in a workflow execution
* Activity Workers: carry out activity tasks
* Domain: A collection of related workflows
* Amazon Simple Queue Service (SQS) and Amazon Simple Workflow Service (SWF) are the services that you can use for creating a decoupled architecture in AWS. Decoupled architecture is a type of computing architecture that enables computing components or layers to execute independently while still interfacing with each other.
Amazon MQ (Apache Active MQ)
---
* HA Amazon MQ (Application migration)
* Active Broker
* Standard Broker
* If you're using messaging with existing applications and want to move your messaging service to the cloud quickly and easily, it is recommended that you consider Amazon MQ.
**EMR**
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis
EMR -> core nodes [YARN+HDFS] 1or 2 nodes
->task nodes [YARN]
Autoscaling -> Task nodes can be autoscaled
EMR creation
Bootstrap actions -> while cluster is getting launched
Step -> processing commands
Presto-> fast and flexible
3V's for Big Data
Velocity
Volumne
Variety
Athena
--
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.serverless EMR
**Athena Federated Queries**
Allows you to query data sources other than S3 buckets using a data connector.
MySQL,DynamoDB,Redshift,TimeSeries
Features :
* Execute a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena to convert query results into efficient storage formats like Parquet or ORC.These formats are optimized for Athena, leading to reduced data scanning per query, thereby enhancing query performance and lowering costs.
* Amazon Athena’s “Reuse Query Results” feature allows users to reuse the last stored query result when re-running a query, enhancing performance and reducing costs
* Amazon Athena now supports User Defined Functions (UDFs), allowing customers to write custom scalar functions and invoke them in SQL queries. UDFs in Athena are defined within an AWS Lambda function as methods in a Java deployment package. This feature is particularly useful for scenarios requiring custom processing, such as categorizing earthquake locations using a geospatial indexing system. UDFs enable the encapsulation of complex logic into reusable functions, thus simplifying SQL queries in Athena.
**Quicksight** -> BI tool
Kinesis (similar to Kafka)
--
platform where straming data is send
Types:
1. Kinesis streams [shards, no storage available]
2. Firehose [firehouse + s3]
3. Analytics [data analytics on the fly]
* default data retnention is 24 hrs max 7 days
* kinesis streams consists of shards (5 transactions per sec for reads up to max of 2mbps)
File data -> S3
Streaming data -> Kinesis
copying data to s3
using nfs mounts and s3 agent data sync
using s3distcp to copy from s3 to hdfs
connectivity between on premises and AWS
1. VPN
2. AWS direct connectivity
3. amazon S3 multipart data load
4. AWS snowball [physical data transfer ]
replicate data in multiple regions so that if one region is down you can still get the data
batch processing vs stream processing
decouple collection and processing
preserve the client ordering
collect multiple strams together
consume in parallel
amazon firehose -> data loader app for kinesis so that you dont have to configure and program for kinesis
instread of using KCL use Spark streaming and EMR
firehose atleast 60s of latency while kinesis sub 1s latency
kinesis analytics
Streams -> Kinesis
Table Streams -> Dynamo DB
<b>Amazon Neptune</b> is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. ... Amazon Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across Availability Zones. Neo4j alternative
**EFS** - shared file storage. Network file storage for EC2 instances.
**HBase/Cassandra** -> columnar data store
shard
**redshift** - datawarehouse
Intelligent tiering service moves the data between diffrent s3 services
Amazon Elasticsearch Service is a popular open-source search and analytics engine for big data use cases such as log and click stream analysis.
**GraphX** is Apache Spark's API for graphs and graph-parallel computation.
smaller object -> dynamo db
bigger objets -> s3 because dynamodb has more storage cost
S3 as data lake
Athena Time taken to execute the query
(Run time: 18.12 seconds, Data scanned: 90.74 GB)
(Run time: 19.44 seconds, Data scanned: 13.08 GB)
Partition compressed data
(Run time: 6.02 seconds, Data scanned: 0 KB) -> to create partition data
(Run time: 19.69 seconds, Data scanned: 432.46 MB)
Parque
(Run time: 3.03 seconds, Data scanned: 1.97 MB)
Master node on dedicated EMR while Task nodes can be on Spot price
Auto scaling Rules based onYARN if memory is less than 15% for 5 min period
hive -d SAMPLE=s3://aws-tc-largeobjects/AWS-200-BIG/v3.1/lab-4-hive/data -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://hive-bucket-123/output/
File sizes -> spliltabble > parquet + snappy 2G-4G >csv,json
non splittable
use kinesis firehose to combine the files if it is smaller (1MB-128MB)
Hive (reliability + more features) < Spark SQL < Presto (less reliable ) < Redshift (Highest performance) defining schema
Apache Zappelin : Jupyter notebook with spark integration
JupyterHub -> jupyter notebook applications/ sagemaker
Hue -> webui for hadoop browser based / facilitates collaberatin
Ganglia (individual workers) vs cloud watch (overall metrics)/ Hadoop UI monitoring
**Amazon Redshift** is an Internet hosting service and data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services.
Data Warehousing Tool
Business Intelligence Toool
only one Availability zone
Glue
--
serverless for fully managed ETL/ Glue catalog in Athena that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
**X- ray** debug lambda functions
Cloud Front
--
content distribution static contents cache (CDN)
Edge location is seperate from AZ. its the location where content is cached
origin : origin of all the files that CDN will distribte this can be S3 bucket / ec2 instance /ealstic load balancer /route 53
edgelocations are not read only . Time to Live (TTL)
**Lambda@Edge** is a feature of Amazon CloudFront that lets you run code closer to users of your application, which improves performance and reduces latency. With Lambda@Edge, you don't have to provision or manage infrastructure in multiple locations around the world. You pay only for the compute time you consume - there is no charge when your code is not running.
AWS storage gateway
--
>AWS Storage Gateway is a hybrid cloud storage service that gives you on-premises access to virtually unlimited cloud storage.
The AWS Storage Gateway Hardware Appliance is a physical, standalone, validated server configuration for on-premises deployments.
Transfer data between in house data center to AWS
1. File Gateway
* Gets your data into Amazon S3 for use with other AWS services
* Utilizes standard storage protocals with NFS and SMB
Provides File based interface to S3

2. Tape Gateway
* Tape Gateway enables you to replace using physical tapes on premises with virtual tapes in AWS without changing existing backup workflows for long term storage retention needs.

3. Cached Volume Gateway/ Volume Gateway
* Volume Gateway offers cloud-backed storage to your on-premises applications using industry standard iSCSI connectivity.
* In the cached Volume Gateway mode, your primary data is stored in Amazon S3, while retaining your frequently accessed data locally in the cache for low latency access

**Rekognition**
Deep learning based visual analysis service
**Step Functions**
state machine - workflow template
task - lambda fnction
activity - handle for external compute
Resources
* https://acloud.guru/forums/aws-certified-solutions-architect-associate/discussion/-LZipqkEptfvidSZs_5i/passed_aws_ssa%20%20Follow%20Responses
* https://d1.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf
* https://www.aws.training/learningobject/curriculum?id=20685
* https://gist.github.com/leonardofed/bbf6459ad154ad5215d354f3825435dc
* https://blog.newrelic.com/engineering/aws-certified-solutions-architect-associate-exam-prep/
* AWS Console recorder chrome plugin. Generates scripts when done in ui
### AWS Guard Duty
Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data stored in Amazon S3.Cloud watch is used along with lambda to watch the events.
### AWS Inspector
>Automated security assessment service to help improve the security and compliance of applications deployed on AWS
>
arn format#
arn:partition:service:region:account_id:resource
* partition : aws|aws-cn
* service : s3| ec2| rds
* region : us-east-1 | eu-central_1
* account_id
**AWS Control tower**
Service Control Policy called as guard rails controls all users at one place like block all SSH connections
**Systems Manager**
> you can use to view and control your infrastructure on AWS
> you can view operational data from multiple AWS services and automate operational tasks across your AWS resources
**Elastic Transcoder**
1. Media transcoder in cloud
2. convert media files from original source format to different format that will play on smartphones , tablets ,pcs etc
3. provides transcoding presets for popular formats
Pipeline-> preset ->Job
AWS WorkDocs
---
Amazon WorkDocs is a fully managed, secure content creation, storage, and collaboration service. With Amazon WorkDocs, you can easily create, edit, and share content, and because it’s stored centrally on AWS, access it from anywhere on any device.
-Migrate on premise file servers to cloud and reduce costs
