AWS Services - HackMD

AWS Services === **aws config** * This enables you to simplify compliance auditing, security analysis, change management, and operational troubleshooting. ECS /EKS - > proxy container managed service / kubernates **Docker** - breaking up monoliths service oriented architecture batch jobs short lived jobs variety and flexibility elasticity CI/CD pipelines packaging in docker images test depoy in prod **Fargate** Run containers without managing servers or clusters **Cloud Formation** Infrastructure service as code, define yml for all configs Model and provision all your cloud infrastructure resources. Can create template ``` { "AWSTemplateFormatVersion" : "version date", "Description" : "JSON string", "Metadata" : { }, "Parameters" : { }, "Mappings" : { }, "Conditions" : { }, "Transform" : { }, "Resources" : { }, "Outputs" : { } } ``` **Beanstalk** publish a webapp without caring for infrastructure Service Role / Instance Role **Apache Flume** is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.Mostly for new users who are new to aws SQS --- * Using SQS you can decouple components of the application so that they can run independantly * managed queue (Simple Queue Service ) * web service that gives you aceess to message queue that can be used to store the messages while waiting for computer to process them * standard queue (not ordered and might get duplicates )/ FIFO queue * pull based, for push based used SNS * 256 kb in size messages * 1 min to 14 days retention days,default retention 4 days * visibility time out (amt of time message is visible in SQS queue after reader picks up that message) maximum is 12 hours * SQS gurantees that your messages will be processed at least once. * Enable Long polling by setting the ==ReceiveMessageWaitTimeSeconds== >0 * FIFO limited to 300 transactions per sec * SQS short polling / Long polling . use long polling for saving money * Delay seconds ?? * For a standard queue, enter a value for Receive message wait time. The range is 0 to 20 seconds. The default value is 0 seconds, which sets short polling. Any non-zero value sets long polling. SNS -- * Push notifications * all messages are stored across multiple AZ * PUB-SUB model * sms / sqs/ android * multiple transport protocalss * You should use SNS instead of SES (Simple Email Service) when you want to monitor your EC2 instances\ database Create Topic : access point(dynamicaly subscribe) Subscribe Delete Topic Publish SWF (Simple Workflow Service) --- >Old way of orchestrating a big workflow. * Workflows executions can last upto 1 yr * SWF presents a task oriented api while SQS message oriented api * SWF ensures that task is assigned only once and ==never duplicated==. with SQS you need to handle duplicated message also need to ensure that message is processed once * SWF keeps track of all the tasks and events in application with SQS you need to implement your own application level tracking * Actors:- * Workflow starters : application that can start a workflow. * Deciders : control of flow of activity tasks in a workflow execution * Activity Workers: carry out activity tasks * Domain: A collection of related workflows * Amazon Simple Queue Service (SQS) and Amazon Simple Workflow Service (SWF) are the services that you can use for creating a decoupled architecture in AWS. Decoupled architecture is a type of computing architecture that enables computing components or layers to execute independently while still interfacing with each other. Amazon MQ (Apache Active MQ) --- * HA Amazon MQ (Application migration) * Active Broker * Standard Broker * If you're using messaging with existing applications and want to move your messaging service to the cloud quickly and easily, it is recommended that you consider Amazon MQ. **EMR** Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis EMR -> core nodes [YARN+HDFS] 1or 2 nodes ->task nodes [YARN] Autoscaling -> Task nodes can be autoscaled EMR creation Bootstrap actions -> while cluster is getting launched Step -> processing commands Presto-> fast and flexible 3V's for Big Data Velocity Volumne Variety Athena -- Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.serverless EMR **Athena Federated Queries** Allows you to query data sources other than S3 buckets using a data connector. MySQL,DynamoDB,Redshift,TimeSeries Features : * Execute a CREATE TABLE AS SELECT (CTAS) query in Amazon Athena to convert query results into efficient storage formats like Parquet or ORC.These formats are optimized for Athena, leading to reduced data scanning per query, thereby enhancing query performance and lowering costs. * Amazon Athena’s “Reuse Query Results” feature allows users to reuse the last stored query result when re-running a query, enhancing performance and reducing costs * Amazon Athena now supports User Defined Functions (UDFs), allowing customers to write custom scalar functions and invoke them in SQL queries. UDFs in Athena are defined within an AWS Lambda function as methods in a Java deployment package. This feature is particularly useful for scenarios requiring custom processing, such as categorizing earthquake locations using a geospatial indexing system. UDFs enable the encapsulation of complex logic into reusable functions, thus simplifying SQL queries in Athena. **Quicksight** -> BI tool Kinesis (similar to Kafka) -- platform where straming data is send Types: 1. Kinesis streams [shards, no storage available] 2. Firehose [firehouse + s3] 3. Analytics [data analytics on the fly] * default data retnention is 24 hrs max 7 days * kinesis streams consists of shards (5 transactions per sec for reads up to max of 2mbps) File data -> S3 Streaming data -> Kinesis copying data to s3 using nfs mounts and s3 agent data sync using s3distcp to copy from s3 to hdfs connectivity between on premises and AWS 1. VPN 2. AWS direct connectivity 3. amazon S3 multipart data load 4. AWS snowball [physical data transfer ] replicate data in multiple regions so that if one region is down you can still get the data batch processing vs stream processing decouple collection and processing preserve the client ordering collect multiple strams together consume in parallel amazon firehose -> data loader app for kinesis so that you dont have to configure and program for kinesis instread of using KCL use Spark streaming and EMR firehose atleast 60s of latency while kinesis sub 1s latency kinesis analytics Streams -> Kinesis Table Streams -> Dynamo DB <b>Amazon Neptune</b> is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. ... Amazon Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across Availability Zones. Neo4j alternative **EFS** - shared file storage. Network file storage for EC2 instances. **HBase/Cassandra** -> columnar data store shard **redshift** - datawarehouse Intelligent tiering service moves the data between diffrent s3 services Amazon Elasticsearch Service is a popular open-source search and analytics engine for big data use cases such as log and click stream analysis. **GraphX** is Apache Spark's API for graphs and graph-parallel computation. smaller object -> dynamo db bigger objets -> s3 because dynamodb has more storage cost S3 as data lake Athena Time taken to execute the query (Run time: 18.12 seconds, Data scanned: 90.74 GB) (Run time: 19.44 seconds, Data scanned: 13.08 GB) Partition compressed data (Run time: 6.02 seconds, Data scanned: 0 KB) -> to create partition data (Run time: 19.69 seconds, Data scanned: 432.46 MB) Parque (Run time: 3.03 seconds, Data scanned: 1.97 MB) Master node on dedicated EMR while Task nodes can be on Spot price Auto scaling Rules based onYARN if memory is less than 15% for 5 min period hive -d SAMPLE=s3://aws-tc-largeobjects/AWS-200-BIG/v3.1/lab-4-hive/data -d DAY=2009-04-13 -d HOUR=08 -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 -d OUTPUT=s3://hive-bucket-123/output/ File sizes -> spliltabble > parquet + snappy 2G-4G >csv,json non splittable use kinesis firehose to combine the files if it is smaller (1MB-128MB) Hive (reliability + more features) < Spark SQL < Presto (less reliable ) < Redshift (Highest performance) defining schema Apache Zappelin : Jupyter notebook with spark integration JupyterHub -> jupyter notebook applications/ sagemaker Hue -> webui for hadoop browser based / facilitates collaberatin Ganglia (individual workers) vs cloud watch (overall metrics)/ Hadoop UI monitoring **Amazon Redshift** is an Internet hosting service and data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services. Data Warehousing Tool Business Intelligence Toool only one Availability zone Glue -- serverless for fully managed ETL/ Glue catalog in Athena that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores **X- ray** debug lambda functions Cloud Front -- content distribution static contents cache (CDN) Edge location is seperate from AZ. its the location where content is cached origin : origin of all the files that CDN will distribte this can be S3 bucket / ec2 instance /ealstic load balancer /route 53 edgelocations are not read only . Time to Live (TTL) **Lambda@Edge** is a feature of Amazon CloudFront that lets you run code closer to users of your application, which improves performance and reduces latency. With Lambda@Edge, you don't have to provision or manage infrastructure in multiple locations around the world. You pay only for the compute time you consume - there is no charge when your code is not running. AWS storage gateway -- >AWS Storage Gateway is a hybrid cloud storage service that gives you on-premises access to virtually unlimited cloud storage. The AWS Storage Gateway Hardware Appliance is a physical, standalone, validated server configuration for on-premises deployments. Transfer data between in house data center to AWS 1. File Gateway * Gets your data into Amazon S3 for use with other AWS services * Utilizes standard storage protocals with NFS and SMB Provides File based interface to S3 ![File Gateway](https://i.imgur.com/w4B5Suj.png) 2. Tape Gateway * Tape Gateway enables you to replace using physical tapes on premises with virtual tapes in AWS without changing existing backup workflows for long term storage retention needs. ![Tape Gateway](https://i.imgur.com/v1oHhOS.png) 3. Cached Volume Gateway/ Volume Gateway * Volume Gateway offers cloud-backed storage to your on-premises applications using industry standard iSCSI connectivity. * In the cached Volume Gateway mode, your primary data is stored in Amazon S3, while retaining your frequently accessed data locally in the cache for low latency access ![Volume Gateway](https://i.imgur.com/h2sg033.png) **Rekognition** Deep learning based visual analysis service **Step Functions** state machine - workflow template task - lambda fnction activity - handle for external compute Resources * https://acloud.guru/forums/aws-certified-solutions-architect-associate/discussion/-LZipqkEptfvidSZs_5i/passed_aws_ssa%20%20Follow%20Responses * https://d1.awsstatic.com/whitepapers/AWS_Cloud_Best_Practices.pdf * https://www.aws.training/learningobject/curriculum?id=20685 * https://gist.github.com/leonardofed/bbf6459ad154ad5215d354f3825435dc * https://blog.newrelic.com/engineering/aws-certified-solutions-architect-associate-exam-prep/ * AWS Console recorder chrome plugin. Generates scripts when done in ui ### AWS Guard Duty Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data stored in Amazon S3.Cloud watch is used along with lambda to watch the events. ### AWS Inspector >Automated security assessment service to help improve the security and compliance of applications deployed on AWS > arn format# arn:partition:service:region:account_id:resource * partition : aws|aws-cn * service : s3| ec2| rds * region : us-east-1 | eu-central_1 * account_id **AWS Control tower** Service Control Policy called as guard rails controls all users at one place like block all SSH connections **Systems Manager** > you can use to view and control your infrastructure on AWS > you can view operational data from multiple AWS services and automate operational tasks across your AWS resources **Elastic Transcoder** 1. Media transcoder in cloud 2. convert media files from original source format to different format that will play on smartphones , tablets ,pcs etc 3. provides transcoding presets for popular formats Pipeline-> preset ->Job AWS WorkDocs --- Amazon WorkDocs is a fully managed, secure content creation, storage, and collaboration service. With Amazon WorkDocs, you can easily create, edit, and share content, and because it’s stored centrally on AWS, access it from anywhere on any device. -Migrate on premise file servers to cloud and reduce costs ![](https://i.imgur.com/o60LePT.png)