Module 9: Cloud Architecture

Section 1: AWS Well-Archtitected Framework

Architecture: designing and building

Architecture is the art and science of designing and building large structure.

manage size and complexity
identify business goals
capabilities needing improvement
alignement between tech deliverables of a solution and business goals
work with delivery team

What is the AWS Well-Architected Framework ?

A guide for designing infrastructure that are
- Secure
- High-performing
- Resilient
- Efficient
A consistent approach to evaluating and implementing cloud architectures
A way to provide best practices that were developed through lessons learned by reviewing customer architectures

Pillars of the AWS Well-Architected Framework

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Pillar organization

Best practice area

Identity and Access Management

Question text

SEC 1: How do you manage credentials ad authentication ?

Question context

Credential and authentication mechanisms include password, tokens and keys granting access directly or inderectly in your workload. Protect credentials with appropriate mechanisms to help reduce the risk of accidental or malicious use.

Best practices

Define requirements of identity and access management
Secure AWS account root user
Enforce use of multi-factor authentication
Automate enforcement of access controls
Integrate with centralized federation provider
Enforce password requirements
Rotate credentials regularly
Audit credentials periodically

Section 2: Operational Excellence Pillar

Focus
- Run and monitor systems to deliver business value, and to continually improve supporting processes and procedures
Key topics
- Managing and automating changes
- Responding to events
- Defining standards to successfully manage daily operations

Operational excellence design principles

Perform operations as code
- Define entire workload (app and infra)
- Limit human error
- Consistent responses to event
Annotate documentation
- automating the creation of annoted doc
- input to your operations as code
Make frequent, small, reversible changes
- components updated regularly
Refine operations procedures frequently
- opportunities to improve procedures
Anticipate failure
- potential sources of failure
Learn from all operational events failures
- share what is learned

Operational excellence questions

Prepare
- How do your determine what your priorties are ?
- How do you design your workload so that you can understand its state ?
- How do you reduce defects, ease, remediation and improve flow into production ?
- How do you mitigate deployement risks ?
- How do you know that you are ready to support a workload ?
Operate
- How do you understand the health of workload ?
- How do you manage workload and operation events ?
Evolve
- How do you evolve operations ?

Section 3: Security Pillar

Focus
- Protecte info, systemes, and assets while delivering business value through risk assessments and mitigation strategies
Key topics
- Identifying and managing who can do what
- Establishing controls to detect security events
- Protecting systems and services
- Protecting confientiality and integrity of data

Security design principles

Implement a strong identity foundation
- principle of least privileges
- separation of duties
Enable traceability
- monitor, alert and audit actions
- integrate logs and metrics to automatically respond and take action
Apply security to all layers
- defense-in-depth
- security controls to all layers of your architecture
Automate security best practices
- improve ability to securely scale more rapidly and cost effectively
Protect data in transit and at rest
- classify data into sensitivity levels
- use mechanisms such as encryption, tokenization and access control
Keep people away from data
- reduce risk of loss or modif of sensitive data due to human error
- create tools to reduce manual processing of data
Prepare for security events
- incident management process

Security questions

Identity and access management
- How do you manage credentials and authentication ?
- How od you control human access ?
- Ho do you control programmatic access ?
Detective controls
- How do you detect and investigate security events ?
- How do you defend against emerging security threats ?
Infrastructure protection
- How do you protect your networks ?
- How do you protect your compute resources ?
Data protection
- How do you classify your data ?
- How do you protect your data at rest ?
- How do you protect your data in transit ?
Incident response
- How od you respond to an incident ?

Section 4: Reliabality Pillar

Focus
- Prevent and quickly recover from failures to meet business and customer demand
Key topics
- Setting up
- Cross-project requirements
- Recovery planning
- Handling change

Reliability design principles

Test recovery procedures
- test how you system fails
- validate recovery procedures
- expore failure pathways
Automatically recover from failure
- monitor systems for key performance indicator
- configure system to trigger an automated recovery when a threshold is breached
Scale horizontally to increase aggregate system availability
- replace one large resources with multiple smaller one
- reduce impact of a single point of failure
Stop guessing capacity
- monitor demand and system usage
- automate the addition or removal of resources to maintain the optimal level
Manage change in automation
- use automation to make changes to infra
- manage changes to automation

Reliability question

Foundations
- How do you manage service limits ?
- How do you manage your network topology ?
Change management
- How does your system adapt to changes in demand ?
- How do you monitor your resources ?
- How do you implement change ?
Failure management
- How do you back up data ?
- How does your system withstand component failure ?
- How do you test resilience ?
- How do you plan for disaster recovery ?

Section 5: Performance Efficiency Pillar

Focus
- Use IT and computing resources efficiently to meet system requirements and to maintain that efficiency as demand changes and technologies evolve
Key topics
- Selecing the right resource types and sizes based on workload requirements
- Monitoring performances
- Making informed decisions to maintain efficiency as business needs evolve

Performance efficiency design principles

Democratize advanced technologies
- consume tech as a service
- focus on product dev
Go global in minutes
- deploy systems in multiple AZ
Use serverless architectures
- remove operational burden of maintaining servers
- reduce costs
Experiment more often
- comparative testing
Have mechanical sympathy
- use tech approach aligning best with what you want to achieve

Performance efficiency question

Selection
- How do you select the best performing architecture ?
- How do you select your compute solution ?
- How do you select your storage solution ?
- How do you select your db solution ?
- How do you select your networking solution ?
Review
- How do you evolve your workload to rake advantageof new releases ?
Monitoring
- How do your monitor your resources to ensure they are performing as expected ?
Tradeoffs
- How do you use tradeoffs to improve performance ?

Section 6: Cost Optimization Pillar

Focus
- Run systems to deliver business value at the lowest price point
Key topics
- Understanding and controlling when money is being spent
- Selecting the most appropriate and right number of resource types
- Analyzing spending over time
- Scaling to meeting business needs without overspending

Cost optimization design principles

Adopt a consumption mode
- pay only for the computing resources required
Measure overall efficiency
- measure business output
- cost associated
- measure the gain you are making
Stop spending money on data centers operations
- AWS does the heavy-lifting
Analyze and attribute expenditure
- accuratly identify system usage and cost
Use managed and application-level services to reduce cost of ownership
- reduce operational burden of maintaining servers for tasks
- lower cost per transaction

Cost optimization questions

Expenditure awareness
- How do you govern usage ?
- How do you monitor usage and cost ?
- How od you decommission resources ?
Cost-effective resources
- How do you evaluate cost when you select services ?
- How do you meet cost target when you select resource type and size ?
- How do you you use pricing models to reduce cost ?
- How do you plan for data transfer changes ?
Matching supply and demand
- How do you match supply of resources with demand ?
Optimizing over time
- How do you evaluate new services ?

Section 7: Reliability and Availability

"Everything fails, all the time" - Werner Vogels, CTO, Amazon.com

Reliability

A measure of your system's ability to provide functionality when desired by the user
System includes: hardware, software, firmware
Probability that your entire system will function as intended for a specified period
Mean time between failures (MTBF) = total time in service/number of failures

Understanding reliability metrics

Availability

Normal operation time / total time
A percentage of uptime (ex: 99.9%) over time (ex: 1y)
Number of 9s - 5 9s means 99.999% availability

High availability

System can withstand some measure of degradation while still remaining available
Downtime is minimized
Minimal human intervention is required

Availability tiers

Factors that influenc availability

Fault tolerance
- The built-in redundancy of an app compononents and its ability to remain operational
Scalability
- The ability of an app to accomodate increases in capacity needs without changing design
Recoverability
- The process, policies, and procedures that are related to restoring service after a catastrophic event

Section 8: AWS Trusted Advisor

Online tool that provides real-time guidnace to help you provision your resources following AWS best practices
Looks at your entire AWS enivronment and gives you a real time recommendations in
- Cost Optimization
  - eliminating unused resources
  - commitment to reserve capacity
- Performance
  - checking service limit
  - service throughput
- Security
  - improve security of the app by identifying gaps
  - permissions
  - increase availability and redundancy
- Fault Tolerance
  - service usage more than 80% of the service limit
  - values based on snapshot

Wrap-up

A SysOps engineer working at a company wants to protect their data in transit and at rest. What service could they use to protect their data ?

Elastic Load Balancibg
Amazon Elastic Block Stor (Amazon EBS)
Amazon Simple Storage Service (Amazon S3)
All of the above

Answer

Keywors:

protect their data in transit and at rest

Answer: 4.