cloud
aws
audit_reports
(provision instances, deployments, document generation, etc)
as codeusers, Identity Management, securing our databases, Firewalls, DDOS protection
it all is very important to be handled with care.handling more than expected load
, elegantly, without enough downtime.kubernetes, DynamoDB, RDS, Aurora, S3, Route53, etc
to help scale seamlessy while still delivering the A+ grade services to their user base.Instance Tencacy, Decommisioning the instances in non dev time, using the right services
helps in handling the costs of running your applications on cloud.Perform Operations As Code (CloudFormation/Terraform)
Most of the projects are deployed with different ENV Setups, like one for the development team (dev env), one for the Testing Team (test env), Beta for experimentation and most importantly, Live/Production ENV, which is used by the real time users.
Performing Operations As Code refers to automating the setup and deployments for these Environments. Tools like Hashicorp Terraform and AWS CloudFormation helps in automation with infrastructure provisioning. So instead of starting instances and installing dependecies by hand, we should automate this process to make deployments easier and faster.
Automate your documentation for your services
When we have teams which depend on documentation, lets say our Frontend Team that needs API Documentation for integrating with our backend, so rather than having to ask your deployment team to rebuild the docs and deploy it everytime there's a change in the API DOC, we can use our Cloud Provider's Automation tools to make it happen. Jenkins, AWS CodeBuild and CodeDeploy helps in this regards
Always make frequent, small changes
Another thing you want to avoid is huge feature set roll out in one go. What if you released Ten New Features only to find out that the existing 50 features are not working because of one of the bew features?
Plus if the deployed features are frequent and small, you can easily roll back to the previous version without making a huge mess with your user base in case of a calamity.
Refine ops by adapting new features and tools
Always keep an open mind towards new tools and features in existing ones. Like if you provision your infrastructure using AWS CloudFormation, its good enough for working with AWS but has Vendor Locking which means you can't use it with other cloud providers, but then there's Hashicorp's Terraform, which is Multi-Cloud by nature.
Aniticipate Failure (TDD, Pre-mortem excercises, etc)
Another good feature that you get with Automated Ops is that you can provision an environment for TDD or PRE-MORTEM of your excercise.
e.g. that you want to do a small DDOS attack on your app just to see how your infra and app reacts to it.
Anicipating Failures will help you with keeping a positive mindset towards future bugs/attacks/issues. You'll be prepared if such a condition arrives.
Setup up notifications, monitoring, dashboards for the applications
Remeber the last time your app crashed and started receiving negative reviews from your users and you weren't aware of what's wrong?
Continous monitoring for your app, checking for Error Logs and setting notifications in case of some app crash, or an attack on your servers can help you with reacting to such events quicker.
X-Ray, CloudWatch, CloudFormation Cloud Trail and VPC Flow Logs help in Operational Excellence on AWS
Strong Identity Management (IAM, Fedrated, OAuth)
Delegating access to your employees is very important, specially when it comes to sharing access to critical resources like your S3 buckets or production environment. Proper IAM setup, usage of IAM Groups for policy management, enabling login into aws with OAuth, is very important
Enabling MFA on your Root Account and not using Root Account for any operations on AWS is also considered a good practice
Enable Traceability (Monitoring, Alerts, Notifications)
Again Alerts and Notifications help you respond quicker to security related issues and monitoring enables you to keep track of whats going on with your infra.
Managing App Secrets (.env, AWS Secrets Manager, Hashicorp Vault)
Another big security concern is poor management of your Secrets. Since backend services required some way to authenticate themselves against the database to perform database operations, a common way to store their credentials is using a .env file in project's root directory, which, of course is not the best practice we can follow.
Using tools like AWS Secrets Manager /w AWS KMS or Hashicorp's Vault is amongst the highest of industry standards for Secrets Management.
Security At Every Step (VPC -> Subnet -> NACL -> Security Groups)
Cloud providers like AWS provide options to enable secure firewalls at different levels of your account. You can enable firewalls on your VPC, restrict routing to open internet, control what your instances can see or who can access your instance from open internet. Using these services in optimal way results in higher security for your application.
Security As Code (Automated Rules, Templates)
Another amazing feature of templating engines like Terraform or CloudFormation is that you can describe Security Group rules or even VPC's NACL inside the template so that if you launch an instance, it'll get those security rules applied to them automatically.
Protect data @rest and in transit (BCrypt /w salts and SSL)
Tools like Vault store your secrets with encryption at rest and transit. On top of it, you can even Encrypt your EBS Volumes for higher security.
Always use SSL for your frontend/backend, Let's Encrypt allows you to use Free SSL Certificates.
Isolation of data from people/devs
Another security measurement that you would want to take is Isolating your Production Environment from your development team. Your teams should not have direct access to Production Servers unless its required.
Prepare Before Hand
Internet is always open to vulnerabilities. No matter how hard you try, you can never be 100% secure. So always keep on sharpening your knowledge on security related events and be prepared with enough tactics in case of emergency.
IAM, Cloud Trail, CloudFormation, ELB, KMS, Custom VPCs, WAF, CloudWatch
Test Recovery Procedures Regularly
If you're using enough Automation, Recovery for your application is case of some disaster would be pretty easy. Taking regular snapshots of your database, Multi-AZ deployments, etc through autmated scripts is a necessity.
Auto Recover From Failure (Auto Scaling Groups)
All the major cloud providers have the option to Scale Out - Scale In based on the amount of load received by your cloud servers. If you face some outage in CPU/Memory/Storage, ASG's can help you in this regard.
Prefer Horizontal Scaling
Horizontal Scaling is about Distirbuting your work load accross multiple machines. You avoid a single point of failure in Horizontal Scaling.
Enable Service Discovery (Consul, ELB target groups)
Cloud Native approach to deployments is Service Discovery, your instances can get deleted/started and still your load balancer will be able to route taffic to them without any manual interference.
Stop Guessing Capacity
Gone is the time when you had to predict the user count and buy your servers before deploying your application. With services like AWS LAMBDA, Cloud Functions on AZURE/GCP, ASGs, you can start with smaller instances and let your cloud provider manage the scaling part of it.
Introduce Automation
The only thing that should make sense our of this report is that Automation is mission critical. Start with small things and then grow into a fully grown automated pipeline.
N/W limits, Resource Limits, Licenses for various products, etc
All the cloud providers set some sort of default limit on the amount of resources that are available to them, e.g EC2 limits on AWS, compute limit on GCP, So while using automation tools like ASGs, keep track of what your limits are and then ask from increasing the limits before hand only.
DB Backups, Mean Time To Recovery (MTTR), Recovery Point Object (RPO), Multi AZ Setup
Taking regular backups of your production server is very important, plus how much time it takes to take a backup and store in lets say S3 on aws, whats the recovery restore time from an existing backup.
VPC, IAM, Cloud Trail, Config, CloudWatch, S3, Glacier, Auto Scaling Groups, CloudFormation, Shield
Democratize Technologies (NoSQL -> DynamoDb, SQL -> Aurora, RDS, AI/Ml/DL -> Sagemaker)
Focus on technology than the tool, e.g, if you want to use a SQL database, so rather than installing MySQL/PostgresSQL/SQLServer, etc on an instances and using it, directly use one of the Managed AWS Database.
They are optimized for your Database Engine plus their scaling is managed by your cloud provider
Global Presence /w minimal effort
Databased like DynamoDB/Aurora are managed by AWS and are fault tolerant by default. Plus since they're managed databases, they're globally available with minimal latency.
Use Serverless (Lambda,S3 Web Hosting, RDS, ELB, Aurora, S3)
Use as many managed services as you can. Their security patches, scaling, high availablity, fault-tolerance is all managed by your cloud provider. Plus they'll give some DDOS protection as well.
Mechanical Sympathy (write instance type, e.g, I3 instances for better IOPS, good for MongoDB)
Always research about what kind of resource is best fit for your application. Cloud Providers have a multitude of options while selecting a resource, like for Compute on AWS, you have 7-8 types of major categories for EC2 instances, Lambda, S3 and many more. But selecting which one is the perfect fit for your app is important.
If you want to host a MongoDB Cluster on AWS, you should consider instances of i3 class which are optimized for higher IOPS. And since MongoDB is a document based NoSQL Database (stores data in BSON files), it perform much better on i3 that something like t2 or m4 which are general purpose instances
Experiment Often
living on the edge of experimentation when it comes deployments can also help, such as using ENVOY Proxy instead of NGINX, because it has much better support for binary protocols and HTTP2, hence keeping the performance a priority.
Select your compute (Instances, Containers, Functions)
AWS, GCP, Azure or Digital Ocean, then all provide different options for compute like Simple EC2, Kubernetes, Containers or even the light eight LXC containers, so always select the one which is optimal in performance and your budget.
Select required Storage type (HDD, SSD, SSD /w P-IOPS)
Selecting the correct storage type is also very important, as for a database cluster on EC2 instances, you dont to use HDD or even General Purpose SSD. For such cases, for enough performance, you'd want to use Provisioned-IOPS for your Storage Device so that they're always consistent with their performance.
Monitoring is Critical
Again monitoring is critical as it'll help you realise if your selected resources are performing as you expected them to.
Adopt Consumtion Model (dev,test envs can be turned off during non-working hours)
Most of the times, Development Instances are only required for Working Hours (9AM-8PM), so you can turn them off (STOP) while they're not being used.
Instances in STOP state do not lose their data since their associated EBS Volumes are not released until the instances are deleted.
Monitor and Measure Efficiency
Do extensive monitoring and find if you get some spare compute, use that spare compute for other resources or change your instance type to cut on costs
Have as little as possible of data centers
Try to have all of your physical servers migrated to your cloud provider. You can use ASGs to cut on huge compute sizes.
Analyze expenditure, use Budget reports
AWS gives an extensive Billing Budget and Reporting System, using which you can set up daily reminders and keep your Bills under control.
use more managed services
Use as much of Managed services as possible. Services like Aurora scales automatically based on load, security patches are applied automatically by aws, Supports backups, latency is in single digit milliseconds plus is globally scalable. Just like Aurora, AWS is full of managed services so try to take advantage of their feature sets.