GCP - HackMD

# GCP # Projects are the base for any resources , Similar to AWS accounts ProjectId :- globally unique , chosen by us , immutable Project name :- not unique , chosen by us , mutable Project number :- globally unique, chosen by google , immutable Projects can be grouped to folders , Similar to AWS Organizations Organization root node needs to be created to use folders `gcloud compute project-info describe --project` `gcloud projects describe <Project_id>` > Use both these commands to get project related data # IAM roles :- Google has 7 types of members :- 1. Google account 2. Service account -> used for service to service interactions . They are identified by an email id , and they have their own IAM policies . 3. Google group 4. Google workspace domain 5. Google identity domain -> User accounts part of a workspace domain without access to G-suite 6. All authenticated users 7. All Users IAM roles can be classified into 3 types :- * Primitive , * Predefined , * Custom roles must possess `iam.roles.create` permission for creating a custom role 1. A less restrictive parent policy can override a more restrictive child policy 1. Service accounts use keys for passwords 1. Use gsutil instead of gcloud for all storage related actions . gsUtil uses linux based command structure 1. Use bq for running big query queries and other commands **IAP (Identity Aware Proxy)** lets you establish a central authorization layer for applications accessed by HTTPS, so you can use an application-level access control model instead of relying on network-level firewalls * IAP works with signed headers or the App Engine standard environment Users API to secure your app. * When you grant a user access to an application or resource by IAP, they're subject to the fine-grained access controls implemented by the product in use without requiring a VPN * IAP TCP forwarding enables you to establish an encrypted tunnel over which you can forward SSH connections to VMs. When you connect to a VM that uses IAP, IAP wraps the SSH connection inside HTTPS before forwarding the connection to the VM. Then, IAP checks if the you have the required IAM permissions and if you do, grants access to the VM. **Workload Identity** Workload Identity allows workloads in your **GKE clusters** to impersonate Identity and Access Management (IAM) service accounts to access Google Cloud services. **Workload Identity Federation** Can be used by 3rd party authentication services (e.g AWS , Azure etc) to impersonate a google service account and thus helps to **avoid sharing service account keys** # Compute Engine :- ***Preemptable VMs*** are those , that lasts less than 24 hours only , and can be destroyed anytime by google during resource crunch . Spark jobs are a good usecase , and also batch job processing applications ***Instance template*** :- contains the info required to build a VM for a deployment .It may include a startup script ***Shielded VMs*** :- Hardened VMs with a set of security controls to block rootkits and viruses ***Sole tenant nodes*** are VMs running on a separate physical compute engine node , basically a dedicated hardware Before provisioning VMs on sole-tenant nodes, you must create a sole-tenant node group. A **node group** is a homogeneous set of sole-tenant nodes in a specific zone. Node groups can contain multiple VMs running on machine types of various sizes, as long as the machine type has 2 or more vCPUs. ***General Purpose (E2,N2,N2D,N1)*** For Web and application servers ***Memory Optimized (M2,M1)*** For in-memory databases & Data Analytics on memory ***CPU Optimized (C2)*** For Gaming related applications and HPC (High performance computing) **Accelerator optimized (A2)** For GPU intensive applications like ML Training for CUDA , gaming etc **GPU vs TPU** GPU have a high precision floating point operation but TPU hava a low precision floating point operation . Hence TPU can only be used for Deep learning workloads and GPUs have a broader use Enabling **Confidential VM Service** while creating a compute instance will encrypt the data disks via a 3rd party key which needs to be provided **Managed Instance Groups** * Stateful - Stateful persists an instance state like hostname , ip Address , disk etc. Useful for mainly database sharding kind of usecases * Stateless - Doesn't persist any state , and mainly can be used for serving web traffic Managed instance groups support two types of update: * Automatic, or **proactive**, updates -> if you want to apply updates **automatically**, set the type to proactive. * Selective, or **opportunistic**, updates -> Alternatively, if an automated update is potentially too disruptive, you can choose to perform an opportunistic update. The MIG applies an opportunistic update only when you **manually** initiate the update on selected instances or when **new** instances are created **Boot Disk Options** Standard < Balanced < SSD < Extreme ***(Only Zonal)*** 1. Sole-tenancy is appropriate for specific types of workloads, for example, gaming workloads with performance requirements might benefit because they are isolated on their own hardware, finance or healthcare workloads might have requirements around security and compliance, and Windows workloads might have requirements related to licensing. 1. Shutdown scripts are especially useful for instances in a managed instance group with an autoscaler. If the autoscaler shuts down an instance in the group, the shutdown script runs before the instance stops and the shutdown script performs any actions that you define 1. The shutdown script won't run if the instance is reset using `instances().reset`. 2. To pass in a local shutdown script file, supply the --**metadata-from-file** flag, followed by a metadata key pair, **shutdown-script**=PATH/TO/FILE, where PATH/TO/FILE is a relative path to the shutdown script. For example: ``` gcloud compute instances create example-instance \ --metadata-from-file shutdown-script=examples/scripts/install.sh ``` 4. You can set custom metadata for an instance or project from the Google Cloud Console, the gcloud command-line tool, or the Compute Engine API. 5. Custom metadata is useful for passing in arbitrary values to your project or instance, and for setting startup and shutdown scripts. 6. The metadata server requires that all requests provide the Metadata-Flavor: Google header, which indicates that the request was made with the intention of retrieving metadata values. 7. Guest attributes are a specific type of custom metadata that your applications can write to while running on your instance. Any application or user on your instance can both read and write data to these guest attribute metadata values. 8. For OS Patch management , we can use VM Manager **Connection options for Internal VMs** * *SSH tunneling with IAP* :- You don't want any external IP address access to any VMs in your project. You can use IAP on all Linux VMs, including bastion host VMs and VMs within projects that use Cloud VPN or Cloud Interconnect. * *Bastion Hosts* :- You have a specific use case, like session recording, and you can't use IAP. * *Cloud VPN/Cloud Interconnect* :- Your organization has configured Cloud VPN or Cloud Interconnect for their networking needs. Cloud VPN and Cloud Interconnect are separate Google Cloud products that aren't included in Compute Engine pricing. # Google VPC networks are global , subnets are regional VPC network can be created in * **Auto mode** :- create one subnet in each GCP region automatically when you create the network. As new regions become available, new subnets in those regions are automatically added to the auto mode network These automatically created subnets use a set of predefined IP ranges that fit within the 10.128.0.0/9 CIDR block. As new Google Cloud regions become available, new subnets in those regions are automatically added to auto mode VPC networks by using an IP range from that block. In addition to the automatically created subnets, you can add more subnets manually to auto mode VPC networks in regions that you choose by using IP ranges outside of 10.128.0.0/9. This is ok for Dev or testing kind of env , but not for Prod envs * **Custom mode** :- you create subnets when you create the network, or you can add subnets later . It is good for prod setup > You can switch a VPC network from auto mode to custom mode. This is a one-way conversion; custom mode VPC networks cannot be changed to auto mode VPC networks. **Types of VPC networking** 1. ***Shared VPC*** connects VPC within the same organization only . A project in a shared VPC is either a host project or a service project. When a host project is enabled , all of it's VPCs automatically become shared VPCs . Shared VPC lets organization administrators **delegate administrative responsibilities**, such as creating and managing instances, to Service Project Admins while maintaining centralized control over network resources like subnets, routes, and firewalls. 1. ***VPC peering*** connects VPCs across projects and organization . Currently only works with Compute Engine , GKE and App Engine flexible, and currently only **25** VPCs can be peered . It has below advantages over External IPs or Cloud VPN -: * Lower latency because of Internal addressing * Not exposed to Public internet * External addressing occurs egress charges which is costly 1. **Private Service Interconnect** is a new service which helps connect services within VPCs by providing **static IP Address** for each service internal to a VPC With this we can connect to a service from other VPC network or make a service available outside it's VPC 1. **Serverless VPC** Access makes it possible for you to connect directly to your Virtual Private Cloud (VPC) network from serverless environments such as Cloud Run, App Engine, or Cloud Functions **VPC Service Controls** Are a form of additional perimeter or security control around resources inside VPC . For e.g if a project can only create VMs but not GKE cluster # Cloud NAT :- Cloud NAT is a distributed, software-defined managed service. It's not based on proxy VMs or appliances. Cloud NAT configures the Andromeda software that powers your Virtual Private Cloud (VPC) network so that it provides source network address translation (source NAT or SNAT) for **VMs without external IP addresses**. Cloud NAT also provides destination network address translation (destination NAT or DNAT) for established inbound response packets. **Used for** :- When a VM does not have an external IP address assigned, it cannot make direct connections to external services, including other Google Cloud services. To allow these VMs to reach services on the public internet, you can set up and configure Cloud NAT, which can route traffic on behalf of any VM on the network. # Cloud VPN :- for connecting on prem network with GCP network . Uses a secure IPsec VPN tunnel . Ideal for scenarios with less traffic and bandwidth . For e.g if 3rd party users want to connect to GCP resources It's a **regional** resource # Cloud Interconnect :- Dedicated direct connect from on prem network to GCP network If Dedicated connector is not available then one can also use below :- 1. **Partner Interconnect** , where telecom services provide connection to GCP network routers 2. **Direct Peering** , which is good for connecting to **google workspace applications** 3. **Carrier Peering** , which is similar to Direct Peering , but involves 3rd party service provider # Cloud router :- A s/w router for dynamically exchange routes between GCP and on prem network . Uses BGP (Border gateway Protocol) . It has two modes 1. Regional Dynamic router -> For advertising all the subnets in a particular region the router is created 2. Global Dynamic router -> For advertising all the subnets in an entire VPC > For example, if you use a Cloud VPN tunnel to connect your networks, you can use Cloud Router to establish a BGP session with your on-premises router over your Cloud VPN tunnel. Cloud Router automatically learns new subnets in your VPC network and announces them to your on-premises network # Billing Accounts To manage ***billing accounts*** , and to add projects to them you must be a billing admin 1. To change the billing account for an existing project , you must be the owner of the project and billing admin on the destination billing account 1. To set a budge alert you must be a billing admin 1. budget alerts can be applied either to a billing account or a project > setting budget alerts and triggering them doesn't cap API usage # GCP Load balancing 1. A client sends a content request to the external IPv4 address defined in the forwarding rule. 2. The forwarding rule directs the request to the target HTTPS proxy. 3. The target proxy uses the rule in the URL map to determine that the single backend service receives all requests. 4. The load balancer determines that the backend service has only one instance group and directs the request to a virtual machine (VM) instance in that group. 5. The VM serves the content requested by the user. **Types of Load Balancer** 1. **Global Load Balancers** (Premium Tier) * Global HTTPS - Layer 7 load balancing , can route different URLs to different backends . Used in cross-regional load balancing * Global SSL Proxy - Layer 4 for no-HTTPS SSL traffic , Only on specific port number * Global TCP Proxy - Layer 4 for non-SSL traffic , only on specific port 1. **Regional Load Balancers** (Standard Tier) * Internal TCP/UDP - Load balancing of any traffic , on any port * Internal HTTPS - traffic inside a VPC , internal tier of multi- tier apps , only on specific port * External TCP/UDP - Regional Layer 4 for non-SSL traffic only on specific port * HTTP and HTTPS traffic require global, external load balancing. * TCP traffic can be handled by global, external load balancing; external, regional load balancing or internal, regional load balancing. * UDP traffic can be handled by external regional load balancing or internal regional load balancing. * SSL Proxy is used for external TCP traffic and SSL processing is offloaded to the the proxy. * Network TCP/UDP Load Balancing is used for internal TCP traffic. * TCP Proxy Load Balancing is a reverse proxy load balancer that distributes TCP traffic coming from the internet to virtual machine (VM) instances in GCP Proxy load balancers terminate incoming client connections at the load balancer and then open new connections from the load balancer to the backends. * Passthrough load balancers do not terminate client connections. Instead, load-balanced packets are received by backend VMs with the packet's source, destination, and, if applicable, port information unchanged. ***target pool*** allows a single access point to all the instances in a group managed instance group vs template ***backend services*** wrap one or more instance groups ***VM instance groups*** , are a group of similar kind of VM instances used for load balancing or autoscaling . * They are of managed :- maintain HA automatically by keeping instances running * and un-mananged :- do not share a similar instance template and can only used for load balancing # Service Directory Managed service to discover , publish and connect to services . Kind of like an Naming service provider where we can connect to services via their name instead of IPs # Cloud Armor It's a software defined service defined at the layer 7(application layer) , for protection against DDOS attacks to any application. **It can only be configured for services behind and external HTTPs Load balancer** It enforces these security rules/policies at the Google Edge POP , and doesn't even allow unsolicated traffic to enter Google datacenters # Cloud storage is of 4 types :- * Regional storage (data stored in a specific region , low cost) , * Multi-Regional storage (data stored in multi regions , high cost) , * Nearline storage (durable data to be access once in month , low cost) , * Cold storage (highly durable data to be access once in 90 days ) , * Archive storage (highly durable data to be accessed once in a year or for DR purposes ,very low cost ) 1. Cloud storage isn't file storage but is a object storage based on bucket. They are immutable and can't be edited 1. Versioning is optional for cloud storage and can be turned on via gsutil . If version is on , then a new version number will get attached to the object 2. If you delete a versioned object , it only creates a non-current version . Then deleting the non-current version , deletes it permanently 3. Lifecycle management config can be assigned to a bucket . Like automatically downgrading storage class of objects older than 365 days to coldline . > Only two actions are supported > Delete and setStorageClass `gsutil rewrite (for making any updates on an object)` `gsutil rewrite -s nearline (to convert an object storage class to nearline)` Two systems exist for enabling ***bucket permission***:- * IAM :- Newly added , provides access through existing IAM policies of gcp . But cannot grant access at object level * ACL :- Legacy , doesn't provide finer permissions like IAM , but should be used to grant access at object level > Enabling Uniform bucket level access will disable the existings ACLs on the bucket completely and access has to be granted via IAM only ``` gsutil iam ch allUsers:objectViewer gs://BUCKET_NAME For granting public access to all users on all objects in a bucket ``` ``` gsutil acl ch -u AllUsers:R gs://BUCKET_NAME/OBJECT_NAME For granting public access to all users on specific objects in a bucket ``` `gsutil stat gs://BUCKET_NAME/OBJECT_NAME :- for checking metadata of an object` `gsutil ls -L -b gs://BUCKET_NAME : - for checking metadata of a bucket` > Avoid using sequential filenames such as timestamp-based filenames if you are uploading many files in parallel `gsutil hash [-c] [-h] [-m] filename...` > Calculate hashes on local files, which can be used to compare with gsutil ls -L output. If a specific hash option is not provided, this command calculates all gsutil-supported hashes for the files. A ***signed URL*** is a URL that provides limited permission and time to make a request. Signed URLs contain authentication information in their query string, allowing users without credentials to perform specific actions on a resource **Signed Policy Document** is an ACL method where we can specify what can be uploaded by adding controls over size , type or any other attributes ***Data Locking*** for audits can be done on buckets and object to block deletion/modification ***Decompressive Coding*** used for uploading a compressed(gzip) file and decompressing the object while serving the request . This can be done via tagging a correct metadata ***Composite Objects*** :- Upload single object in pieces and create a composite objects without having to concatenate the pieces after upload **Transfer Service** :- Used for transfer large quantity of data ~100TB from on-prem or other buckets of data **Physical Appliance** :- Physical device shipped to your datacenter and manually transfer data . Has a capacity of 40TB and 300TB **Cloud Storage FUSE** :- adapter that allows you to mount and access Cloud Storage buckets as local file systems, so applications can read and write objects in your bucket using standard file system semantics # Cloud Big Table:- Fully managed , column database , accessed using HBase API > It's a wide column database build on top of "sparse (can be empty) multidimensional array" By default Big Table provides eventual consistency ,but we can also have the below HA options :- 1. **Read-your-write** consistency where any app writing the data will read the same data . But for that , need to select *single-cluster routing* 2. **Strong** consistency , where all apps will read the same data , but for that need to use *single-cluster* routing with failover only * For maximum performance insert data with a "rowKey" which is already sorted , and doesn't spend much time in sorting. > That's why time series data which is already sorted is a best match * Bigtable requires that column family names follow the regular expression [_a-zA-Z0-9][-_.a-zA-Z0-9]*. If you are importing data into Bigtable HBase, you might need to first change the family names to follow this pattern * Bigtable stores timestamps in microseconds, while HBase stores timestamps in milliseconds * In HBase, deletes mask puts, but Bigtable does not mask puts after deletes when put requests are sent after deletion requests. This means that in Bigtable, a write request sent to a cell is not affected by a previously sent delete request to the same cell. * Differences between Hbase & bigTable :- https://cloud.google.com/bigtable/docs/hbase-differences * If the hottest node is frequently above the recommended value, even when your average CPU utilization is reasonable, you might be accessing a small part of your data much more frequently than the rest of your data. > Use the Key Visualizer tool to identify hotspots in your table that might be causing spikes in CPU utilization. > Check your schema design to make sure it supports an even distribution of reads and writes across each table. * Creating a client for Bigtable is a relatively expensive operation. Therefore, you should create the smallest possible number of clients * If your data includes integers that you want to store or sort numerically, pad the integers with leading zeroes. Bigtable stores data lexicographically. For example, lexicographically, 3 > 20 but 20 > 03 * In many cases, you should design row keys **that start with a common value and end with a granular value**. For example, if your row key includes a continent, country, and city, you can create row keys that looks like the following so that they automatically sort first by values with lower cardinality: * When you create a Bigtable instance, your choice of SSD or HDD storage for the instance is permanent. You **cannot use the Google Cloud Console to change the type of storage** that is used for the instance * **Rows with Time Buckets** Ideal for storing time sensitive data like stocks . where each row can represent a bucket of time like "hour or day" . Row key can have non time identifier like "hour23" * **Column for each event** Write new column for new event . Ideally used in scenarios where don't need to measure changes in time series data , and will save storage space # Cloud SQl :- MySQL and postgresql as a service , vertical scaling (read-write) and horizontal scaling (read) ***Cloud SQL proxy*** is the recommended way to connect to Cloud SQL from a gcp VM **Enable automatic storage increases** If you enable this setting, Cloud SQL checks your available storage every 30 seconds. If the available storage falls below a threshold size, Cloud SQL automatically adds additional storage capacity. If the available storage repeatedly falls below the threshold size, Cloud SQL continues to add storage until it reaches the maximum of 64 TB. The legacy configuration for high availability used a failover replica instance. The new configuration does not use a failover replica. Instead, it uses Google's **regional persistent disks**, which synchronously replicate data at the block level between two zones in a region **Cons** :- * Only supports regional storage , multi-regional storage only used for backup purposes * Only has capacity of 30TB per database # Cloud Spanner :- Similar to Cloud SQL, but offers transactional consistency at a global scale , and it also can be scaled to petabytes , unlike Cloud SQL which is bound by db instance size > Scales by adding nodes . Each node can manage 2TB of data **Cons** * 100 database per instance , and upto 2TB per node * 5000 tables per database * 1024 columns per database, and 10 MB max per column * 32 indexes per table , and 1000 indexes per database # Cloud datastore :- High scalable NoSQL document database . Also provides a SQL like query language which is basically **GQL** gcloud datastore export gs://bucket-name --async :- for databackup ***Cloud firestore*** provides two modes :- Native and Datastore . 1. ***Datastore*** is the older google cloud datastore .Mostly used for backend storage 2. ***Native*** mode uses firebase API and is similar to document based database like MongoDB. Better suited for mobile clients > Datastore doesn't support all firestore features like offline support for mobile devices and uses a Object based database # Cloud MemoryStore Managed caching service with both Redis and Memcached **Redis** supports a lot of other data types .But only has a size of **max 300 GB** . Supports two version * Basic -> Only for development with no replication * Standard -> Replication and good for prod requirements **Memcached** Key-value based caching storage with Maximum 20 nodes with **max size of 5 TB** . Better suited for bigger caching requirements. Good for database query caches # App Engine comes with two flavours :- 1. ***Standard*** is kind of a sandbox environment supporting only some specific versions/Runtimes of coding languages like java , python , go . Can be scaled down to 0 instances 2. ***Flexible*** App Engine is a container based service using compute engines for running . They provide more flexibility of runtimes and geographic location > Multiple versions of an app engine can exist simultaneously . We can decide which one is currently live * For A/B testing use traffic splitting * For canary use "Random" where traffic will be randomly sent across > App Engine can be segregated into various "services" which act like microServices architecture .And each service can support multiple versions > "Instances" are basic building blocks of appEngine . And scaling is achieved by adding/removing instances only * App Engine sends this request to bring an instance into existence; users cannot send a request to /_ah/start * Configuration files in App Engine :- 1. app.yaml :- Each service in your app has its own app.yaml file, which acts as a descriptor for its deployment. You must first create the app.yaml file for the default service before you can create and deploy app.yaml files for additional services within your app. 1. cron.yaml :- Use the cron.yaml file to define scheduled tasks for your application. 1. dos.yaml :- The dos.yaml file provides the controls to denylist IP addresses or subnets to protect your app from Denial of Service (DOS) attacks or similar forms of abuse 1. dispatch.yaml :- The dispatch.yaml allows you to override routing rules. You can use the dispatch.yaml to send incoming requests to a specific service (formerly known as modules) based on the path or hostname in the URL.The dispatch.yaml file should reside in the root directory or in the directory that defines the default service 1. index.yaml :- App Engine uses indexes in Datastore for every query your application makes. These indexes are updated whenever an entity changes, so the results can be returned quickly when the app makes a query . This is managed by index.yaml file # Cloud functions are run via triggers . Triggers can be HTTP , Pub/Sub , Cloud storage etc **Reducing the amount of imported code** and using **lazy initializations** are recommended practices for improving function performance # Deployment Manager Google's own Infra-As-A-Code Managed service , supporting code in jinja,YAML or python language . **Creating a new deployment** :- ``` gcloud deployment-manager deployments create <depl-name> --config <config-file-name.yaml> ``` **Deleting a deployment ** :- ``` gcloud deployment-manager deployments delete <depl-name> ``` # Cloud Monitor (Stackdriver) is multi cloud logging and monitoring solution . > Cloud Monitor debugger connects the app's production data to the source code for realtime code inspection in prod . > For some metrics like Memory , stackdriver monitoring agent needs to be installed in the VM # Dataproc:- * Service for easily creating Hadoop/Spark clusters * HDF cluster automatically created for storage * On GCP Jupiter is the underlying networking component and Colossus is the underlying storage component * GCS is the cloud native alternative for HDFS > But GCS can have some latency . So , if you have large jobs with lots of tiny blocks of data , then stick to HDFS > Also if you are modifying or renaming your directories constantly , then better to use HDFS . Since gcs rename is an expensive operation > Also if you heavily use the append operation on HDFS or workloads that involve heavy I/O and a lot of partitioned writes > Also , avoid iterating sequentially over many nested directories in a single job * To get your data into the cloud gcs , you can use distCp * GCS should be used for initial and final source of data , the intermediate job outputs can be written to HDFS * For reducing costs , preemptible worker instances can be used # Cloud Dataflow :- Fully Managed service for running dataflow pipelines . Apache Beam SDK used for programming the dataflows https://beam.apache.org/documentation/programming-guide/#overview > Cloud storage is used as a staging area for dataflow , can be used as input/output as well 1. Can be used as both batch and streaming pipeline , that's why the name BEAM (Batch + Stream) > Dataflow is ideal for any new pipeline building activities , for migrating existing on prem Hadoop jobs , use dataproc ![](https://i.imgur.com/p3QKzHp.png) 2. In Dataflow , graph is optimized for best execution plan , and it doesn't wait for a previous step to finish before starting 3. Autoscaling happens automatically step-by-step 4. It can handle late arriving records with intelligent watermarking > This feature is currently supported only in Beam's Java SDK 6. Data flow windowing * Fixed :- Divided into fixed sliced data hourly/daily with non -overlaping intervals * Sliding :- Used for computing , can overlap * Session :- Defined by minimum gap duration and timing is governed by other element. E.g web sessions 7. Data flow job replacement * Pass the --update option. * Set the --jobName option in PipelineOptions to the same name as the job you want to update. * If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the --transformNameMapping option. * When you launch your replacement job, the value you pass for the --jobName option must match exactly the name of the job you want to replace. * Your replacement job might fail Dataflow's compatibility check if you remove certain stateful operations from your pipeline * You must run your replacement job in the same zone in which you ran your prior job. * Beam’s **default windowing behavior is to assign all elements of a PCollection to a single, global window and discard late data**, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must do at least one of the following: 1. Set a non-global windowing function. 2. Set a non-default trigger. This allows the global window to emit results under other conditions, since the default windowing behavior (waiting for all data to arrive) will never occu * A **watermark** is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data # Dataprep :- Used for data transformation activity from any of the input like cloud storage etc to BigQuery or any analytics engine # Data Fusion :- 1. Fully managed , cloud native service for building and managing data pipelines 2. Uses GKE cluster to provision instances 3. Graphical UI available to orchestrate pipeline building via workflows and available templates 4. It's primarily a **code-free ETL/ELT workflow orchestrator** , and can be used by mainly business users. 5. It's mostly good for batch processing kind of requirements . For streaming and more complex workflows it's better to use Cloud Dataflow # BigQuery :- Data warehousing and data analysis service > Extremely inexpensive storage . 2 cents/GB/month for 1st 3 months and then 1 cent per month > Data analyzed using ANSI SQL 2011 queries 1. Big query is the serverless data warehouse 2. ***Federated query*** :- External data lakes like google sheets or cloud storage can also natively queried via bigQuery , but the they loose the feature of data caching in bigQuery.So every time we run any query ,there is no caching 3. Big Query is a column based storage instead of record based storage like SQL . So , an individual column can be opened for reading instead of accessing the entire record 4. Monitoring of metrices like "queryCount" or "querySize" can be done via Stackdriver monitoring . Where alerts/notifications can be added 5. **Avro** is the preferred storage format 6. For efficient encoding , **parquet** type can be used since it has better compression ratio and smaller file size 7. To submit a query , you need cloud IAM permissions 8. Access control is at org, project,dataset,table,column and row level 9. To set access control at column level , you need to set a taxanomy as **"Business Criticality :High/Medium"** , and set that particular tag on a specified google group . And only users belonging to that particular tag can access that particular column 10. An ***authorized view*** allows you to share data externally without sharing the underlying table > You cannot export data from a view. And the tables in the view must be in the same location > Assigning your data analysts the project-level **bigquery.user** role does not give them the ability to view or query table data in the dataset containing the tables queried by the view. **Most individuals (data scientists, business intelligence analysts, data analysts) in an enterprise should be assigned the project-level bigquery.user role** > The bigquery.user role gives your data analysts the permissions required to create query jobs, but they cannot successfully query the view unless they also have at least **bigquery.dataViewer** access to the dataset that contains the view. 11. Denomralizing the data in Big Query datasets improve query performance > But Denormalizing decreases performance while grouping 1-to-many field , since it will create more shuffling of data over network > Solution for this is using ***STRUCT*** (Nested Or repeated data). Any field with a dot are STRUCTS > ***ARRAY*** type can be identified by the Mode "REPEATED" `finding the number of elements with ARRAY_LENGTH(<array>)` `deduplicating elements with ARRAY_AGG(DISTINCT <field>)` `ordering elements with ARRAY_AGG(<field> ORDER BY <field>)` `limiting ARRAY_AGG(<field> LIMIT 5)` 9. You need to UNNEST() arrays to bring the array elements back into rows 10. UNNEST() always follows the table name in your FROM clause (think of it conceptually like a pre-joined table) 11. BigQuery supports ***Clustering/Group-by*** automatically. Clustering is now supported on a non prartitioned tables as well 12. Clustering is about co-locating data for faster aggregation and clustering column is ideal for data with **high cardinality** 13. To view a table schema `bq show <dataset>.<tablename>` 14. To view all tables in a dataset `bq ls <dataset>` 15. Data can directly be streamed into a big query table via streaming inserts > Streaming data inserts is charged ,but batch load isn't > DML statements now can run without any limit in Big Query 16. BigQuery imports/exports data to local/multi-regional buckets in the same location only . 17. **require_partitioning_filter** will enforce all the select queries to containg the necessary "where" clause with the required partitioning column 18. **Partitioning is preferred over sharding** 19. **Datasets are immutable**, and location can't be updated 20. All other **schema modifications** are unsupported and require manual workarounds, including: * Changing a column's name * Changing a column's data type * Changing a column's mode (aside from relaxing REQUIRED columns to NULLABLE) **IAM Roles** * bigquery.user role when applied to a dataset can only read a dataset and list tables , but when applied on a project level it provides access to run jobs and monitor jobs run by him/her # Data Protection & Compliance ***Data Catalog*** It's a metadata management service for tagging any particular filed as sensitive . ***Data Loss Prevention (DLP) API*** This helps to manage sensitive data like credit card nums , phone nums etc. This API can be invoked in a dataflow pipline while reading data from storage sources , and can be used to mask or remove sensitive datas # Cloud Composer (Managed Apache Airflow) This is managed open source orchestrator for Production workflows It uses GKE clusters internally to create an Airflow instance . And it has two set of resources * Customer resources -> Where customer pays for resources like GKE , Cloud storage & Redis * Tenant resources -> Where GCP uses it's own internal resources to manage the instances , for which the user is not billed 1. ***DAG*** A directed Acyclic graph is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies 2. ***Operator*** The desc of a single task , it's usually atomic .For e.g BashOperator used to execute bash command 3. ***Task*** A parameterised instance of an Operator;a node in the DAG 4. ***Task Instance*** A specific run of a task; characterized as a DAG, a Task, and a point in time # Cloud Workflows :- It's an managed orchestrator for managing APIs , and works with both Google services and HTTP based APIs. It's primarily used for processing BPM style workflow for processes like order management , inventory management etc Workflows can be created in a YAML or JSON format # GKE (Google Kubernetes Engine):- The types of available clusters include: zonal (single-zone or multi-zonal) and regional. Types of GKE application :- 1. Standard :- We need to manage instances , along with the required configurations (pay per node) 1. Autopilot :- No need to manage any instances , Google does that automatically and we only pay for the pod being used (pay per pod) ***Batch Jobs*** :- Can be achieved by Kubernates jobs **StatefulSets** :- Used for running stateful pods like database engines ***Daemons*** :- It ensures all nodes contain a copy of a pod backends services like log collector . Can be achived by kubernetes DaemonSet `To enable cloud operations for gce :- gcloud container clusters update --enable-stackdriver-kubernetes` `To see running pods kubectl get pods` > By default all pods are internal . To make them public facing , you need to connect a LB to your deployment using Services kubectl expose command There are 5 types of services :- 1. ***ClusterIP*** (default): Internal clients send requests to a stable internal IP address. 1. ***NodePort***: Clients send requests to the IP address of a node on one or more nodePort values that are specified by the Service.Suitable for local and POC needs , because local Node Ip won't chane 1. ***LoadBalancer***: Clients send requests to the IP address of a network load balancer.Ideal for production use cases where need to expose over internet * *External LoadBalancer* :- If service needs to be reached outside the service and VPC (GKE N/w LB provisioned automatically) * *Internal LoadBalancer* :- If service needs to be reached inside the cluster . Uses IP from VPC * *HTTPs LoadBalancer* :- To allow https traffic from outside VPC . Uses Kubernetes ingress resource. Service is configured to use both NodePort and ClusterIP 3. ***ExternalName***: Internal clients use the DNS name of a Service as an alias for an external DNS name. 4. **Ingress**: When you create the Ingress, the GKE ingress controller creates and configures an external Application Load Balancer or an internal Application Load Balancer according to the information in the Ingress and the associated Services. Also, the load balancer is given a stable IP address that you can associate with a domain name. 5. ***Headless***: You can use a headless service in situations where you want a Pod grouping, but don't need a stable IP address. `kubectl apply command to update the deployment config file` `kubectl apply -f ./<fileName>` ***to deploy an app to GKE cluster*** `kubectl describe deployment <deployment_name> / kubectl get pods -l <key>:<value>` ***for get details of the deployment or pods*** `kubectl rollout undo deployment <deployment_name>` ***for rolling back a deployment*** `kubectl scale deployment <deploymentName> --replicas <numOfReplicas>` ***for scaling a deployment*** `kubectl delete deployment <deploymentName>` ***for deleting a deployment*** `kubectl autoscale deployment my-app --max 6 --min 4 --cpu-percent 50` `kubectl set image deployment nginx nginx=nginx:1.9.1` ***For performing rolling update*** ``` gcloud container clusters create cluster-name \ --enable-vertical-pod-autoscaling --cluster-version=1.14.7 ``` By default, GKE collects logs for both your system and application workloads deployed to the cluster. 1. System logs – These logs include the audit logs for the cluster including the Admin Activity log, Data Access log, and the Events log. For detailed information about the Audit Logs for GKE, refer to the Audit Logs for GKE documentation. Some system logs run as containers, such as those for the kube-system, and they're described in Controlling the collection of your application logs. 1. Application logs – Kubernetes containers collect logs for your workloads, written to STDOUT and STDERR. GKE is by default integrated with Cloud Operations for logging & monitoring . If disabled , then it will write logs to worker nodes which may get overwritten or deleted You can perform a **rolling update** to update the images, configuration, labels, annotations, and resource limits/requests of the workloads in your clusters ***Labels and selectors are used to assign pods to nodes*** **Binary Authorization** :- Can be used to authorize only tested images to be deployed on Production # Cloud Audit Logs provides the following audit logs for each Cloud project, folder, and organization: 1. **Admin Activity audit logs** - Admin Activity audit logs contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources. For example, these logs record when users create VM instances or change Identity and Access Management permissions. ***Admin Activity audit logs are always written***; you can't configure or disable them. There is no charge for your Admin Activity audit logs 1. **Data Access audit logs** :- Data Access audit logs contain API calls that read the configuration or metadata of resources, as well as user-driven API calls that create, modify, or read user-provided resource data. ***Data Access audit logs are disabled by default*** because they can be quite large 1. **System Event audit logs** :- System Event audit logs contain log entries for Google Cloud administrative actions that modify the configuration of resources. System Event audit logs are generated by Google systems; they are not driven by direct user action.***System Event audit logs are always written***; you can't configure or disable them. There is no charge for your System Event audit logs. 1. **Policy Denied audit logs**:- Cloud Logging records Policy Denied audit logs when a Google Cloud service denies access to a user or service account because of a security policy violation. ***Policy Denied audit logs are generated by default*** and your Cloud project is charged for the logs storage # Cloud Run :- * Executables in the container image must be compiled for Linux 64-bit * The container must listen for requests on 0.0.0.0 on the port to which requests are sent. By default, requests are sent to 8080 * The container should not implement any transport layer security directly. TLS is terminated by Cloud Run for HTTPS and gRPC, and then requests are proxied as HTTP or gRPC to the container without TLS * The following environment variables are automatically added to the running containers: 1. PORT The port your HTTP server should listen on. 8080 1. K_SERVICE The name of the Cloud Run service being run. hello-world 1. K_REVISION The name of the Cloud Run revision being run. hello-world.1 1. K_CONFIGURATION The name of the Cloud Run configuration that created the revision. hello-world * Each instance gets 1vCPU & 256 MB of memory > Cloud Run container instances expose a metadata server that you can use to retrieve details about your container instance, such as the project ID, region, instance ID or service accounts http://metadata.google.internal/computeMetadata/v1/project/project-id # Pub/Sub :- If messages have the same ordering key and you publish the messages to the same region, subscribers can receive the messages in order. ``` gcloud pubsub topics publish TOPIC_ID \ --message=MESSAGE_DATA \ --ordering-key=ORDERING_KEY ``` > Increasing the --ack-deadline would give the consuming application more time to acknowledge successful processing of the message. Pub/Sub delivers each published message at least once for every subscription. There are some exceptions to this at-least-once behavior: 1. a message that cannot be delivered within the maximum retention time of 7 days is deleted and is no longer accessible 2. A message published before a given subscription was created will usually not be delivered for that subscription 3. Once a message is sent to a subscriber, the subscriber should acknowledge the message 4. A subscription can use either the pull or push mechanism for message delivery 5. Push based requires a HTTPS webhook with SSL enabled endpoint to receive notifications . It must send a 200 response code as acknowledgement 6. ***Dead-letter-topics*** are specified to receive messages that cannot be acknowledged. > --max-retry-delay is the maximum delay between consecutive deliveries of a given message. > The min-retry-delay is the minimum delay between consecutive deliveries of a given message. 7. By default Publisher batches messages , turn this off if you desire lower latency 8. Pub/Sub doesn't gurantee ordering and dedpulication , needs to be handled at application end 9. On App Engine, we recommend that you use the /_ah/push-handlers/ prefix in the endpoints URL path, as described in Registering App Engine endpoints. This code allows the endpoint to receive push messages from Pub/Sub API. # Cloud AI Platform A fully managed service for custom machine learning models # AI Hub Pre built and tested kubeflow pipelines to use as a out of the box service # Cloud Auto ML used to train and deploy custom model wihout any need of prior coding language 1. A Csv file needs to be deployed to use this. That file would contain the traning dataset source (gs bucket) , labels and features > It shouldn't contain duplicates or blank lines and should be UTF-8 encoded 2. From the brief explanation, to solve the **overfitting** problem in the scenario, you need to: * Increase the training set. * Decrease features parameters. * Increase regularization. # Firebase * By default, Cloud Firestore automatically maintains single-field indexes for each field in a document and each subfield in a map. * You can exempt a field from your automatic indexing settings by creating a single-field index exemption * A composite index stores a sorted mapping of all the documents in a collection, based on an ordered list of fields to index. * Indexing Best practices :- 1. If you have a string field that often holds long string values that you don't use for querying, you can cut storage costs by exempting the field from indexing. 2. If you index a field that increases or decreases sequentially between documents in a collection, like a timestamp, then the maximum write rate to the collection is 500 writes per second. If you don't query based on the field with sequential values, you can exempt the field from indexing to bypass this limit. 3. arge array or map fields can approach the limit of 40,000 index entries per document. If you are not querying based on a large array or map field, you should exempt it from indexing. # Anthos Used for managing multiple kubernetes clusters across a hybrid model , including AWS , Azure , OnPremise etc Fleets are how Anthos lets you logically group and normalize Kubernetes clusters, making administration of infrastructure easier. Adopting fleets helps your organization uplevel management from individual clusters to groups of clusters, with a single view on your entire fleet in the Google Cloud console. > For example, you can apply a security policy with Policy Controller to all fleet services in namespace foo, regardless of which clusters they happen to be in, or where those clusters are. To manage the connection to Google in hybrid and multi-cloud fleets, Google provides a Kubernetes deployment called the **Connect Agent**. Once installed in a cluster as part of fleet registration, the agent establishes a connection between your cluster outside Google Cloud and its Google Cloud fleet host project, letting you manage your clusters and workloads from Google and use Google services. In on-premises environments, connectivity to Google can use the public internet, a high-availability VPN, Public Interconnect, or Dedicated Interconnect, depending on your applications' latency, security, and bandwidth requirements when interacting with Google Cloud. **MultiCluster Ingress** is a cloud based multi cloud ingress controller which can be used to share load across multiple clusters . Can be used to manage clusters in multiple regions in high availability . And can also provide proximity based routing # Imp links 1. https://cloud.google.com/certification/guides/data-engineer/ 1. https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1 2. https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage 3. https://cloud.google.com/bigtable/docs/schema-design 4. https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotting 5. https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference 6. https://cloud.google.com/speech-to-text/docs/best-practices 7. https://cloud.google.com/speech-to-text/docs/sync-recognize 8. https://cloud.google.com/speech-to-text/docs/release-notes#v1beta1 9. https://cloud.google.com/vision/docs/labels 10. https://cloud.google.com/ai-platform/training/docs/managing-models-jobs 11. https://cloud.google.com/stackdriver/docs/solutions/gke 12. https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting 13. https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization 14. https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-overfitting-2bd5d99abe5d 15. https://developers.google.com/machine-learning/crash-course/training-neural-networks/best-practices 16. https://cloud.google.com/pubsub/docs/faq#duplicates 17. https://cloud.google.com/bigquery/docs/best-practices-performance-patterns 18. https://cloud.google.com/bigquery/docs/reference/auditlogs/#mapping-audit-entries-to-log-streams 19. https://cloud.google.com/bigquery/docs/monitoring#slots-available 20. https://cloud.google.com/security/encryption/default-encryption 21. https://cloud.google.com/architecture/hadoop/hadoop-gcp-migration-jobs 22. https://cloud.google.com/bigquery/docs/loading-data#loading_encoded_data 23. https://cloud.google.com/bigquery/docs/reference/standard-sql/numbering_functions 24. https://cloud.google.com/bigquery/streaming-data-into-bigquery#manually_removing_duplicates 25. https://cloud.google.com/bigquery/docs/best-practices-performance-compute 26. https://cloud.google.com/architecture/data-preprocessing-for-ml-with-tf-transform-pt1#preprocessing_operations 27. https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#preventing_compatibility_breaks 28. https://cloud.google.com/blog/products/data-analytics/tips-and-tricks-to-get-your-cloud-dataflow-pipelines-into-production 29. https://firebase.google.com/docs/firestore/query-data/index-overview#composite_indexes 30. https://towardsdatascience.com/how-i-passed-google-professional-data-engineer-exam-in-2020-2830e10658b6 31. https://cloud.google.com/bigquery-ml/docs/preventing-overfitting 32. https://cloud.google.com/blog/products/bigquery/busting-12-myths-about-bigquery 33. https://cloud.google.com/blog/products/bigquery/performing-large-scale-mutations-in-bigquery 34. https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-organizations 35. https://cloud.google.com/blog/products/ai-machine-learning/understanding-neural-networks-with-tensorflow-playground 36. https://github.com/sathishvj/awesome-gcp-certifications/blob/master/professional-data-engineer.md 37. https://github.com/Leverege/gcp-data-engineer-exam 38. https://www.examtopics.com/exams/google/professional-data-engineer/view/ 39. https://www.gcp-examquestions.com/google-associate-cloud-engineer-practice-exam-part-1/ https://www.examtopics.com/exams/google/associate-cloud-engineer/ 29. https://cloud.google.com/certification/sample-questions/cloud-engineer 30. https://jayendrapatil.com/google-cloud-associate-cloud-engineer-certification-learning-path/ 31. https://www.examtopics.com/exams/google/professional-cloud-architect/view/ 32. https://cloud.google.com/load-balancing/docs/backend-service 33. https://cloud.google.com/architecture/identity/federating-gcp-with-active-directory-introduction 34. https://cloud.google.com/datastore/docs/best-practices 35. https://cloud.google.com/compute/docs/images/image-management-best-practices 36. https://cloud.google.com/architecture/framework 37. https://cloud.google.com/firewall/docs/firewalls#default_firewall_rules 38. https://cloud.google.com/architecture/migrating-vms-migrate-for-compute-engine-best-practices 39. https://www.gcp-examquestions.com/course/google-professional-cloud-architect-practice-exam/ 40. https://cloud.google.com/learn/certification/cloud-architect