AWS Certificate Data Engineer

AWS Certificate Data Engineer Associated (DEA-C01) === ## REF * [題庫](https://www.examtopics.com/exams/amazon/aws-certified-data-engineer-associate-dea-c01/view/5/) ## Questions ### Q1 #### A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint. The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket. Which solution will meet this requirement? - [ ] A. Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint. - [ ] B. Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket. - [ ] C. Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name. - [x] D. Verify that the VPC's route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint. ### Q2 #### A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company's data analysts can access data only for customers who are within the same country as the analysts. Which solution will meet these requirements with the LEAST operational effort? - [ ] A. Create a separate table for each country's customer data. Provide access to each analyst based on the country that the analyst serves. - [x] B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies. - [ ] C. Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves. - [ ] D. Load the data into Amazon Redshift. Create a view for each country. Create separate IAM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts. ### Q3 #### A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing analytics platform. The company wants to minimize the effort and time required to incorporate third-party datasets. Which solution will meet these requirements with the LEAST operational overhead? - [x] A. Use API calls to access and integrate third-party datasets from AWS Data Exchange. - [ ] B. Use API calls to access and integrate third-party datasets from AWS DataSync. - [ ] C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from AWS CodeCommit repositories. - [ ] D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets from Amazon Elastic Container Registry (Amazon ECR). ### Q4 #### A financial company wants to implement a data mesh. The data mesh must support centralized data governance, data analysis, and data access control. The company has decided to use AWS Glue for data catalogs and extract, transform, and load (ETL) operations. Which combination of AWS services will implement a data mesh? (Choose two.) - [ ] A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster for data analysis. - [x] B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis. - [ ] C. Use AWS Glue DataBrew for centralized data governance and access control. - [ ] D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis. - [x] E. Use AWS Lake Formation for centralized data governance and access control. ### Q5 #### A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions. The data engineer requires a less manual way to update the Lambda functions. Which solution will meet this requirement? - [ ] A. Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket. - [x] B. Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions. - [ ] C. Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket. - [ ] D. Assign the same alias to each Lambda function. Call reach Lambda function by specifying the function's alias. ### Q6 #### A company created an extract, transform, and load (ETL) data pipeline in AWS Glue. A data engineer must crawl a table that is in Microsoft SQL Server. The data engineer needs to extract, transform, and load the output of the crawl to an Amazon S3 bucket. The data engineer also must orchestrate the data pipeline. Which AWS service or feature will meet these requirements MOST cost-effectively? - [ ] A. AWS Step Functions - [x] B. AWS Glue workflows - [ ] C. AWS Glue Studio - [ ] D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) ### Q7 #### A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application. Which solution will meet these requirements with the LEAST operational overhead? - [ ] A. Establish WebSocket connections to Amazon Redshift. - [x] B. Use the Amazon Redshift Data API. - [ ] C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift. - [ ] D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries. ### Q8 #### A company uses Amazon Athena for one-time queries against data that is in Amazon S3. The company has several use cases. The company must implement permission controls to separate query processes and access to query history among users, teams, and applications that are in the same AWS account. Which solution will meet these requirements? - [ ] A. Create an S3 bucket for each use case. Create an S3 bucket policy that grants permissions to appropriate individual IAM users. Apply the S3 bucket policy to the S3 bucket. - [x] B. Create an Athena workgroup for each use case. Apply tags to the workgroup. Create an IAM policy that uses the tags to apply appropriate permissions to the workgroup. - [ ] C. Create an IAM role for each use case. Assign appropriate permissions to the role for each use case. Associate the role with Athena. - [ ] D. Create an AWS Glue Data Catalog resource policy that grants permissions to appropriate individual IAM users for each use case. Apply the resource policy to the specific tables that Athena uses. ### Q9 #### A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time. Which solution will run the Glue jobs in the MOST cost-effective way? - [x] A. Choose the FLEX execution class in the Glue job properties. - [ ] B. Use the Spot Instance type in Glue job properties. - [ ] C. Choose the STANDARD execution class in the Glue job properties. - [ ] D. Choose the latest version in the GlueVersion field in the Glue job properties. ### Q10 #### A data engineer needs to create an AWS Lambda function that converts the format of data from .csv to Apache Parquet. The Lambda function must run only if a user uploads a .csv file to an Amazon S3 bucket. Which solution will meet these requirements with the LEAST operational overhead? - [x] A. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification. - [ ] B. Create an S3 event notification that has an event type of s3:ObjectTagging:* for objects that have a tag set to .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification. - [ ] C. Create an S3 event notification that has an event type of s3:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set the Amazon Resource Name (ARN) of the Lambda function as the destination for the event notification. - [ ] D. Create an S3 event notification that has an event type of s3:ObjectCreated:*. Use a filter rule to generate notifications only when the suffix includes .csv. Set an Amazon Simple Notification Service (Amazon SNS) topic as the destination for the event notification. Subscribe the Lambda function to the SNS topic. ### Q11 #### A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance? - [ ] A. Change the data format from .csv to JSON format. Apply Snappy compression. - [ ] B. Compress the .csv files by using Snappy compression. - [x] C. Change the data format from .csv to Apache Parquet. Apply Snappy compression. - [ ] D. Compress the .csv files by using gzip compression. ### Q12 #### A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket. The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility. Which solution will meet these requirements with the LOWEST latency? - [x] **A. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.** - [ ] B. Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard. - [ ] C. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard. - [ ] D. Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard. > Grafana 比 Quicksight 還要快 > ### Q13 #### A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data. The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog. Which solution will meet these requirements? - [ ] A. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket. - [ ] B. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output. - [ ] C. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output. - [ ] D. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket. > 先用二分法，看到配置最高的 S3 權限就應該剔除 > 剩下 B D 都只有給 IAM role 配置 AWSGlueServiceRole，並套用給 AWS Glue crawlers 才是合理的。 > D is wrong because you don't need to provision DPU and the destination should be a database, not an s3 bucket. so it's B > 這邊指的是透過 Glue ETL 後的資料應該擺進資料庫，監控 DPU 是為了用來監控爬蟲耗掉多少資源，以用來擴充或者節省費用，似乎不是主要關注的議題。 ### Q14 #### A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table? - [ ] A. Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events. - [x] B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function. - [ ] C. Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function. - [ ] D. Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events. > 有一個公司要將 Redshift 的每日交易資料撈出來看，在 end of each day。 > 工程師建立一個 Lambda 將 load statuses 寫入 DynamoDB，試問如何載入和讀取？ > 選 EventBridge 是因為可以在這邊設定 Schedule Lambda Task 也可以直接觸發 Lambda Function ### Q15 #### A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically. Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way? - [x] A. AWS DataSync - [ ] B. AWS Glue - [ ] C. AWS Direct Connect - [ ] D. Amazon S3 Transfer Acceleration > AWS DataSync is a managed data transfer service that simplifies and accelerates moving large amounts of data online between on-premises storage and Amazon S3, EFS, or FSx for Windows File Server. DataSync is optimized for efficient, incremental, and reliable transfers of large datasets, making it suitable for transferring 5 TB of data with daily updates. > ### Q16 #### A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently. The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database. Which AWS service should the company use to meet these requirements? - [ ] A. AWS Lambda - [x] B. AWS Database Migration Service (AWS DMS) - [ ] C. AWS Direct Connect - [ ] D. AWS DataSync > 用地端微軟資料庫存交易資料，每個月將資料搬到雲上，最近搬遷的費用增加，想要省錢 > 其實就是考你懂不懂 DMS 的用途而已，用來抄資料上雲的，但沒辦法即時同步喔！！ ### Q17 #### A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour. Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.) - [x] A. Configure AWS Glue triggers to run the ETL jobs every hour. - [ ] B. Use AWS Glue DataBrew to clean and prepare the data for analytics. - [ ] C. Use AWS Lambda functions to schedule and run the ETL jobs every hour. - [x] D. Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift. - [ ] E. Use the Redshift Data API to load transformed data into Amazon Redshift. > AWS 偏好叫使用者去用 Glue，說是很容易建置 > 做完 ETL 的資料當然需要從 Glue 寫到 Redshift ### Q18 #### A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling. Which solution will meet this requirement? - [ ] A. Turn on concurrency scaling in workload management (WLM) for Redshift Serverless workgroups. - [x] B. Turn on concurrency scaling at the workload management (WLM) queue level in the Redshift cluster. - [ ] C. Turn on concurrency scaling in the settings during the creation of any new Redshift cluster. - [ ] D. Turn on concurrency scaling for the daily usage quota for the Redshift cluster. > 了解 Redshift 的 WLM (WorkLoad Management) 功能 > Concurrency scaling in Amazon Redshift allows the cluster to automatically add and remove compute resources in response to workload demands. Enabling concurrency scaling at the workload management (WLM) queue level allows you to specify which queues can benefit from concurrency scaling based on the query workload. > https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling-queues.html > 您可以藉由將工作負載管理 (WLM) 佇列啟用為並行擴展佇列，來將查詢路由至並行擴展叢集。若要開啟佇列的並行擴展，請將 Concurrency Scaling mode (並行擴展模式) 值設定為 auto (自動)。 > You route queries to concurrency scaling clusters by enabling a workload manager (WLM) queue as a concurrency scaling queue. To turn on concurrency scaling for a queue, set the Concurrency Scaling mode value to auto. ### Q19 `*****` #### A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes. Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.) - [x] A. Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically. - [x] B. Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running. - [ ] C. Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically. - [ ] D. Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running. - [ ] E. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch. > A: Lambda 有最長執行 15 分鐘的限制，所以肯定不會是 A；不過如果 Athena 本身支援 Sync 的話，就可以拆成兩部份來執行 > B: Step Function: https://docs.aws.amazon.com/zh_tw/step-functions/latest/dg/welcome.html > C: 單純呼叫 Glue 去進行查訊，而 B 的話，透過 Step Function 去監控 Lambda 的執行狀態，肯定貴 > D: 在 Glue 裡面使用 sleep 並不會省錢 > E: 看起來可行，但不會省錢 ### Q20 #### A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS. Which extract, transform, and load (ETL) service will meet these requirements? - [ ] A. AWS Glue - [x] B. Amazon EMR - [ ] C. AWS Lambda - [ ] D. Amazon Redshift > 這題考 Hadoop 的 AWS 版本 ### Q21 #### A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII. Which solution will meet this requirement with the LEAST operational effort? - [ ] A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream. - [ ] B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake. - [x] C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake. - [ ] D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake. > 說明 Glue Studio 有個 Detect PII 功能 > 所以在 BC 之間選擇 > B: 看起來最簡單達成 > C: 支持者表示，B 的做法不合理無法直接進行混淆 We cannot directly handle PII with Glue Studio, and Glue Data Quality can be used to handle PII. > The transform Detect PII in AWS Glue Studio is specifically used to identify personally identifiable information (PII) within the data. It can detect and flag this information, but on its own, it does not perform the obfuscation or removal of these details. To effectively obfuscate or alter the identified PII, an additional transformation would be necessary. This could be accomplished in several ways, such as: Writing a custom script within the same AWS Glue job using Python or Scala to modify the PII data as needed. Using AWS Glue Data Quality, if available, to create rules that automatically obfuscate or modify the data identified as PII. AWS Glue Data Quality is a newer tool that helps improve data quality through rules and transformations, but whether it's needed will depend on the functionality's availability and the specificity of the obfuscation requirements > 我選Ｃ / 因為文件有 https://docs.aws.amazon.com/zh_tw/glue/latest/dg/detect-PII.html > 選擇如何處理已識別的 PII 資料 > * 如果您選擇在整個資料來源中偵測 PII，則可選取要套用的全域動作： > * Enrich data with detection results (利用偵測結果豐富資料)：如果您在每個儲存格中選擇「偵測 PII」，則可以將偵測到的實體存放到新的資料行中。 > * Redact detected text (將偵測到的文字設為密文)：您可以使用在選擇性的取代文字輸入欄位中指定的字串來取代偵測到的 PII 值。如果未指定任何字串，則偵測到的 PII 實體會以 '*******' 取代。 > * 部分遮蔽偵測到的文字：您可以使用選擇的字串取代部分偵測到的 PII 值。其中提供兩個可能的選項：保持結尾未遮罩，或透過明確的 regex 模式進行遮罩。此功能尚無法在 AWS Glue 2.0 中使用。 > * Apply cryptographic hash (套用加密雜湊)：您可以將偵測到的 PII 值傳遞給 SHA-256 密碼編譯雜湊函數，並以函數的輸出取代該值。 ### Q22 #### A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data. The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort. Which solution will meet these requirements with the LEAST operational overhead? - [ ] A. AWS Glue workflows - [x] B. AWS Step Functions tasks - [ ] C. AWS Lambda functions - [ ] D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows > 因為有多個 ETL，還要去接 EMR，所以光靠 Glue workflows 是只能做到**資料清洗**，不太適合拿去接 EMR，不然要被剝皮了 > 看到 orchestration 要直接連接到 Step Functions task ### Q23 #### A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class. A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year. The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability. Which solution will meet these requirements in the MOST cost-effective way? - [ ] A. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years. - [ ] B. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years. - [x] C. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years. - [ ] D. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years. > 使用 S3 One Zone-IA，客戶現在可在單一可用區域儲存不經常存取的資料，相較於 S3 Standard-IA，可降低 20% 的成本。 > 那因為還是蠻常存取的，所以不會選 One-Zone > S3 Glacier Deep Archive: 12-48 hours 解凍 > S3 Glacier Flexible Retrieval: choose S3 Glacier Flexible Retrieval, with retrieval in minutes or free bulk retrievals in 5-12 hours. > B: 32% / C: 68% > 選最划算的方案，所以選 C ### Q24 #### A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks. The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster. The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster. Which solution will meet these requirements? - [x] A. Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing. - [ ] B. Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster. - [ ] C. Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster. - [ ] D. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster. > A 和 D 都蠻多人選的 > 選 A 目前看起來是最簡單的做法 ### Q25 `***` #### A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3. Which solution will meet this requirement MOST cost-effectively? - [ ] A. Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis. - [ ] B. Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files. - [x] C. Use Amazon Athena Federated Query to join the data from all data sources. - [ ] D. Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift. > 要選最便宜的，那就絕對不是 A 另外開 EMR cluster 出來或 D 去使用Redshift > B 要另外花 S3 儲存 > C 合理，因為那些資料都已經存在了 > ### Q26 #### A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.) - [ ] A. Use Hadoop Distributed File System (HDFS) as a persistent data store. - [x] B. Use Amazon S3 as a persistent data store. - [ ] C. Use x86-based instances for core nodes and task nodes. - [x] D. Use Graviton instances for core nodes and task nodes. - [ ] E. Use Spot Instances for all primary nodes. > 比較 HDFS > S3 儲存費，x86 > ARM ### Q27 #### A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools. Which solution will meet these requirements with the LEAST operational overhead? - [ ] A. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command to load data from Amazon S3 directly into Amazon Redshift to make the data immediately available for real-time analysis. - [ ] B. Access the data from Kinesis Data Streams by using SQL queries. Create materialized views directly on top of the stream. Refresh the materialized views regularly to query the most recent stream data. - [x] C. Create an external schema in Amazon Redshift to map the data from Kinesis Data Streams to an Amazon Redshift object. Create a materialized view to read data from the stream. Set the materialized view to auto refresh. - [ ] D. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis Data Firehose to stage the data in Amazon S3. Use the COPY command to load the data from Amazon S3 to a table in Amazon Redshift. > 要即時資訊，就別把東西塞到 S3 > C is correct. (KDS -> Redshift) > D is wrong as it has more operational overhead (KDS -> KDF -> S3 -> Redshift) ### Q28 #### A company uses an Amazon QuickSight dashboard to monitor usage of one of the company's applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day. A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs. Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.) - [x] A. Partition the data that is in the S3 bucket. Organize the data by year, month, and day. - [x] B. Increase the AWS Glue instance size by scaling up the worker type. - [ ] C. Convert the AWS Glue schema to the DynamicFrame schema class. - [ ] D. Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day. - [ ] E. Modify the IAM role that grants access to AWS glue to grant access to all S3 features. > 探究變慢原因，當然要從資料讀取數量和硬體規格著手改善 ### Q29 #### ### Q30 #### ### Q31 #### ### Q32 #### ### Q33 #### ### Q34 #### ### Q35 #### ### Q36 #### ### Q37 #### ### Q38 #### ### Q39 #### ### Q1 #### A data engineer working for an analytics company is working on a consumer to a Kinesis Data Streams application. They have written the consumer using Kinesis Client Library (KCL), however, currently they are receiving an ExpiredIteratorException when reading records from Kinesis Data Streams. What would you recommend to the engineer to solve their issue? - [ ] A. Change the capacity mode of the Kinesis Data Stream to on-demand. - [x] B. Increase WCU in DynamoDB checkpointing table. - [ ] C. Increase the amount of shards in Kinesis Data Streams. - [ ] D. Increase RCU in DynamoDB checkpointing table. > ExpiredIteratorException 是在寫入的時候發生，試圖提高 WCU 限制 > * 讀取容量單位 (RCU)：每個從資料表讀取資料的 API 呼叫即視為一個讀取請求。讀取請求可以是嚴格一致、最終一致或交易形式。對於大小達 4 KB 的項目，一個 RCU 每秒可執行一個嚴格一致的讀取請求。大於 4 KB 的項目則需要額外的 RCU。對於大小達 4 KB 的項目，一個 RCU 每秒可執行兩個最終一致的讀取請求。對於大小達 4 KB 的項目，需要兩個 RCU 才能每秒執行一個交易讀取請求。例如，嚴格一致讀取 8 KB 項目的請求需要兩個 RCU 來執行；最終一致讀取 8 KB 項目的請求需要一個 RCU；而交易讀取 8 KB 項目的請求則需要 4 個 RCU。如需更多詳細資訊，請參閱讀取一致性。 > * 寫入容量單位 (WCU)：每個寫入資料到資料表的 API 呼叫即視為一個寫入請求。對於大小達 1 KB 的項目，一個 WCU 每秒可執行一個標準寫入請求。大於 1 KB 的項目需要額外的 WCU。對於大小達 1 KB 的項目，需要兩個 WCU 才能每秒執行一個交易寫入請求。例如，標準寫入 1 KB 項目的請求需要一個 WCU 來執行；標準寫入 3 KB 項目的請求需要三個 WCU；而交易寫入 3 KB 項目的請求則需要六個 WCU。 > * 複寫的寫入容量單位 (rWCU)：使用 DynamoDB 全域表時，系統會自動將資料寫入多個由您選擇的 AWS 區域。每個寫入作業除了發生在本機區域，也會發生在複寫區域。 ### Q2 #### You are building a pipeline to process, analyze and classify images. Your image datasets contain images that you need to preprocess as a first step by resizing and enhancing image contrast. Which AWS service should you consider to use to preprocess the datasets? - [ ] A. AWS Step Functions - [ ] B. AWS Data Pipeline - [x] C. AWS SageMaker Data Wrangler - [ ] D. AWS Glue DataBrew > Amazon SageMaker Data Wrangler 可將 ML 彙總和準備資料表與影像資料所需的時間從數週減少至數分鐘。藉助 Amazon SageMaker Data Wrangler，您可以簡化資料準備和特徵工程的程序，並從單一視覺化界面完成資料準備工作流程的每個步驟，包括資料選取、清理、探索、視覺化和大規模處理。您可以使用 SQL，從各種資料來源中選擇您想要的資料並快速匯入。接著，您可以使用資料品質和洞察報告，自動驗證資料品質並偵測異常狀況，如重複行和目標洩漏。Amazon SageMaker Data Wrangler 包含超過 300 個內建資料轉換，因此您無需編寫任何程式碼，即可快速轉換資料。 ### Q3 #### A social media company currently stores 1TB of data within S3 and wants to be able to analyze this data using SQL queries. Which method allows the company to do this with the LEAST effort? - [ ] A. Use AWS Kinesis Data Analytics to query the data from S3 - [x] B. Use Athena to query the data from S3 - [ ] C. Create an AWS Glue job to transfer the data to Redshift. Query the data from - [ ] Redshift - [ ] D. Use Kinesis Data Firehose to transfer the data to RDS. Query the data from RDS > 選步驟最簡單最省事的，因為資料都已經在 S3 裡面了 ### Q4 #### A solutions architect at a retail company is working on an application that utilizes SQS for decoupling its components. However, when inspecting application logs, they see that SQS messages are often being processed multiple times. What would you advise the architect to fix the issue? - [ ] A. Decrease the visibility timeout for the messages - [ ] B. Change the application to use long polling with SQS - [x] C. Increase the visibility timeout for the messages - [ ] D. Change the application to use short polling with SQS > 零售公司使用 SQS 訊息佇列去進行資料解耦，但發現好多重複的訊息被塞進 SQS 搞了好多次 > 調高 Visibility timeout 使 SQS 的消費者收資料的時候，不馬上刪除 Queue 中的資料，以避免相同資料連續收到 > To prevent other consumers from processing the message again, Amazon SQS sets a visibility timeout, a period of time during which Amazon SQS prevents all consumers from receiving and processing the message. The default visibility timeout for a message is 30 seconds. The minimum is 0 seconds. The maximum is 12 hours. For information about configuring visibility timeout for a queue using the console, see Configuring queue parameters using the Amazon SQS console. ![image](https://hackmd.io/_uploads/HJK7Ap0IC.png) ### Q5 #### Which of the following services are capable of reading from AWS Kinesis Data Streams (SELECT THREE)? - [x] A. Amazon Managed Service for Apache Flink - [ ] B. EFS - [ ] C. S3 - [x] D. **EC2** - [x] E. EMR > * The currently available services for processing data from Kinesis Data Streams are: > Amazon Managed Service for Apache Flink, Spark on Amazon EMR, EC2, Lambda, Kinesis Data Firehose, and the Kinesis Client Library. > * S3 and EFS can be output locations using the previously mentioned services, however, they cannot be used with Kinesis Data Streams without an intermediary processing step/service. > 需要中介層才可以放到 S3 和 EFS ### Q6 #### A company has a daily running ETL process, which processes transactions from a production database. The process is not time sensitive and can be run at any point of the day. Currently, the company is in the process of migrating the ETL job to an AWS Glue Spark job. As a Certified Data Engineer, what would be the most cost-efficient way to structure the Glue ETL job? - [ ] A. Set the Glue to version 2.0 - [ ] B. Set the execution class of the Glue job to FLEX - [ ] C. Set the execution class of the Glue job to STANDARD - [x] D. Set the Glue job to use Spot instances ### Q7 #### Which of the following statements are CORRECT regarding AWS SQS (SELECT TWO)? - [x] A. Message size is limited to 256KB - [ ] B. Messages cannot be duplicated - [ ] C. Messages will be delivered in order - [x] D. Messages can be duplicated ### Q8 #### A data engineer at a company is tasked with designing a data integration and transformation solution for the organization‘s data lake in AWS. The goal is to ensure efficient, automated, and scalable data ingestion and transformation workflows. Which AWS service is best suited for achieving this, offering capabilities for data cataloging, ETL job orchestration, and serverless execution? - [x] **A. AWS Glue** - [ ] B. AWS Lambda - [ ] C. AWS Data Pipeline - [ ] **D. AWS Step Functions** > AWS Step Functions. Explanation: AWS Step Functions are great for orchestrating serverless workflows, but do not offer the full range of ETL and data cataloging capabilities required for a data lake integration and transformation solution. > 答案是 A 因為 Glue 有 serverless 版本 > 但是 Step functions 並沒有 full range ETL 能力，也沒有 cataloging 能力，需要外加 data lake ### Q9 #### A data engineer working in an analytics company has been tasked to migrate their Apache Cassandra database to AWS. What AWS service should the engineer use to migrate the database to AWS, with the LEAST amount of operational overhead? - [ ] A. DocumentDB - [ ] B. Amazon Neptune - [x] C. **Amazon Keyspaces** - [ ] D. Amazon RDS > 背誦題 Amazon Keyspaces 是 Cassandra，是一套開源分散式NoSQL資料庫系統。它最初由Meta開發，用於改善電子郵件系統的搜尋效能的簡單格式資料 > Amazon Neptune: Managed Graph Database - AWS > Json Document Database - Amazon DocumentDB - AWS ### Q10 #### A data analyst at a social media company wants to create a new Redshift table from a query. What would you recommend to the analyst? - [x] A. Use the SELECT INTO command on Redshift to query data and create a table from the results of the query - [ ] B. Use the CREATE TABLE command on Redshift to create a table from a given query - [ ] C. Use the COPY command on Redshift to create a table from a given query - [ ] D. Use the SELECT command on Redshift and save the intermediary results to S3. - [ ] E. Use the COPY command to create a new table on Redshift from the S3 data > 觀念題，考知不知道 analyst