Preparation for Cloud Architect

--- title: 'Preparation for Cloud Architect' disqus: hackmd tags: Cathay --- {%hackmd BJrTq20hE %} Preparation for Cloud Architect === ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ## Table of Contents [TOC] ## Hint :::info :bulb: This article is for cloud architect interview. 我們組名稱為數位發展中心-雲端策略發展部(CSD)-金融醫療科-數據組我這一小組的組內目前有三個專案: 1. 全方位醫療數據平台 (目前進度:整體架構建立, 包含ETL, data lakehouse, 預計串接組內另外三個上線中的產品, 包括醫療電子錢包, [FHIR轉換平台](https://dlvcl9.axshare.com/?id=giuyyw&p=%E6%A1%88%E4%BB%B6%E7%AE%A1%E7%90%86-%E6%B5%81%E7%A8%8B%E5%9C%96__%E9%86%AB%E8%AD%B7%E4%BA%BA%E5%93%A1%E6%96%BC_ifhir_%E6%93%8D%E4%BD%9C_&g=1), [人壽理賠平台](https://dlvcl9.axshare.com/?id=giuyyw&p=%E6%A1%88%E4%BB%B6%E7%AE%A1%E7%90%86-%E6%B5%81%E7%A8%8B%E5%9C%96__%E9%86%AB%E8%AD%B7%E4%BA%BA%E5%93%A1%E6%96%BC_ifhir_%E6%93%8D%E4%BD%9C_&g=1)) 2. 長照醫療物聯網平台 IoMT hardware toolkit 3. 自然語言轉SQL並做視覺化分析 no-code tool devlopment 上述 2, 3是由我負責 ::: ![](https://i.imgur.com/Cmmkcrn.png) ## Keyword :::info 資料庫 -> DB2, Teradata, PostgreSQL, MySQL, Oracle (sql) MongoDB、Cassandra、Redis (no-sql) 雲端平台 -> GCP (google cloud platform) AWS (amazon web services) Azure (Microsoft Azure) DataLake, Data warehouse -> 資料倉儲 SQL語法 & ETL ::: [DataLake definition](https://aws.amazon.com/tw/big-data/datalakes-and-analytics/what-is-a-data-lake/) [Data warehouse definition](https://aws.amazon.com/tw/data-warehouse/) [闖蕩16年的大數據架構師，帶你穿越資料倉儲的演變史 ](https://communeit.medium.com/%E9%97%96%E8%95%A916%E5%B9%B4%E7%9A%84%E5%A4%A7%E6%95%B8%E6%93%9A%E6%9E%B6%E6%A7%8B%E5%B8%AB-%E5%B8%B6%E4%BD%A0%E7%A9%BF%E8%B6%8A%E6%95%B8%E6%93%9A%E5%80%89%E7%9A%84%E6%BC%94%E8%AE%8A%E5%8F%B2-%E4%B8%80-%E5%BE%9E0%E5%88%B01%E7%9A%84%E8%9B%BB%E8%AE%8A-9d7cf07aecf8) 延伸閱讀：[什麼？我寫的一條SQL讓公司網站癱瘓了…SQL慢查詢改善方案](https://medium.com/%E6%95%B8%E6%93%9A%E5%88%86%E6%9E%90%E4%B8%8D%E6%98%AF%E5%80%8B%E4%BA%8B/%E4%BB%80%E9%BA%BC-%E6%88%91%E5%AF%AB%E7%9A%84%E4%B8%80%E6%A2%9Dsql%E8%AE%93%E5%85%AC%E5%8F%B8%E7%B6%B2%E7%AB%99%E7%99%B1%E7%98%93%E4%BA%86-sql%E6%85%A2%E6%9F%A5%E8%A9%A2%E6%94%B9%E5%96%84%E6%96%B9%E6%A1%88-b354b4114ce2) ## Familiar with complex SQL usage & ETL - [一些進階的sql語法參考](https://www.1keydata.com/tw/sql/advanced.html) 大概會考的問題程度： ``` * 什麼是 SQL JOIN 和 SQL UNION？它們有何不同之處？ * SQL INTERSECT 和 SQL MINUS 是什麼？它們如何使用？ * 什麼是 SQL Subquery？它有什麼應用場景？ * SQL EXISTS 的作用是什麼？它如何使用？ * 什麼是 SQL CASE？它有什麼應用場景？ * 什麼是 SQL 窗口函數？它有哪些常用的函數？ * 寫出一個使用 SQL UNION ALL 和 ORDER BY 的查詢語句。 * 寫出一個使用 SQL Subquery 和 WHERE 子句的查詢語句。 * 寫出一個使用 SQL EXISTS 和 NOT EXISTS 的查詢語句。 * 寫出一個使用 SQL CASE 和 GROUP BY 的查詢語句，並設置一個別名為“total”。 ``` ETL (Extract, Transform, Load data)是什麼？ ETL 的常见技术方案是什么？ ETL常用的三种工具介绍及对比 Datastage，Informatica和Kettle? :::warning [30天學會Data Integration](https://https://ithelp.ithome.com.tw/users/20091626/ironman/1230) - Kettle - 6hr ::: ## Familiar with No-SQL & SQL database ``` 傳統型資料倉庫：一般做法是先將源資料匯出成檔案形式，再透過FTP傳輸到資料倉庫伺服器，然後再載入進資料倉庫介面層或者ods層。這個過程主要是涉及到ETL中的E和L，其實現工具主要是資料庫自帶的匯入匯出工具，或者是kettle、datastage、informatic等etl工具。 Nosql型資料倉庫：一般是基於hadoop生態的資料倉庫，不僅有結構化資料的採集，也有非結構資料的採集。結構化資料採集：可同傳統型資料倉庫一樣，透過落地檔案來完成；也可使用hadoop元件sqoop，只需簡單配置，便可同時完成抽取和載入。如果是實時採集，可以透過maxwell或者flume等工具將資料先放入kafka，再透過流式計算元件sparkstreaming或者flink進行加工轉換或者直接存入hdfs；非結構化資料採集：主要為日誌資料，同樣可使用flume進行採集，直接載入進hdfs或者先進入kafka。 ``` ## Familiar with the data warehouse system including but not limited to “Bigquery”, “Redshift”, or “Snowflake”- - Bigquery: google, Redshift: Amazon, Snowflake: Azure Bigquery 解釋及實作 https://blog.cloud-ace.tw/big-data/bigquery-tutorial/ - 3hr https://vocus.cc/article/62e22cecfd89780001813f92 - 0.5hr ## Experience with processing huge data set efficiently What is meant by "experience in working with large datasets" on a job requirement list? (The optimal solution as below.) ``` It can mean several things, listed here in decreasing order of stringency: Working with data that doesn’t fit in RAM and has to be distributed across many machines (using message passing or some frameworks such as Hadoop and Spark) Working with data that doesn’t fit in RAM but can be processed locally from a hard drive Working with data which barely fits in RAM (typically requires some advanced programming skills, e.g. C++, Java and algorithms) Working with data that is bigger than what you can process using some basic statistics software and packages (typically requires using Python/R and some knowledge of algorithms) Working with data that is bigger than fits into person’s memory (typically requires using Excel) ``` ## Experience with data pipeline system building HADOOP https://cloud.google.com/learn/what-is-hadoop?hl=zh-tw SPARK https://cloud.google.com/learn/what-is-apache-spark?hl=zh-tw STORM https://storm.apache.org/ ## Experience with public cloud systems including but not limited to GCP, AWS and Azure [【懶人包】Google Cloud 基礎教學資源彙集 - 6hr](https://ikala.cloud/google-cloud-products-quick-start/)