Try   HackMD

Azure Synapse Analytics 2020/10/14

tags: Azure Azure Synapse Microsoft Synapse Data Warehouse 雲端 分析 資料倉儲 數據倉儲

2020/10/14 13:00-17:00
微軟研討會
打破資料孤島,構建資料與業務中台的關鍵服務:Azure Synapse Analytics 實作坊

Azure Portal上目前有3種選擇

  • Azure Synapse Analytics(formerly SQl DW)
  • Azure Synapse Analytics (workspaces preview)
  • Azure Synapse Analytics (private link hubs preview)

Auzre Synapse Analytics

先前為Azure DW,Azure Synapse Analytics是一組服務
建立後會出現以下Azure resource

  1. Azure Synapse Analytics(formerly SQl DW)
  1. Azure Synapse Analytics (workspaces preview)
  • Synapse workspace
  • SQL pool (如果有建立)(xxx.sql.azuresynapse.net 是由微軟配置的SQL Server)
  • Apache Spark pool (如果有建立)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Azure Synapse Analytics(formerly SQl DW)
Synapse SQL pool (data warehouse)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Azure Synapse Analytics (workspaces preview)
Synapse workspace

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

官方文件
Azure Synapse 有四個元件:

  • Synapse SQL:完成以 T-SQL 為基礎的分析 - 正式推出
  • SQL 集區 (依據佈建的 DWU 付費)
  • SQL 隨選 (依據處理的 TB 量付費) (預覽)
  • Spark:深入整合的 Apache Spark (預覽)
  • Synapse Pipelines:混合式資料整合 (預覽)
  • Studio:整合的使用者體驗。 (預覽)

Lab File

Lab 1
Azure Synapse Analytics(formerly SQl DW) 傳統Data warehouse(沒有spark)

1.Create Azure Synapse Analytics(formerly SQl DW) in Azure portal

​[Synapse SQL pool]
​SQL Server: synapsebrad.database.windows.net
​Pool: 例如設定為 sqlpool01 (在sql物件就是db name)

​sqladmin
​Pa55w.rd1234
​
​
​performance level
​因為是測試所以將scale等級從預設1000先降到200

​平時不用Pause就只會計算儲存空間的費用

2.create workload group in Azure portal
Workload management-new workload group
設定Min esources % 例如設定為 20
Min. resources % per request 例如設定為 5
Cap resources % 預設 100

3.CREATE a LOGIN in SSMS(connect to [Synapse SQL pool])
```
master
IF NOT EXISTS (SELECT * FROM sys.sql_logins WHERE name = 'ELTLogin')

​BEGIN
​CREATE LOGIN [ELTLogin] WITH PASSWORD='Aasd456++'
​END;

​--sqlpool01
​IF NOT EXISTS (SELECT * FROM sys.database_principals WHERE name = 'ELTLogin')

​BEGIN
​CREATE USER [ELTLogin] FOR LOGIN [ELTLogin]
​GRANT CONTROL ON DATABASE::sqlpool01 TO ELTLogin 
​END;


4.Create WORKLOAD CLASSIFIER for the login in SSMS

select * from sys.workload_management_workload_groups

select * from sys.workload_management_workload_classifiers

select * from sys.workload_management_workload_classifier_details


CREATE WORKLOAD CLASSIFIER ELTLogin
WITH  
(
      WORKLOAD_GROUP = 'WGMedium'
     ,MEMBERNAME     = 'ELTLogin' 
     ,IMPORTANCE     = HIGH
     ,WLM_LABEL      = 'loadsfact'
);


SELECT c.[name], c.group_name, c.importance, cd.classifier_type, cd.classifier_value
  FROM sys.workload_management_workload_classifiers c
  JOIN sys.workload_management_workload_classifier_details cd
    ON cd.classifier_id = c.classifier_id
  WHERE c.name = 'ELTLogin'


-- gen sample load data
CREATE TABLE factstaging (ColA int)
INSERT INTO factstaging VALUES(0)
INSERT INTO factstaging VALUES(1)
INSERT INTO factstaging VALUES(2)
GO


5.登入ETLlogin帳號測試insert data in SSMS
	
	用CREATE AS SELECT寫入資料模擬分析查詢的workload
-switch user to ELTLogin
CREATE TABLE testclassifierfact2 WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT * FROM factstaging
OPTION (LABEL='loadsfact')

-- 查詢剛剛語法是否有落入指定的CLASSIFIER
SELECT TOP 1 request_id, classifier_name, group_name, resource_allocation_percentage, submit_time, [status], [label], command 
FROM sys.dm_pdw_exec_requests 
WHERE [label] = 'loadsfact'
ORDER BY submit_time DESC








Lab 2

Azure Synapse Analytics (workspaces preview)

1.Create Azure Synapse Analytics(formerly SQl DW) in Azure portal

workspace: synapasewsbrad
storage account: synapasestbrad
container: synapase

預設管理員帳號 sqladminuser
輸入自訂密碼 Pa55w.rd1234



2.firewall 
Allow Azure services and resources to access this workspace
change to ON


3.create a SQL Pool
name: sqlpool01
performance level同樣降到200
因為是測試所以將scale等級從預設1000先降到200

因為目前是preview所以workload group功能還沒有出現


synapsewsbrad.sql.azuresynapse.net
sqlpool01


4.create a spark pool
pool name: spark01
node size 因為是測試降成small
autospale 因為是測試diskable
number of nodes: 因為是測試降低4

建立很快只是配置檔完成

後續第一次執行notebook語法才會真正初始化主機

5.回到Ovewview
點選 Workspace web URL 
進入 Azure Synapse Analytics的Web開發與管理介面
左邊選擇 Develop,然後按下+號,選SQL Scripts

-- type your sql script here, we now have intellisense
CREATE TABLE [dbo].[Trip]
(
    [DateID] int NOT NULL,
    [MedallionID] int NOT NULL,
    [HackneyLicenseID] int NOT NULL,
    [PickupTimeID] int NOT NULL,
    [DropoffTimeID] int NOT NULL,
    [PickupGeographyID] int NULL,
    [DropoffGeographyID] int NULL,
    [PickupLatitude] float NULL,
    [PickupLongitude] float NULL,
    [PickupLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [DropoffLatitude] float NULL,
    [DropoffLongitude] float NULL,
    [DropoffLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [PassengerCount] int NULL,
    [TripDurationSeconds] int NULL,
    [TripDistanceMiles] float NULL,
    [PaymentType] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [FareAmount] money NULL,
    [SurchargeAmount] money NULL,
    [TaxAmount] money NULL,
    [TipAmount] money NULL,
    [TollsAmount] money NULL,
    [TotalAmount] money NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
);

COPY INTO [dbo].[Trip]
FROM 'https://nytaxiblob.blob.core.windows.net/2013/Trip2013/QID6392_20171107_05910_0.txt.gz'
WITH
(
    FILE_TYPE = 'CSV',
    FIELDTERMINATOR = '|',
    FIELDQUOTE = '',
    ROWTERMINATOR='0X0A',
    COMPRESSION = 'GZIP'
)
OPTION (LABEL = 'COPY : Load [dbo].[Trip] - Taxi dataset');

SELECT PassengerCount,
      SUM(TripDistanceMiles) as SumTripDistance,
      AVG(TripDistanceMiles) as AvgTripDistance
FROM  dbo.Trip
WHERE TripDistanceMiles > 0 AND PassengerCount > 0
GROUP BY PassengerCount
ORDER BY PassengerCount


6.進入 Azure Synapse Analytics的Web開發與管理介面
	左邊選擇 Develop,然後按下+號,選Import
	選 Analyze with Apache Spark.ipynb

	執行第一段就會開始初始化主機(會稍微久一點)
from azureml.opendatasets import NycTlcYellow

data = NycTlcYellow()
data_df = data.to_spark_dataframe()
# Display 10 rows
display(data_df.limit(10))

	
	
7.進入 Azure Synapse Analytics的Web開發與管理介面
	左邊選擇 Develop,然後按下+號,選Import
	Analyze data with SQL on-demand.sql
	
	Connect to 改成SQL on-demand
	
	-- 此URL路徑下是有很多個檔案,以下語法一次查詢出來 (約36 sec)
	SELECT
		TOP 100 *
	FROM
		OPENROWSET(
			BULK 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/puYear=*/puMonth=*/*.parquet',
			FORMAT = 'parquet'
		) AS [result];



	-- 下方這句須等6都做完才會有資料
	SELECT *
	FROM nyctaxi.dbo.passengercountstats
	
	
	
	
8.Azure Synapse Link for CosmosDB
自動複寫CosmosDB即時資料到背景空間給Azure Synapse取用分析



Create CosmosDB in Azure Portal

Account Type可以改成production
Multi-region wrtite改成disable

neworking
改成All Network



設定
Feature
	啟用Azure Synapse Link (約3分鐘,會在背後建立一個空間自動即時同步資料給Synapse用)


Data Explorer
	1.new一個database
	RetailData
	level 400

	2.new container
	container id: StoreDemoGraphics (依據csv檔名來新增)
	partition key: /id
	analyitca store: on(預設也必須是on)

	container id: RetailSales (依據csv檔名來新增)
	partition key: /id
	analyitca store: on(預設也必須是on)
	
	container id: Products (依據csv檔名來新增)
	partition key: /id
	analyitca store: on(預設也必須是on)


進入 Azure Synapse Analytics的Web開發與管理介面
 Data
 的Linked + Connect to external data 加入 CosmosDB
 


9.到storage account建立一個retaildata資料夾(Container)
,然後上傳csv檔
Products.csv
RetailSales.csv
StoreDemoGraphics.csv


10.Power BI帳號
個人帳號無法使用PowreBI
所以要用Azure AD建立的帳號才能登入

建立後到PowreBI網站,使用此Azure AD新帳號申請試用

登入Web PowerBI後,改用心檢視畫面
左邊workspaces,啟用PowerBI Pro 60天免費試用
new workspaces
加入存取權給Azure AD管理員


回到 Azure Synapse Analytics的Web開發與管理介面
new linked 選擇powerBI
切到develop就有一個PowerBI


下載一個檔案,就可以用Power BI Desktop開發
開發完就發佈到Power BI Web工作區

此時就可以在Azure Synapse Analytics的Web開發與管理介面
develop的PowerBI就會出現此報表,並且可以修改開發,未來就直接在Azure Synapse Analytics開發報表