Copel - Treinamento de BigData

# Copel - Treinamento de BigData ###### tags: `copel` ## 1. Acesse o JupyterHub Acesse o link http://pdt05.copel.nt:30080 para entrar no JupyterHub do ambiente lab. ## 2. Abrir o terminal Abaixo temos duas imagens mostrando como pode iniciar um novo terminal. ![novo-terminal-launcher](https://i.imgur.com/RIVW8Gi.png) ![novo-terminal-file-new](https://i.imgur.com/EYuq5cI.png) ## 3. Autenticar o usuário Digite o comando abaixo no terminal. ```bash= kinit <seu-usuario> # Por exemplo, meu usuário é robertogyn19 $ kinit robertogyn19 Password for robertogyn19@ADALGISO.NIVALDO: # O comando klist pode ser utilizado para verificar se deu tudo certo klist Ticket cache: FILE:/tmp/krb5cc_1000 Default principal: robertogyn19@ADALGISO.NIVALDO Valid starting Expires Service principal 11/28/2020 17:59:19 11/29/2020 03:59:19 krbtgt/ADALGISO.NIVALDO@ADALGISO.NIVALDO renew until 12/05/2020 17:58:45 ``` ![kinit](https://i.imgur.com/7dYOg1X.png) ## 4. Notebook Execute o comando abaixo no terminal para copiar o notebook que está no HDFS. ```bash= hdfs dfs -copyToLocal /treinamento/pecld/PeCLD.ipynb . ``` Depois de copiar o arquivo, abra-o clicando duas vezes no arquivo. ![jupyter-abrir-notebook](https://i.imgur.com/ZanUstZ.png) ## 5. Trechos de código ### 5.1 Iniciar spark ```python= spark = SparkSession. \ builder. \ master('local[4]'). \ config('spark.executor.memory', '8g'). \ config('spark.driver.memory', '8g'). \ getOrCreate() spark.conf.set("spark.sql.repl.eagerEval.enabled", True) ``` ### 5.2 Leitura do arquivo cad-uc-fatura ```python= spark.read.parquet('/data/pecld/cad-uc-fatura') ``` ### 5.3 Filtro 1 - cláusula 2 ```python= clause2 = (df.DTA_VENC_EUF >= F.date_sub(F.current_date(), 1095)) & ( df.COD_SITU_COM_EUF.isin(['AR', 'RN', 'AA'])) ``` ### 5.4 Filtro 2 - cláusula 3 ```python= clause3 = df.COD_ORIG_FAT_EUF.isin(['FAT', 'SOM', 'EVE', 'PAC', 'PAR', 'DSC', 'FRC']) ``` ### 5.5 Filtro completo - clause 1, 2 e 3 ```python= df = df.filter((clause1 | clause2) & clause3) ``` ### 5.6 Coluna STA_JUDIC_EUF ```python= df = df.withColumn('STA_JUDIC_EUF', F.when(df.STA_JUDIC_EUF == 'S', 1).otherwise(0)) ``` ### 5.7 Coluna IND_BXRD_FAT ```python= baixa_renda_clause = (df.COD_ROTI_BXRD_EUF.isNotNull()) & (df.COD_CRIT_BXRD_EUF.isNotNull()) df = df.withColumn('IND_BXRD_FAT', F.when(baixa_renda_clause, 1).otherwise(0)) ``` ### 5.8 Coluna DIAS_PGTO - trunc(dta_pgto_euf) ```python= dta_pgto_trunc = F.date_trunc('day', df.DTA_PGTO_EUF) ``` ### 5.9 Coluna DIAS_PGTO - dias_pgto_inner_1 ```python= dias_pgto_c1_inner = F.when(dta_pgto_trunc < dta_venc_trunc, 0).otherwise(F.datediff(dta_pgto_trunc, dta_venc_trunc)) ``` ### 5.10 Coluna DIAS_PGTO - dias_pgto_inner_2 ```python= dias_pgto_c2_inner = F.when(dta_venc_trunc > current_trunc, 0).otherwise(F.datediff(current_trunc, dta_venc_trunc)) ``` ### 5.11 Coluna DIAS_PGTO - dias_pgto_c2 ```python= dias_pgto_c2 = df.COD_SITU_COM_EUF.isin(['AB', 'DA', 'AG']) ``` ### 5.12 código de escrita ```python= df.write.parquet("/treinamento/output/roberto/faturas.parquet", mode="overwrite") ```