Snowflake - HackMD

# Snowflake ###### tags: `snowflake` ## Load data into snowflake * google big query * amazon s3 * parquet of files from amazon s3 bucket * when additional packets of files are added into s3 or there's just a lot of groups of files * raw files ### Uploading File 1. Create schema: create schema [schema name] 2. Snowflake - stage: pointer for material for importation (look at internal vs external in snowflake) * show file formats available: 'show file formats' * create file format that is in stage/is stage; describe the file format: ``` create or replace file format parquet type = PARQUET TRIM_SPACE=false null_if = () BINARY_AS_TEXT = true COMPRESSION = AUTO; ``` ``` create or replace file format gz_format type = csv field_delimiter = '"' skip_header = 1 null_if = ('NULL', 'null') empty_field_as_null = true COMPRESSION = 'GZIP'; ``` ### Create stage ``` create or replace stage mprod_cm_demo_stage url='s3://gn5456-prodanalytics-snowflake-us-west-2/music/db-delph-trgt-1/db_replica_full/MPROD_CM/' credentials=(aws_key_id='AKIAXSMVNSKGBY2Y55AP' aws_secret_key='4Y2kqVCrp5EIIePJQ3ImKaNeCCwWvtio7n8ZiTjN') file_format= gz_format; ``` * list stages made: `show stages` * list files in the s3 bucket: `list @mprod_cm_demo_stage` ![](https://i.imgur.com/1opGHTq.png) * create table in snowflake: ``` create table ddhar_test.stg.mprod_cm ( MPROD_CM_ID bigint, ROOT_OBJECT_ID bigint, OBJECT_ID bigint, MPROD_ID string, MPROD_OWNER string, MPROD_TABLE string, MPROD_PARENT_ID string, MPROD_PARENT_TABLE string, NAME string, ABREVIATED_NAME string, DESCRIPTION string, LANGUAGE_CONTEXT_ID bigint, SORT bigint, TYPE string, SOURCE_ID bigint, STATUS_ID bigint, DESTINATION_ID bigint, CDATE datetime, UDATE datetime, SESSION_ID bigint ) ``` * Copying some files originally in the s3 bucket * Note that you can't use this when creating pipes. You need to write a more simple select statement: ``` copy into DDHAR_TEST.STG.MPROD_CM from ( select t.$1::bigint, t.$2::bigint, t.$3::bigint, --t.$3::bigint t.$4::string, t.$5::string, t.$6::string, t.$7::string, t.$8::string, t.$9::string, t.$10::string, t.$11::string, t.$12::bigint, t.$13::bigint, t.$14::string, t.$15::bigint, t.$16::bigint, t.$17::bigint, t.$18::datetime, t.$19::datetime, t.$20::bigint from @mprod_cm_demo_stage (file_format => gz_format) t ) ``` ## Create Pipe * automated ingestion of data into snowflake * you can have a bucket, * enable the continuous reading of a folder * lets say you want to automate a pipeline from your s3 bucket of certain files. Use the stage of the source in order to pipe files into a table in snowflake * unfortunately i haven't figured out a way for this to take columns :( * show the pipes already made `show pipes` ``` create pipe if not exists my_mprod_cm_pipe auto_ingest= true as copy into ddhar_test.stg.mprod_cm from @mprod_cm_demo_stage/ file_format = gz_format; ``` ``` create pipe if not exists d_mprod_cm_pipe auto_ingest= true as copy into ddhar_test.stg.mprod_cm from @mprod_cm_demo_stage/ file_format = gz_format; ``` * you do this in order to set datatypes for all rows * you also can make a pipe and a table to copy the file into ``` create pipe if not exists my_mprod_cm_pipe auto_ingest= true as copy into ddhar_test.stg.mprod_cm from (file_format => parquet) t; ``` ## Quering files from s3 in Amazon Athena 1. Build a crawler that holds metadata about a table * analyzes sites of data (5-10%), and captures metadata about the table * once you create it, it * what types of files we're dealing with, etc. 2. Build a database where crawler stores information about the database #### Schema migration script * when you have a lot of tables to copy! #### Importing multiple schemas into scowflae 1. Oracle Data modeler 2. create a logical model * logical:relational::entity:table 3. convert datatypes right? Use domains: logical concepts that maps to native types, you can have a mapping from a native type to a domain then another time, etc. * to get a snowflake: * no difference in different types of varchar