it is a zoo tycoon 1 files
Columns and data structure
Animal | Terrain | Status | Cost | Location/Era |
---|---|---|---|---|
Ankylosaurus | Savannah | Extinct | $2,700 | North America (Cretaceous) |
Triceratops | Savannah | Extinct | $4,000 | North America (Cretaceous) |
it is a 8859-1 ANSI encoded file
Location and Era are concatenated
it is zoo tycoon 2 files
Columns and data structure
animal | biome | status | popularity | cost | location |
---|---|---|---|---|---|
Grizzly Bear | Boreal Forest | Endangered | 3.5 | $15,000 | North America |
Polar Bear | Tundra | Low risk | 3 | $10,000 | Arctic |
it is an UTF-8 encoded file
As compared to the zoo tycoon 1 files
there isn't any era information
"terrain" had been renamed to "biome"
there is a "popularity" information
Well, it is a file …
Let's see if ARC is able to know from what version of the Zoo Tycoon game it is from
We will set a target data model. This data model will store the data of these 2 different files the way we want.
We will set a first pipeline to load the animal from the first type of files
We will test our pipeline by loading all the files and see what happen
We will then define a second pipeline to load the files from the game zoo tycoon 2
We will change both the data model and the pipeline to host some new data
We will test our pipeline by loading all the files again and see what happen
We will finally change the data model to structure our file data in a two tables normalized model and work again on the pipeline
We want to load the file in a data model called "bio"
Structure of a "bio" data model
animal_name | cost_in_euros | conservation_status |
---|---|---|
text | float | text |
In the main menu > click on "Norm family managment"
In the "Norm Families" section
1- Click in the cell right of the "+"
2- Write "bio"
3- Press Ctrl+Enter to validate the entry or click on "Add" button
The "Ctrl+Enter" hotkey execute the default action associated with the cell
Add a new norm family called "test" in "Norm Families" table
Delete the "test" norm family entry
Click the line checkbox for and click on "delete"
A table associated to a data model must check the following format
Create the "animal" table associated to the data model "bio"
In the "Model tables" section
1- Add an entry "mapping_bio_animal_ok"
In the "data model fields" section
- in "Field name", write "animal_name". Use only lower case and no space. Only dollar and underscore are allowed as separator as it must be database compliant
- in "Type", select "text". It is the database type of the field to be created
- in "Comment", write an optionnal comment that describe the field to be created
- in "animal", write a "x" (or whatever you want). It means that the field animal_name now belongs to the table animal
Insert the other fields cost_in_euros (bigint as type) and conservation_status (text as type)
Insert the two mandatory fields id_source and id_{table_name}
Field Name | Field Type | Comment | animal |
---|---|---|---|
- | - | - | - |
animal_name | text | Name of the animal | x |
cost_in_euros | float | Value of the animal in euros | x |
conservation_status | text | Conservation status of the animal | x |
id_source | text | Name of the original file | x |
id_animal | bigint | Animal table primary key | x |
game_name | text | The name of the game where the animal data comes from | x |
- Write "id" in the cells just below the "Field Name" header
- Press Ctrl+Enter to trigger the filter action
- Click on the "Field Type" header to sort the view on the header ascending.
- Click again to sort descending.
In the main menu > click on "Norm managment"
Fill the cells of "+" line of the "Norm definition" section as follow
- in "Norm family", choose the "bio" data model created before
It means that the target of the processing pipeline is the "bio" data model- in "Norm", choose a name for your processing pipeline. I choosed ZT1_V001 (like Zoo Tycoon 1 version 001)
- in "Periodicity", choose a periodicity of your file. At the moment, only "Annual" or "Monthly" can be selected but more can be added. I choosed Annual. Note that is is only a metadata information that is not really used by the pipeline.
- in "State", choose "ACTIF". It means that the norm is active. A norm can be disabled by choosing "INACTIF"
Overview on the "Norm calculation"
In "Norm calculation", a SQL select query must be written
For each file loaded, this query is evaluated. If it returns any records, ARC will know this norm and this pipeline will have to be used for the file. This "Norm calculation" rules must be exclusive between all norms and musn't overlap either ARC will generate an error.
The query may use a table called "alias_table". Each file loaded by ARC is read row by row and the content is temporary stored in the a table called "alias_table".
alias_table got 3 columns
When ARC loads the file animals-zt1.csv from the "DEFAULT" warehouse, the "alias_table" generated looks like :
id_source | id_ligne | ligne |
---|---|---|
DEFAULT_animals-zt1.csv | 0 | Animal;Terrain;Status;Cost;Location/Era |
DEFAULT_animals-zt1.csv | 1 | African Elephant;Savannah;Vulnerable;$2,500;Africa |
DEFAULT_animals-zt1.csv | 2 | Olive Baboon;Savannah;Least Concern;$900;Africa |
DEFAULT_animals-zt1.csv | 3 | Plains Zebra;Savannah;Near Threatened;$800;Africa |
… |
Let's fill the remaining cells of the "Norm definition" before adding it in ARC
- In “Norm calculation”, write the sql query
SELECT 1 FROM alias_table WHERE id_source LIKE '%animal%'
AND NOT EXISTS (SELECT 1 FROM alias_table WHERE id_ligne=0 AND lower(ligne) LIKE '%;popularity;%')
- In "Validity calculation", write a sql query to return '2020-06-01'
SELECT '2020-06-01'
- Click the "Add" button below the "Norm definition" section or press Ctrl-Enter
- The new line correponding to the norm created appears in the "Norm definition" section
- It can be updated by clicking on the cells to edit and click the "Update" button or press Ctrl-Enter
- It can be selected by clicking on the line checkbox. The selection opens the "Norms calendar" section
- It can be deleted by selecting it with the checkbox and click the "Delete" button
In the "Norm definition" section
- Click on the norm checkbox to open its "Norms calendar" section
Fill the cells of "+" line of the "Norm calendar" section
- in "Validity min", 2020-01-01
- in "Validity max", 2025-01-01
- in "State", choose "ACTIVE"
- Click the "Add" button below the section or press Ctrl-Enter
In the "Norm calendar" section
- Click on the "norm calendar" checkbox to open its "Rulesets" section
Fill the cells of "+" line of the "Rulesets" section
- in "Version", write "v001"
- in "State", select the "Bac à sable 1" (sandbox 1)
- Click the "Add" button below the section or press Ctrl-Enter
In the "Rulesets" section
- Click on the "Rulesets" checkbox to open the pipeline steps section
- Click on the "Load" link to open the "Load rules" section
Fill the cells of "+" line of the "Load rules" section
- in "Type of file", choose "plat" (flat file)
- in "Delimiter", write ;
- in "Format", write <encoding>ISO-8859-1</encoding>
- Click the "Add" button below the section or press Ctrl-Enter
Enter the sandbox workbench
- Next to "Choose your working environment", select "BAS1" (sandbox1)
- Click on "Manage environment"
In the "Run a module" section
- Click on "Initialize"
In the "Run a module" section
- Click on "Reset Sandbox"
In the "Register files" section, click on "Browse"
- Select all the files you want to upload to arc with the crtl key
ARC can read zip, gzip, tar and tar.gz archive
- Select the "DEFAULT" target warehouse
- Click on "Register"
In the "Files status in the workflow" section
- 3 files are in the "register OK" state
- click on the cell "3". The details of the 3 files appear in the "Workflow files detail" section
The step after "Register" is "Load". So the 3 files in the "Register OK" state are eligible to be processed by the "Load" module
In the "Run a module" section
- Click on the "Load" button. It runs the "Load" module on the eligible files (the ones in a "Register OK" state).
In the "Files status in the workflow" section
- Two files are in a "Load OK" state, one file is in a "Load KO" state
- Click on the cell containing "2" referencing the "2" files in the "Load OK" status
It opens the "Workflow files detail" section with the 2 files details- These 2 files had been marked by ARC as "ZT1_V001" norm, "2020-06-01" as validity and "A" as periodicity
- Click on the cell containing "1" referencing the file in the "Load KO" status
It opens or updates the "Workflow files detail" section with the file details- The norm, validity and periodicity are empty. But there is a "Message report" : "java.lang.Exception: Zero norm match the expression".
- It means that ARC couldn't find any norm that matches this file and thus, it isn't a Zoo Tycoon 1 file as we defined them
* Execute a full processing chain > In the "Run a module" section >> 3. Now execute the full processing chain remaining for the 2 files >> 4. The 2 files stops at "Map KO". Indeed, we still hadn't set how ARC must map the file data into the "bio" data_modelIn the "Run a module" section
- Click on the "Restructure" button
- The 2 files in the "Load OK" status went through the "Restructure" process. They are now in a "Restructure OK" state and eligible for the next module "Control".
In the "Run a module" section
- CLick on "undo Load" button to bring back all the 3 files in a "Register OK" state as if the "Load" module had never been processed.
- Click on the "Load" button to process the with "Load" module again
In the "Files status in the workflow" section
- Click on the cell containing "2" referencing the "2" files in the "Load OK" status
It opens or updates the "Workflow files detail" section with the file details
In the "Workflow files detail" section
- Click on "Download Data"
- Open the csv file correponding to "DEFAULT_animals-001.csv"
id_source | id | date_integration | id_norme | periodicite | validite | i_animal | v_animal | i_terrain | v_terrain | i_status | v_status | i_cost | v_cost | i_location_era | v_location_era |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text | int4 | text | text | text | text | int4 | text | int4 | text | int4 | text | int4 | text | int4 | text |
DEFAULT_animals-001.csv | 1 | 2020-06-14 | BIO_v001 | A | 2020-05-22 | 1 | African Elephant | 1 | Savannah | 1 | Vulnerable | 1 | $2 | 500 | 1 |
DEFAULT_animals-001.csv | 2 | 2020-06-14 | BIO_v001 | A | 2020-05-22 | 2 | Olive Baboon | 2 | Savannah | 2 | Least Concern | 2 | $900 | 2 | Africa |
DEFAULT_animals-001.csv | 3 | 2020-06-14 | BIO_v001 | A | 2020-05-22 | 3 | Plains Zebra | 3 | Savannah | 3 | Near Threatened | 3 | $800 | 3 | Africa |
In the main menu, click on the "Norm managment" link
- Select the "ZT1_V001" norm by clicking on its checkbox in the "Norms definition" section
- Select the calendar by clicking on its checkbox in the "Norms calendar" section
- Select the ruleset correponding to "sandbox 1" by clicking on its checkbox in the "Rulesets" section
- Click on the "Map Model" link checkbox to open the "Mapping rules" section
- Click on "Generate a ruleset" in the "Map Model" section
It initializes empty rules with all the columns of our "bio" data that will have to be set
Use SQL syntax to write the data model mapping expression
Remember you must use the columns naming used by ARC to store data such as "v_animal", "v_location_era", "id_source", "id_norme", "validite", …
These column names must be escaped by brackets such as {v_animal}, {id_source}, …
In the "Map Model" section, fill the "SQL expression" and optionally the "Comment" for the variables of the data model
- Click on the "SQL expression" cell or the "Comment" cell correponding to the variables, and write the rules
- Click on "Update" or press Ctrl+Enter to validate the entries
- Be careful when writing multiple cells at the same time, if the SQL expression is not correct, ARC rejects all the entries and they will be lost
Field name | SQL expression | Comment |
---|---|---|
id_animal | {pk:mapping_bio_animal_ok} | this sql expression means id_animal is a serial number primary key for each file |
id_source | {id_source} | The name of the file. Don't forget to use brackets ! |
animal_name | {v_animal_name} | The animal name as it is stored by ARC. v_ means "value" |
cost_in_euros | regexp_replace({v_cost},'[^0123456789]','','g')::int*0.89 | Convert the v_cost in euros as an integer : keep only number digit, cast to integer and apply the conversion rate. SQL is powerful |
conservation_status | {v_status} | TODO : change that to store modalities instead of plain text |
game_name | case when {id_norme} like 'ZT1%' then 'Zoo Tycoon 1' end | We've built the norm to identify the game version of the data. That is a smart choice for our use case. |
Enter the sandbox workbench
- Next to "Choose your working environment", select "BAS1" (sandbox1)
- Click on "Manage environment"
- Run the "Initialize" module to be sure that your new rules will be applied to the sandbox
- Run the required modules to try the "Map model" rules
- Download the output data of the "Map model OK" files to vizualize the output
Fix the error on cost_in_euros
(nullif(regexp_replace({v_cost},'[^0123456789]','','g'),'')::int*0.89)::int
Manuel Soulier
SNDI Orléans, INSEE
manuel.soulier@insee.fr