Tokenization procedure

# Tokenization procedure v 0.1 [TOC] ## Current Flow ```plantuml @startuml start :STEP 1\n[IT] Files with exported profiles downloaded \nfrom provider's \(eg. Stripe, Affinipay etc.) SFTP \n(sometimes 2 of them, one for CC one for ACH/BA\n FILE COUNT SO FAR: 2; :STEP 2\n[IT] Sensitive columns are masked. \nWe create masked copies of those files for Ops team\nFILE COUNT SO FAR: 4; :STEP 3\n[IT] Masked files are attached to the JIRA ticket; :STEP 4\n[Ops] Files are reviewed; addresses are amended etc.; :STEP 5\n[Ops] Passes files back to the IT for \ntokenization (usually subset, or in multiple files)\n FILE COUNT SO FAR: 6?; :STEP 6\n[IT] Performs checks on files, matches \nthese files against original, unmasked files, \nresolves problems and produces files for \ntokenization (one for CC records one for ACH/BA records)\n FILE COUNT SO FAR: 8; :STEP 7\n[IT] Runs tokenization scripts on those 2 files.\n Each run creates an output of 2 files, \n(errors file and successful tokenizations file) \nso for entire step - 4 files\n FILE COUNT SO FAR: 12; :STEP 8\n[IT] Puts files in the JIRA ticket for the Ops team to pick up; stop ``` >[color=red] Note: in case of some data missing (which is often the case) - another wave of this operation is necessary, in which case steps 4 to 8 have to be repeated. ## Current problems - Pushing around different files, too many of them, and with different formats of data: source files are different, revised files different, output files also different. - Necessity to continuously combine file, doing comparison and validation - (often) lack of unique, reference column (makes comparing files and finding records difficult, or just, much slower - have to match by multiple columns). - lack of standarization, and thus difficulty with tracking and securing quality of data - data quality problems (addresses first and foremost) ## Proposed flow V 2.0 ```plantuml @startuml start :STEP 1\n[IT] Files with exported profiles downloaded \nfrom provider's \(eg. Stripe's) SFTP \n(sometimes 2 of them, one for CC one for ACH/BA\n FILE COUNT SO FAR: 2; #palegreen:STEP 2\n[IT] STEP 2; Files go through the scripts that:\n 1) combine files into 1 (CC and ACH/BA records together)\n 2) validate data\n 3) autocorrect data where safe & possible\n 4) perform auto-fix of address fields \n 5) Unique reference column is added [mxReference]\n A status column is added [mxStatus] indicating whether we as \n an IT believe the record to be valid and ready for tokenization\n6) create masked version of file for Ops team \n FILE COUNT SO FAR: 3; :STEP 3\n[IT] Masked files are attached to the JIRA ticket; :STEP 4\n[Ops] Files are reviewed; addresses are amended etc.; #palegreen:STEP 5\n[Ops] Passes a new version of the same file back to the IT for \ntokenization:\n 1) number of records stays the same \n(no additions, no subtractions unless agreed)\n 2) always same format, CSV\n 3) if some records are not to be tokenized \n - Ops changes their mxStatus \n4) if extra columns are to be added for the \nvalidation/inspection process \n- they are added with mx-prefix after original columns\n FILE COUNT SO FAR: 3; :STEP 6\n[IT] Performs checks on file received from the Ops\n FILE COUNT SO FAR: 3; #palegreen:STEP 7\n[IT] Runs tokenization scripts on previous\n 1 file containing both CC and ACH.\n Each run creates an output of the same file,\n only with extra columns:\n- mxStatus(tokenized|tokenization_error),\n- mxTokenizedAt +\n- columns depending on tokenization output\n (error message or tokens data)\n FILE COUNT SO FAR: 3; :STEP 8\n[IT] Puts files in the JIRA ticket for the Ops team to pick up; stop ``` ### Expected benefits - procedure simplified, more manageable - Less manual intervention required for both Ops and IT - errors and problems easier to discover and track ## TODOs [IT] - prepare preValidation script (input files) - ~~prepare Geo-Checker script for auto-fixing addresses~~ - change Tokenization script to accept unified files of new format, and to output files ## Current CSV row vs Proposed future CSV row ### Current Row ```csvpreview {header="true"} description name email id card.address_city card.address_country card.address_line1 card.address_line2 card.address_state card.address_zip card.exp_month card.exp_year card.id card.name card.number default_source card.transaction_ids John Doe John Doe john.doe@gmail.com cus_Pt7EYURu9wJhC8 US 92057 4 2027 card_1P3L03K6NnMfWf5L2WaPJgXJ John Doe xxxxxxxxxxx1007 card_1P3L03K6NnMfWf5L2WaPJgXJ 7470871316301 ``` ### V2 Row ```csvpreview {header="true"} mxReference description name email id card.address_city card.address_country card.address_line1 card.address_line2 card.address_state card.address_zip card.exp_month card.exp_year card.id card.name card.number default_source card.transaction_ids mxStatus mxAutoFixes mxValidationErrors mxTokenizationError mxCustomerVaultToken mxVaultToken mxTransactionId CustomColumn1 5e268d41-ed44-4b6f-bba9-3098dca750b8 John Doe John Doe john.doe@gmail.com cus_Pt7EYURu9wJhC8 US 92057 4 2027 card_1P3L03K6NnMfWf5L2WaPJgXJ John Doe xxxxxxxxxxx1007 card_1P3L03K6NnMfWf5L2WaPJgXJ 7470871316301 ready_for_tokenization {“card.address_country”: [“United States of America”, “US”]} (If tokenization failed) (If tokenization succeeded) (If tokenization succeeded) (If tokenization succeeded) ``` #### mxStatuses, possible values - [ ] ready_for_tokenization - [ ] invalid - [ ] tokenized - [ ] tokenization_error