Xendit Reliability ETL

# Xendit Reliability ETL The purpose behind this project is to organize the various data that is necessary so we can easily analyze how reliable we are operating as an engineering organization. The current set of data that we look at are: 1. Incidents 2. PagerDuty 3. Sentry A primer on the above can be found in our [RFC documents](https://drive.google.com/drive/u/1/folders/1ltxy0JbLY53PX7mtwow7Depc2FqflfJ-) so will not be covered in this document. ## Installation Installation is designed to be simple, with minimal dependencies. Simply get started by running, ``` npm install ``` in the root directory ## Execution The main feature of this project is to provide an effective ETL pipeline so that the source data can be easily analyzed. To get started quickly, run the following command in the project root, ``` node scripts/initialize.js ``` When you run the script above, it would initialize ETL by creating the database and schemas specified in the `/scripts/schemas` directory. By default, the database would be created in your memory. To make it create on disk, run `export DB_MODE=DISK` before you run the script above. ``` export DIRECTORY_PATH=incidents node scripts/run_etl.js ``` The script above would do the process of data mapping (Transform) from `raw_data/incidents` form into the appropriate form before loading it to the database (Load). If you want to ETL other data, change the value of `DIRECTORY_PATH` with the name of the directory where you want to process the ETL. ``` node scripts/run_queries.js ``` You should see a set of CSV files generated in the `/output` directory. These are the results of running the queries found in the `/scripts/queries` directory and converted into CSV. ## Configuration All of the configuration needed to run the ETL and the data extractors can be found in `.env.local`. Make a copy of the file into `.env` to avoid having to pass in command line arguments every time you want to run the scripts. ### Steps The actual steps the ETL follows under the hood is pretty straightfoward: 1. Initialize an in-memory SQLite3 database 2. Initialize tables from `/scripts/schemas` 3. Load the JSON data from `/raw_data` into the initialized tables using the mappers found in `/scripts/mappers` 4. Perform the queries in `/scripts/queries` and write the output to `/output` 5. Clean everything up Currently everything is being run from memory given we're not dealing with huge datasets, performance has not yet become a bottleneck. ## Extension There are a few areas which need to be covered before you're able to understand and contribute to this project: 1. Source Data/Extractors 2. Data Schemas 3. Transform 4. Load 5. Queries ### Source Data/Extractor As mentioned above, our current data sources which inform how well we're doing in terms of reliability are taken from (this list will expand): 1. Incidents 2. PagerDuty 3. Sentry 4. Github 5. DataDog #### Incidents Our incidents data comes from [postmortem docs](https://drive.google.com/drive/u/1/folders/0ANGugt3gnJvqUk9PVA) and is currently hand parsed and converted into a YAML format. This process is still relatively time consuming and is a candidate for automation in the future. ##### Structures ``` ... - incident_name: <INCIDENT_NAME> postmortem_link: <POSTMORTEM_LINK> incident_date: <INCIDENT_DATE> affected_entities: - entity_name: <AFFECTED_ENTITY_NAME> products: - <AFFECTED_PRODUCT> root_causes: - <ROOT_CAUSES> stats: time_of_first_trigger: <TIME_OF_FIRST_TRIGGER> time_of_customer_detect: <TIME_OF_CUSTOMER_DETECT> time_of_internal_detect: <TIME_OF_INTERNAL_DETECT> time_of_recovery: <TIME_OF_RECOVERY> time_of_reconcile: <TIME_OF_RECONCILE> time_of_rca: <TIME_OF_RCA> number_of_impacted_customers: <NUMBER_OF_IMPACTED_CUSTOMERS> number_of_failed_requests: <NUMBER_OF_FAILED_REQUESTS> severity_level: <SEVERITY_LEVEL> reliability_gaps: <RELIABILITY_GAPS> ... ``` - `INCIDENT_NAME`, this placeholder would be filled with the name of the incident in the document title. - `POSTMORTEM_LINK`, this placeholder would be filled with the link of the postmortem document. - `INCIDENT_DATE`, you could get the value from the document title. - `AFFECTED_ENTITY_NAME`, this placeholder would be filled with the affected entity name (e.g. Xendit, Instamoney). You could get the value easily from Affected Entities section on document. - `AFFECTED_PRODUCT`, you must fill it with the corresponding product that are affected. You could get the value easily from Affected Entities section on document. - `ROOT_CAUSES`, you must fill it with existing values. You can check it [here](https://docs.google.com/spreadsheets/d/1heZfdhdiEMJN7fvaE-ZdM0HmSGQ4kqC8j7pks8W3kuI/edit?usp=sharing) for the existing values. You have to choose the most appropriate one. - `TIME_OF_*`, you must fill it with `null` if this field in the document is empty. Otherwise, you must fill it with ISO format, `YYYY-MM-DD HH:ss`. - `NUMBER_OF_*`, you must fill it with a number. If this field in the document is empty, you must fill it with `NaN`. - `SEVERITY_LEVEL`, this placeholder would be filled with the severity level of the incident. - `RELIABILITY_GAPS`, you could get this value from Reliability Gaps section in the document. Delete the `reliability_gaps` field, if all gaps in the Reliability Gaps section in the document is empty. #### PagerDuty The way the process for getting the performance data from PagerDuty is as follows: 1. Run the extractor script in `PAGERDUTY_SECRET=<YOUR_PAGERDUTY_SECRET> node /scripts/extractors/pagerduty.js` 2. Make sure the raw data gets written to: 1. `/raw_data/pagerduty_services/` 2. `/raw_data/pagerduty_incidents/` 3. `/raw_data/pagerduty_log_entries/` 4. `/raw_data/pagerduty_teams/` 5. `/raw_data/pagerduty_users/` #### Sentry Sentry offers us an API to read the information about our usage and statistics. In order to pull this data, you must have access to a Sentry API secret key (we're currently using personal tokens to avoid having to go through the OAuth flow). 1. Run the Sentry projects extractor by running `SENTRY_API_KEY=<YOUR_API_KEY> node scripts/extractors/sentry.js` 2. Check to make sure that the data has been written to: 1. `/raw_data/sentry_projects/` 2. `/raw_data/sentry_issues/` 3. `/raw_data/sentry_teams/` #### Github Our github data will allow us to know whether we've implemented our RFCs correctly. To pull the github data, run the following: 1. Run the extractor by calling `GITHUB_API_KEY=<YOUR_API_KEY> node scripts/extractors/github.js` 2. Check to make sure the data has been written to: 1. `/raw_data/github_teams/` 2. `/raw_data/github_repos/` 3. `/raw_data/github_repo_teams/` 4. `/raw_data/github_repo_hooks/` #### DataDog To pull DataDog data, run the following: 1. Run the extractor by calling `DATADOG_API_KEY=<YOUR_API_KEY> DATADOG_APPLICATION_KEY=<YOUR_APP_KEY> node scripts/extractors/datadog.js` 2. Check to make sure the data has been written to: 1. `/raw_data/datadog_monitors/` 2. `/raw_data/datadog_synthetics/` ### Data Schemas Below is a detailed description of our data schemas that the source data is loaded into to allow us to perform complex queries more efficiently. #### Incidents ![Incidents ERD](resources/erd/Incidents.png) #### PagerDuty ![PagerDuty ERD](resources/erd/PagerDuty.png) #### Sentry ![Sentry ERD](resources/erd/Sentry.png) #### Github ![Github ERD](resources/erd/Github.png) #### DataDog ![DataDog ERD](resources/erd/DataDog.png) #### Developer Survey ![Developer Survey ERD](resources/erd/Developer%20Survey.png) #### 1. Incidents 1. Incidents 2. AffectedProducts 3. RootCauses 2. PagerDuty 1. PagerDutyTeams 2. PagerDutyUsers 3. PagerDutyUserTeams 4. PagerDutyUserContactMethods 5. PagerDutyServices 6. PagerDutyIncidents 7. PagerDutyLogEntries 3. Sentry 1. SentryProjects 2. SentryTeams 3. SentryIssues 4. SentryIssueEvents 4. Github 1. GithubRepos 2. GithubTeams 3. GithubRepoTeams 4. GithubRepoHooks 5. DataDog 1. DataDogMonitors 2. DataDogSynthetics ### Transform Transform is our process of cleaning and augmenting the underlying data to enable useful joins and analysis in the future. All of the transform logic happens in the mappers found in `/scripts/mappers` directory. Each mapper should return an array of arrays. Each of the subarray elements should be in the format: ``` [ "<Table Name>", [<Table Data>] ] ``` Each of these subarray elements represents a separate `INSERT` statement. By returning multiple subarray elements, we can perform multiple SQL inserts for a single piece of source data. A problem that exists right now is having table data stored as an array instead of an object is less clear to the reader which fields each value should be mapped to and depends on the order of fields defined in the schemas. ### Load Load is the process where we insert the transformed data into the database. Note that we currently use a `REPLACE INTO` statement which will overwrite when there is a conflict on a Primary Key or Unique constraint. This means that for now, you will need to load your data in ascending time or the final result will only have the earliest result. This is a known issue and contribution would be helpful here.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.