# E2E Data Ingestion This document is created to document all the changes needed to unable us to ingest data into Elasticsearch. At the end you can find the rest of the issues which are to be solved. ## Infrastructure The [xxx-]es-http service needs to be exposed ## [Elasticsearch JavaScript Client](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/8.4/index.html) ### Versioning We are sticking to the version closest the Elasticsearch version, although there is compatability > Language clients are forward compatible; meaning that clients support communicating with greater or equal minor versions of Elasticsearch. Elasticsearch language clients are only backwards compatible with default distributions and without guarantees made. > For more information consult the [Compatability matrix](https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/8.4/installation.html#js-compatibility-matrix). --- ### Connecting > By default Elasticsearch will start with security features like authentication and TLS enabled. To connect to the Elasticsearch cluster you’ll need to configure the Node.js Elasticsearch client to use HTTPS with the generated CA certificate in order to make requests successfully. > > Depending on the circumstances there are two options for verifying the HTTPS connection, either verifying with the CA certificate itself or via the HTTP CA certificate fingerprint. > #### TLS configuration > The generated root CA certificate can be found in the certs directory in your Elasticsearch config location ($ES_CONF_PATH/certs/http_ca.crt). ```typescript const { Client } = require('@elastic/elasticsearch') const client = new Client({ node: 'https://localhost:9200', auth: { ... }, tls: { ca: fs.readFileSync('./http_ca.crt'), rejectUnauthorized: false } }) ``` #### CA fingerprint > You can configure the client to only trust certificates that are signed by a specific CA certificate (CA certificate pinning) by providing a caFingerprint option. This will verify that the fingerprint of the CA certificate that has signed the certificate of the server matches the supplied value. You must configure a SHA256 digest. ```typescript const { Client } = require('@elastic/elasticsearch') const client = new Client({ node: 'https://example.com' auth: { ... }, // the fingerprint (SHA256) of the CA certificate that is used to sign // the certificate that the Elasticsearch node presents for TLS. caFingerprint: '20:0D:CA:FA:76:...', tls: { // might be required if it's a self-signed certificate rejectUnauthorized: false } }) ``` --- **For testing purposes you can turn off certificate verification by passing a tls object on client.** ```typescript tls: { rejectUnauthorized: false } ``` --- #### ApiKey authentication An [API key](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-api-create-api-key.html#security-api-create-api-key) can be provided by either passing the base64 encoded string or an object with the key's id and value. ```typescript const { Client } = require('@elastic/elasticsearch') const client = new Client({ node: 'https://localhost:9200', auth: { apiKey: 'base64EncodedKey' } }) ``` ```typescript const { Client } = require('@elastic/elasticsearch') const client = new Client({ node: 'https://localhost:9200', auth: { apiKey: { id: 'foo', api_key: 'bar' } } }) ``` #### Api key generation ```typescript POST /_security/api_key { "name": "alkemio-server", "role_descriptors": { "alkemio-server": { "cluster": ["monitor"], "index": [ { "names": ["*"], // specify the index "privileges": ["read", "write", "monitor"] } ] } }, "metadata": { "application": "alkemio-server", "environment": { "managed_by": "alkemio-server", "trusted": true, "tags": ["ingestion", "alkemio-server"] } } } ``` #### Bearer authentication [Service account token](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-service-token.html) can be provided by passing an auth object. **From what I have seen so far this option won't help us.** ```typescript const { Client } = require('@elastic/elasticsearch') const client = new Client({ node: 'https://localhost:9200', auth: { bearer: 'token' } }) ``` ## The Nest.js service The usage of the client needs to be abstracted and the client needs to be initialized before-hand. For that purpose a service is introduced solving the following - narrowing down the use cases - configuration is hidden and managed centrally - the client is initialized a single time ## Elastic storage This section is solving all the issues related to storing the documents and managing the indices. ### Index vs [Stream](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html) I have chosen the stream here instead of the index, because the stream handles alot of the lifecycle issues automatically like rollover, write index, aliases, etc. ### Index Lifecycle Management (ILM) ```json PUT _ilm/policy/contribution-events-ilm-policy { "policy": { "_meta": { "description": "used for contribution events", "project": { "name": "MAU", "department": "Alkemio" } }, "phases": { "warm": { "min_age": "60d", "actions": { "forcemerge": { "max_num_segments": 2, "index_codec": "best_compression" }, "readonly": {}, "set_priority": { "priority": 50 } } }, "cold": { "min_age": "100d", "actions": { "readonly": {}, "set_priority": { "priority": 0 } } }, "hot": { "min_age": "0ms", "actions": { "rollover": { "max_primary_shard_size": "20gb", "max_age": "60d" }, "set_priority": { "priority": 100 } } } } } } ``` ### Index template ```json PUT _index_template/contribution-events { "index_patterns" : ["contribution-events*"], "data_stream": { }, "priority": 500, "template": { "settings" : { "number_of_shards" : 2, "index.lifecycle.name": "contribution-events-ilm-policy", "index.lifecycle.rollover_alias": "contribution-events" }, "mappings": { "properties": { "id": { "type": "keyword" }, "name": { "type": "keyword" }, "author": { "type": "keyword" }, "type": { "type": "keyword" }, "hub": { "type": "keyword" }, "environment": { "type": "keyword" }, "alkemio": { "type": "boolean" }, "@timestamp": { "type": "date", "format": "date_optional_time" } } } }, "_meta": { "description": "Index template for contribution events" } } ``` ### Stream creation ``` PUT _data_stream/contribution-events ``` ## Issues to be solved - Exposing the Elasticsearch http service publicly - managing certificates - using certificate to connect - managing the expiration of api keys/bearer tokens ###### tags: Data, Ingestion, Elasticsearch