owned this note
owned this note
Published
Linked with GitHub
---
Title: Data Search
Subtitle: Finding the right dataset in a massive mess - A Practical Case Study in the Energy Domain
---
[[Initial Draft]]
# Data Search
## A Practical Case Study in the Energy Domain
## Problem / Introduction - Finding the right dataset in a massive data mess
The fact is that searching for specific information in complex and multiple heterogeneous databases is difficult and can take
Reasons:
* Heterogeneous data
* Missing metadata
* Heterogeneous data sources such as spreadsheets, RDBMSs, NoSQL, JSON, CSV, PDFs and plain text files as well as data sources that can be accessed only under NDAs and some other sources
In most companies there are several data sources, internal, external, open access, legal, governmental. Many problems require cross referencing data among those sources or at least being able to shorten a list of possibly related data points. Also those sources can (and do) use different ways of referring to the same entity and to make reference to related entities (for whatever is referred as an entity in each data source), all these elements only make harder to search through such a setup.
So, the first step in finding the right data is being able to first find *where* that data is available.
Just *after* finding where the data *might be* is possible to go forward and look the particular data instance (or instances) that one is interested in, including being able to cross reference those datasets.
The cross referencing stage took place manually during a first step
## Deeper description of the problem
For our particular problem we have hundreds of datasets coming from different data sources these include
* regulations
* filings
* geographic databases (from geographic regions to locations and building shapes)
* weather data
* historic power data generation
* historic energy prices
*
* internal annotations
* internal projects
* entities descriptions
* entities name mappings
* other entities data
* companies
* holding companies
* GIS data
* Historical Power Data
* Weather (temperature, pluviometry and wind speed and direction)
* Power plants, power generation units,
* fillings
* Companies, including holding companies
So how do we go aboubt finding that information?
First we create a full text search database with as much meta-information as possible of each of the databases we contain, this is done in several ways including manual data input (specially for databases that are not yet ingested) automated metadata extraction and manual correction of metadata.
We established an iterative process to create a *MASTER* dataset that contains cross referenced data from different databases. This dataset is then used to find data that is related but difficult to match. This process was determined at the same time that the first iterations were manually processed
(going through metadata, finding data sources that we can cross reference, cross reference them to get a better MASTER dataset that can be searched for related datapoints across different databases). Although the first couple of iterations were done manually in parallel to the metadata gathering/extraction
We also cross reference data for the most used datasets and we also use external search engines (like the FERC e-library) which are pre-configured to accept and merge different search terms in our search engines and then post-processed to shorten the list of matching candidates. This adding external sources is mostly due to some time and human power limitations
## Tech Stack
* Python
* Elasticsearch
* BigQuery + SQL
* Cloud Run
## Data Paths
Manual Metadata input
Automatic Extracting metadata from the DB
BigQuery->Json->Elasticsearch
This process takes place linearly, as there is no performance or time issue going through these in series and has the advantage of not having to deal with concurrency issues[* 1](Even if there is a problem there is no mission critical real-time problem, at most we'll have a day delay on the user-facing interfaces).
Daily tasks (in order of occurrence):
TODO reference the data paths image here (already created)
Extract data from the changes in the datasets and tables stored in bigquery and dumps to a GCS bucket
Extract the manually introduced data (in a spreadsheet and through the UI forms) and dump into a JSON file in GCS (Google Compute Storage)
Take those and many other reference datasets [* 2](static datasets built for reference and mapping different naming, geographic and aliasses) and combines all these into a single json file that we
This json file represents a Metadata Catalog that can be searched through later.
Mixing manual data with the automated pathway
Adding manual updates over all the manually and automatic inferred metadata <- what problems come here?
Priorities on metadata modifications: show the manual edit over the automated ones
Difference between manually added metadata and automatically extracted (hint: the automated one means the organization has ingested the data in bigquery while manually added means that the organization only knows about it's existence but does not ahve the data itself, just knows how/where to get it)
Issue: what happens when a manually added metadata gets finally added as a data source?
TODO reference the conflicting updates image here (already created)
*Path A - Search path*: The user updated data takes priority over the automated one (this is the implemented policy, even if we could implement others such as timestamp based)
*Path B - User Write path*: The new data is kept in a separate index, this avoids any data overwrite conflict and leaves the display policy to a later stage which is configurable and can change without creating data conflicts
## Extracting metadata from hundreds of databases
Extracting metadata from hundreds of DBs is one of the most CPU and memory intensive tasks.
The goal of this step is, for each new table and Database extract the different types of values that exist for *significant* columns, being *significant* different from db to db depending on the domain and search goal.
### Steps:
1. Manually checking column names to understand patterns and what kind of content
1. From those column names select a few names and string patterns to be blacklisted (anything that is time series, geo-points, geo-shapes, dates and ids for example). This first blacklist will make sure to cut down the processing time.
1. Then after a pre-selection of the fields, do a `SELECT count() ...` over the selected columns for each.
1. Select a threshold of the maximum number of different values for the columns
1. Run the queries
1. Store the query results in a new DB or file
This results need now to be post-processed to clean from any undesired data
### Updating Metadata
Afterwards, to update the metadata the same process can be run on the data that has been updated since the last run and then post-process to join the results with the previous results.
These updating and post processing can be done in SQL in part but for our purposes we needed to run full-text search so the post processing includes passing the metadata to JSON and then injecting it into an ElasticSearch index.
## Searching through metadata -
If the things I'm looking for are there, find out where
## Searching though the data - MASTER data search
Geographic search (google-maps style)