Indexing Content on the CNs

# Indexing Content on the CNs The indexing process takes existing content, parses it, extracts values to populate fields of an index document, and updates the Solr index with the index document. Indexing performance on the CNs is very slow. A complete re-index currently will take several days, perhaps weeks. The goal is to complete re-indexing of the entire corpus in an hour. One hour is 3600 seconds. There are currently about 3E6 documents in DataONE. Say 3.6E6. That means 1,000 docs per second, on average. ## Feeding the indexer In DataONE CNs, the indexer running as `index_task_processor` reads tasks from a postgres database. Those tasks are added to the database by `index_task_generator`, which listens to events on the Hazelcast system metadata map. The Hazelcast system metadata map emits a large number of events, many of which are not relevant to the indexing process. ```plantuml hzSysMeta -> index_task_generator: change index_task_generator -> index_task_generator: filter note right: [1] index_task_generator -> queue: add queue -> index_task_processor: task ``` **Figure 1.** Simplistic depiction of index task creation. [1] The filter algorithm is something like: ``` 1. Fetching the solr index for the pid. 2. If there is no solr index for the pid, 2.1 check the archive flag in the system metadata. If the archive=true, filter this pid out 2.2 if the archive=false, keep this pid task (return false). 3. If there is a solr index, compare the modification date between the solr index and the system metadata. 3.1 if sysmeta > solr index, return false (keep index task) 3.2 if sysmeta < solr index, return true (filter it out) 3.3 if sysemeta = solr index, compare the replica info. 3.3.1 If serialVersion in the solr is available, compare the value of serial version. 3.3.1.1 If solr = sysmeta , return true (filter it out) since no change in replica 3.3.1.2 If solr < sysmeta, return false (keep index task) since the solr has a smaller (older) serial version. 3.3.1.3 If solr > sysmeta, return true (filter it out) since the solr has a bigger (newer) serial version. 3.3.2. If serialVersion in solr is Not availabe, comare replica lists: 3.3.2.1 no change on replica info, return true (filter out) 3.3.2.2 there is a change, return false (keep index task) ``` An alternative approach for creating index events is to use the information in the `metacat` postgres database. Specifically, content with system metadata `dateModified` more recent than the most recent entry in the solr index has not been indexed. ```plantuml metacat_db -> index_task_generator: modified > X index_task_generator -> index_task_generator: filter note right: [1] index_task_generator -> queue: add queue -> index_task_processor: task ``` **Figure 2.** Alternative approach to populating index queue. [1] The filter algorithm is something like: ``` 1. get records from postgres newer than X 2. for each id: 2.1. get record from Solr (use realtime get API) 2.2. if not found: add task 2.3. if postgres modified stamp > solr modified: add task ``` When re-indexing, can page through all records with no need to check for entry in Solr. - Node level must be event driven rather than polling. CN would also benefit from this, but less critical. - Removing Hazelcast: - sync and indexing uses it - need to ensure CNs use postgres replication to maintain consistency - Session cluster is used to pass JWT between CNs. Can switch to postgres replication for this. Actually no need to do this if client is able to maintain it's JWT - Add the file path to the task stored in the index - Or use hash of id or content hash etc ## Indexing stuff General rules for performant Solr indexing: 1. Do not commit manually, let Solr do it 2. Do not rely on document state 3. If not #2, then use realtime get for retrieval Fetching documents by query requires the index is up to date, which requires a commit to be processed. Doing this per document adds significant overhead (2-3 orders of magnitude). **DataONE Indexing** There are three broad categories of content being indexed: 1. System metadata System metadata (`sysm`) is present for all indexed content. Certain properties of `sysm` are mutable, and hence existing indexed entries may be updated. 3. Independent Documents Metadata formats such as Dublin Core, EML, or ISO are immutable in DataONE and only need to be indexed once (unless the index is being rebuilt). These documents can be indexed independently of other documents, that is index values extracted from the source document do not change values for other documents. 3. Interdependent Documents Formats such as resource maps and annotations are imutable in DataONE and only need to be indexed once. However, these format have index values that update certain fields of other documents. Those other documents may or may not be indexed beforehand. ## Interdependent Document Formats These are documents handled by: * `org.dataone.cn.indexer.annotation.RdfXmlSubprocessor` * `org.dataone.cn.indexer.parser.ResourceMapSubprocessor` * `org.dataone.cn.indexer.annotation.EmlAnnotationSubprocessor` * `org.dataone.cn.indexer.annotation.AnnotatorSubprocessor` ``` RdfXmlSubprocessor prov_wasDerivedFrom prov_wasInformedBy prov_used prov_generated prov_generatedByProgram prov_generatedByExecution prov_generatedByUser prov_usedByProgram prov_usedByExecution prov_usedByUser prov_wasExecutedByExecution prov_wasExecutedByUser prov_hasDerivations prov_instanceOfClass hasPart isPartOf ResourceMapSubprocessor resourceMap documents isDocumentedBy EmlAnnotationSubprocessor sem_annotation AnnotatorSubprocessor sem_annotation sem_annotation_bioportal_sm sem_annotation_esor_sm sem_annotation_cosine_sm sem_annotates sem_annotated_by ``` ```plantuml indexer -> SolrIndexService: processObject() activate SolrIndexService SolrIndexService -> systemMetadataProcessor:processDocument() activate systemMetadataProcessor note right: (1) systemMetadataProcessor -> SolrIndexService:docs[] deactivate systemMetadataProcessor SolrIndexService -> subprocessor:processDocument(docs[]) activate subprocessor note right: (2) subprocessor -> SolrIndexService:docs[] deactivate subprocessor loop merge_doc in docs: loop subprocessor in subprocessors SolrIndexService -> subprocessor: mergeWithIndexedDocument(merge_doc) activate subprocessor note right of subprocessor: (3) subprocessor -> Solr: query(identifier) Solr -> subprocessor: docs subprocessor -> SolrIndexService:merge_doc deactivate subprocessor end end SolrIndexService -> SolrIndexService:getAddCommand(merge_docs[]) SolrIndexService -> indexer: add_doc deactivate SolrIndexService ``` **Figure 3.** General flow for indexing a single document. (1) System metadata values are extracted. (2) Metadata document values are extracted using subprocessors that match the `formatId` of the document. (3) All subprocessor apply `mergeWithIndexedDocument()`, though only the interdependent document processors (currently 4) do anything. The `mergeWithIndexedDocument` operation retrieves the document from Solr with the same ID as the document being indexed. Then for each field in the retrieved document, if the field is in the list of `fieldsToMerge` and the document being indexed does not already have that value in the field, adds the field to the document being indexed. The modified document is returned. `fieldsToMerge` is set from the respective bean configuration. Note that interdependent document processors **each** request entries matching the document identifier from Solr, and this is applied to **all** entries being indexed. Each of these processors will issue two queries to solr, and these queries are repeated for each processor. These queries require that the index is up-to-date (any changes comitted), hence the overhead is higher - a commit and 8 solr queries are needed for each system metadata entry processed (update or add). This process should be ellimintated if possible. Instead, the `merge_doc` should be created with field level `add` or `set` modifiers and the document `_version_=0`. This will create documents if not present, or update existing documents with the updated values. This elliminates the need to make multiple requests to solr to retrieve possibly existing documents in order to generate the `merge_doc`. The challenge is that the `identifier` in question may be a PID or a SID (e.g. a resource map may assert that a metadata SID documents a data PID), and only PIDs are used for the solr document `id`. Hence the need for the queries on PID and SID, and hence the need to `commit` for each change to the index. ### Series Identifiers SIDs complicate indexing considerably. The object of any relation may be a PID or a SID. Solr documents are referenced only by `id` which is set to the PID value. #### Use a composite key on PID and SID Something like `id = "{PID} {SID}"`? Benefit: can use the [realtime get API](https://solr.apache.org/guide/8_1/realtime-get.html) to retrieve the document (no commit needed). A `get/` request can include multiple IDs, so all 4 combinations could be retrieved with a single call. #### Create Solr documents for both PID and SID Can lead to inconsistency in the SID identified document as older revisions receive updates. Perhaps don't update SID doc if current object is obsoleted. This is basically the same as using the SID as the `id`, below. #### Use SID as `id` when available If current doc has SID and is not obsoleted, then `id`=SID, else `id`=PID. Use `get/` to retrieve referenced doc when needed. However, object of relation can be SID or PID, and we don't know which. Hence there needs to be a mechanism to determine if an identifier is a PID or a SID, and this needs to be available before the content is available on the CN. e.g. resource_map A has a reference to "B". B is not yet sync'd, so no way to know if "B" is a PID or a SID. Create a doc with `id` = B, then later B is sync'd and we find it has SID "C". Hence a doc with `id` = C is created, the document B values are copied to C, then B is deleted. #### Separate collections for documents and relations, and possibly replica info This option means there's never any need to retrieve content from the index while indexing. Likely to provide the highest performance, in the order of 1000 docs / sec. Requires refactoring any components using the Solr index. #### Continue as is Performance can be tuned a fair bit, but the hard limits of at least one commit and one solr search per system metadata entry remain. #### Query PG for PID All identifiers are in postgres before indexing, so we could query postgres to resolve ID to PID. The following query would return the PID given an ID: ```sql SELECT guid FROM systemmetadata WHERE guid={ID} OR series_id={ID} ORDER BY date_modified DESC LIMIT 1; ``` If the response is NULL, then the item has not yet been synchronized and an entry in Solr can not be created (since it is not known if the ID is a PID or a SID). Otherwise the resulting PID can be used with the Solr realtime get to retrieve the document for merge. See also: https://github.com/amoeba/dataone-solr-refactor-demo ## Indexer Suggestions **Minimal/no code refactoring:** 1. Configuration layout is a bit opaque Configuration of the indexer components is a bit cumbersome with potential duplication of resources in the jars vs. filesystem. Suggest moving all to file system config, with internal resources providing the minimal necessary to bootstrap. 1. Current 5.x version of Solr is 5.5.5 [^v5]. The CNs should at least be updated to this version to address bug fixes. Note that Solr older that 7.7.x is EOL. 1. Upgrade to the current production release of Solr. Solr is now on the 8.10 series with many significant improvements over the 5.x series. The upgrade is a major change, and will require schema modifications and a complete re-index. Solr 8.x will work with Java 8 [^jver]. 1. Add an index field that records the time the solr record was added or modified. This is useful for diagnosing indexing delays / latency. 1. Use more granular update commands. The current implementation is brute force and highly dependent on a synchronous state of the solr index. This requires a commit on every write operation which defeats index update optimizations and leads to poor update performance. See [updating parts of a document](https://solr.apache.org/guide/8_8/updating-parts-of-documents.html) and [optimistic concurrency](https://solr.apache.org/guide/8_8/updating-parts-of-documents.html#optimistic-concurrency) Note that the granular approach (setting or updating individual fields) works regardless of whether the document exists or not. 1. Use Cross Data Center Replication (CDCR) for propagating the index to other CN instances. ~~CDCR is more resilient and less chatty. It was introduced in Version 6.0, so a Solr upgrade is necessary to use.~~ CDCR is deprecated and scheduled to be removed in 9.0. However there *should* be a replacement available. https://solr.apache.org/guide/8_10/cross-data-center-replication-cdcr.html We can also setup many index replicas which are kept up-to-date from the master index. It may be preferable to serve all user facing operations off replica instances. This would reduce the load on the active master, especially during heavy indexing operations. 1. Set `preferLocalShards`. This is important for cloud clients when updating the index - otherwise updates can be sent to the "leader", which may or may not be local. 1. Writes are sent to the Solr cloud leader[^wleader]. This means that if the leader is ORC, the indexer on UCSB writing to Solr cloud is sending documents to ORC for the leader to propagate back to UCSB. The [FORCELEADER action](https://solr.apache.org/guide/6_6/collections-api.html#CollectionsAPI-forceleader) was introduced in Solr 6. In prior versions, the leader depends on startup order (earliest start wins). 1. Increase or decrease the number of ZK nodes. DataONE is currently operating with two CNs. The solr index uses Zookeeper for replication. ZK replication can not form a majority quorum with an even number of participants. Running with two servers is less stable than a single server. Note though that DataONE operates in single master mode, so updates to Solr only occur on the active node, and those changes are propogated to the other via ZK. 1. Use a combination of atomic updates with optimistic concurrency. Despite the language in the solr docs that implies updates apply to existing documents, the operations work fine with no documents present. Combine this with `_version_=1` and docuemnt will be created if they don't exist, or updated if they do. With this strategy the latest update wins. **High priority refactoring:** 1. Make `index-task-processor` more atomic in operation. It should be possible to run multiple instances all using the same queue. Use Postgres for the queue. (There are many examples demonstrating efficient queues with SKIP LOCKED) Make it as easy to get systemMetadata as the metadata document (e.g. read from file path). 1. Clients to Solr cloud should generally not request explicit commits. Instead, client should rely on the `/get` API which retrieves a document by identifier. Note however the need to retrieve documents by PID or SID - only the PID is used as the document `id`. 1. Index processors that update repeat requests for related documents. The processors `RdfXmlSubprocessor`, `ResourceMapSubprocessor`, `EmlAnnotationSubprocessor`, `AnnotatorSubprocessor` each query for documents to be updated 1. Indexer processors that require updating content retrieve documents by PID or SID. This requirement defeats optimizations that can leverage retrieval of documents using the `get/` API (retrieve added document that has not been indexed). This in turn implies that the indexing process should be split into parsing and indexing stages. The parsing stage generates documents that can be handed off to the index with minimal additional processing instead of performing synchronous, sequential operations on the Solr index. **Other refactoring...** 1. Use JSON format for update documents Parsing / creating JSON is generally faster with less escaping issues. 1. Clean up logging Logging is a bit verbose by default, and messages can be hard to parse. [^v5]: See https://archive.apache.org/dist/lucene/solr/5.5.5/ [^jver]: See https://solr.apache.org/guide/8_10/solr-system-requirements.html [^wleader]: See https://solr.apache.org/guide/8_10/shards-and-indexing-data-in-solrcloud.html#leaders-and-replicas