# Indexing Content on the CNs
The indexing process takes existing content, parses it, extracts values to populate fields of an index document, and updates the Solr index with the index document.
Indexing performance on the CNs is very slow. A complete re-index currently will take several days, perhaps weeks.
The goal is to complete re-indexing of the entire corpus in an hour. One hour is 3600 seconds. There are currently about 3E6 documents in DataONE. Say 3.6E6. That means 1,000 docs per second, on average.
## Feeding the indexer
In DataONE CNs, the indexer running as `index_task_processor` reads tasks from a postgres database. Those tasks are added to the database by `index_task_generator`, which listens to events on the Hazelcast system metadata map. The Hazelcast system metadata map emits a large number of events, many of which are not relevant to the indexing process.
```plantuml
hzSysMeta -> index_task_generator: change
index_task_generator -> index_task_generator: filter
note right: [1]
index_task_generator -> queue: add
queue -> index_task_processor: task
```
**Figure 1.** Simplistic depiction of index task creation.
[1] The filter algorithm is something like:
```
1. Fetching the solr index for the pid.
2. If there is no solr index for the pid,
2.1 check the archive flag in the system metadata. If the archive=true, filter this pid out
2.2 if the archive=false, keep this pid task (return false).
3. If there is a solr index, compare the modification date between the solr index and the system metadata.
3.1 if sysmeta > solr index, return false (keep index task)
3.2 if sysmeta < solr index, return true (filter it out)
3.3 if sysemeta = solr index, compare the replica info.
3.3.1 If serialVersion in the solr is available, compare the value of serial version.
3.3.1.1 If solr = sysmeta , return true (filter it out) since no change in replica
3.3.1.2 If solr < sysmeta, return false (keep index task) since the solr has a smaller (older) serial version.
3.3.1.3 If solr > sysmeta, return true (filter it out) since the solr has a bigger (newer) serial version.
3.3.2. If serialVersion in solr is Not availabe, comare replica lists:
3.3.2.1 no change on replica info, return true (filter out)
3.3.2.2 there is a change, return false (keep index task)
```
An alternative approach for creating index events is to use the information in the `metacat` postgres database. Specifically, content with system metadata `dateModified` more recent than the most recent entry in the solr index has not been indexed.
```plantuml
metacat_db -> index_task_generator: modified > X
index_task_generator -> index_task_generator: filter
note right: [1]
index_task_generator -> queue: add
queue -> index_task_processor: task
```
**Figure 2.** Alternative approach to populating index queue.
[1] The filter algorithm is something like:
```
1. get records from postgres newer than X
2. for each id:
2.1. get record from Solr (use realtime get API)
2.2. if not found: add task
2.3. if postgres modified stamp > solr modified: add task
```
When re-indexing, can page through all records with no need to check for entry in Solr.
- Node level must be event driven rather than polling. CN would also benefit from this, but less critical.
- Removing Hazelcast:
- sync and indexing uses it
- need to ensure CNs use postgres replication to maintain consistency
- Session cluster is used to pass JWT between CNs. Can switch to postgres replication for this. Actually no need to do this if client is able to maintain it's JWT
- Add the file path to the task stored in the index
- Or use hash of id or content hash etc
## Indexing stuff
General rules for performant Solr indexing:
1. Do not commit manually, let Solr do it
2. Do not rely on document state
3. If not #2, then use realtime get for retrieval
Fetching documents by query requires the index is up to date, which requires a commit to be processed. Doing this per document adds significant overhead (2-3 orders of magnitude).
**DataONE Indexing**
There are three broad categories of content being indexed:
1. System metadata
System metadata (`sysm`) is present for all indexed content. Certain properties of `sysm` are mutable, and hence existing indexed entries may be updated.
3. Independent Documents
Metadata formats such as Dublin Core, EML, or ISO are immutable in DataONE and only need to be indexed once (unless the index is being rebuilt). These documents can be indexed independently of other documents, that is index values extracted from the source document do not change values for other documents.
3. Interdependent Documents
Formats such as resource maps and annotations are imutable in DataONE and only need to be indexed once. However, these format have index values that update certain fields of other documents. Those other documents may or may not be indexed beforehand.
## Interdependent Document Formats
These are documents handled by:
* `org.dataone.cn.indexer.annotation.RdfXmlSubprocessor`
* `org.dataone.cn.indexer.parser.ResourceMapSubprocessor`
* `org.dataone.cn.indexer.annotation.EmlAnnotationSubprocessor`
* `org.dataone.cn.indexer.annotation.AnnotatorSubprocessor`
```
RdfXmlSubprocessor
prov_wasDerivedFrom
prov_wasInformedBy
prov_used
prov_generated
prov_generatedByProgram
prov_generatedByExecution
prov_generatedByUser
prov_usedByProgram
prov_usedByExecution
prov_usedByUser
prov_wasExecutedByExecution
prov_wasExecutedByUser
prov_hasDerivations
prov_instanceOfClass
hasPart
isPartOf
ResourceMapSubprocessor
resourceMap
documents
isDocumentedBy
EmlAnnotationSubprocessor
sem_annotation
AnnotatorSubprocessor
sem_annotation
sem_annotation_bioportal_sm
sem_annotation_esor_sm
sem_annotation_cosine_sm
sem_annotates
sem_annotated_by
```
```plantuml
indexer -> SolrIndexService: processObject()
activate SolrIndexService
SolrIndexService -> systemMetadataProcessor:processDocument()
activate systemMetadataProcessor
note right: (1)
systemMetadataProcessor -> SolrIndexService:docs[]
deactivate systemMetadataProcessor
SolrIndexService -> subprocessor:processDocument(docs[])
activate subprocessor
note right: (2)
subprocessor -> SolrIndexService:docs[]
deactivate subprocessor
loop merge_doc in docs:
loop subprocessor in subprocessors
SolrIndexService -> subprocessor: mergeWithIndexedDocument(merge_doc)
activate subprocessor
note right of subprocessor: (3)
subprocessor -> Solr: query(identifier)
Solr -> subprocessor: docs
subprocessor -> SolrIndexService:merge_doc
deactivate subprocessor
end
end
SolrIndexService -> SolrIndexService:getAddCommand(merge_docs[])
SolrIndexService -> indexer: add_doc
deactivate SolrIndexService
```
**Figure 3.** General flow for indexing a single document. (1) System metadata values are extracted. (2) Metadata document values are extracted using subprocessors that match the `formatId` of the document. (3) All subprocessor apply `mergeWithIndexedDocument()`, though only the interdependent document processors (currently 4) do anything.
The `mergeWithIndexedDocument` operation retrieves the document from Solr with the same ID as the document being indexed. Then for each field in the retrieved document, if the field is in the list of `fieldsToMerge` and the document being indexed does not already have that value in the field, adds the field to the document being indexed. The modified document is returned.
`fieldsToMerge` is set from the respective bean configuration.
Note that interdependent document processors **each** request entries matching the document identifier from Solr, and this is applied to **all** entries being indexed.
Each of these processors will issue two queries to solr, and these queries are repeated for each processor. These queries require that the index is up-to-date (any changes comitted), hence the overhead is higher - a commit and 8 solr queries are needed for each system metadata entry processed (update or add).
This process should be ellimintated if possible. Instead, the `merge_doc` should be created with field level `add` or `set` modifiers and the document `_version_=0`. This will create documents if not present, or update existing documents with the updated values. This elliminates the need to make multiple requests to solr to retrieve possibly existing documents in order to generate the `merge_doc`.
The challenge is that the `identifier` in question may be a PID or a SID (e.g. a resource map may assert that a metadata SID documents a data PID), and only PIDs are used for the solr document `id`. Hence the need for the queries on PID and SID, and hence the need to `commit` for each change to the index.
### Series Identifiers
SIDs complicate indexing considerably. The object of any relation may be a PID or a SID. Solr documents are referenced only by `id` which is set to the PID value.
#### Use a composite key on PID and SID
Something like `id = "{PID} {SID}"`?
Benefit: can use the [realtime get API](https://solr.apache.org/guide/8_1/realtime-get.html) to retrieve the document (no commit needed).
A `get/` request can include multiple IDs, so all 4 combinations could be retrieved with a single call.
#### Create Solr documents for both PID and SID
Can lead to inconsistency in the SID identified document as older revisions receive updates. Perhaps don't update SID doc if current object is obsoleted. This is basically the same as using the SID as the `id`, below.
#### Use SID as `id` when available
If current doc has SID and is not obsoleted, then `id`=SID, else `id`=PID.
Use `get/` to retrieve referenced doc when needed.
However, object of relation can be SID or PID, and we don't know which. Hence there needs to be a mechanism to determine if an identifier is a PID or a SID, and this needs to be available before the content is available on the CN.
e.g. resource_map A has a reference to "B". B is not yet sync'd, so no way to know if "B" is a PID or a SID. Create a doc with `id` = B, then later B is sync'd and we find it has SID "C". Hence a doc with `id` = C is created, the document B values are copied to C, then B is deleted.
#### Separate collections for documents and relations, and possibly replica info
This option means there's never any need to retrieve content from the index while indexing. Likely to provide the highest performance, in the order of 1000 docs / sec.
Requires refactoring any components using the Solr index.
#### Continue as is
Performance can be tuned a fair bit, but the hard limits of at least one commit and one solr search per system metadata entry remain.
#### Query PG for PID
All identifiers are in postgres before indexing, so we could query postgres to resolve ID to PID. The following query would return the PID given an ID:
```sql
SELECT guid FROM systemmetadata
WHERE guid={ID} OR series_id={ID}
ORDER BY date_modified DESC LIMIT 1;
```
If the response is NULL, then the item has not yet been synchronized and an entry in Solr can not be created (since it is not known if the ID is a PID or a SID). Otherwise the resulting PID can be used with the Solr realtime get to retrieve the document for merge.
See also: https://github.com/amoeba/dataone-solr-refactor-demo
## Indexer Suggestions
**Minimal/no code refactoring:**
1. Configuration layout is a bit opaque
Configuration of the indexer components is a bit cumbersome with potential duplication of resources in the jars vs. filesystem. Suggest moving all to file system config, with internal resources providing the minimal necessary to bootstrap.
1. Current 5.x version of Solr is 5.5.5 [^v5].
The CNs should at least be updated to this version to address bug fixes.
Note that Solr older that 7.7.x is EOL.
1. Upgrade to the current production release of Solr.
Solr is now on the 8.10 series with many significant improvements over the 5.x series. The upgrade is a major change, and will require schema modifications and a complete re-index. Solr 8.x will work with Java 8 [^jver].
1. Add an index field that records the time the solr record was added or modified.
This is useful for diagnosing indexing delays / latency.
1. Use more granular update commands.
The current implementation is brute force and highly dependent on a synchronous state of the solr index. This requires a commit on every write operation which defeats index update optimizations and leads to poor update performance.
See [updating parts of a document](https://solr.apache.org/guide/8_8/updating-parts-of-documents.html) and [optimistic concurrency](https://solr.apache.org/guide/8_8/updating-parts-of-documents.html#optimistic-concurrency)
Note that the granular approach (setting or updating individual fields) works regardless of whether the document exists or not.
1. Use Cross Data Center Replication (CDCR) for propagating the index to other CN instances.
~~CDCR is more resilient and less chatty. It was introduced in Version 6.0, so a Solr upgrade is necessary to use.~~
CDCR is deprecated and scheduled to be removed in 9.0. However there *should* be a replacement available. https://solr.apache.org/guide/8_10/cross-data-center-replication-cdcr.html
We can also setup many index replicas which are kept up-to-date from the master index.
It may be preferable to serve all user facing operations off replica instances. This would reduce the load on the active master, especially during heavy indexing operations.
1. Set `preferLocalShards`.
This is important for cloud clients when updating the index - otherwise updates can be sent to the "leader", which may or may not be local.
1. Writes are sent to the Solr cloud leader[^wleader].
This means that if the leader is ORC, the indexer on UCSB writing to Solr cloud is sending documents to ORC for the leader to propagate back to UCSB.
The [FORCELEADER action](https://solr.apache.org/guide/6_6/collections-api.html#CollectionsAPI-forceleader) was introduced in Solr 6. In prior versions, the leader depends on startup order (earliest start wins).
1. Increase or decrease the number of ZK nodes.
DataONE is currently operating with two CNs. The solr index uses Zookeeper for replication. ZK replication can not form a majority quorum with an even number of participants. Running with two servers is less stable than a single server.
Note though that DataONE operates in single master mode, so updates to Solr only occur on the active node, and those changes are propogated to the other via ZK.
1. Use a combination of atomic updates with optimistic concurrency.
Despite the language in the solr docs that implies updates apply to existing documents, the operations work fine with no documents present. Combine this with `_version_=1` and docuemnt will be created if they don't exist, or updated if they do.
With this strategy the latest update wins.
**High priority refactoring:**
1. Make `index-task-processor` more atomic in operation.
It should be possible to run multiple instances all using the same queue.
Use Postgres for the queue. (There are many examples demonstrating efficient queues with SKIP LOCKED)
Make it as easy to get systemMetadata as the metadata document (e.g. read from file path).
1. Clients to Solr cloud should generally not request explicit commits.
Instead, client should rely on the `/get` API which retrieves a document by identifier. Note however the need to retrieve documents by PID or SID - only the PID is used as the document `id`.
1. Index processors that update repeat requests for related documents.
The processors `RdfXmlSubprocessor`, `ResourceMapSubprocessor`, `EmlAnnotationSubprocessor`, `AnnotatorSubprocessor` each query for documents to be updated
1. Indexer processors that require updating content retrieve documents by PID or SID.
This requirement defeats optimizations that can leverage retrieval of documents using the `get/` API (retrieve added document that has not been indexed).
This in turn implies that the indexing process should be split into parsing and indexing stages. The parsing stage generates documents that can be handed off to the index with minimal additional processing instead of performing synchronous, sequential operations on the Solr index.
**Other refactoring...**
1. Use JSON format for update documents
Parsing / creating JSON is generally faster with less escaping issues.
1. Clean up logging
Logging is a bit verbose by default, and messages can be hard to parse.
[^v5]: See https://archive.apache.org/dist/lucene/solr/5.5.5/
[^jver]: See https://solr.apache.org/guide/8_10/solr-system-requirements.html
[^wleader]: See https://solr.apache.org/guide/8_10/shards-and-indexing-data-in-solrcloud.html#leaders-and-replicas