External Models Requirements

--- tags: forge --- # External Models Requirements "External Models" refers to the ability to include _external_ models/scores to the Forge.AI pipeline such that customers can add their own scores and still access them as if thoses scores had been computed by Forge.AI. ## Options 2 options have been considered: 1. Custom models are run inside Forge's infrastructure. 2. Custom models are run inside the customer's infrastructure. Both options have their pros/cons, but customers are likely to be uncomfortable handing their model code to Forge. As a result, option 2 (models are run inside the customer infrastructure) is preferred. ## High-Level Description As a customer, I deploy a REST endpoint within my infrastructure which computes the score I want to add to Forge's documents. The score will follow Forge's scores description with: * `scoreType` (a UUID assigned by Forge) * `modelName` * `modelVersion` * `scope` -- entity, entity location, sentence, section, or document. Once processed, my scores are only available/viewable to my organization, very much like documents submitted to the ingest API. I can query my scores in the appropriate snowflake table (`EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, or `DocumentScores`) in a dedicated database/share. I understand that this will require a cross-database join (between the generic Forge database and my dedicated database.) Due to the asynchronous nature of the processing of documents, it is possible that a document is available in Forge's database but my customer score has not yet been computed or loaded into snowflake. The opposite however (customer's score loaded into snowflake before the document being loaded into Forge's data-lake) should not happen. The input to the REST endpoint would be a PDOC (fully populated with text and all augmentation computed by the Forge's pipeline) in JSON format. The output of the endpoint will be a JSON object described below. The REST endpoint will be called for every document processed by Forge's ANvIL pipeline. Although called for every document, the REST endpoint is not required to compute a score for every document and may return an empty result. One REST endpoint should serve one score identified by the immutable triplet: `scope`/`scoreType`/`modelName`. The endpoint may return multiple `modelVersion` scores -- in a transition period from one version to the next for instance. Multiple scores should each be handled by their own endpoint (which may be handled by the same server.) Monitoring of the operation of the REST endpoint is left up to the customer. However, Forge will make available the following metrics as metadata items (stored in the `DocumentMetadata` table inside the dedicated database/share): * `<scoreType>/<modelName> Time` -- the (successful) query time in milliseconds * `<scoreType>/<modelName> Timeout` -- if the query timed out * `<scoreType>/<modelName> Error` -- the (unsuccessful) HTTP status code or network error. * `<scoreType>/<modelName> Message` -- the (unsuccessful) HTTP status message. * `<scoreType>/<modelName> Dropped` -- if the document was dropped and not sent to the endpoint when the endpoint cannot sustain the throughput. As a customer, I can configure my endpoint to be called or not for each type of document processed by Forge: * open-source -- for documents collected by Forge * SEC 8K/10K/10Q -- for SEC filings * ~~Naviga -- for documents received from Naviga~~ * ~~Factset -- for transcripts received from Factset (if applicatble to this customer)~~ * ~~DJN -- for documents received from Dow Jones (if applicable to this customer)~~ * FactSquared -- for earnings transcripts * API -- for documents submitted via the ingest API (if applicable to this customer) As a customer, I understand this facility can only be used to add scores to documents. It cannot be used to add any other elements, like entities, resolved entities, events, relationships, etc... ### Workflow Here's an expected typical workflow (from a customer's point-of-view): 1. Imagine a new score to be added to Forge's data. It could be a numerical score, like a new sentiment model, a textual score (classification), etc... The current `documentScore` element is expected to be sufficiently flexible to handle all score types: * `score` -- the _score_ itself. * `confidence` -- to hold the _confidence_ of the score (if applicable) * `index` -- may be used to store multi-valued scores For example, a new sentiment model would store the sentiment score in the `score` element with a confidence in the `confidence` element (if applicable). A new classification score could use `index` element for the classes with a `score` for each class. 3. Use the historical data, accessible in Snowflake, to develop the model Note: this might require access to the full text of the document. 4. Request a `scoreType` from Forge. 5. Implement a REST API endpoint to wrap the model and interface with ANvIL. 6. Test the REST API locally using sample PDOCs provided by Forge.AI. 7. Deploy the endpoint in test mode That means at a minimum making the endpoint accessible from Forge's AWS internal network, maybe via a Secure Link. 8. Register/configure the endpoint with Forge in test mode. In test mode, a smaller number of documents will be submitted to the endpoint. 9. After validation of the REST endpoint and the model performance, deploy the endpoint in prod mode. 10. Register/configure the endpoint with Forge in prod mode. In prod mode, every applicable documents processed by Forge will be submitted to the endpoint. 11. It is expected that re-training will be required from time to time imposing the current production version of the endpoint+model to run in parallel with a newer version in development. 12. If I want to have a test endpoint and a prod endpoint (of the same model), I can deploy 2 endpoints (with the same `scoreType` but different `modelName`) and keep one in test mode. ## JSON response This response format is modeled after the `documentScore` XML element from the PDOC: ```json= { "version": "1.0", "timestamp": "<UNIX time (seconds since 1970-01-01 00:00:00 UTC)>", "uuid": "<string: document uuid>", "scoreType": "<string>", "modelName": "<string>", "scope": "entity | entity-location | sentence | section | document", "versions": [ { "modelVersion": "<string (256 characters max)>", "scores": [ { "score": "<string (256 characters max)> (may be a float encoded as a string)", "confidence": "<float> (optional)", "index": "<integer> (optional)", "entityId": "<integer> (required for entity score or entity-location score)", "sentenceId": "<integer> (required for sentence score or entity-location score)", "startOffset": "<integer> (required for entity-location score)", "sectionId": "<integer> (required for section score)" } ] } ] } ``` An empty response can take either the following form (empty `scores` array): ```json= { "version": "1.0", "timestamp": "1628877265", "uuid": "9fd57a12-fc5f-11eb-9a03-0242ac130003", "scoreType": "de8b0e4c-ba3f-48cd-8e13-b5035c48d0a7", "modelName": "foo", "scope": "entity", "versions": [ { "modelVersion": "1.0", "scores": [] } ] } ``` or (empty `versions` array): ```json= { "version": "1.0", "timestamp": "1628877265", "uuid": "9fd57a12-fc5f-11eb-9a03-0242ac130003", "scoreType": "de8b0e4c-ba3f-48cd-8e13-b5035c48d0a7", "modelName": "foo", "scope": "entity", "versions": [] } ``` Note in case a score is returned multiple times, only one of them will be made available in Snowflake. ## Requirements ### To the REST endpoint * The REST endpoint should respond to the `PUT` verb. The payload will be in the message body in JSON format encoded using UTF-8. The response will also be in JSON format encoded using UTF-8. * The server should expect standard HTTP headers on the request, including: * `Content-Type: application/json; encoding=UTF-8` * `Content-Length: ...` * `Accept-Encoding: gzip` * `Content-Encoding: gzip` (if compression is used) * The response should include standard HTTP headers, including: * `Content-Type: application/json; encoding=UTF-8` * `Content-Length: ...` * `Content-Encoding: gzip` (if compression is used) * The REST endpoint should support HTTP/1.1, in particular persistent connections. * Any HTTP status other than `200` (OK) will be considered an error. It is assumed the call failed and no score could be computed. No retry will be attempted. * If the REST endpoint doesn't respond within a fixed 30s timeout the call is aborted. It is assumed the call failed and no score could be computed. No retry will be attempted. * The REST endpoint is expected to sustain at least 100 calls/minute, with peaks at 800 calls/minute. PDOC's are on average 260KB for general web articles, and can be as large as 200MB (uncompressed). * From time to time, Forge processes large volume of documents outside of its normal pipeline -- for processing historical documents or reprocessing documents -- in a dedicated bacth environment. In that case volume may be 10 times higher than normal processing, and Forge will contact the customer to discuss inclusion, or not, of the custom scores in the reprocessing. * It is recommended that the REST endpoint support compressed queries with the `gzip` method (ie. `Content-Encoding: gzip`) to lower bandwidth requirement for large documents. The REST endpoint may also reply with compressed responses (and set the `Content-Encoding` header accordingly.) * It is advised to deploy the REST endpoint behind a load-balancer to enable green/blue deployments with no down-time as maintenance windows are not supported in this release. * It is recommended that the endpoint URL encodes at least the `scoreType`, and potentially `modelName` and `modelVersion`, such that multiple scores may be served behind one load-balancer. eg: `http://.../<scoreType>/<modelVersion>`. * Although Forge guaranties it will only send a specific version of the PDOC to the endpoint, it is recommended to prepare for changes in version of the PDOC format. The version of the PDOC format is included in the PDOC itself allowing code to dynamically accommodate multiple versions. Typically, changes to the PDOC format are backward compatible such that code that simply ignores unknown elements will be easier to upgrade, and is likely to "just work" against new versions. Forge guaranties to support at least 2 versions behind the current version. * Custom models are called asynchronously from Forge's pipeline. There is no guaranty that a document submitted for scoring would be available in Snowflake. Similarly, it should be expected that there will be a (unspecified) delay between the computation of a score and that score being available in Snowflake. * The REST endpoint should be idempotent as documents may occasionally be submitted multiple times (during reprocessing for example). #### out of scope * Authentication is out-of-score (beyond a hard-coded token-based approach in the URL). It is advised to deploy the REST endpoint either through an AWS SecureLink or behind an IP-based firewall. * Inclusion of custom HTTP headers in the request is also out-of-scope. ### To Forge's pipeline * Custom models should not impact Forge's internal pipeline in any way. To that end, custom models should be called *after* Forge's pipeline, possibly leveraging the syndication infrastructure. This includes loading documents into Snowflake. In fact, it is expected that documents would be available in Snowflake prior to having custom scores. * If multiple custom models (from the same customer) are deployed it is left to the implementation whether each custom model is called in parallel or serially. * Custom scores should not be loaded into Snowflake until all custom scores (for a customer) are computed (for a document.) * Custom models from customer A should also not impact custom models from customer B. It is expected that custom models from customer A and from customer B would be called in parallel. It is expected that at a minimum one agent/handler would be required to be deployed for each customer (with a registered endpoint.) such that even if the event of a crash of the agent other customers scores wouldn't be affected. * ~~Client side failures, including network failures, that prevented some documents from being scored should be resubmitted. It is acceptable if this is initially a manual process.~~ * `scoreType` (assigned by Forge) should be a UUID to ensure uniqueness (and opaqueness) across customers. * See above for description of the REST call. * Forge's agent should take advantage of persistent connections offered by HTTP/1.1 (if supported by the endpoint.) * To reduce bandwidth, requests might be compressed using the `gzip` method (if supported by the endpoint and configured to use compression) * PDOCs submitted to the customer's endpoint should follow the external schema -- stripped of `internalMetadata` and `events` for instance. * PDOCs submitted to the customer's endpoint should follow a _fixed_ version/format. Although only the latest version of PDOC is supported in this first release, it should be expected that in the future a specific version of the PDOC must be created to call out a custom model. In other words, customer models are not required to update to handle changes to the PDOC format. Multiple endpoints from the same customer may use different PDOC versions. In that case, the custom scores will still be available in each versioned databases: old scores will be visible in the latest database; and new scores will be visible in the old database. * The custom model handler should make sure that only the score expected to be computed by the model is recognized and loaded into Snowflake. This does NOT apply to `modelVersion` -- although it would be an error for a custom model to return scores without a `modelVersion`. Note that multiple `modelVersion`s are supported. Other scores should simply be ignored. ie. if the custom model is registered to compute `scoreType=abc`, `modelName=foo` but returns a `scoreType=abc`, `modelName=bar`. If multiple scores are returned, it is left to the implementation how to handle that as long as at least one the score is available in snowflake. * Only documents that a customer has access to (and has elected to receive for this endpoint) should be candidates for scoring. For instance, if a customer does not have access to the DowJones dataset, no DJN documents should be submitted to the endpoint. No provision is expected in this release to further filter documents submitted to the endpoint. However the customer model handler must be able to limit the number of documents submitted to the endpoint **while in test mode** to say 1% of the normal traffic. There is no requirement to ensure any kind of distribution of the documents submitted in test mode -- ie. a random pattern is acceptable. * Any validation error of the response will cause the whole response to be rejected with error code 418. Example of validation error include: * `scoreType` doesn't match the expected type * `index` is not an `integer` * score for an entity with a non-existing `entityId` * entity-location level score without the associated `startOffset` * etc... Note that a response without any score is not an error. * Processing information (timing of successful scoring, or error information) and should be made available as metadata items loaded into the dedicated customer database. So that a customer may use that data to derive operational statistics such as: * runtime performance statistics * error rate * etc... * If the endpoint is falling behind -- ie it cannot process documents fast enough to sustain the throughput -- the pipeline should catch up by dropping (not sending) documents older than 15min (ie. have been sitting in the queue for more than 15min.) In this case the `<scoreType>/<modelName> Dropped` metadata item should be added to the PDOC. * Responses from custom models should be archived in S3. Those data files may be used for restoring the database in case of a disaster, or for debugging or auditing. Note: Responses from all custom model endpoints per customer should be saved in one file (per document.) * High-level monitoring and alerting should be in place such as we get notified if a configured endpoint is not responding or is falling behind. At a minimum the same runtime information added to the metadata should be exposed via prometheus. * All custom endpoints will be called from our production environment -- except from our own test endpoints. * Although it is not specified how configuration information for each endpoint is stored, managed or handled, it should be easy to disable or suspend all endpoints for one customer. Note: It is expected that an admin UI will be deployed in the future to ease management of many endpoints for many customers. * The system should enforce uniqueness of `scoreType` across all customers. It should also enforce a 1:1 relationships between `scoreType`/`modelName` and `endpoint` -- ie. 2 different endpoints should not return the same `scoreType`/`modelName` (even with different `modelVersion`s.) * It should be possible to include, or not, custom models during reprocessing or batch processing of historical documents. ### To Forge's Snowflake The Snowflake share for the custom models data should look like: * database: (some name, not versioned) * schema: `ARTICLES_V0660` (union of `FORGEAI_ARTICLES` & `FORGEAI_NAVIGA`) * views: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `DJN_V0660` (if applicable) * views: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `FACTSET_V0660` (if applicable) * views: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `SEC_8K_V0660`, `SEC_10K_V0660`, `SEC_10Q_V0660` (if applicable) * views: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `<ingest api name>_V0660` (if applicable) * views: all the 06.60 views As a result, in Forge's Snowflake account there needs to be a customer database with all the tables and views in it: * database: `<API name>` * schema: `FORGEAI_ARTICLES` * tables: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `FORGEAI_NAVIGA` * tables: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `ARTICLES_V0660` * views (union `FORGEAI_ARTICLES` & `FORGEAI_NAVIGA`): `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * same for DJN, SEC, ... * schema: `FORGEAI_<API name>` * tables: `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` * schema: `<API name>_V0660` * views (from `FORGEAI_<api name>`): `EntityScores`, `EntityLocationScores`, `SentenceScores`, `SectionScores`, `DocumentScores`, `DocumentMetadata` ### Out of scope Although desirable the following items are out-of-scope initially: * In this first release, only one score is to be computed by one call to the endpoint as to not complicate the timeout logic and reporting/monitoring. * Any client code, even in the form of sample code, to help in the deployment of the REST endpoint. That includes code to parse/generate PDOCs. However we should provide sample PDOCs covering various cases for testing locally. * An adminstration UI to manage the deployment of REST endpoints. Initially this will be done manually. * Monitoring of the REST endpoint is left to the customer. No action is taken if the REST endpoint is not responding for instance. * It is not envisioned Forge would charge compute time, network IO cost or ETL cost back to the customer, and there is no provision to specifically keep track of compute time per customer. * Support for maintenance window. It is advised to deploy the REST endpoint behind a load-balancer to enable green/blue deployments with no down-time. * Support for adding customer's score(s) to already processed documents * Reprocessing documents to re-compute a customer's score(s) * Support for redirection (HTTP 3xx) is out-of-scope. It is advised to deploy the REST endpoint behind a load-balancer/proxy to handle internal routing. * Support for caching (HTTP 304 Not Modified) is not supported. * Retries and/or HTTP 503 (service unavailable) status code (simply considered as error in this first release.) * It would be advisable to filter data sent to a model to only include relevant data from the PDOC. This is however out-of-score at this time. * No notification to the customer will be implemented to indicate when a computed score is loaded into snowflake. * PDOCs syndicated through the current syndication pipeline will not contain any custom scores -- ie. no change to syndication. * PDOCs retrieved via the API will not contain any custom scores -- ie. no change to that API. * SEC forms 3/4/5 are not included. ## Implementation Framework <iframe width="100%" height="600px" src="https://docs.google.com/drawings/d/1ASAj2s6TqrdjiV57Axzzu5WbWQ4vRMjg2ov7UYrm1OM/edit"></iframe> ## Acceptance criteria To validate and test the implementation, we should provide an endpoint computing a (simple) score for each score scope. For example: * At document scope, compute the count of sections as `scoreType = test`, `modelName = section-count`, `modelVersion = 0`, `scope = document`; * At section scope, compute the count sentences of as `scoreType = test`, `modelName = sentence-count`, `modelVersion = 0`, `scope = section`; * At sentence scope, compute the count of entities as `scoreType = test`, `modelName = entity-count`, `modelVersion = 0`, `scope = sentence`; * At location scope, compute the minimum label length (as `index` 0), maximum label length (as `index` 1) as `scoreType = test`, `modelName = label-length`, `modelVersion = 0`, `scope = entity-location`; * At entity scope, compute the number of instances of that entity as `scoreType = test`, `modelName = instance-count`, `modelVersion = 0`, `scope = entity`. To exercise error cases, an "error endpoint" which always responds with a "HTTP 500 Internal Server Error" should be set up (`scoreType = test`, `modelName = error`, `modelVersion = 0`) for each scope. Finally, a "timeout endpoint" which always times out (ie. doesn't respond for at least 30s) should be set up (`scoreType = test`, `modelName = timeout`, `modelVersion = 0`) for each scope. Care must be taken with regards to character outside of the basic ASCII set. ## Q&A * Can custom scores be used by Forge? * No. * Custom scores will be stored in the customer dedicated database regardless of the source (open-source, SEC, DJN, etc...) of the document. * How can a customer re-train and re-deploy a model? * Recommendation is to update the `modelVersion` to indicate the new re-trained model * In this first release, Forge recommends a zero-downtime deployment model to upgrade the endpoint. Although Forge can accomodate a short downtime upon request. * How can a customer deploy an updated model (significant update, more than a re-train)? * Recommendation is to register/deploy a new endpoint identified with the same `scoreType` but a different `modelName`. * How does Forge test a new version of the custom model handler? * As much effort will be made to ensure correctness of the agent in our dev environment. * Rolling deploy of the agent/handler with careful monitoring. * Potentially will require help from customer. * Estimate of outbound bandwidth: ~270GB/day ??? * How are "API customers" handled? * option 1: like regular documents, API documents are available in Snowflake before custom scores become available * option 2: the entire API document, including custom scores, become available at once. * In the first version, option 1 will be supported. ## High-Level Project Plan 1. (High-Level) Design Document 1. Modify syndication to only include control record (ie. `docuuid`) 1. Syndication work 2. S3 work: store record 3. sync-agent: API to fetch documents 2. Custom Model Agent 3. Custom Score Loader 4. Implement a mock service (see [Acceptance criteria]) 5. Review PDOC documentation for 06.60 6. DevOps 1. Dev 2. Monitoring