# Forge.AI Data
### Overview
Forge collects publically available information in real time and transforms it into a resource suitable for enriching our customers' knowledge bases, improving automated decision making and reasoning processes, and making the world of open source communications available to a wide assortment of machine learning and artificial intelligence applications. The source information is collected from national. local, and industry specific news sites, organizations web sites, different US and international sites such as the SEC, social media, and private feeds and transformed into real time into a structred resource made available to our customers in either a streaming format or addressible through a SQL enabled cloud data warehouse.
There are three types of data generated by Forge:
1. Data that corresponds directly with a published document (e.g. news story, SEC filing, tweet, etc.)
2. Calculated information, that is derived from a published document, such as as automated LDA topic vectors representing the source information
3. Aggregate information derived by looking at a large collection of the corpus of processed data
This document describes how the data is managed in the cloud data warehouse. It is intended to orient the customer to nature of the data managed in the data warehouse, its structure, and how to query the data and integrate it into your operations. The real-time streaming data solution has a conceptually similar structure, however the representation in XML or JSON and the delivery mechanism usinging HTTPS 2.0 Push are different and are discussed in a seperate document.
### Forge Document Processing
Prior to digging into the structure of Forge's data, a very short describing how the data is processed will be helpful in understanding the subsequent material.igh Level Picture
..... BLAH... BLAH... BLAH...
.... add a pretty picture ...
##### Domain Specific Ontologies
* Corporate Structure
* People and Corporate Relationships
* Business Concepts
* Marketing Concepts
* Risk (operational Risk, Credit Risk, Market Risk)
* Science (Physics, Chemistry, Biology, etc.)
* Military and National Security
* Events
*
### Data Description
The following information is a description of different tables that an extracted document are stored in. All of this information is available in a single XML/JSON file for customers choosing to receive the streaming data. After the identiutication and description of the tables sample SQL and expected output will be provided to better illustrate the data.
**DOCUMENTS**
The DOCUMENTS table identifies each document that Forge has processed that you have access to. It is the "anchor" to all other document specific information including:
* The document summary and text
* Meta information about the document
* Document Topic iinformation
* Entities and domain concepts
* Relationships expressed in the document
* LDA topic model vectors representing the document
* Events discussed in this document
* Pointers to similar documents
The DOCUMENTS.DOCID, a GUID, is the key that joins these different elements together.
**DOCUMENTMETADATA**
Document metadata is information provided by the document puplisher that desribes the document. It may include authorship information, additional publisher information, publisher supplied summaries, etc. Forge does not alter the publisher supplied metadata, but may augment the data with additional data such as company stock symbols, exchange information, etc.
DOCUMENTTOPICS
DOCUMENTTOPICMODELVECTORS
DOCUMENTTOPICWORDS
DOCUMENTSOURCE
ENTITIES
SIMILARDOCS
SECTIONMETADATA
RELATIONSHIPSUPPORTINGSENTENCES
RELATIONSHIPPREDICATEASSERTIONS
RELATIONSHIPS
TOPICWORDS
TOPICS
TOPICMODELVECTORS
SENTENCES
SOURCETEXTS
SECTIONS
ENTITYCOMPONENTS
ENTITYPROPERTIES
ENTITYLOCATIONS
ENTITYASSERTIONS
### Data Structure Details
### Sample Data Access via Snowflake
Lets start by asking a simple question....
> How many new documents have been published by the different sites we monitor over the past 24 hours.
The SQL for this would be
```
select
REGEXP_SUBSTR(URL,'((https?):\/\/)?([^\/]+)') as site,
count(*) as cnt
from
DOCUMENTS
where
URL like 'http%'
and DOCDATETIME between '2019-05-30 12:00:00' and '2019-05-31 12:00:00'
group by
site
order by
cnt desc
limit 10;
```
| Site | Count |
| -------- | -------- |
|https://www.sec.gov|263|
|https://www.reuters.com|167|
|https://www.prnewswire.com|461|
|https://www.oilandgas360.com|110|
|https://www.npr.org|51|
|https://www.marketwatch.com|94|
|https://www.cnn.com|129|
|https://www.cnet.com|80|
|https://www.businesswire.com|318|
|https://seekingalpha.com|509|
Next, Lets explore these documents by getting the summary information from the first 10 documents published by Marketwatch.com over this period.
```
select
DOCID, TITLE, URL, DOCDATETIME
from
DOCUMENTS where URL like ('https://www.marketwatch.com%')
and DOCDATETIME between '2019-05-31 12:00:00' and '2019-06-01 12:00:00'
order by
DOCDATETIME asc
LIMIT 10;
```
| DOCID | TITLE | URL | DOCDATETIME |
|:-------------------------------------|:----------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------|
| d09411d2-8bfa-4c11-b6e8-9cd70892c154 | Stocks off sharply as Trump threatens tariffs on Mexico, U.S.-China trade tensions continue - MarketWatch | https://www.marketwatch.com/story/stocks-off-sharply-as-trump-threatens-tariffs-on-mexico-us-china-trade-tensions-continue-2019-05-31?link=MW_latest_news | 2019-05-31 12:00:00 |
| 7d51ddb2-c4b8-45b8-a28c-7f056050d18c | Why Atlantic hurricane season should make commodities traders nervous - MarketWatch | https://www.marketwatch.com/story/why-atlantic-hurricane-season-should-make-commodities-traders-nervous-2019-05-31?siteid=rss&rss=1 | 2019-05-31 12:11:00 |
| 9bff732a-e451-4f39-8061-7c86eda4b6c1 | The Dow Drops 245 Points Because Trump’s New Tariff Threat Hits Closer to Home | https://www.marketwatch.com/articles/dow-jones-industrial-average-trump-mexico-tariff-threat-51559319062?mod=mw_hpm | 2019-05-31 12:14:00 |
| b34706e3-2064-4feb-94c3-ff67da5a8b6b | China Is Planning a Corporate Blacklist in Latest Trade-War Move | https://www.marketwatch.com/articles/china-trade-tension-blacklist-companies-huawei-apple-microsoft-51559319114?mod=mw_hpm | 2019-05-31 12:15:00 |
| dd712dc1-82fe-4728-a03a-1767e9dae922 | 2-year Treasury plunge below 2% underlines bond investors’ expectations for rate cuts - MarketWatch | https://www.marketwatch.com/story/2-year-treasury-plunge-below-2-underlines-bond-investors-expectations-for-rate-cuts-2019-05-31?siteid=rss&rss=1 | 2019-05-31 12:18:00 |
Lets look at one of these documents more deeply y quering for it. Recall that all documents are uniquely identified by their DOCUMENTS.DOCID
First, lets look at the document table itself.
>
|Name | Value |
|:----------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| DOCID | 08c1096f-7a44-472c-8b15-91d0833a5ba0 |
| PARENTUUID | 311b7150-f0cf-474f-8389-5b02331afd69 |
| URL | https://www.prnewswire.com/news-releases/successful-usfda-inspection-of-asymchem-dunhua-1-api-manufacturing-facility-300859877.html |
| ORIGINALURL | |
| DOCTYPE | article |
| TITLE | Successful USFDA inspection of Asymchem Dunhua 1 API manufacturing facility |
| ORIGINALTITLE | |
| DOCDATETIME | 2019-05-31 12:05:00 |
| SUMMARY | TIANJIN, China, May 31, 2019 /PRNewswire/ -- Asymchem, a leading custom manufacturer of Intermediates and API's for the pharmaceutical industry, today announced that its Dunhua 1 site, a small molecule custom manufacturing API facility, successfully completed a U.S. Food and Drug Administration (USFDA) general GMP reinspection conducted between April 8-12, 2019. (Dunhua 1 site), established in 2007, manufactures APIs and API intermediates for the global pharma industry. This is the second successful inspection for the Dunhua 1 site, and 13th successful western agency audit in total across all Asymchem's 7 manufacturing sites. |
| COLLECTDATETIME | 2019-05-31 16:13:13 |
### Sample Data Access in Python