owned this note
owned this note
Published
Linked with GitHub
# A version from Versionista looks like:
###### tags: 'scratchpad'
```json
{
"capture_time": "2018-12-07T21:08:38.000Z",
"uri": "https://edgi-wm-versionista.s3.amazonaws.com/versionista2/74273-6221689/version-18702657.html",
"version_hash": "a85dfa67eea08d56a665d39fa6758db0fc0dc0af3777d85f6902e73a50b5577f",
"title": "EERE Success Story—The Navy Saves Energy in its Buildings With EERE Expertise | Department of Energy",
"page_url": "https://energy.gov/eere/success-stories/articles/eere-success-story-navy-saves-energy-its-buildings-eere-expertise",
"status": 200
"source_type": "versionista",
"source_metadata": {
// How to get this data from the original source if you need it later
// The particular field names and values vary by source (they're differently
// named for Wayback, for example); they just whatever would be necessary
"account": "versionista2",
"url": "https://versionista.com/74273/6221689/18702657/",
"page_id": "6221689",
"site_id": "74273",
"version_id": "18702657",
// whatever other extra data is useful that the source can provide...
// length & content_type will soon be promoted to a top-level value outside
// source_metadata, but this is where it typically lives for now. See:
// https://github.com/edgi-govdata-archiving/web-monitoring-db/issues/199
"length": 45230,
"content_type": "text/html; charset=UTF-8",
// You don't really need `status` here anymore -- it's here for Versionista
// because we previously weren't able to get status codes from some sources
// and so didn't have a top-level status field outside `source_metadata` (see
// above). This is how we stored status codes before we had that.
"status": 200,
// For Wayback, this is the headers from the original, snapshotted response.
// It's only sort-of that for Versionista, but that's what we're trying to
// get at with this field.
"headers": {
"age": "0",
"date": "Fri, 07 Dec 2018 21:20:19 GMT",
"vary": "Accept-Encoding",
"expires": "Fri, 07 Dec 2018 21:20:19 GMT",
"x-cachee": "MISS",
"connection": "close",
"content-type": "text/html; charset=UTF-8",
"accept-ranges": "bytes",
"cache-control": "private, max-age=0",
"transfer-encoding": "chunked"
},
// We don't do anything with load_time, but it's new information that we
// are able to gather, so we do
"load_time": 1905,
// Similar for last_date -- this is new stuff that we are only recently
// able to get from Versionista and we don't use it in any meaningful
// way right now. It's the last time at which this version was seen
// (as opposed to `capture_time`, which is when it was *first* seen)
"last_date": "2018-12-07T21:08:38.000Z",
// If the page_url redirected to another URL, this is it. It's an array
// so that we have the same structure as other sources (i.e. Wayback)
// which provide the full redirect chain, which is much better.
"redirects": [
"https://www.energy.gov/eere/success-stories/articles/eere-success-story-navy-saves-energy-its-buildings-eere-expertise"
],
// WE DON'T CARE ABOUT this stuff, which is super specific to Versionista
"has_content": true,
"diff_length": 19711,
"diff_hash": "c5d3b3b21f7cbc81427418ee3f68b1a5d1cacd41ba90c22ea054314c6e6a19a0",
"diff_text_hash": "06cb905eb3dce9bb4a35c88a8cdba5e15305bb865223c89e5285556211c70f5c",
"diff_text_length": 176,
"diff_with_first_url": "https://versionista.com/74273/6221689/18702657:9445187/",
"diff_with_previous_url": "https://versionista.com/74273/6221689/18702657:18669493/"
}
}
```
`source_metadata{}` can contain anything (ideally all this stuff)
`capture_time`, `URI`, `capture_url` are all that is required to be accepted into `-db`