# [Investigate what occurs when Reader is refreshed](https://github.com/department-of-veterans-affairs/caseflow/issues/14298#) #14298
###### tags: `Echo`
There are 2 pages for Caseflow Reader:
* Document List page - shows a table of documents for the Veteran
* Document View page - shows one document at a time
To help diagnose problems, a description of backend calls for each page are described below, followed by a list of recent known and resolved problems.
## Document List page
Upon load or page refresh, the [Document List page](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#document-list) makes 2 requests to the backend:
1. `Reader::AppealController#show` returns info about the current appeal.
2. `Reader::DocumentsController#index` returns an object with the following:
* `documents`: a list of document records and associated document tags (aka "Issue Tags" in Reader). To get the documents, the controller calls `appeal.document_fetcher.`**find_or_create_documents!**
* In this case, `document_fetcher` uses `EFolderService` for AMA and Legacy appeals (see [Integrations](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#integrations)). Upon `document_fetcher` initiation, it
* retrieves document metadata from eFolder Express with **EFolderService.fetch_documents_for(appeal, user)** (next section has further descriptions)
* then sets `manifest_vbms_fetched_at` and `manifest_vva_fetched_at` (which are also sent back to the frontend)
* Upon **find_or_create_documents!** being called, it ensures versions of `Document`s can be tracked by `series_id` as follows:
* it calls `DocumentSeriesIdAssigner` to [ensure all known `Document`s have a `series_id`](https://github.com/department-of-veterans-affairs/caseflow/blob/7e38376d2cda2c0d5c8ffe9a5097c54d89aa6d7c/app/services/document_series_id_assigner.rb#L23)
* and merges fetched documents with known `Document` records or, if unknown, creates a new `Document` (copying annotations/comments from [previous doc with the same `series_id`](https://github.com/department-of-veterans-affairs/caseflow/blob/ef965424cc9d8670bac32193fa80870d5ee9fed4/app/services/document_fetcher.rb#L68))
* `annotations` (aka document "Comments" in Reader)
* also calls `appeal.document_fetcher.`**find_or_create_documents!**
* `manifestVbmsFetchedAt`: timestamp indicating when documents were fetched from VBMS
* `manifestVvaFetchedAt` : timestamp indicating when documents were fetched from VVA
### `series_id` and `vbms_document_id`
* A specific version of a document is referenced by its `vbms_document_id`.
* All versions of the same document have the same `series_id`.
* So there may be `Document` records that represent older versions of documents that are (correctly) not presented in the UI. As a result, `Document.where(file_number: vet.file_number).count` is not equal to `documents.size` returned from `Reader::DocumentsController`.
From [Reader's VBMS integration](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#vbms):
> Each document has a `series_id` and a `version_id` (unfortunately we refer to `version_id` as `vbms_document_id` in most of the code). In VBMS a document may be uploaded with multiple versions. Each version of the document gets its own `version_id`, but will have the same `series_id`. Whenever we see a new document with the same `series_id` as an existing document, we copy over all the metadata (comments, tags, etc.) we'd associated with that first document.
### `EfolderService.fetch_documents_for(appeal, user)`
`EfolderService` is a client for the eFolder Express service (aka Caseflow eFolder). `EfolderService.fetch_documents_for` is used by Reader to download documents from VBMS and VVA.
* First it sends a POST request to `/api/v2/manifests` (see [Reader access to VBMS](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#vbms))
* In response to the POST request, eFolder Express (specifically `Api::V2::ManifestsController#start`) creates a `Manifest` (and a corresponding `FilesDownload` per current_user) for the Veteran. A `Manifest` typically has 2 `ManifestSource`s -- one for each of VBMS and VVA.
* [Schema diagram of relevant eFolder Express classes](https://dbdiagram.io/d/5ed6741c39d18f555300202a)
* It starts to retrieve documents for each `ManifestSource` using a high_priority `V2::DownloadManifestJob` parameterized by the `current_user`. `V2::DownloadManifestJob` does the following:
* uses `ManifestFetcher` to [fetch a *list of documents* for all the "file numbers" known for the veteran using BGS info](https://github.com/department-of-veterans-affairs/caseflow-efolder/blob/7061bde7c9a2f919db122314f5f7e94f6d35cfb4/app/services/manifest_fetcher.rb#L23). The actual document-list fetching is done by calling `v2_fetch_documents_for(file_number)` on `VBMSService` and `VVAService`. A `DocumentCreator` is used to [delete and recreate all `Record`s associated with the `manifest_source`, after applying `DocumentFilter`](https://github.com/department-of-veterans-affairs/caseflow-efolder/blob/5edb1df749fd3bae0bf601ae8738ba3bae524ebb/app/services/document_creator.rb#L11).
* then it starts a low_priority `V2::SaveFilesInS3Job` to retrieve the documents' *contents* and store them as files in S3: [`manifest_source.records.each(&:fetch!)`](https://github.com/department-of-veterans-affairs/caseflow-efolder/blob/a25a268ad641829addc890401583f2d5ee2dca8f/app/jobs/v2/save_files_in_s3_job.rb#L7)
* A `Record` corresponds to a `Document` to be retrieved by `RecordFetcher`, which will [fetch the contents from S3 before trying VBMS/VVA](https://github.com/department-of-veterans-affairs/caseflow-efolder/blob/5b925d7ecba59a0f636c68d77ae975a2083cb45b/app/services/record_fetcher.rb#L15), convert images to PDF files if needed, and save to S3.
* If conversion to PDF fails, the image is saved to S3 (however Reader can only show PDFs) and no alert is logged. **Should investigate a solution and at least log the error when `record.conversion_status==conversion_failed`.**
* Once all documents for the appeal are fetched, `EfolderService` sends a GET request to `/api/v2/manifests/#{manifest_id}` to return the retrieved documents.
## Document View page
From [Reader's Document View](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#document-view):
> [The frontend] makes calls directly to the `/api/v2/records/:id` endpoint on eFolder Express to retrieve the content of a document. [...] the document contents should already be cached in S3.
1. With each document shown to the user, `DocumentController#pdf` is called for the current, next, and previous documents. (Note this is not the same `Reader::DocumentsController` used for the Document List page above.)
* It serves up the [pdf file from directory `/tmp/pdfs/`](https://github.com/department-of-veterans-affairs/caseflow/blob/43562ed9e75b14c6f949802521da9df6a4214c2b/app/models/document.rb#L154). The pdf could come from [3 places](https://github.com/department-of-veterans-affairs/caseflow/blob/43562ed9e75b14c6f949802521da9df6a4214c2b/app/models/document.rb#L122):
> Currently three levels of caching. Try to serve content
from memory, then look to S3 if it's not in memory, and
if it's not in S3 grab it from VBMS
Log where we get the file from for now for easy verification
of S3 integration.
* So if the document is not in S3 and comes from VVA, then Reader won't be able to show it. **Should investigate a solution.**
* Can check in Rails logs for ["File #{vbms_document_id} fetched from VBMS"](https://github.com/department-of-veterans-affairs/caseflow/blob/43562ed9e75b14c6f949802521da9df6a4214c2b/app/models/document.rb#L130)
2. `DocumentController#mark_as_read` updates `DocumentView` records to capture when the user views the document
3. `Reader::DocumentsController#show` sets up the page content
4. `Reader::AppealController#show` returns info about the current appeal
5. `Metrics::V1::HistogramController#create` sends a histogram to DataDog about `pdf_page_render_time_in_ms` but values seem to always be 0: `[{"group":"front_end","name":"pdf_page_render_time_in_ms","value":0,"app_name":"Reader","attrs":{"overscan":6,"document_type":"VA Memo","page_count":4}}, ...]`
### Documents cached in S3
Reader pulls document files from S3, if they're available.
[A RetrieveDocumentsForReaderJob caches documents in S3](https://github.com/department-of-veterans-affairs/caseflow/wiki/Caseflow-Reader#eagerly-caching-documents-in-s3):
* According to [serverless.yml](https://github.com/department-of-veterans-affairs/appeals-lambdas/blob/master/async-jobs-trigger/serverless.yml#L92), this job runs every 5 minutes for [active Reader users](https://github.com/department-of-veterans-affairs/caseflow/blob/master/app/queries/batch_users_for_reader_query.rb#L7).
* This job chooses up to 5 users who (1) logged in within the last week and (2) haven't used eFolder to fetch documents at all or not within the last day.
* For the Legacy and AMA appeals these users are assigned to, the job calls [`appeal.document_fetcher.`**find_or_create_documents!**](https://github.com/department-of-veterans-affairs/caseflow/blob/a45cada3ddf04c1356f055c23afb5e54c4bdd7ca/app/workflows/fetch_documents_for_reader_job.rb#L32) -- same as on Reader's Document List page.
**Concerns**:
* 5 minutes may be too frequent. Could the same 5 users be chosen by consecutive jobs if the first job is still processing? Since `efolder_documents_fetched_at` is not set until a job finishes, if the first job takes longer than 5 minutes (e.g., 1000+ documents) then the next job would pick the same users. **Should investigate improvements to this.**
* How often is S3 used compared to document retrievals from VBMS/VVA? The intent of the job is to retrieve preferably all documents from S3. **Should measure how well this job is achieving this intent and improve it, while considering S3 file auto-deletions.**
#### When are these files auto-deleted in S3?
Asked Tango: [Slack convo](https://dsva.slack.com/archives/C3EAF3Q15/p1591118799353200)
Some digging reveals this:
```ruby
bucket=Caseflow::S3Service.init!
client = Aws::S3::Client.new
resp=client.get_bucket_lifecycle({bucket: bucket.name})
pp resp.rules.pluck(:id,:prefix)
[["delete form 8s after 5 days", "form_8 "],
["delete documents after 5 days", "documents"]]
```
The earliest file in the S3 `documents` folder is 5 days ago ([AWS S3 web UI shows folder contents](https://console.amazonaws-us-gov.com/s3/buckets/dsva-appeals-caseflow-prod/?region=us-gov-west-1&tab=overview)), so Reader documents are indeed deleted after 5 days.
## Doc counts in Reader
In the Reader UI, document counts are displayed to the user. It can be simulated as follows:
```ruby
appeal=Appeal.find_by(uuid: ...)
docs=Document.where(file_number: appeal.veteran_file_number)
# Document List page
page1resp=ExternalApi::EfolderService.document_count(appeal.veteran_file_number,user)
# Document View page
page2resp=ExternalApi::EfolderService.fetch_documents_for(appeal,user)
page2resp[:documents].size
```
These document counts can change over time. For example,
* 2 `Document` records were created and retrieved but are no longer retrievable by eFolder, possibly because new versions are available.
* eFolder has new 1 `Record` that Reader doesn't yet know about, possibly because a new document was uploaded to VBMS/VVA.
* The net document count change may be 1, but there are 3 differences. **Should investigate a better way to track documents.**
Some code for further investigation:
```ruby
docs=Document.where(file_number: appeal.veteran_file_number)
vbms_idsD=docs.pluck(:vbms_document_id)
df=appeal.document_fetcher # takes many seconds to complete
df.number_of_documents
df.documents.group_by{|d| d.upload_date.beginning_of_day}.map{|k,v| [k,v.size]}.sort
df.documents.group_by{|d| d.received_at.beginning_of_day}.map{|k,v| [k,v.size]}.sort
vbms_idsR=df.documents.pluck(:vbms_document_id)
vbms_idsD - vbms_idsR
=> ["{2605FFFC-C9C7-4EF8-BAB8-E1042CB7A92F}",
"{50FB8137-8D01-431E-B71D-55F8A6BC7F09}"]
vbms_idsR - vbms_idsD
=> ["{2B507FFF-0CF2-41DA-92A4-0394D3BBF52A}"]
```
## Doc counts in Queue page
Document counts are shown in the table on some Queue pages.
`AppealsController#document_count` provides these document counts. It calls `EFolderService.document_count(appeal.veteran_file_number, current_user)`, which:
* checks Rails.cache `"Efolder-document-count-#{file_number}"`
* checks Rails.cache `"Efolder-document-count-bgjob-#{file_number}"` (`expires_in: 15.minutes`)
* starts background `FetchEfolderDocumentCountJob`, which checks Rails.cache `"Efolder-document-count-#{file_number}"` (`expires_in: 4.hours`) and sends GET request to `/api/v2/document_counts`
* In response, eFolder Express `api/v2/document_counts#index` checks its cache `veteran-doc-count-#{file_number}` (`expires_in: 2.hours`) and responds with `DocumentCounter.new(veteran_file_number: file_number)`
* which calls `v2_fetch_documents_for(file_number)` for both VBMSService and VVAService (same as `ManifestFetcher` mentioned in the context of Reader's Document List page), and then returns a set of `document_ids`, which is counted.
## Known Problems
1. PDF version of TIFF from VVA not shown b/c the TIFF(not the PDF) is in S3 and cannot be immediately retrieved like a VBMS-sourced file. [#14193](https://github.com/department-of-veterans-affairs/caseflow/issues/14193)
* Which documents fail conversion?
2. Why document counts change over time? e.g., 421 + 5 more: [#14289](https://github.com/department-of-veterans-affairs/caseflow/issues/14289)
* Why 425 vs 426? 2 gone + 1 added; VBMS's response changes over time
* 6/2/2020: Now 440. `docs.pluck(:series_id).uniq.size => 440`
* Added details at [14081 investigation Part-3](https://hackmd.io/9DYl3EwdTqKCpALVbIWnQg#Part-3)
* Need to better synchronize documents with VBMS/VVA.
3. Is the same job submitted within the same time span? "active user" check and limited to 5 at a time
4. [Should no longer be a problem] Document count numbers are not the same in Queue and Reader ([Related resolved ticket due to VBMS pagination](https://github.com/department-of-veterans-affairs/caseflow/issues/13261))
# To diagnose
* Which documents (VBMS vs VVA) are retrieved by a call to ...
* `appeal.document_fetcher.`**find_or_create_documents!** is called by
* `RetrieveDocumentsForReaderJob` for 5 active users
* `DocumentController#pdf` on the Document List page