# HORAE Use-case 1
[TOC]
## 1.Description
Pour tous les manuscrits des corpus HORAE et HORPIC, extraire les transcriptions HTR de toutes les pages (dans l'ordre du manifest) et de toutes les lignes (dans l'ordre d'extraction sur la page)
Sauvegarder les transcriptions dans un fichier .txt, ligne par ligne; un fichier par manuscrit, fichier identifié par le nom du manuscrit
## 2. Useful links
* Full API endpoints [documentation](https://arkindex.gitlab.io/api-client/)
* Arkindex [web interface](https://arkindex.teklia.com/browse)
## 3. Configure client
### Requirements
* Python 3.6
### Install client
#### Setup virtual environment
You may prefer to install client on a virtual environment, if you want to do a system install you can skip this section
```shell
$ pip install --user virtualenvwrapper
$ mkdir arkindex_client && cd arkindex_client
$ mkvirtualenv -p /usr/bin/python3 -a . arkindex_client
```
> Anaconda: [Setup environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands)
#### Install arkindex-client module
You can install [arkindex client](https://gitlab.com/arkindex/api-client) from [pip](https://pip.pypa.io/en/stable/)
```shell
$ pip3 install arkindex-client
```
> Anaconda: [Install modules](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html#installing-non-conda-packages)
If you want, you can install project from sources
```shell
$ git clone https://gitlab.com/arkindex/api-client.git
$ cd api-client
$ pip3 install .
```
### Configure client
#### Login to arkindex service
In order to pass credentials to Python client, you may use a token that you can retrieve on [Teklia API](https://arkindex.teklia.com/api/v1/user/login/)

#### Test client
Open a Python prompt and try to import and configure arkindex_client module with retrieved token
```python
>>> from arkindex import ArkindexClient
>>> client = ArkindexClient(
>>> base_url='https://arkindex.teklia.com/api/v1',
>>> token='my_secret_token'
>>> )
```
To check configuration you can hit user endpoint and validate result
```python
>>> client.request('RetrieveUser')
{'id': 42,
'email': 'me@horae.com',
'verified_email': True,
'is_admin': False,
'auth_token': 'my_secret_token'
}
```
#### Use environment variable
You can export two variables to your shell
```sh
$ export ARKINDEX_API_TOKEN=my_secret_token
$ export ARKINDEX_API_URL="https://arkindex.teklia.com/api/v1"
```
Then, you may configure client directly from environment
```python
>>> from arkindex import ArkindexClient, options_from_env
>>> client = ArkindexClient(**options_from_env())
>>> client.request('RetrieveUser')
{'id': 42,
'email': 'me@horae.com',
'verified_email': True,
'is_admin': False,
'auth_token': 'my_secret_token'
}
```
## 4. Use client to retrieve transcriptions
### Extract HTR on all manuscript
> You may want to use a web browser to see [HORAE elements](https://arkindex.teklia.com/browse?corpus=ad53fca4-3082-4382-bf17-19eb17d2b83e&type=&name=&display=table) you are manipulating. Manuscripts are of type **volume**.

#### Retrieve horae corpus informations
The endpoint that list corpora is [ListCorpus](https://arkindex.gitlab.io/api-client/#operation/ListCorpus)
```python
>>> corpora = client.request('ListCorpus')
```
The response contains details about accessible corpora
```python
>>> horae_corpus = next(filter(lambda corpus: corpus['name'] == 'HORAE', corpora))
{'id': 'ad53fca4-3082-4382-bf17-19eb17d2b83e',
'name': 'HORAE',
'description': 'Collection of Books of Hours'
[...]
```
NB: It is possible to retrieve a corpus details from its id directly
```python
>>> client.request('RetrieveCorpus', id="ad53fca4-3082-4382-bf17-19eb17d2b83e")
{'id': 'ad53fca4-3082-4382-bf17-19eb17d2b83e',
'name': 'HORAE',
'description': 'Collection of Books of hours'
[...]
```
#### Retrieve manuscripts
We may use [ListElements](https://arkindex.gitlab.io/api-client/#operation/ListElements) endpoint in order to fetch manuscripts list.
This endpoint is paginated because it generally returns a lot of elements. It means that request response contains a limited amount of elements with a link to the next page containing the following elements
**client.paginate()** method handle pagination and returns an [iterator](https://docs.python.org/3.6/glossary.html#term-iterator) object which handle **len()**
NB: **len()** function returns total number of elements without requesting all pages yet
```python
>>> manuscripts = client.paginate(
>>> 'ListElements',
>>> corpus=horae_corpus['id'],
>>> type='volume'
>>> )
>>> len(manuscripts) # Assert manuscript count is 500
500
```
#### Retrieve a page on the first manuscript
To have an idea of the results, we may run it on the first page of the first manuscript. Manuscript children can be retrieved with [ListElementChildren](https://arkindex.gitlab.io/api-client/#operation/ListElementChildren) endpoint with a `page` type filter
```python
>>> first_manuscript = next(manuscripts)
>>> pages_list = list(client.paginate(
>>> 'ListElementChildren',
>>> id=first_manuscript['id'],
>>> type='page'
>>> ))
>>> page = pages_list[6]
>>> page
{'id': '47e2a641-0f65-489c-beea-dd815019ff65',
'type': 'page',
'name': '6',
[...]
```
#### Extract page and lines HTR of a page
We may use a function to extract page HTR of type **page** and **line** with [ListTranscription](https://arkindex.gitlab.io/api-client/#operation/ListTranscriptions) paginated endpoint
```python
>>> page_id = page['id']
>>> transcriptions = list(client.paginate(
>>> 'ListTranscriptions',
>>> id=page_id,
>>> type='line'
>>> ))
>>> len(transcriptions)
21
>>> transcriptions[0]
{'id': '11856db0-b490-4353-85af-582bacf6010b'
'type': 'line'
'text': 'qui'
'score': 0.59
[...]
```
From this point, you may be able to get page transcription too using `type='page'` filter
## 5. To go further
#### Iterate over pages for all manuscripts
NB: Pages will be sorted automatically by their manifest
```python
>>> def extract_page_htr(page_id):
>>> return {
>>> 'page': next(
>>> client.paginate(
>>> 'ListTranscriptions',
>>> id=page_id,
>>> type='page'
>>> ), None
>>> ),
>>> 'lines': list(
>>> client.paginate(
>>> 'ListTranscriptions',
>>> id=page_id,
>>> type='line'
>>> )
>>> )
>>> }
>>> transcriptions = {}
>>> for manuscript in manuscripts:
>>> pages = client.paginate(
>>> 'ListElementChildren',
>>> id=first_man['id'],
>>> type='page'
>>> )
>>> for page in pages:
>>> transcriptions[page['id']] = extract_page_htr(page['id'])
>>> #save(manuscript, transcriptions)
```
#### Save results
Transcriptions are contained in a dict whose keys corresponds to pages ids
It is possible to dump transcriptions as json for each manuscript
```python
>>> from pathlib import Path
>>> import json
>>> FOLDER_PATH = Path('./results/')
>>> def save(manuscript, transcriptions, folder=FOLDER_PATH):
>>> filepath = folderpath / manuscript['name']
>>> assert not filepath.exists(), \
>>> "File '{}' already exists".format(filepath)
>>> with filepath.open('a') as f:
>>> for tr in transcriptions:
>>> json.dump(tr, f, indent=2)
```
## 6. Full script
An example script (relevant to use-case 1 description with logging and more features) is available [here](http://cloud.teklia.com/s/B48W6Mne2okgrbC)
Teklia, 20/11/2019