HORAE Use-case 1

# HORAE Use-case 1 [TOC] ## 1.Description Pour tous les manuscrits des corpus HORAE et HORPIC, extraire les transcriptions HTR de toutes les pages (dans l'ordre du manifest) et de toutes les lignes (dans l'ordre d'extraction sur la page) Sauvegarder les transcriptions dans un fichier .txt, ligne par ligne; un fichier par manuscrit, fichier identifié par le nom du manuscrit ## 2. Useful links * Full API endpoints [documentation](https://arkindex.gitlab.io/api-client/) * Arkindex [web interface](https://arkindex.teklia.com/browse) ## 3. Configure client ### Requirements * Python 3.6 ### Install client #### Setup virtual environment You may prefer to install client on a virtual environment, if you want to do a system install you can skip this section ```shell $ pip install --user virtualenvwrapper $ mkdir arkindex_client && cd arkindex_client $ mkvirtualenv -p /usr/bin/python3 -a . arkindex_client ``` > Anaconda: [Setup environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands) #### Install arkindex-client module You can install [arkindex client](https://gitlab.com/arkindex/api-client) from [pip](https://pip.pypa.io/en/stable/) ```shell $ pip3 install arkindex-client ``` > Anaconda: [Install modules](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html#installing-non-conda-packages) If you want, you can install project from sources ```shell $ git clone https://gitlab.com/arkindex/api-client.git $ cd api-client $ pip3 install . ``` ### Configure client #### Login to arkindex service In order to pass credentials to Python client, you may use a token that you can retrieve on [Teklia API](https://arkindex.teklia.com/api/v1/user/login/) ![](https://i.imgur.com/mMOICCV.png "Retrieve Arkindex credentials") #### Test client Open a Python prompt and try to import and configure arkindex_client module with retrieved token ```python >>> from arkindex import ArkindexClient >>> client = ArkindexClient( >>> base_url='https://arkindex.teklia.com/api/v1', >>> token='my_secret_token' >>> ) ``` To check configuration you can hit user endpoint and validate result ```python >>> client.request('RetrieveUser') {'id': 42, 'email': 'me@horae.com', 'verified_email': True, 'is_admin': False, 'auth_token': 'my_secret_token' } ``` #### Use environment variable You can export two variables to your shell ```sh $ export ARKINDEX_API_TOKEN=my_secret_token $ export ARKINDEX_API_URL="https://arkindex.teklia.com/api/v1" ``` Then, you may configure client directly from environment ```python >>> from arkindex import ArkindexClient, options_from_env >>> client = ArkindexClient(**options_from_env()) >>> client.request('RetrieveUser') {'id': 42, 'email': 'me@horae.com', 'verified_email': True, 'is_admin': False, 'auth_token': 'my_secret_token' } ``` ## 4. Use client to retrieve transcriptions ### Extract HTR on all manuscript > You may want to use a web browser to see [HORAE elements](https://arkindex.teklia.com/browse?corpus=ad53fca4-3082-4382-bf17-19eb17d2b83e&type=&name=&display=table) you are manipulating. Manuscripts are of type **volume**. ![](https://i.imgur.com/ZWePgBh.png "Horae Volumes") #### Retrieve horae corpus informations The endpoint that list corpora is [ListCorpus](https://arkindex.gitlab.io/api-client/#operation/ListCorpus) ```python >>> corpora = client.request('ListCorpus') ``` The response contains details about accessible corpora ```python >>> horae_corpus = next(filter(lambda corpus: corpus['name'] == 'HORAE', corpora)) {'id': 'ad53fca4-3082-4382-bf17-19eb17d2b83e', 'name': 'HORAE', 'description': 'Collection of Books of Hours' [...] ``` NB: It is possible to retrieve a corpus details from its id directly ```python >>> client.request('RetrieveCorpus', id="ad53fca4-3082-4382-bf17-19eb17d2b83e") {'id': 'ad53fca4-3082-4382-bf17-19eb17d2b83e', 'name': 'HORAE', 'description': 'Collection of Books of hours' [...] ``` #### Retrieve manuscripts We may use [ListElements](https://arkindex.gitlab.io/api-client/#operation/ListElements) endpoint in order to fetch manuscripts list. This endpoint is paginated because it generally returns a lot of elements. It means that request response contains a limited amount of elements with a link to the next page containing the following elements **client.paginate()** method handle pagination and returns an [iterator](https://docs.python.org/3.6/glossary.html#term-iterator) object which handle **len()** NB: **len()** function returns total number of elements without requesting all pages yet ```python >>> manuscripts = client.paginate( >>> 'ListElements', >>> corpus=horae_corpus['id'], >>> type='volume' >>> ) >>> len(manuscripts) # Assert manuscript count is 500 500 ``` #### Retrieve a page on the first manuscript To have an idea of the results, we may run it on the first page of the first manuscript. Manuscript children can be retrieved with [ListElementChildren](https://arkindex.gitlab.io/api-client/#operation/ListElementChildren) endpoint with a `page` type filter ```python >>> first_manuscript = next(manuscripts) >>> pages_list = list(client.paginate( >>> 'ListElementChildren', >>> id=first_manuscript['id'], >>> type='page' >>> )) >>> page = pages_list[6] >>> page {'id': '47e2a641-0f65-489c-beea-dd815019ff65', 'type': 'page', 'name': '6', [...] ``` #### Extract page and lines HTR of a page We may use a function to extract page HTR of type **page** and **line** with [ListTranscription](https://arkindex.gitlab.io/api-client/#operation/ListTranscriptions) paginated endpoint ```python >>> page_id = page['id'] >>> transcriptions = list(client.paginate( >>> 'ListTranscriptions', >>> id=page_id, >>> type='line' >>> )) >>> len(transcriptions) 21 >>> transcriptions[0] {'id': '11856db0-b490-4353-85af-582bacf6010b' 'type': 'line' 'text': 'qui' 'score': 0.59 [...] ``` From this point, you may be able to get page transcription too using `type='page'` filter ## 5. To go further #### Iterate over pages for all manuscripts NB: Pages will be sorted automatically by their manifest ```python >>> def extract_page_htr(page_id): >>> return { >>> 'page': next( >>> client.paginate( >>> 'ListTranscriptions', >>> id=page_id, >>> type='page' >>> ), None >>> ), >>> 'lines': list( >>> client.paginate( >>> 'ListTranscriptions', >>> id=page_id, >>> type='line' >>> ) >>> ) >>> } >>> transcriptions = {} >>> for manuscript in manuscripts: >>> pages = client.paginate( >>> 'ListElementChildren', >>> id=first_man['id'], >>> type='page' >>> ) >>> for page in pages: >>> transcriptions[page['id']] = extract_page_htr(page['id']) >>> #save(manuscript, transcriptions) ``` #### Save results Transcriptions are contained in a dict whose keys corresponds to pages ids It is possible to dump transcriptions as json for each manuscript ```python >>> from pathlib import Path >>> import json >>> FOLDER_PATH = Path('./results/') >>> def save(manuscript, transcriptions, folder=FOLDER_PATH): >>> filepath = folderpath / manuscript['name'] >>> assert not filepath.exists(), \ >>> "File '{}' already exists".format(filepath) >>> with filepath.open('a') as f: >>> for tr in transcriptions: >>> json.dump(tr, f, indent=2) ``` ## 6. Full script An example script (relevant to use-case 1 description with logging and more features) is available [here](http://cloud.teklia.com/s/B48W6Mne2okgrbC) Teklia, 20/11/2019