# Architecture Notes
Notes for EZID application architecture, python 3 revision onwards.
:::warning
**DRAFT STATUS**
These diagrams are undergoing active revision.
:::
## Context
The context diagram depicts the interaction of the EZID and N2T systems with agents external systems.
---
## Abstract Resolver
General sequence of interaction of a user with a resolver service, basically, "Given an identifier, how do I get the thing it identifies?":
```plantuml
actor user
user -> resolver: Where is thing?\nhttp://resolver.net/thing
activate user
note left
There is no standard URL for a
resolver, common practice is to
provide the identifier value as
a path parameter.
end note
activate resolver
resolver -> resolver: match thing to rules
note right
Success will typically return
HTTP 302 or 307 though may also be
any 300 code depending on implementation.
Error will typically be 404
end note
resolver -> user: http://some.target/thing
deactivate resolver
user-> target: GET http://some.target/thing
activate target
note right
Note that the target may itself
redirect the user to another location.
end note
target -> user: thing
deactivate target
deactivate user
```
Generalized parts of a resolver service:
```plantuml
!include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml
Person(user, "User")
System_Boundary(resolver, "Resolver") {
System(service, "Resolver")
ContainerDb(rules, "Rules")
}
System(target, "Target")
Rel_L(rules, service, "Inform")
Rel_R(service, user, "pid\ntarget", "http")
Rel(user, service, "resolve(pid)", "http")
Rel_R(user, target, "get(pid)", "http")
```
---
## N2T Resolver Context
```plantuml
!include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml
Person(user, "User")
System(target, "Target")
System_Boundary(resolver, "N2T Resolver") {
System(service, "Resolver","N2T HTTP\nresolver service")
ContainerDb(rules, "Rules", "Rules merged\nfrom several sources")
System(binder, "Binder","Adds entries for\nindividual identifiers")
Rel(binder, rules, "Identifiers")
}
ContainerDb_Ext(naans,"NAAN\nRegistry")
ContainerDb_Ext(idorg,"Other\nRules")
Rel(rules, naans, "Imports")
Rel(rules, idorg, "Imports")
Rel_L(rules, service, "Inform")
Rel(service, user, "2. target/pid")
Rel(user, service, "1. resolve(pid)")
Rel_R(user, target, "3. get(pid)")
System_Ext(ezid, "EZID")
Rel_L(ezid, binder, "Register\nIdentifiers")
```
* N2T maintains a list of resolver rules built from several sources
* Given an identifier, N2T:
1. Finds the rule that matches the most characters of the identifier starting from the left.
2. Redirects the user to the target using HTTP redirection
* Rules in N2T are continually updated through the binder service
* Access to the binding service requires authentication
* EZID is currently the only system using binder (sort of)
* N2T also supports "inflection", an operation that returns information about an identifier
---
# EZID Context
The EZID system interacts with some other services such as N2T, DataCite, and CrossRef.
```plantuml
!include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml
Person(user, "User", "EZID User")
System_Boundary(ezid, "EZID"){
System(ezidapp, "EZID", "EZID Application")
SystemDb(eziddb, "Data Store", "RDS Datastore")
}
System_Ext(n2t, "n2t", "Name to Thing\nresolver")
System_Ext(datacite, "DataCite", "DataCite DOI service")
System_Ext(crossref, "CrossRef", "Crossref DOI service")
ContainerDb_Ext(naans,"NAAN\nRegistry")
ContainerDb_Ext(idorg,"Other\nRules")
Rel(user, ezidapp, "Create, Update, Search", "HTTP")
Rel(ezidapp, user, "Notify", "Email")
Rel(ezidapp, n2t, "Register ARK metadata", "HTTP")
Rel(ezidapp, datacite, "Register DataCite metadata", "HTTP")
Rel(ezidapp, crossref, "Register CrossRef metadata", "HTTP")
BiRel_R(ezidapp, eziddb, "CRUD")
Rel(n2t, naans, "Imports")
Rel(n2t, idorg, "Imports")
Rel(user, n2t, "Resolve", "HTTP")
Rel(ezidapp, n2t, "bind", "HTTP")
SHOW_LEGEND()
```
* EZID registered users have one or more prefixes ("shoulders") to which minted identifiers add characters to be globally unique.
* Prefixes are associated with minters
* "minters" are state machines that emit strings guaranteed to be unique within a minter
* EZID currently sends minted identifiers to N2T where the N2T rules are updated to support resolution
* EZID does not currently support identifier resolution.
* Resolve capability for EZID is close to completion.
* Once completed, binding with N2T will no longer be needed
* N2T will then point to EZID for resolving those NAANs
---
## Container
Major functional elements of the EZID application. Not shown are the background processes which perform asynchronous operations register identifiers with N2T, DataCite, and CrossRef, update the search index, evaluate links, and support batch downloads.
```plantuml
!include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Container.puml
Person(user, "User", "EZID User")
Person(admin, "Admin", "EZID Administrator")
Container_Boundary(aws, "EZID, AWS Hosted"){
Container_Boundary(ezid, "EZID EC2"){
Container(ezidui, "Web App", "HTML, Javascript, Python", "Web service UI and API")
Container(ezidapp, "EZID", "Django, Python", "EZID application")
Container(ezid_man, "Management", "Bash, Python", "Commandline management operations")
ContainerDb(ezid_s, "Minters", "BDB")
}
ContainerDb(eziddb,"RDS read/Write", "AWS RDS MySQL")
Container(lb,"Load Balance","HTTP")
}
Rel(user, lb, "CRU", "HTTP")
Rel(lb, ezidui, "CRU")
Rel(admin, lb, "Manage", "HTTP")
Rel(lb, ezidui, "Manage")
Rel(ezidui, ezidapp, "API Calls", "HTTP")
Rel(admin, ezid_man, "Manage EZID system", "SSH")
Rel(ezid_man, ezidapp, "Manage", "CLI")
Rel_L(ezidapp, ezid_s, "Minter state")
BiRel_R(ezidapp, eziddb, "CRUDLS")
BiRel_R(ezid_man, eziddb, "CRUDLS")
SHOW_LEGEND()
```
* The "Minters" DB is currently external to RDS, and exists as a set of Berkeley DB databases.
* Moving minter state into RDS is an important step for scalability
---
## Component
### EZID Container
Identifies components invovled with the EZID container and generally indicates interactions between components.
```plantuml
!include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml
AddElementTag("v3.1", $fontColor="#000000", $borderColor="#44B055", $bgColor="#CDEAD0")
AddElementTag("v3.2", $fontColor="#000000", $borderColor="#E99276", $bgColor="#FCF0E6")
Component(api, "API", "Django")
Component(app, "EZID", "Django")
BiRel(api, app, "All user operations")
ComponentDb(rds, "RDS", "MySQL")
BiRel(app, rds, "CRUDLS")
ComponentDb(minters, "Minters", "BDB", $tags="v3.2")
Rel(app, minters, "Create", "impl.nog.minter.mint_id")
ComponentQueue(updateq, "Update Q", "impl.backproc.py")
Rel(app, updateq, "Identifier CUD", "DB UpdateQueue", "Create\nUpdate\nDelete")
Rel(updateq, rds, "Update")
ComponentQueue(binderq, "Binder Q", "impl.binder_async.py")
Rel(updateq, binderq, "create", "")
Component(noidegg, "noid_egg", "noid_egg.py", $tags="v3.1")
Rel(binderq, noidegg, "Async Calls")
Component_Ext(n2t, "n2t", "HTTP")
BiRel(noidegg, n2t, "State")
ComponentQueue(downloadq, "Download Q", "impl.download.py")
Rel(app, downloadq, "Batch Download", "DB download_queue", "Download")
Component_Ext(email, "Email", "SMTP")
Rel(downloadq, email, "Email bundle")
ComponentQueue(dataciteq, "DataCite Q", "MySQL, Python")
Rel(updateq, dataciteq, "Update DataCite")
Component(datacitepy, "datacite.py", "Python", "modernize HTTP", $tags="v3.2")
Rel(dataciteq, datacitepy, "async")
Component_Ext(datacite, "DataCite", "HTTPS")
Rel(datacitepy, datacite, "Send metadata")
ComponentQueue(crossrefq, "Crossref Q", "MySQL, Python")
Rel(updateq, crossrefq, "Update Crossref")
Component(crossrefpy, "crossref.py", "Python", "modernize HTTP", $tags="v3.2")
Rel(crossrefq, crossrefpy, async)
Component_Ext(crossref, "Crossref", "HTTPS")
Rel(crossrefpy, crossref, "Send metadata")
Component(resolve, "Resolve", "Python", $tags="v3.1")
Rel(app, resolve, "Resolve")
Rel(resolve, rds, "Read")
SHOW_LEGEND()
```
---
---
**Misc notes below, may be out of date**
```
Person(admin, "Admin", "EZID Administrator")
Rel(admin, ezidapp, "Manage", "HTTP, SSH")
Rel(ezidapp, admin, "Notify", "Email")
Rel(naan, admin, "ARK Shoulder", "OOB")
Rel(datacite, admin, "DOI Shoulder", "OOB")
Rel(crossref, admin, "DOI Shoulder", "OOB")
```
## Download Operation
```plantuml
Actor Alice
Alice -> API: Batch Download
API -> download: enqueue_request
download -> download_queue: payload
activate download_queue
download -> API: ACK
API -> Alice: ACK
thread -> download_queue: objects.all[0]
download_queue -> thread: task
note right: These steps are called separately\nin the thead loop. They can all\nbe coded in a single method called\nonce with no loss of functionality.
thread -> thread: CREATE: _createFile
thread -> thread: HARVEST: _harvest
thread -> thread: COMPRESS: _compressFile
thread -> thread: DELETE: _deleteUncompressedFile
thread -> thread: MOVE: _moveCompressedFile
thread -> Mail: MAIL: _notifyRequestor: msg
thread -> download_queue: delete task
deactivate download_queue
Mail -> Alice: message
```
## Mint Identifier
Create Part 1. Store pending identifier, enqueue task, respond to user.
```plantuml
Actor Alice
Alice -> API: Mint Identifier
API -> ezid: mintIdentifier
ezid -> identifierLock: _acquireIdentifierLock
note right: Why is this\nlocked twice?
activate identifierLock
ezid -> ezid: _mintIdentifier
ezid -> ezid: createIdentifier
ezid -> identifierLock: _acquireIdentifierLock
activate identifierLock
ezid -> models.store_identifier: StoreIdentifier.save
note right #aqua: BLOB is written here
ezid -> update_queue: enqueue (SI, "create")
activate update_queue
deactivate identifierLock
ezid -> API: ACK
deactivate identifierLock
API -> Alice: ACK
note right of update_queue: Worker thread in\nbackproc.py
```
Create Part 2. Process task in update queue. This is shown for a DataCite DOI. CrossRef DOIs follow the same process, except using the CrossRef queue.
```plantuml
activate update_queue
update_queue -> update_queue: get task
note right: _backprocDaemon\nbackproc.py
update_queue -> update_queue: metadata = toLegacy()
update_queue -> update_queue: BLOB = blobify(metadata)
update_queue -> models.search_identifier: updateFromLegacy (id, "create", metadata, BLOB)
note right #aqua: BLOB is written here
alt is DOI?
update_queue -> datacite_async: enqueueIdentifier (id, "create", BLOB)
activate datacite_async
note right: _workerThread in\nregister_async.py
par Async execution
datacite_async -> datacite.py: uploadMetadata
activate datacite.py
participant DataCite #white
datacite.py -> DataCite: HTTP
DataCite -> datacite.py: ACK
deactivate datacite.py
datacite_async -> datacite.py: setTargetUrl
activate datacite.py
datacite_async -> datacite_async !!: workerThread\ndelete
deactivate datacite_async
datacite.py -> datacite.py: registerIdentifier
datacite.py -> DataCite: HTTP
DataCite -> datacite.py: ACK
deactivate datacite.py
end
end
note right: This block only applies to DOIs.\nIt is executed in parallel with the\noperation on binder_async
update_queue -> binder_async: enqueueIdentifier (id, "create", BLOB)
activate binder_async
update_queue -> update_queue !!: daemon\ndelete task
deactivate update_queue
binder_async -> binder_async: create
note right: _workerThread
binder_async -> noid_egg: setElements
binder_async -> binder_async !!: workerThread\ndelete
deactivate binder_async
activate noid_egg
noid_egg -> noid_egg: _setElements
noid_egg -> noid_egg: _issue
participant N2T #white
noid_egg -> N2T: HTTP
deactivate noid_egg
```
---
### Metadata BLOBs
Progression of metadata from initial POST to appearance in the processing queues.
```plantuml
[*] --> metadata
note right: metadata dict is loaded from ANVL POST body
state "createIdentifier() in ezid.py" as ezid{
metadata --> StoreIdentifier: updateFromUntrustedLegacy()
StoreIdentifier --> [*]: Save()
note right: Metadata persisted\nto StoreIdentifier table
StoreIdentifier --> UpdateQueue: enqueue()
note right: StoreIdentifier instance is added\nas a value to the UpdateQueue
}
state "_daemonThread() in backproc.py" as daemonThread {
UpdateQueue --> UpdateModel
UpdateModel --> SI.metadata: SI.toLegacy()
SI.metadata --> BLOB: impl.util.blobify()
note left: The BLOB is a gzipped copy\nof the metadata converted to an\n"exchange" format using toExchange()
BLOB -> SearchIdentifier
SI.metadata -> SearchIdentifier
SearchIdentifier: BLOB (not used)
SearchIdentifier: metadata
BLOB --> binder_async: all
binder_async: BLOB
BLOB --> datacite_queue: if DataCite
datacite_queue: BLOB
BLOB --> crossref_queue: if CrossRef
SI.metadata --> crossref_queue: if CrossRef
crossref_queue: BLOB
crossref_queue: metadata (used only for owner)
UpdateModel --> [*]
note right: deleted after populating\nsearch and other queues
}
binder_async --> [*]: impl.util.deblobify()
datacite_queue --> [*]: impl.util.deblobify()
crossref_queue --> [*]: impl.util.deblobify()
note left: BLOBs are deblobified prior\nto any subsequent processing\nin any of these queues
```
## Update Identifier
```plantuml
Actor Alice
Alice -> API: Update Metadata
API -> ezid: setMetadata
ezid -> identifierLock: _acquireIdentifierLock
activate identifierLock
ezid -> models.store_identifier: StoreIdentifier.save
note right #aqua: BLOB is written here
ezid -> update_queue: enqueue (SI, "update")
deactivate identifierLock
activate update_queue
ezid -> API: ACK
API -> Alice: ACK
note right of update_queue: Worker thread in\nbackproc.py
```
## Delete Identifier
```plantuml
Actor Alice
Alice -> API: delete
API -> ezid: deleteIdentifier
ezid -> identifierLock: _acquireIdentifierLock
activate identifierLock
ezid -> models.store_identifier: StoreIdentifier.delete
note right #aqua: BLOB is deleted here
ezid -> update_queue: enqueue (SI, "delete")
deactivate identifierLock
activate update_queue
ezid -> API: ACK
API -> Alice: ACK
note right of update_queue: Worker thread in\nbackproc.py
```
## BLOBs
BLOBs are stored in the following tables:
- `ezidapp_storeidentifier`
- `ezidapp_searchidentifier`
- `ezidapp_binderqueue`
- `ezidapp_updatequeue`
- `ezidapp_crossrefqueue`
- `ezidapp_datacitequeue`
Each of the queues is ephemeral. The `searchidentifier` entries are created from `storeidentifier` records.