# Architecture Notes Notes for EZID application architecture, python 3 revision onwards. :::warning **DRAFT STATUS** These diagrams are undergoing active revision. ::: ## Context The context diagram depicts the interaction of the EZID and N2T systems with agents external systems. --- ## Abstract Resolver General sequence of interaction of a user with a resolver service, basically, "Given an identifier, how do I get the thing it identifies?": ```plantuml actor user user -> resolver: Where is thing?\nhttp://resolver.net/thing activate user note left There is no standard URL for a resolver, common practice is to provide the identifier value as a path parameter. end note activate resolver resolver -> resolver: match thing to rules note right Success will typically return HTTP 302 or 307 though may also be any 300 code depending on implementation. Error will typically be 404 end note resolver -> user: http://some.target/thing deactivate resolver user-> target: GET http://some.target/thing activate target note right Note that the target may itself redirect the user to another location. end note target -> user: thing deactivate target deactivate user ``` Generalized parts of a resolver service: ```plantuml !include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml Person(user, "User") System_Boundary(resolver, "Resolver") { System(service, "Resolver") ContainerDb(rules, "Rules") } System(target, "Target") Rel_L(rules, service, "Inform") Rel_R(service, user, "pid\ntarget", "http") Rel(user, service, "resolve(pid)", "http") Rel_R(user, target, "get(pid)", "http") ``` --- ## N2T Resolver Context ```plantuml !include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml Person(user, "User") System(target, "Target") System_Boundary(resolver, "N2T Resolver") { System(service, "Resolver","N2T HTTP\nresolver service") ContainerDb(rules, "Rules", "Rules merged\nfrom several sources") System(binder, "Binder","Adds entries for\nindividual identifiers") Rel(binder, rules, "Identifiers") } ContainerDb_Ext(naans,"NAAN\nRegistry") ContainerDb_Ext(idorg,"Other\nRules") Rel(rules, naans, "Imports") Rel(rules, idorg, "Imports") Rel_L(rules, service, "Inform") Rel(service, user, "2. target/pid") Rel(user, service, "1. resolve(pid)") Rel_R(user, target, "3. get(pid)") System_Ext(ezid, "EZID") Rel_L(ezid, binder, "Register\nIdentifiers") ``` * N2T maintains a list of resolver rules built from several sources * Given an identifier, N2T: 1. Finds the rule that matches the most characters of the identifier starting from the left. 2. Redirects the user to the target using HTTP redirection * Rules in N2T are continually updated through the binder service * Access to the binding service requires authentication * EZID is currently the only system using binder (sort of) * N2T also supports "inflection", an operation that returns information about an identifier --- # EZID Context The EZID system interacts with some other services such as N2T, DataCite, and CrossRef. ```plantuml !include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml Person(user, "User", "EZID User") System_Boundary(ezid, "EZID"){ System(ezidapp, "EZID", "EZID Application") SystemDb(eziddb, "Data Store", "RDS Datastore") } System_Ext(n2t, "n2t", "Name to Thing\nresolver") System_Ext(datacite, "DataCite", "DataCite DOI service") System_Ext(crossref, "CrossRef", "Crossref DOI service") ContainerDb_Ext(naans,"NAAN\nRegistry") ContainerDb_Ext(idorg,"Other\nRules") Rel(user, ezidapp, "Create, Update, Search", "HTTP") Rel(ezidapp, user, "Notify", "Email") Rel(ezidapp, n2t, "Register ARK metadata", "HTTP") Rel(ezidapp, datacite, "Register DataCite metadata", "HTTP") Rel(ezidapp, crossref, "Register CrossRef metadata", "HTTP") BiRel_R(ezidapp, eziddb, "CRUD") Rel(n2t, naans, "Imports") Rel(n2t, idorg, "Imports") Rel(user, n2t, "Resolve", "HTTP") Rel(ezidapp, n2t, "bind", "HTTP") SHOW_LEGEND() ``` * EZID registered users have one or more prefixes ("shoulders") to which minted identifiers add characters to be globally unique. * Prefixes are associated with minters * "minters" are state machines that emit strings guaranteed to be unique within a minter * EZID currently sends minted identifiers to N2T where the N2T rules are updated to support resolution * EZID does not currently support identifier resolution. * Resolve capability for EZID is close to completion. * Once completed, binding with N2T will no longer be needed * N2T will then point to EZID for resolving those NAANs --- ## Container Major functional elements of the EZID application. Not shown are the background processes which perform asynchronous operations register identifiers with N2T, DataCite, and CrossRef, update the search index, evaluate links, and support batch downloads. ```plantuml !include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Container.puml Person(user, "User", "EZID User") Person(admin, "Admin", "EZID Administrator") Container_Boundary(aws, "EZID, AWS Hosted"){ Container_Boundary(ezid, "EZID EC2"){ Container(ezidui, "Web App", "HTML, Javascript, Python", "Web service UI and API") Container(ezidapp, "EZID", "Django, Python", "EZID application") Container(ezid_man, "Management", "Bash, Python", "Commandline management operations") ContainerDb(ezid_s, "Minters", "BDB") } ContainerDb(eziddb,"RDS read/Write", "AWS RDS MySQL") Container(lb,"Load Balance","HTTP") } Rel(user, lb, "CRU", "HTTP") Rel(lb, ezidui, "CRU") Rel(admin, lb, "Manage", "HTTP") Rel(lb, ezidui, "Manage") Rel(ezidui, ezidapp, "API Calls", "HTTP") Rel(admin, ezid_man, "Manage EZID system", "SSH") Rel(ezid_man, ezidapp, "Manage", "CLI") Rel_L(ezidapp, ezid_s, "Minter state") BiRel_R(ezidapp, eziddb, "CRUDLS") BiRel_R(ezid_man, eziddb, "CRUDLS") SHOW_LEGEND() ``` * The "Minters" DB is currently external to RDS, and exists as a set of Berkeley DB databases. * Moving minter state into RDS is an important step for scalability --- ## Component ### EZID Container Identifies components invovled with the EZID container and generally indicates interactions between components. ```plantuml !include https://raw.githubusercontent.com/datadavev/C4-PlantUML/master/C4_Component.puml AddElementTag("v3.1", $fontColor="#000000", $borderColor="#44B055", $bgColor="#CDEAD0") AddElementTag("v3.2", $fontColor="#000000", $borderColor="#E99276", $bgColor="#FCF0E6") Component(api, "API", "Django") Component(app, "EZID", "Django") BiRel(api, app, "All user operations") ComponentDb(rds, "RDS", "MySQL") BiRel(app, rds, "CRUDLS") ComponentDb(minters, "Minters", "BDB", $tags="v3.2") Rel(app, minters, "Create", "impl.nog.minter.mint_id") ComponentQueue(updateq, "Update Q", "impl.backproc.py") Rel(app, updateq, "Identifier CUD", "DB UpdateQueue", "Create\nUpdate\nDelete") Rel(updateq, rds, "Update") ComponentQueue(binderq, "Binder Q", "impl.binder_async.py") Rel(updateq, binderq, "create", "") Component(noidegg, "noid_egg", "noid_egg.py", $tags="v3.1") Rel(binderq, noidegg, "Async Calls") Component_Ext(n2t, "n2t", "HTTP") BiRel(noidegg, n2t, "State") ComponentQueue(downloadq, "Download Q", "impl.download.py") Rel(app, downloadq, "Batch Download", "DB download_queue", "Download") Component_Ext(email, "Email", "SMTP") Rel(downloadq, email, "Email bundle") ComponentQueue(dataciteq, "DataCite Q", "MySQL, Python") Rel(updateq, dataciteq, "Update DataCite") Component(datacitepy, "datacite.py", "Python", "modernize HTTP", $tags="v3.2") Rel(dataciteq, datacitepy, "async") Component_Ext(datacite, "DataCite", "HTTPS") Rel(datacitepy, datacite, "Send metadata") ComponentQueue(crossrefq, "Crossref Q", "MySQL, Python") Rel(updateq, crossrefq, "Update Crossref") Component(crossrefpy, "crossref.py", "Python", "modernize HTTP", $tags="v3.2") Rel(crossrefq, crossrefpy, async) Component_Ext(crossref, "Crossref", "HTTPS") Rel(crossrefpy, crossref, "Send metadata") Component(resolve, "Resolve", "Python", $tags="v3.1") Rel(app, resolve, "Resolve") Rel(resolve, rds, "Read") SHOW_LEGEND() ``` --- --- **Misc notes below, may be out of date** ``` Person(admin, "Admin", "EZID Administrator") Rel(admin, ezidapp, "Manage", "HTTP, SSH") Rel(ezidapp, admin, "Notify", "Email") Rel(naan, admin, "ARK Shoulder", "OOB") Rel(datacite, admin, "DOI Shoulder", "OOB") Rel(crossref, admin, "DOI Shoulder", "OOB") ``` ## Download Operation ```plantuml Actor Alice Alice -> API: Batch Download API -> download: enqueue_request download -> download_queue: payload activate download_queue download -> API: ACK API -> Alice: ACK thread -> download_queue: objects.all[0] download_queue -> thread: task note right: These steps are called separately\nin the thead loop. They can all\nbe coded in a single method called\nonce with no loss of functionality. thread -> thread: CREATE: _createFile thread -> thread: HARVEST: _harvest thread -> thread: COMPRESS: _compressFile thread -> thread: DELETE: _deleteUncompressedFile thread -> thread: MOVE: _moveCompressedFile thread -> Mail: MAIL: _notifyRequestor: msg thread -> download_queue: delete task deactivate download_queue Mail -> Alice: message ``` ## Mint Identifier Create Part 1. Store pending identifier, enqueue task, respond to user. ```plantuml Actor Alice Alice -> API: Mint Identifier API -> ezid: mintIdentifier ezid -> identifierLock: _acquireIdentifierLock note right: Why is this\nlocked twice? activate identifierLock ezid -> ezid: _mintIdentifier ezid -> ezid: createIdentifier ezid -> identifierLock: _acquireIdentifierLock activate identifierLock ezid -> models.store_identifier: StoreIdentifier.save note right #aqua: BLOB is written here ezid -> update_queue: enqueue (SI, "create") activate update_queue deactivate identifierLock ezid -> API: ACK deactivate identifierLock API -> Alice: ACK note right of update_queue: Worker thread in\nbackproc.py ``` Create Part 2. Process task in update queue. This is shown for a DataCite DOI. CrossRef DOIs follow the same process, except using the CrossRef queue. ```plantuml activate update_queue update_queue -> update_queue: get task note right: _backprocDaemon\nbackproc.py update_queue -> update_queue: metadata = toLegacy() update_queue -> update_queue: BLOB = blobify(metadata) update_queue -> models.search_identifier: updateFromLegacy (id, "create", metadata, BLOB) note right #aqua: BLOB is written here alt is DOI? update_queue -> datacite_async: enqueueIdentifier (id, "create", BLOB) activate datacite_async note right: _workerThread in\nregister_async.py par Async execution datacite_async -> datacite.py: uploadMetadata activate datacite.py participant DataCite #white datacite.py -> DataCite: HTTP DataCite -> datacite.py: ACK deactivate datacite.py datacite_async -> datacite.py: setTargetUrl activate datacite.py datacite_async -> datacite_async !!: workerThread\ndelete deactivate datacite_async datacite.py -> datacite.py: registerIdentifier datacite.py -> DataCite: HTTP DataCite -> datacite.py: ACK deactivate datacite.py end end note right: This block only applies to DOIs.\nIt is executed in parallel with the\noperation on binder_async update_queue -> binder_async: enqueueIdentifier (id, "create", BLOB) activate binder_async update_queue -> update_queue !!: daemon\ndelete task deactivate update_queue binder_async -> binder_async: create note right: _workerThread binder_async -> noid_egg: setElements binder_async -> binder_async !!: workerThread\ndelete deactivate binder_async activate noid_egg noid_egg -> noid_egg: _setElements noid_egg -> noid_egg: _issue participant N2T #white noid_egg -> N2T: HTTP deactivate noid_egg ``` --- ### Metadata BLOBs Progression of metadata from initial POST to appearance in the processing queues. ```plantuml [*] --> metadata note right: metadata dict is loaded from ANVL POST body state "createIdentifier() in ezid.py" as ezid{ metadata --> StoreIdentifier: updateFromUntrustedLegacy() StoreIdentifier --> [*]: Save() note right: Metadata persisted\nto StoreIdentifier table StoreIdentifier --> UpdateQueue: enqueue() note right: StoreIdentifier instance is added\nas a value to the UpdateQueue } state "_daemonThread() in backproc.py" as daemonThread { UpdateQueue --> UpdateModel UpdateModel --> SI.metadata: SI.toLegacy() SI.metadata --> BLOB: impl.util.blobify() note left: The BLOB is a gzipped copy\nof the metadata converted to an\n"exchange" format using toExchange() BLOB -> SearchIdentifier SI.metadata -> SearchIdentifier SearchIdentifier: BLOB (not used) SearchIdentifier: metadata BLOB --> binder_async: all binder_async: BLOB BLOB --> datacite_queue: if DataCite datacite_queue: BLOB BLOB --> crossref_queue: if CrossRef SI.metadata --> crossref_queue: if CrossRef crossref_queue: BLOB crossref_queue: metadata (used only for owner) UpdateModel --> [*] note right: deleted after populating\nsearch and other queues } binder_async --> [*]: impl.util.deblobify() datacite_queue --> [*]: impl.util.deblobify() crossref_queue --> [*]: impl.util.deblobify() note left: BLOBs are deblobified prior\nto any subsequent processing\nin any of these queues ``` ## Update Identifier ```plantuml Actor Alice Alice -> API: Update Metadata API -> ezid: setMetadata ezid -> identifierLock: _acquireIdentifierLock activate identifierLock ezid -> models.store_identifier: StoreIdentifier.save note right #aqua: BLOB is written here ezid -> update_queue: enqueue (SI, "update") deactivate identifierLock activate update_queue ezid -> API: ACK API -> Alice: ACK note right of update_queue: Worker thread in\nbackproc.py ``` ## Delete Identifier ```plantuml Actor Alice Alice -> API: delete API -> ezid: deleteIdentifier ezid -> identifierLock: _acquireIdentifierLock activate identifierLock ezid -> models.store_identifier: StoreIdentifier.delete note right #aqua: BLOB is deleted here ezid -> update_queue: enqueue (SI, "delete") deactivate identifierLock activate update_queue ezid -> API: ACK API -> Alice: ACK note right of update_queue: Worker thread in\nbackproc.py ``` ## BLOBs BLOBs are stored in the following tables: - `ezidapp_storeidentifier` - `ezidapp_searchidentifier` - `ezidapp_binderqueue` - `ezidapp_updatequeue` - `ezidapp_crossrefqueue` - `ezidapp_datacitequeue` Each of the queues is ephemeral. The `searchidentifier` entries are created from `storeidentifier` records.