Open Neuro PET Hack Notes

# Open Neuro PET Hack Notes **High-level overview** - Data deposition is not user-triggered but rather an administrative process on OpenNeuroPET's side, invloing plenty of paper work due to GDPR. - Data retrieval requires authentication and approval. There will be an SQL-based API (not build yet) that DataLad could query to learn if a user is allowed to retrieve data or not (many degrees of freedom with regard to what that API returns) - API likely returns Dataset ID, token, and user - During the deposition process, OpenNeuroPET does data conversion to a DataLad dataset **URL-based lookup workflow** * Every OpenNeuroDataset is ideally 1 DataLad dataset * For each file per dataset, a URL is registered. This URL points to the storage backend. * When taken together, the info in that URL (e.g., base URL, ID, ) + the info that datalad has (e.g., checksums, dataset UUID, file sizes, hashtree formats) need to be sufficient to allow for infrastructure flexibility (e.g. to allow for moving of data), such that new access URLs can be composed on the fly. Given an annex key and file ID, we should be able to perform a lookup on GoogleDrive/S3/Azure/... * Different auth methods possibly encoded in the URL: token, temporary URLs, ... How are we doing it? ### Support for multiple and future storage backends Basic concept for extensibility and long term accessibility. Each individual file can be identified by two static and one dynamic, i.e. storage specific, component. The static components are: - Dataset UUID - File key (in this case: annex key) The dynamic component is specific to the concrete storage backend. Conceptually a storage-specific mapping (SPM) will produce an access URL: access_url = SPM(UUID, file-key) That will ensure that the datasets stored are accessible even if the storage system changes. All that is required is the implementation of a new storage specific mapping function on the client side. **Exploring storage backend** Current storage backend of interest: **Microsoft Azure** (lawyers said yes). - It is likely that each dataset is one [Azure container within Azure storage blobs](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-overview). But its worthwhile to explore whether, instead of containers, one might group files into "Azure folders" to identify datasets. (there is no literal concept of a 'folder', but file names can contain slashes, ) - What does it take to download a file from azure? - Whats the Azure URL scheme? - Container names can be between 3 and 63 characters long. - Container names must start with a letter or number, and can contain only lowercase letters, numbers, and the dash (-) character. - Two or more consecutive dash characters aren't permitted in container names. - The URI for a container is similar to: > https://myaccount.blob.core.windows.net/mycontainer - Which protocol is used for upload/download? From the docs: "Users/client applications can access objects in Blob Storage via HTTP/HTTPS [...]. Objects in Blob Storage are accessible via the Azure Storage REST API, Azure PowerShell, Azure CLI, or an Azure Storage client library" - "" - How can we encode all additional information to files stored in Azure? - What are the limits on Azure (how many containers, files, etc), and the granularity we'll be going for? - Answer from the docs: "A storage account can include an unlimited number of containers, and a container can store an unlimited number of blobs." containers could be identified by Dataset UUID. You can add key names ### Python stuff Python library to interact with Azure storage https://pypi.org/project/azure-storage-blob/ There are different ways to authenticate. Promising are SAS token (can be time limited) Code to download a blob from an Azure container with a temporary token: ```python from azure.storage.blob import ContainerClient sas_token = <super-secret-string-here> container = ContainerClient.from_container_url('https://neurothenticate.blob.core.windows.net/pet-phantoms', credential=sas_token) # download requires an exact file name stream = container.download_blob(blob='https://neurothenticate.blob.core.windows.net/pet-phantoms/OpenNeuroPET-Phantoms-20221207T203406Z-001.zip') ``` # Metadata pipeline notes ## OpenNeuro Metadata form [Metadata form](https://openview.metadatacenter.org/templates/https:%2F%2Frepo.metadatacenter.org%2Ftemplates%2Ffacf7262-e29b-40a3-b053-e5fcdaa080dd) (generator & schema) on CEDAR ## High-level pipeline - user sends BIDS data to OpenNeuroPET via ftp -