The design of a storage-oriented system parachain for the Polkadot ecosystem involves a multiplicity of decisions, often interdependent, that determine the exact inner workings, guarantees provided, and use-cases that it caters to. Even within Web3, these use-cases can have very different requirements, that could affect the design in one way or the other. The aim of this document is to expose all the dimensions that were taken into account for an implementation of Storage Hub, briefly introduce two approaches to such a solution, and expand on the one this team considers to add more value to the ecosystem.
When it comes to the design of a storage chain, some of the key questions that should be asked include:
It is evident that some of these questions are related to each other. For example, one cannot talk about the Game Theory and incentives without considering how price is calculated. At the same time, it is also true that the answers to these questions depend greatly on the use cases the system aims to serve. Even within Web3 native use cases, it is not the same to provide cheap NFT storage with just enough bandwidth to be read sporadically, than to store appchain runtimes that need to be fetched by Relay Chain validators whenever they need to verify a block. In the near future, there could even be a market for running AI training or processing over some data stored in Storage Hub.
The following illustrates a basic diagram of what at this point in the document, Storage Hub should be visualised as:
For all the previous reasons, the design of Storage Hub was reduced to two alternatives that could make sense, considering different design values. The alternatives are:
Storage Hub will be a service available to all parachains. Its use-cases are as unpredictable as the imagination of teams behind those parachains. Considering this, the Providers Interaction design represents the superior alternative to add value to Polkadot's ecosystem, being able to adaptively create new offerings for requirements, use-cases and demand that might not even exist today, but will be crucial in the future. All the while, providing strong certainties that even if a Main Storage Provider fails to deliver, disappears, or becomes a bad actor, there should always be the possibility for choosing another.
This design is ideated with three core values in mind that should guide the decision process when more than one option is presented:
This section covers the way in which a user of this system would interact with it in an abstract level. It will be explained the actions that the user takes to upload and retrieve data, but not necessarily the exact means or the appearance of the application that is used to interact. In other words, what this section describes is how much of the inner workings of the system the user is exposed to.
When talking about "the user" in this section, it should be thought as a Polkadot parachain or relay chain, interacting with Storage Hub through XCM (or a similar communication channel), or an application running on that parachain, to whom the parachain exposes the interaction with Storage Hub. It is expected that dApps would interact directly with Storage Hub but mostly to retrieve/upload data that is managed by other parachains.
There is a public list of available Main Storage Providers that offer their service in an open market. Users can compare characteristics of Storage Providers to select at least one as their Main Provider. Such characteristics include:
The Storage Provider agrees to uphold the conditions advertised, and the user should be given all the information it needs to make a sensible decision.
The user informs the choice of Main Storage Provider through an on-chain transaction, while the data is transferred off-chain in peer-to-peer communication between the parachain / relay chain node, and the Storage Provider. Once this is done, the Storage Provider is eligible to collect payments, contingent upon proving that it continues to store the data when prompted. For more details on proofs, please refer to the Proofs section.
At any point in time, if the user feels the Main Storage Provider is not upholding their side of the deal, it can change and use another Storage Provider as their Main Provider. The data does not need to be re-uploaded to the new Main Storage Provider, since it would request it from the previous one. If that is not possible due to legitimate issues, censorship, or dishonest behaviour, the data can be retrieved from Backup Backup Storage Providers.
There is an on-chain Distributed Hash Table (DHT) that links a file to the ID of the Main Storage Provider who exposes it, and also to the Backup Storage Providers. The Main Storage Provider maintains an on-chain "profile" with the appropriate information for retrieving data from it, which happens off-chain. A file is identified by the hash of its location, and the location is the concatenation of the AccountId
owner of the file, and an arbitrary path. For example, for AccountId = 12345
, the file whose path is cool_project/v1/logo
is identified by hash(/12345/coolproject/v1/logo)
. Account IDs can be considered to have a dedicated "drive". The reason for this location-addressed system is to allow for modification and upgradability of files.
The existence of a Main Storage Provider is primarily for providing a hustle-free retrieval user experience. The Storage Provider is tasked with making data retrieval channels available, as stipulated by its advertised services and agreed upon by the user who selected it. For example, a Storage Provider may offer exposing the data through an HTTP endpoint, with a given maximum request rate limit, hosted in a given region, and with a given network bandwidth. Another Storage Provider could also offer access through WebSocket or IPFS as its value proposition. Ultimately, the objective is to foster a competitive open market where Storage Providers compete for selection as a Main Storage Provider by offering unique value propositions for data retrieval, thereby appealing to various user needs. This way, the system as a whole is flexible enough to serve use cases that go from mission-critical data retrieval applications, to cheap/free NFT storage.
The reason behind this model for data retrieval is to avoid complexity in trying to prove on-chain that the Storage Provider's served the data, which for large files, would necessarily have to be transferred off-chain. Proving that the Storage Provider served the requested data to the requesting user, without trusting any of the two (i.e. assuming any of the two could be malicious), can be extremely cumbersome and impractical.
Although good for simplicity of data retrieval, it should be acknowledged that this open market competition may lead to centralisation in a handful of Storage Providers that, due to reputation, superior value propositions, or even default options in dApps, end up being the Main Storage Provider choice for most of the users. However, the unstoppability goal can still be accomplished through the role of Backup Storage Providers. These alternative Storage Providers store a backup of the data to ensure that users can always permissionlessly change their Main Storage Provider, even when this one is reluctant or unable to serve the data back.
Reading and writing to a file are fundamentally different actions in this system. Writing involves an on-chain transaction where the user posts the request for storing a file, and pays according to the defined payment method (see Payment). The user then transfers the file data through an off-chain channel, and then the Storage Provider proves it is storing the file, to be compensated. In this process, it can be verified on-chain that the user who initiates the writing request is allowed to do so, otherwise the transaction is invalid.
Reading is a completely off-chain action, where the Main Storage Provider is supposed to serve the file to the user requesting it, provided the latter has access, but there is no way to verify on-chain that the Main Storage Provider is upholding the reading permissions set for that file. So there are two ways to approach this: by trusting the Main Storage Provider, and change it should it misbehave, or through file encryption.
An AccountId
in Storage Hub is the owner of a file, and it is the only one (at first) able to assign permissions to any other AccountId
over that file. An AccountId
in Storage Hub can be controlled by different kinds of entities. Some examples include:
The common pattern in all of the previous is that they send messages through XCM, except for the first case. But in fact, all of them can be thought as "locations" in the XCM language, where in the first case, the origin location does not come from outside of Storage Hub.
An AccountId
can assign permissions to these locations specifically (for example to the location of parachain A
, or account 1
in pallet 2
of parachain B
) and those permissions can be
Wildcards are allowed when assigning permissions to locations. For example, the location 0/polkadot/parachain_a/*
would be used to allow writing operations from any requests coming from locations under parachain_a
, whereas 0/polkadot/parachain_a
allows writing only from messages whose sender is parachain_a
itself. This enables assigning permissions to a larger group of individuals at once, which could be all accounts in a given collective registered in a Collectives pallet of a parachain. That is also the reason why negative permissions exist, to disallow a specific user or group of users, who belong to a group that is allowed by a wildcard. Negative permissions have the last say when in conflict with positive permissions.
Reading permissions cannot be assigned to groups in the same way as writing permissions, because reading does not involve an on-chain action like a transaction or XCM message that can be checked where it came from. So instead, it is proposed for reading permissions to be NFTs in Storage Hub that represent the credentials, and this NFTs could be configurable by the permissions issuer to allow transfers or not. Whoever requests reading a file from a Main Storage Provider should provide a valid signature from an account that holds the corresponding NFT credentials. A special case for Collectives could be considered, should users have a way to prove they belong to a Collective.
There are two options when it comes to enforcing reading permissions: trusting the Main Storage Provider to enforce them and change to another if it fails to do so, or add a layer of encryption. The first one is understandably simpler to implement, but gives less guarantees to the user. Adding a layer of encryption comes with the difficulty of sharing the encryption secrets properly and avoiding leaks. One model that could be implemented is for the owner of the file to divide the encryption keys using Shamir Secret Sharing, and distributing the secret among a subset of Storage Hub nodes. Then when reading the file, the user gets the encrypted blob from the Main Storage Provider, and then communicates with Storage Hub's nodes to retrieve the pieces and reconstruct the encryption keys. The user should prove to the nodes that it possesses the corresponding NFT credentials through a valid signature, and the retrieval of the pieces should be communicated through and encrypted channel.
The aim of this section is to provide a system design that naturally incentivises the actors in it to behave in a way that benefits the system as a whole. These behaviours are:
With these goals in mind, the system should provide positive incentives to guide behaviours, and negative incentives to punish them if they misbehave. Positive incentives are preferred to make actors willingly act in the expected way, and negative ones should be a last resort and only used for proven bad actors.
The incentives should also be guided by target parameters that can be modified by governance of the Storage Hub. These are:
Before moving on to the actual incentives specification, the difference between a Main and a Backup Storage Provider should be explained. The Main Storage Provider is the one that stores the full file given by the user, and exposes the means for retrieving the data off-chain. For a given file, there areN
Backup Storage Providers, where N
is determined in Replications. The job of a Backup Storage Provider is to store a piece of the file, and have it readily available to be served to another Storage Provider, if the user assigns it as the new Main Storage Provider. That means a Backup Storage Provider does not need to expose the piece of data for public retrieval.
Both kinds of Storage Providers are required to regularly prove they keep storing the data they committed to, but their roles in the system are different. Main Storage Providers are there to provide convenience and simplicity for data retrieval. They are supposed to compete for user adoption, and it is expected that a certain degree of centralisation will happen. They are also trusted parties when it comes to expecting them to allow data retrieval. Backup Storage Providers are there in case that trust is broken, so that the user can freely choose another Main Storage Provider, and they can rest assure that the data is available in Storage Hub to do so. Backup Storage Providers do not compete, as they do not offer a distinctive service to one another, and they are not chosen by the user. Instead, they are assigned by the system, with considerations to have an even distribution of data.
The users' payment for storing a file is distributed between Main and Backup Storage Providers (and a small fraction goes to the treasury). The amount going to the Main Storage Provider is agreed upon when the user selects that Main Storage Provider, but normally it would be substantially higher than what a Backup provider receives, because they have to provide the infrastructure for convenient data retrieval. What doesn't go to the Main provider, gets evenly distributed between the assigned Backup providers. For example, if a user is paying 10 units of currency, and the Main Storage Provider charges 5 units, then the other 5 is distributed among Backup Storage Providers.
With all these considerations taken into account, the following incentives are proposed:
Alternatively, if there is more interest in providing a stable price for a more hustle free user experience, the price could be maintained flat following the next curve, with the disadvantage of loosing the demand-side of incentives:
Regardless of the exact method, payments in this design need to work as money streams, meaning that there should be a constant flow of monetary incentives that keep requiring the Storage Providers (both kinds) to prove they are still storing the data. The frequency in which this money streams release the payment is a system parameter, and so is the price per unit of data for the Backup Storage Providers. The latter means that each user pays an amount per file, that is dependent on the size of the file.
When the user announces on-chain its intention for storing a file of a given size, it should be checked that it has enough balance to cover at least one billing period (or more, could be a system setting), and both the Main Storage Provider and Backup Storage Providers assigned will open their peer-to-peer communication channels to receive the data, and will not accept more data than what was requested and agreed to pay for by the user.
Two alternatives are presented as payment methods:
As mentioned in previous sections, there are Main Storage Providers and Backup Storage Providers. In principle, there is only one Main Storage Provider that stores the entire file for a user, but the option could be made available for users to select more than one, and pay for it accordingly. When it comes to Backup Storage Providers, the system selects N
providers to store chunks of the file. Using Erasure Coding, it is assured that as long as M
out of the N
providers return their chunks (where M < N
), the file can be reconstructed. Both M
and N
are system parameters.
This section refers to the proofs Storage Providers (Main and Backup alike) have to submit regularly in order to get paid. The criteria to choose a proofs method consists of:
Three approaches are considered:
Given the trust considerations of Hashes + Consensus, that alternative is almost automatically ruled out, leaving the decision between the first two. A deeper technical analysis on the computational cost and size of the proofs should be carried out to choose the best one.
This design does not directly address how mutability features will be implemented for this system, but takes it into account by providing various means in which it could be later added. Some of these means and considerations include:
The following diagrams illustrate the Protocol Interaction design, with a focus on how the user interacts with it.