Technical & Product Debts

# Technical & Product Debts ## Localstore rewrite (ongoing) The Bee localstore loses data sometimes, the disk usage grows indefinitely and the current implementation is not able to deduplicate chunks for stamping or store multiple postage stamps per chunk. These problems are currently being addressed with a complete rewrite of the localstore. https://github.com/ethersphere/bee/issues/3171 ## Libp2p as a dependency The libp2p library is used to implement the lower-level p2p abstractions. This library is created by and maintained by Protocol Labs as an open-source project. New versions are released quite aggressively meaning that they break backwards compatibility in minor/patch versions and also impose external dependency upgrades (for example a Go compiler version upgrade). This makes it difficult to keep the library up-to-date, on the other hand it deals with unsanitized data from the network, therefore it is critical from security perspective. Another issue is that Protocol Labs is the creator of IPFS, which is currently the biggest competitor in the space and therefore their priorities may not be aligned with the priorities of the Swarm project. There is currently an ongoing version upgrade that was originally reported in August as a security issue. https://github.com/ethersphere/bee/pull/3483 ## Big file/data support This is the [original 4th milestone](https://www.ethswarm.org/roadmap) that is supposed to happen after the storage incentives. Currently the reliability of the protocols responsible for uploading and downloading chunks are not very high, so chunks are sometimes missing during retrieval. With smaller datasets this is less of a problem, however with bigger datasets the probability of a single missing chunk is also bigger. The issues mentioned above with the localstore also contribute to this problem, so fixing that is also a requirement before this issue can be addressed. ## Single Owner Chunk (SOC) reserve calculation The current implementation of calculating a value from the chunks stored in the reserve for the Storage Incentives assume that the nodes store the same chunks in a given neighborhood. However with the SOCs this is not necessarily the case and an attacker may craft and upload SOCs so that they will end up at different nodes of a neighborhood and therefore force to freeze or slash nodes because of that. There does not seem to be an issue opened about this problem and although there were some discussions about possible workarounds, they haven't been documented and reviewed yet, so it is not known if they correctly fix the issue and don't break anything else. ## Postage stamp related issues There were multiple issues identified related to postage stamps that are either missing functionality in the current implementation or usability issues that make it very hard to understand and use them correctly. For the sake of completeness here is an already prioritized list of issues that were already discovered. #### Utilisation rate and presentation The postage stamp utilization calculation employs a probabilistic method which are susceptible to the Birthday paradox issue. That makes it non-intuitive to calculate the capacity and current utilization of a given stamp. There were research made to provide a formula for the calculation and it seems that it's better to provide a table with the pre-calculated values. This table has not yet been published and therefore it is also not yet implemented in the tooling (`swarm-cli`, `bee-dashboard`). Also it seems there may be other, implementation related problems that makes the utilisation calculation incorrect. https://github.com/ethersphere/bee/issues/3122 #### Multiple stamps per chunk For data availability and correctness it is required that the localstore may store multiple stamps per chunk. Currently this is not the case, but this is going to be addressed as part of the localstore rewrite. https://github.com/ethersphere/bee/issues/2850 #### Enrich stamper data / deduplication When uploading it is possible that several identical chunks are produced from the input data and in the current implementation the chunks are not deduplicated before stamping. This may also lead to incorrect utilisation calculation. This is also addressed as part of the localstore rewrite. https://github.com/ethersphere/bee/issues/3606 https://github.com/ethersphere/bee/issues/3122 #### Portability The number one request from users who were trying to use swarm was to let them do the stamping on the client side before uploading, so that they can manage and potentially share their own postage stamps between different computers. The problem with that is that way the ability to track the utilisation of a postage stamp is on the client side, but nodes may risk connectivity with other nodes if they upload data with overcomissioned postage stamps. Therefore this topic needs more research and also an extra API that covers this use-case. https://github.com/ethersphere/bee/issues/2905 #### Bucket depth / neighborhood depth There is a concept of bucket depth in the postage stamp definition, which by definition is a function of the size of the network and therefore is subject to change (grow) over time. There is also a requirement for the difference of the stamp depth and bucket depth to be greater than a constant value, so that the utilisation can be calculated correctly. These together results in that the minimum postage stamp size will grow over time and the changes will need to be reflected in the PostageStamp smart contract. The status quo is that the Swarm team needs to revisit and update the bucket depth value every year, or find an automated solution to this problem or eliminate the need for the bucket depth altogether. #### Privacy of postage stamps Currently the postage stamp transactions are public on the blockchain where the stamps were bought and therefore each piece of chunk can be traced back to a wallet that paid for it. This leaks information in a way that is antithetical to the principles of swarm, therefore this needs to be addressed in some way. #### Data expiry and availability The intuitive way to explain and present how the data is stored on swarm with postage stamps uses the formula `capacity * time`, e.g 16GB for a month. This may lead to an assumption that once the postage stamp is paid then there is a capacity allocated for a certain time. However the time is not fixed, so if the postage stamp unit price increases then the time the data is stored will decrease. Most people find this difficult to understand and plan with and it would be good to find solutions that can provide some guarantees for data availability for a given time period even if the price changes. ## Bandwidth incentives Most of the focus in 2022 was spent on figuring out how the storage incentives could work, but there is also a bandwidth incentives mechanism in swarm. The current implementation of the bandwidth incentives are based on a design (time based settlements) that was accepted as a last minute idea before swarm mainnet went live. Therefore it is not very well documented, the tradeoffs that were chosen at that time are not known and the implementation details are know only by one person (Abel). Because of this, currently it is not possible to tell users how much earnings they could expect or how much it will cost to upload data even after the storage incentives will be in place and there will be data flowing on the network. Without more work done on this topic the current value proposition of the bandwidth incentives is quite weak. ## End-to-end encryption by default One of the value proposition of swarm is that it node operators are not responsible for the data they store because the data is split into chunks and everything is encrypted ("plausible deniability") and also they cannot see the uploaded content ("privacy"). However currently the data by default is not encrypted and therefore node operators can have access to the data and they can be made responsible for storing illegal data (at least chunk sized). It is possible to turn on encrpytion by default, and also keep a way to store publicly available data in a deterministic manner (so that two uploaders uploading the same dataset would get the same hash). A proposal for this is written here: https://github.com/ethersphere/bee/issues/2958 ## Bee API The Bee API was designed with an ambitious scope for its features, but it appears that some practical considerations were not fully addressed in the implementation. As a result, certain features are currently incomplete or buggy, and some may not meet the specific needs of users (such as feeds, manifests, pinning, PSS, and so on) #### API layering One idea to improve the API situation is to reduce the scope for the core Bee and expose only a small set of functionality ("lean Bee"). Basically keeping only chunks, postage stamps, chequebooks and administrative endpoints in the layer 1 and expose everything else in higer layers. This has the potential to keep the functionality at layer 1 well tested and the maintainer team could focus on the reliability and performance issues. Also it could enable creating alternative implementations easier, because the scope of such project would be much smaller. On the other hand this opens questions about the standardisation process and potential fragmentation of layer 2 solutions. #### Debug / Restricted API There are certain endpoints that makes sense to keep restricted, so that accessing them requires a different privilege level than the regular endpoints. Typically such endpoints are the ones that make operations on the blockchain, because there is some money involved, so incorrectly or maliciously calling these endpoint can cause damage to the node operator. A workaround in the past was to put such operations on the so-called Debug API which was originally reserved for functionality for the developers to test potentially destructive operations that are not available during normal operations. However this lead to a situation that most node operators effectively needed to turn on the Debug API for everyday usage, which makes it hard to use because there are now two ports (1633 and 1635) that needs to be accessible. It also enabled debug functionality in production, which is not desirable. An improvement to this was the introduction of the Restricted API which enables the restricted operations on the normal API if a certain header with an encrypted key was passed in when invoking the request. The plan was to deprecate having restricted endpoints on the debug API and move to the restricted API as a default. However one can argue that the actual implementation for restricted API got a bit over-engineered and/or under-documented because we have not seen anybody using this feature in the wild. #### Big file HTTP operations There is a problem when downloading big files on the HTTP API is that it is not feasible to keep the whole file in memory. Instead the implementation sends back a HTTP 200 OK message, then uses a lookahead buffer, and fetches just enough chunks to fill the buffer then stream it to the client. However if there is a missing chunk later, due to the nature of HTTP there is no way to indicate the problem other than to bail out quickly. This leads to the perception when a chunk is missing that swarm returns corrupt data silently. https://github.com/ethersphere/bee/issues/3714 ## JS projects At the beginning of 2021 the dev team collectively decided that the tooling and user facing components will be done by the Javascript team, which were at that time 4 developers. The team started, released and maintained numerous projects: - [bee-js](https://github.com/ethersphere/bee-js): this project enables access of the Bee API from Javascript and is the basis for all the other tooling projects - [swarm-cli](https://github.com/ethersphere/swarm-cli): CLI tool to work with your Bee node - [bee-dashboard](https://github.com/ethersphere/bee-dashboard): a web UI written in React to control your Bee node - [Swarm Desktop](https://github.com/ethersphere/swarm-desktop): a desktop app that bundles Bee as a light client and uses the `bee-dashboard` as a UI - [Swarm Gateway](https://gateway.ethswarm.org/): a website to share and access content on the swarm network without installing anything. Since it allows accessing unfiltered content under the `ethswarm.org` domain, it is a future risk for the Swarm Foundation/Association and would be better to move to a different domain or run by someone else from the community. - [gateway-proxy](https://github.com/ethersphere/gateway-proxy): a backend service behind the gateway website that simplifies and automates postage stamp management. Has some rudimentary data stewardship functionality - [bee-factory](https://github.com/ethersphere/bee-factory): test environment for running multiple Bees and a blockchain node locally There are other related supporting projects that are omitted here. However since then the team headcount is reduced to one person (Aron), the bus factor is low and without a clear roadmap for these projects the motivation of maintaining them may be low.