Our current storage system is causing significant errors due to lack of transactionality and improper use of caches. These issues are affecting the stability and performance of our application. To address this, we propose a complete overhaul of the storage layer to implement a more robust, transactional, and efficient system.
Related content:
Instead of saving in memory first and then to a database, use transaction APIs directly.
GoLevelDB: Uses a global read/write mutex. Only one transaction can be open at a time, blocking other write and transaction operations until the current transaction is completed. GoLevelDB transactions implementation.
BoltDB: Uses a copy-on-write B+tree structure for transactions, which supports fully serializable ACID transactions. All operations are blocked during a read-write transaction to maintain data integrity. BoltDB transactions implementation.
In both cases, we are currently not utilizing batches, which could improve performance by reducing the number of transactions. We have the example of the tx-indexer: 10x speed improvement just using transactions.
Different storage solutions offer different APIs and functionalities. PebbleDB, based on RocksDB, supports parallel transactions, making it ideal for our use case where most keys are "write once, read many." This reduces conflicts and improves performance when multiple transactions run simultaneously.
This doesn't mean that we are not going to still have a Storage interface, but in reality means that the interface functionality and methods will be highly coupled with pebble db functionality. If any other database is able to fill all the exposed functionality is a coincidence. Using these interfaces to implement a memory storage is trivial on any case. This memory storage can be use for specific corner cases, and mainly for testing when needed.
PebbleDB is specifically designed to handle high-concurrency environments without blocking issues commonly seen in other databases. Here are the key reasons why PebbleDB is a suitable choice:
Concurrent Compactions: PebbleDB supports level-based compaction with concurrent compactions, reducing write stalls and ensuring smoother performance under heavy write loads. This feature prevents blocking issues that occur with other databases like GoLevelDB and BoltDB.
Indexed Batches: PebbleDB uses batches for atomic operations, which help in maintaining consistency and atomicity without the need for full transactions. This approach avoids the overhead and complexity associated with transaction management, thereby enhancing write throughput.
High Write Throughput: By using a Fragmented Log-Structured Merge (FLSM) tree, PebbleDB significantly reduces write amplification and increases write throughput compared to traditional LSM trees used in LevelDB and RocksDB.
Cache should only be used for performance improvements, not as a core architectural component. For example, an LRU cache for compiled Gno code can be beneficial. Gas usage should be defined per data operation (Get/Set) independently of cache hits, to avoid gas inconsistencies.
Avoid operations that will "eventually" persist to disk. All operations should be immediate and atomic, eliminating the use of Set/SetSync
.
To properly implement transactions, we need robust error handling at the storage level. Commit or rollback transactions as needed to prevent leaving garbage in the storage layer.
We propose to use simpler and faster serialization formats for our data. While we have been using Amino, which is designed for determinism and ease of use in the Cosmos ecosystem, we are evaluating the use of MessagePack for its compactness and speed.
MessagePack: A binary format that is highly efficient in terms of storage size and speed. It offers space savings compared to Amino and is known for its performance in serialization and deserialization.
Amino: Currently in use, Amino ensures deterministic encoding and is designed for the Cosmos ecosystem. However, these functionalities and extra complexities are not needed at storage level.
Libraries I liked the most for MessagePack in Go:
State is a Merkle tree; for now, each node is saved on a separate key.
Genesis, latest height and this kind of data.
type ErrDatastoreKeyNotFound error = errors.New("key not found")
type Datastore interface {
io.Closer
DatastoreWriter // not sure if we should have it here (to avoid misusage).
DatastorReader
WBatch() (DatastoreWriteBatch, error)
RWBatch() (DatastoreReadWriteBatch, error)
}
type DatastoreWriter interface {
Set(key []byte, value []byte) error
Delete(key []byte) error
}
type DatastoreReader interface {
Get(key []byte) ([]byte, error)
Iterator(lowerBound []byte, upperBound []byte) (DatastoreIterator, error)
ReverseIterator(lowerBound []byte, upperBound []byte) (DatastoreIterator, error)
}
type DatastoreIterator interface {
io.Closer
Next() bool
Error() error
Value() []byte
}
type DatastoreBatch interface {
Commit() error
Rollback() error
}
type DatastoreWriteBatch interface {
DatastoreBatch
DatastoreWriter
}
type DatastoreReadWriteBatch interface {
DatastoreBatch
DatastoreWriter
DatastoreReader
}
Implement SaveBlock
from BlockStore
using a write batch:
func (bs *BlockStoreImpl) SaveBlock(block *types.Block, blockParts *types.PartSet, seenCommit *types.Commit) error {
b, err := bs.db.WBatch()
defer b.Rollback()
if err != nil {
return err // TODO: wrap
}
bb, err := msgpack.Marshal(block)
if err != nil {
return err // TODO: wrap
}
if err := b.Set(blockKey, bb); err != nil {
return err
}
bbp, err := msgpack.Marshal(blockParts)
if err != nil {
return err // TODO: wrap
}
if err := b.Set(blockPartsKey, bbp); err != nil {
return err
}
bsc, err := msgpack.Marshal(seenCommit)
if err != nil {
return err // TODO: wrap
}
if err := b.Set(seenCommitKey, bsc); err != nil {
return err
}
return b.Commit()
}
// TODO