Storage Improvements

Introduction

Our current storage system is causing significant errors due to lack of transactionality and improper use of caches. These issues are affecting the stability and performance of our application. To address this, we propose a complete overhaul of the storage layer to implement a more robust, transactional, and efficient system.

Related content:

Detailed Proposal

Use original storage transactions

Instead of saving in memory first and then to a database, use transaction APIs directly.

  • GoLevelDB: Uses a global read/write mutex. Only one transaction can be open at a time, blocking other write and transaction operations until the current transaction is completed. GoLevelDB transactions implementation.

  • BoltDB: Uses a copy-on-write B+tree structure for transactions, which supports fully serializable ACID transactions. All operations are blocked during a read-write transaction to maintain data integrity. BoltDB transactions implementation.

In both cases, we are currently not utilizing batches, which could improve performance by reducing the number of transactions. We have the example of the tx-indexer: 10x speed improvement just using transactions.

Decide on One Storage Solution and Stick to It

Different storage solutions offer different APIs and functionalities. PebbleDB, based on RocksDB, supports parallel transactions, making it ideal for our use case where most keys are "write once, read many." This reduces conflicts and improves performance when multiple transactions run simultaneously.

This doesn't mean that we are not going to still have a Storage interface, but in reality means that the interface functionality and methods will be highly coupled with pebble db functionality. If any other database is able to fill all the exposed functionality is a coincidence. Using these interfaces to implement a memory storage is trivial on any case. This memory storage can be use for specific corner cases, and mainly for testing when needed.

Why PebbleDB?

PebbleDB is specifically designed to handle high-concurrency environments without blocking issues commonly seen in other databases. Here are the key reasons why PebbleDB is a suitable choice:

  • Concurrent Compactions: PebbleDB supports level-based compaction with concurrent compactions, reducing write stalls and ensuring smoother performance under heavy write loads. This feature prevents blocking issues that occur with other databases like GoLevelDB and BoltDB.

  • Indexed Batches: PebbleDB uses batches for atomic operations, which help in maintaining consistency and atomicity without the need for full transactions. This approach avoids the overhead and complexity associated with transaction management, thereby enhancing write throughput.

  • High Write Throughput: By using a Fragmented Log-Structured Merge (FLSM) tree, PebbleDB significantly reduces write amplification and increases write throughput compared to traditional LSM trees used in LevelDB and RocksDB.

Avoid Using Cache Wraps

Cache should only be used for performance improvements, not as a core architectural component. For example, an LRU cache for compiled Gno code can be beneficial. Gas usage should be defined per data operation (Get/Set) independently of cache hits, to avoid gas inconsistencies.

Ensure Every Operation is Atomic

Avoid operations that will "eventually" persist to disk. All operations should be immediate and atomic, eliminating the use of Set/SetSync.

Proper Error Handling in Transactions

To properly implement transactions, we need robust error handling at the storage level. Commit or rollback transactions as needed to prevent leaving garbage in the storage layer.

Use Simpler Serialization Formats

We propose to use simpler and faster serialization formats for our data. While we have been using Amino, which is designed for determinism and ease of use in the Cosmos ecosystem, we are evaluating the use of MessagePack for its compactness and speed.

  • MessagePack: A binary format that is highly efficient in terms of storage size and speed. It offers space savings compared to Amino and is known for its performance in serialization and deserialization.

  • Amino: Currently in use, Amino ensures deterministic encoding and is designed for the Cosmos ecosystem. However, these functionalities and extra complexities are not needed at storage level.

Libraries I liked the most for MessagePack in Go:

Needed functionality

State

State is a Merkle tree; for now, each node is saved on a separate key.

Possible Improvements:

  • Add a manifest with nodes for data prefetch.
  • Store nodes in order, using original hashes as a secondary index to improve fetch speed.
  • Realm (TODO: better understanding of all realm.go logic and fit it on storage batches.)
  • Object
  • Type
  • BlockNode
  • NumMemPackages
  • MemPackage
  • ITER MemPackages
  • Load STDLIB once into the storage

Set

  • Set when generating the tree, we might not have the entire tree yet.

Get

  • Lazy get. Prefetch if needed.

Transaction

  • GET TxResult by hash
  • Standard set/gets with transaction info, potentially using block height and transaction index or directly the transaction hash.

Code

  • Store files in separate keys.
  • Use an LRU cache for compiled code.

Blocks

  • IndexCounter ???
  • State? (Size, Height, AppHash)
  • Validator keys by height
  • ConsensusParams by height
  • LoadBlockPart(types.Part) ?? by height and index
  • LoadBlockMeta(types.BlockMeta) ?? by height
  • LoadBlockCommit(types.Commit) by height
  • LoadSeenCommit(types.Commit) by height
  • ABCI response keys by height
  • SaveBlock(block *types.Block, blockParts *types.PartSet, seenCommit *types.Commit)
  • KEYBASE GetByName, GetByAddress

Other data

Genesis, latest height and this kind of data.

API Proposal

Base storage


type ErrDatastoreKeyNotFound error = errors.New("key not found")

type Datastore interface {
    io.Closer
    DatastoreWriter // not sure if we should have it here (to avoid misusage).
    DatastorReader 
    
    WBatch() (DatastoreWriteBatch, error)
    RWBatch() (DatastoreReadWriteBatch, error)
}

type DatastoreWriter interface {
    Set(key []byte, value []byte) error
    Delete(key []byte) error
}

type DatastoreReader interface {
    Get(key []byte) ([]byte, error)
    Iterator(lowerBound []byte, upperBound []byte) (DatastoreIterator, error)
    ReverseIterator(lowerBound []byte, upperBound []byte) (DatastoreIterator, error)
}

type DatastoreIterator interface {
    io.Closer
    Next() bool
    Error() error
    Value() []byte
}

type DatastoreBatch interface {
    Commit() error
    Rollback() error
}

type DatastoreWriteBatch interface {
    DatastoreBatch
    DatastoreWriter
}

type DatastoreReadWriteBatch interface {
    DatastoreBatch
    DatastoreWriter
    DatastoreReader
}

Intermediate Block Storage

Implement SaveBlock from BlockStore using a write batch:


func (bs *BlockStoreImpl) SaveBlock(block *types.Block, blockParts *types.PartSet, seenCommit *types.Commit) error {
    b, err := bs.db.WBatch()
    defer b.Rollback()
    
    if err != nil {
        return err // TODO: wrap
    }
    
    bb, err := msgpack.Marshal(block)
    if err != nil {
        return err // TODO: wrap
    }
    
    if err := b.Set(blockKey, bb); err != nil {
        return err
    }
    
    bbp, err := msgpack.Marshal(blockParts)
    if err != nil {
        return err // TODO: wrap
    }
    
    if err := b.Set(blockPartsKey, bbp); err != nil {
        return err
    }
    
    bsc, err := msgpack.Marshal(seenCommit)
    if err != nil {
        return err // TODO: wrap
    }
    
    if err := b.Set(seenCommitKey, bsc); err != nil {
        return err
    }
    
    
    return b.Commit()
}

Intermediate VM State Storage

// TODO

Select a repo