Software Design Principles

# Software Design Principles ## Objective To define long-term principles for creating maintainable Go projects at Prysmatic Labs with excellent developer experience, security, and testability. ## Background [Prysm](https://github.com/prysmaticlabs/prysm) has become a fairly large Go project with a diverse set of contributors and complex features. As our project has become more critical and running in production, it's integral we, as software engineers, manage complexity well. Bringing popular software design principles into our organization and integrating them religiously can have a positive, compounding effect on the success of Prysm and our productivity. We want to give credit and huge thanks to Terence Tsao and Nishant Das. Under extremely complex constraints, the code they have written now powers Prysm in production and we wouldn't be here without their work. ## Design Patterns ### SOLID [SOLID](https://www.digitalocean.com/community/conceptual_articles/s-o-l-i-d-the-first-five-principles-of-object-oriented-design) is a 5 letter acronym defining popular software engineering principles to write clean, maintainable, and ideally more secure code. It stands for: - Single responsibility principle - Open / closed principle - Liskov substitution principle - Interface substitution principle - Dependency inversion principle In general, the teachings of SOLID allow developers to create extremely readable code that is easy to test, refactor as needed, and lead to major productivity boosts when working on complex codebases. When it comes to managing complexity, using the right abstractions and principles makes life easier for an entire team. We'll go over every letter of SOLID in this document and outline how it can be applied to Prysm. ### Shift-Left Thinking Another useful pattern is [Shift-Left Thinking](https://en.wikipedia.org/wiki/Shift-left_testing) The earlier in the software development process a failure can be detected, the easier it is to correct. Following that assertion to it’s logical conclusion, prefer type system / compiler checks > unit tests > integration tests > end-to-end tests > production failures. As such, emphasis should be placed on leveraging Go to its full capacity and using the power of the compiler to prevent bugs making it into PRs at all. ## Single responsibility principle Let's take a look at one of the most important functions in Prysm, namely, `validateBeaconBlockPubSub` in `beacon-chain/sync/validate_beacon_block.go`. The function spans 150 lines of code doing some of the following actions: - Acquisitions of mutexes - Sending over event feeds - Validation of data integrity - Retrieval of data from the database - Acquiring of another mutex to add data to a cache map - Slot time verification - More database access - ... ```go= // validateBeaconBlockPubSub checks that the incoming block has a valid BLS signature. // Blocks that have already been seen are ignored. If the BLS signature is any valid signature, // this method rebroadcasts the message. func (s *Service) validateBeaconBlockPubSub(ctx context.Context, pid peer.ID, msg *pubsub.Message) pubsub.ValidationResult { receivedTime := timeutils.Now() // Validation runs on publish (not just subscriptions), so we should approve any message from // ourselves. if pid == s.cfg.P2P.PeerID() { return pubsub.ValidationAccept } // We should not attempt to process blocks until fully synced, but propagation is OK. if s.cfg.InitialSync.Syncing() { return pubsub.ValidationIgnore } ctx, span := trace.StartSpan(ctx, "sync.validateBeaconBlockPubSub") defer span.End() m, err := s.decodePubsubMessage(msg) if err != nil { log.WithError(err).Debug("Could not decode message") traceutil.AnnotateError(span, err) return pubsub.ValidationReject } s.validateBlockLock.Lock() defer s.validateBlockLock.Unlock( ... ``` It becomes really difficult to have full coverage of this function, and also becomes hard to modify in the future if needed. For a new reader of our code, reasoning about its complexity. What if we need to refactor one part of it? Refactoring safely while wrangling mutexes, caches, goroutine spawns, etc. becomes a major task with potentially dangerous consequences for users. The single responsibility favors creating smaller functions to accomplish smaller tasks we care about. For example, in the function above, we could have a function that checks if we're ready to validate the block, such as simply checking if we are receiving a message from ourselves and not currently syncing. Next, we could have another function that encompasses a few smaller tasks, etc. At the end, our function will be a lot more readable and easier to unit test in smaller chunks. ### Risks of SRP done incorrectly #### Purely aesthetic SRP Note: it can be tempting to fulfill the single responsibility principle (SRP) by just taking chunks of a function and putting them into one-of helpers with poor names just to make things cleaner. Instead, splitting code up for SRP should be intentional and care should be taken to ensure functions are named after their purpose. ```go= func computeSomethingExpensive() { step1() step2() step3() } ``` These smaller helpers are only used to split up and make the function look pretty, but these helpers are neither reusable nor easy to understand without the context of the bigger function surrounding them. We should avoid SRP for aesthetic purposes only. #### Bugs with improper SRP The main thing to consider with SRP is whether or not we are encapsulating code properly. Caching issues are a good example of where violating the SRP can introduce bugs. When every consumer of a cache is required to make remember the cache keys are appropriately marked dirty system, the risk for mistakes shoots up, compared to if that is an internal detail of a data access type that mediates talking to the cache vs. talking to the database. Reducing or eliminating things that callers need to remember is a good way to prevent bugs using the principle of SRP. ### Recommendations #### Reusable validation pipelines with functional options There are common data structures in Prysm that are validated through our codebase in different ways. Namely, blocks, attestations, signed messages, etc. Many times, defining validation pipelines ends up being repetitive, verbose, and violates principles of DRY (don't repeat yourself) code. An alternative for better validation of data structures is to have reusable, extensible validation pipelines that are easy to include as desired. For example, in a web application, you might write form validators that are extensible and easy to chain, such as: ```go func validateUser(user *User) { valid, err := withValidationPipeline( user, passwordMinLength(8), validEmail, uniqueEmail, ) ... } ``` In Prysm, there are many functions that perform some validation on blocks or other data structures. Let's take a look at how we could create reusable validation pipelines for blocks: ```go func (s *Service) validateBlockPubSub(req *rpc.Request) { ... blk := readFromRequest(req, &ethpb.SignedBeaconBlock{}) valid, err := withValidationPipeline( blk, validStructure, validSignature(proposerPubKey), isCanonical, notYetSeen, fromValidPeer(s.p2p.PeerSet()), ) ... } ``` Other functions in our codebase might also want to validate an block from input arguments, but use fewer of these validators. For example, we may only want to check for structural integrity and for a valid signature. ```go func (s *Service) checkSlashableBlock(blk *ethpb.SignedBeaconBlock) { valid, err := withValidationPipeline( blk, validStructure, validSignature(proposerPubKey), ) ... } ``` ## Open / closed principle This principle refers to writing code that is simple to maintain. It states that types such as Go interfaces or classes in object oriented programming should be closed for modification but open for extension. In Go, this means that if we want to modify or add new functionality to an structs, we shouldn't have to worry about breaking or changing all other code that depends on it. Instead, we should favor **interfaces** as a way to extend functionality. ### Recommendations #### Leverage interface embedding in structs Interface embedding in structs is a powerful way to extend an interface or "override a method". For example, let's say we want to implement sorting in reverse. We already have the awesome `sort.Interface` from the standard library, and we know that sorting in reverse simply requires a different comparison function to check list elements. A really nice way to write clean code for this is to leverage the standard library as follows: ```go type reverse struct { sort.Interface } func (r reverse) Less(i, j int) bool { return r.Interface.Less(j, i) } func Reverse(data sort.Interface) sort.Interface { return &reverse{data} } func main() { sort.Sort(sort.Reverse(sort.IntSlice([]int{5, 2, 9))) fmt.Println(lst) } ``` Embedding `sort.Interface` just means that reverse holds a value that is a type that satisfies `sort.Interface`. By implementing `Less`, reverse itself *satisfies* `sort.Interface`, it's not really overriding anything, it's just allowing us to wrap and delegate for another value that implements `sort.Interface`. `TeeReader` is a standard library example of this kind of pattern https://cs.opensource.google/go/go/+/refs/tags/go1.17:src/io/io.go;l=550 ## Liskov substitution principle This principle states that as long as two types are interchangeable if a caller is unable to tell the difference. In Go, we use interfaces to accomplish this. If we can abstract common code into interface that defines some behavior, we can use it extensively. Although we do a good job at using interfaces in general, there is room to improve to use them at their full extent. ### Recommendations #### Smaller interfaces are preferred for composition Rather than having a major interface ```go type Database interface { ... // 10-20 fields. } ``` We could split up this interface into smaller chunks and compose them into the bigger interface. We use this pattern in Prysm effectively, but still have a few large interfaces in the codebase that could be refactored. ## Interface segregation principle In general, callers should not be forced to depend on arguments they do not use. This helps keep code clean and easy to test. In Prysm, we try to follow this pattern reasonably, but there is still room for improvement. #### Depend on only what you need Bad: ```go func validBlock(blockchainSrv *blockchain.Service, blk *ethpb.SignedBeaconBlock) error { canonical := blockchainSrv.IsCanonical(blk) if !canonical { return errors.New("block is not canonical") } ... } ``` Preferred: ```go type CanonicalChecker interface { IsCanonical(*ethpb.SignedBeaconBlock) bool } func validBlock(checker CanonicalChecker, blk *ethpb.SignedBeaconBlock) error { canonical := checker.IsCanonical(blk) if !canonical { return errors.New("block is not canonical") } ... } ``` ## Dependency inversion > High-level modules should not depend on low-level modules. Both should depend on abstractions. Abstractions should not depend on details. Details should depend on abstractions. - Uncle Bob High-level code should deal with specifics, low-level with abstractions. #### CLI flags should not need to be accessed beyond the main package In Prysm, our `main.go` file simply serves to define execution commands and list the flags used. Then, control flows into `beacon-chain/node/node.go` or `validator/node/node.go`, which then perform a wide array of cli flag parsing and checking. Moreover, we end up propagating cli flag contexts down to low-level packages such as the database. It is common to see code in different parts of Prysm that accesses `cli.Context` to fetch flag values such as `dataDir := cli.StringFlag(cmd.DataDirFlag.Name)`. Example deep inside the validator's account package which should not need to deal with CLI flags at such a low level of abstraction: ```go // Input the directory where they wish to backup their accounts. backupDir, err := prompt.InputDirectory(cliCtx, backupPromptText, flags.BackupDirFlag) if err != nil { return errors.Wrap(err, "could not parse keys directory") } // Allow the user to interactively select the accounts to backup or optionally // provide them via cli flags as a string of comma-separated, hex strings. filteredPubKeys, err := filterPublicKeysFromUserInput( cliCtx, flags.BackupPublicKeysFlag, pubKeys, prompt.SelectAccountsBackupPromptText, ) if err != nil { return errors.Wrap(err, "could not filter public keys for backup") } ``` Instead, we could leverage our `main` packages a lot more for initialization of configuration values from flags, reading user input, and performing otherwise "specific", implementation-dependent operations. The recommendation is to avoid leaking cli.Context outside of any `main` package in Prysm. Ideally, low-level packages should deal in abstractions, interfaces, rather than specifics of our application context. ```go // in cmd/beacon-chain/main.go func main() { options := make([]PrysmOption, 0) if cliCtx.BoolFlag(enableTracing.Flag) { options = append(options, WithTracing()) } if cliCtx.BoolFlag(pprof.Flag) { options = append(options, WithPprof()) } ... } // in beacon-chain/node.go func New(opts ...PrysmOption) { ... } ``` #### Avoid tight coupling Golang does not allow circular package imports, and for good reason. More often than not, having a circular dependency is a sign that code should perhaps live side by side. These circular graphs typically arise when we try to segregate packages based on what feels nice rather than what is functional. For example, let's say we have a `client/` and `server/` package in a web application, and the client requires some types and imports the server package. However, the server imports the client package as well because it needs to know some information about the kind of client options beind used for initialization. This is a code smell, and instead we might find these should be consolidated under a single package. `net/http` has both `client.go` and `server.go` files for this same reason. Preferred: ``` http/ client.go server.go ``` Current: ``` server/ server.go client client.go types/ types.go ``` We often create a third package, such as `types` simply to break import cycles. However, this does not solve the reason we have a cycle in the first place. A specific example in Prysm is the tight coupling that exists between packages on initialization. For example, `initial sync`, `sync`, and `blockchain` depend on each other for various things, so we fix this dependency by having global feeds in the `node/` package, which makes things significantly more complex. The initialization process of Prysm often relies on things to be done across different packages in a specific order, which is communicated via the shared event feed. This creates tight coupling (whereas SRP pushes us towards loosely coupled components). Likewise the initial-sync, sync and blockchain packages all share responsibility for key stages of the blockchain processing algorithm. #### Avoid multiple other packages modifying properties of a struct in a package Kasey points out that one form of tight coupling we have is where multiple packages may attempt to mutate the same struct value. This often leads to fairly complex locking logic. another reference point / rule of thumb for limiting what packages know about each other is https://en.wikipedia.org/wiki/Law_of_Demeter. Specifically "Only talk to your *immediate* friends". ## Miscellaneous Design Patterns ### Encapsulation #### Abstract caches away Users of caches should not need to worry about acquiring locks nor transforming the cache keys for the objects involved. Current: ```go func (s *service) doSomething() { // Some other operations... ... s.cacheLock.Lock() defer s.cacheLock.Unlock() hashed, err := block.HashTreeRoot() if err != nil { log.Error(err) return } s.blocksCache[hashed] = block ... } ``` Preferred: ```go func (s *service) doSomething() { // Some other operations... ... s.seenBlocksCache.Put(block) } ``` #### Encapsulate nil checks Too many nil checks exposed that could be done in one place can lead to bugs if someone forgets to perform the nil checks. By encapsulating nil checks, we get the ability to make safe calls and simplify our code a lot. Thanks to Kasey for the analogy: > Imagine if HTTP 404 did not exist and everyone always received 200s and they had to then do nil checks on the body to check if the data is not found Current: ```go st, err := beaconDB.StateByRoot(ctx, root) if err != nil { return err } if st == nil || st.IsNil() { ... } ``` Preferred: ```go // beacon-chain/db/kv/state.go var ErrNotFound = errors.New("not found") func (s *Store) StateByRoot(root [32]byte) (*State, error) { state := s.db.view(root) if st == nil || st.IsNil() { return nil, ErrNotFound } } // elsewhere. st, err := beaconDB.StateByRoot(ctx, root) switch err { case kv.ErrNotFound: // Handle... default: return err } ``` #### Minimize feature flag conditionals We should place feature flags deep in the implementation they toggle and not require the rest of the codebase to deal with setting conditionals. Current: ```go if featureconfig.Get().EnableCommitteeCache { committee, err := committeeCache.Get(publicKey) if err == cache.ErrNotFound { // Get from db... ... } else { log.Error(err) return } } else { // Do something else... } ``` Preferred: ```go // in cache/committee.go func (c *CommitteeCache) Get(publicKey [48]byte) (*Committee, error) { if !featureconfig.Get().EnableCommitteeCache { return nil, cache.ErrNotFound } ... } // elsewhere committee, err := committeeCache.Get(publicKey) if err == cache.ErrNotFound { // Get from db... ... } else { log.Error(err) return } ``` #### Leverage package-level errors Current ```go state := syncCommitteeStateCache.Get(slot) if state == nil || state.IsNil() { ... } ``` Preferred ```go // beacon-chain/cache func (c *SyncCommitteeHeadStateCache) Get(slot types.Slot) (state.BeaconState, error) { c.lock.RLock() defer c.lock.RUnlock() val, exists := c.cache.Get(slot) if !exists || val == nil { return nil, ErrNotFound } return val.(*stateAltair.BeaconState), nil } // Elsewhere. state, err := syncCommitteeStateCache.Get(slot) if err == cache.ErrNotFound { ... } ``` ### Concurrency #### Use channels more for communication instead of mutexes Current ```go unc (c *AttestationCache) Get(ctx context.Context, req *ethpb.AttestationDataRequest) (*ethpb.AttestationData, error) { if req == nil { return nil, errors.New("nil attestation data request") } s, e := reqToKey(req) if e != nil { return nil, e } delay := minDelay // Another identical request may be in progress already. Let's wait until // any in progress request resolves or our timeout is exceeded. for { if ctx.Err() != nil { return nil, ctx.Err() } c.lock.RLock() if !c.inProgress[s] { c.lock.RUnlock() break } c.lock.RUnlock() // This increasing backoff is to decrease the CPU cycles while waiting // for the in progress boolean to flip to false. time.Sleep(time.Duration(delay) * time.Nanosecond) delay *= delayFactor delay = math.Min(delay, maxDelay) ``` Preferred: [in-progress cache using semaphore](https://gist.github.com/rauljordan/e5b7bcaefbff778d2eab379e0923ddab) #### Avoid using default for main business logic ```go select { case <-ctx.Err(): default: // Main logic here. ... } ``` ## References https://www.youtube.com/watch?v=zzAdEt3xZ1M Dave Cheney SOLID principles in Go ## Acknowledgements First of all, thanks to everyone on the team that contributed towards the software used by the majority of Ethereum consensus. You guys are rockstars that have done amazing work under the time contraints of shipping phase 0, and we constantly learn from your output. Thanks to Kasey and Preston for their insights and recommendations. Kasey for comments that started this discussion and Preston for teaching the ways of [Uncle Bob](https://en.wikipedia.org/wiki/Robert_C._Martin)