WW Distributed Crawling Architecture

Overview

Two-step project:

Implement a distributed, replicated log data-structure on top of Wetware.
Implement web-crawling framework on top of replicated log.

Rationale. The goal of Wetware is to facilitate the development of distributed systems, so it makes sense to start by showing that we can easily reproduce a widely used data-structure out of ww primitives. From there, we show that it's straightforward to build a non-trivial application in the usual way.

System Architecture

The system comprises a set of worker peers, each of which consumes URLs from a replicated queue, fetches the resource, and appends the resource to a separate, replicated "results" queue. From this description, we identify XXX core components:

Consistent Replicated Log (Raft Protocol)
Distributed Queues for URLs and Results
Item Processor

Consistent Replicated Log with Raft

The Raft protocol provides a replicated log abstraction with PC/EC semantics. We can use this as a foundation for modelling our distributed queues.

As a first step, we should demonstrate the implementation of the pure Raft protocol on top of Wetware, in a separate package/repository. In effect, this means all protocol state and state-transition logic should be encapsulated by a WASM process, and RPC messages should be transported over pubsub or channels, whichever is more approrpriate.

Etcd's Raft implementation is well-suited to this task because it has minimal dependencies, and message transport is entirely factored out of its design; it provides no facilities for transporting messages, leaving the design space unconstrained for the application.

This first step should be considered "done" when basic "Raft for Wetware" functionality has been released as a stand-alone project, with CI, documentation and issue-tracking. It will constitute the first of many "standard library" modules, providing more sophisticated behaviors on top of Wetware primitives. In the interest of expediency, early versions can use hard-coded values as a means of deferring the implementation of such things as CLI parameters until later.

The user-facing interface to the Raft cluster should follow the ww run rom.wasm pattern introduced in Wetware v0.1.0. The idea is to compile the Raft module into a WASM binary (called a "ROM"), which is first executed locally. The ww run command connects to a cluster and negotiates a Host capability, which is then passed into the local WASM process. From there, the guest logic is responsible for:

Enumerating cluster peers and selecting the appropriate subset of peers for the Raft cluster
Starting the Raft process on each of the above peers, and storing the process capaiblity along some anchor path in the corresponding host (e.g. /<peer id>/raft/<version>)

The Raft process proper, i.e. the one assigned to the anchor path, is responsible for setting up any channels needed to consume client requests (e.g. to store/read values), and assigning these to sub-anchors (e.g. /<peer id>/raft/<version>/request & /<peer id>/raft/<version>/response). Subcommands in the top-level Raft ROM can later be implemented that emplly these sub-anchors to fetch/store values from the Raft cluster.

For the avoidance of doubt: the "inner" ROM (i.e. the one actually executed on the remote Wetware node) can be embedded into the "outer" ROM by means of Go's fs/embed package. Future work (independent of this effort) will facilitate the creation of local ROMs, e.g. via some lisp-based scripting language, but as this is pioneering work, we'll have to do it the low-level way.

Tasks

Extend ww run command to optionally (default: true) dial a Wetware cluster, establish a Host capability, and pass it into the ROM being executed.
"Hello World" for Raft ROM: bootstraps the Host connection and publishes "Hello, World" to a "test" topic.
Outer ROM embeds a "Hello, World" app: start a process along /<peer id>/hello on some node, and have it publish a message to the same "test" topic.
Integrate the etcd Raft library into the inner ROM, preferably using pubsub for message transport
Add documentation to README
Add CI using GitHub Actions
Release v0.0.1 w/ WASM binary artefact

Distributed Queues for URLs and Results

// TODO(lthibault)

// Model multi-topic queue on top of Raft

Item Processor

// TODO(lthibault)

// Allow users to supply a custom ROM that worker nodes use to

Putting it all together

// TODO(lthibault)

// In essence: write a Crawler ROM that

Overview

System Architecture

Consistent Replicated Log with Raft

Tasks

Distributed Queues for URLs and Results

Item Processor

Putting it all together

References

The Raft Protocol

Chain Replication / CRAQ

Overview

System Architecture

Consistent Replicated Log with Raft

Tasks

Distributed Queues for URLs and Results

Item Processor

Putting it all together

References

The Raft Protocol

Chain Replication / CRAQ

Read more

Wetware Cluster Membership

[ ]

WASM-Based Precompiles for SUAVE

Wetware