Paper Studying: Distributed Systems

tags: `Main`

Time, Clocks, and the Ordering of Events in a Distributed System

What are the main contributions of the paper?

It discusses and defines the meaning of time in distributed systems.
It proposes a distributed algorithm for synchronizing a system of logical clocks which can be used to totally order the events.
The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

What is a distributed system?

A distributed system consists of a collection of distinct processes which are spatially separated, and which communicate with one another by exchanging messages.

Why a single computercan also be viewed as a distributed system?

The central control unit, the memory units, and the input-output channels can all be considered as separate processes. If the message transmission delay is not negligible compared to the time between events in a single process, it can be viewed as a distributed system.
But primarily with systems of spatially separated computers.

What is the challenge of time in distributed system?

It is sometimes impossible to say that one of two events occured first.

How is difference betwen logical clock, physical clock and physical time?

physical time: the real time in physical world.
physical clock: a clock on a computer to represent the physical time.
logical clock: a logical clock is just a way of assigning a number to an event, where the number is thought of as the time at which the event occurred. Hence it maybe implemented by counters with no actual timing mechanism, and it is represented as a function C, where C_i(a) is the logical time of event a occurred in process i.

How to define logical clock with respect to the notion of "happen before"?

The relation "happened before" is a partial ordering of the events in the system.
- It is denoted as "->" .
- a->b means that it is possible for event a casually affect event b.
- Two events are concurrent if neither can casually affect the other.
A logical clodk must satisfy the two Clock conditions
- If a and b are events in process Pi, and a comes before b, then C_i(a) < C_i(b).
- If a is the sending of a message by process Pi and b is the receipt of that message by process Pi, then C_i(a) < C_i(b).

How to implement a logical clock?

Each message m contain a timestamp Tm which equals the time at which the message was sent. Upon receiving a message timestamped Tm, a process must advance its clock to be later than Tm.
- If event a is the sending of a message m by process Pi, then the message m contains a timestamp Tm = C_i(a).
- Upon receiving a message m, process Pj sets Cj greater than or equal to its present value and greater than Tm.

How to obtain a total order (a=>b) among processes using logical clock?

We simply order the events by the logical times at which they occur. To break ties, we use any arbitrary total ordering < of the processes.
- If a is an event in process Pi and b is an event in process Pj, then a => b if and only if either (i) C_i(a) < C_j(b) or (ii) C_i(a)=C_j(b) and Pi < Pj.
The relation => is a way of completing the "happened before" partial ordering to a total ordering. The ordering => depends upon the system of clocks C_i, and is not unique.

Why we need total order in a distributed system?

To solve process synchronization problem, where different requests for the resource must be granted in the order in which they are made.

Why the synchronization problem cannot be solved by a centralized scheduler?

Let P0 be the scheduling process. Suppose P1 sends a request to P0 and then sends a message to P2. Upon receiving the latter message, P2 sends a request to P0. It is possible for P2's request to reach P0 before P1's request does. The condition for synchronization is then violated if P2's request is granted first.
In short, out of order messages can occur with respect to their sending time.
Using the concept of logical clock, it is possible to design a distributed algotithm that solves the problem with no central synchronizing process or central storage.

What is the limitation of the proposed synchronization algorithm based on logical clock?

A process must know all the commands issued by other processes, so that the failure of a single process will make it impossible for any other process to execute.

Why the entire concept of failure is only meaningful in the context of physical time.

Without physical time, there is no way to distinguish a failed process from one which is just pausing between events.
A user can tell that a system has "crashed" only because he has been waiting too long for a response.

What is the anomalous behavior mentioned in the paper?

An anomalous behavior occurs if the ordering obtained by the proposed logical clock algorithm differs from that perceived by the user.

How to avoid the anomalous behavior by physical clocks?

Assume physical clocks of processes are synchronized and run at approximately the correct rate with respect to some threshold.
Based on the value of the threshold, we can then identify message lost and avoid the anomalous behavior.
The only requirement of this implementation is that the clocks needs to be synchrnoized. That is why we normally need to periodically synchronize time of machines in a distributed system using the NTP “Network Time Protocol" command.

Cap Theorem: Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

What are the main contributions of the paper?

Prove it is impossible to achieve all three (consistency, availability, and partition tolerance) properties in an asynchronous network model.
Discuss solutions in a partially synchronous model.
Open the minds of designers to a wider range of systems and tradeoffs.

What is the definition of CAP?

Consistency (Sequential): There must exist a total order on all operations such that each operation looks as if it were completed at a single instant (node).
Availbility: Every request received by a non-failing node in the system must result in a response. That is, any algorithm used by the service must eventually terminate.
Partition: When some nodes crash or some communication links fail, it is important that the service still perform as expected. No set of failures less than total network failure is allowed to cause the system to respond incorrectly.

What is the definition of linearizable/atomic consistency?

Linearizability (also known as atomic consistency) can be defined as sequential consistency with the real-time constraint.

What is the definition of synchronous network model and asynchronous network model?

In the asynchronous model an algorithm has no way of determining whether a message has been lost, or has been arbitrarily delayed in the transmission channel.
A partially synchronous model means every node has a clock, and all clocks increase at the same rate. However, the clocks themselves are not synchronized.
As a result, it is possible to determining whether a message has been lost in partially synchronous network model but not in asynchronous network model.

Why it is possible to determining whether a message has been lost in partially synchronous network model.

A local timer can be used to schedule an action to occur a certain interval of time after some other event. Therefore, we can assume that every message is either delivered within a given, known time: tmsg, or it is lost.

What does fair execution mean?

All executions including those in which messages are lost

How to prove Theorem 1: It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: Availability, Atomic consistency in all fair executions (including those in which messages are lost).

Prove by contradiction. Assume that the network consists of at least two nodes. Thus it can be divided into two disjoint, non-empty sets: {G1,G2}. Assume that all messages between G1 and G2 are lost. If a write occurs in G1, and later a read occurs in G2, then the read operation cannot return the results of the earlier write operation.

What is the purpose of Corollary 1.1, and how is it different from Theorem 1?

Corollary 1.1 is used to support the proof of why there still could be a solution for CAP in partially synchronous network model.
Theroem 1 is the definition of CAP theorem. So it applies to the asynchronous network model without knowing lost message.
Corollary 1.1 only applies to the asynchronous network model where message lost does not actually happen. But the system doesn't know about the fact of message lost.

Why there could be a solution for partially synchronous network model but not in asynchronous network model?

In the asynchronous model an algorithm has no way of determining whether a message has been lost, or has been arbitrarily delayed in the transmission channel.
But in partially synchronous network model, a timeout mechanism can be designed to identify message lost.
Hence, it is possible to design a solution for CAP if we know the message is not lost.

Give an example solution for asynchronous network model.

By the definitino of asynchronous network model. A centralized node can determine if a message is lost.
Therefore, a relexed consistency model (Delayed-t consistency) can be proposed, in which the centralized node can ensure atomic data when all messages in an execution are delivered.

Why time is so important to the discussion of consistency or atomic operation?

Both consistency and atomic operation would define a partial order of the operations, and then require that if one operation begins after another one ends, the former does not precede the latter in the partial order.

CAP Twelve Years Later How the “Rules” Have Changed

What are the main contributions of the paper?

Clarify the purpose of CAP theory and start the argument that the modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application.
Point out the misleading thinking from the CAP theory and argue the opportunities from the CAP theory.
Discuss the steps, challenges, and design principals to handle network partitions.

What does BASE and AICD refer to?

ACID refers to the traditional database systems which must guarantee Atomic, Isolation, Consistnecy, Durability
BASE refers to the new distributed system design paradigm which means Basically Available, Soft state, Eventually consistent.

What are the missing leading and opportunities from CAP theory?

Because partitions are rare, there is little reason to forfeit C or A when the system is not partitioned. Also, if users cannot reach the service at all due to network parition, there is no need to choose between C and A as well. More importantly, designer probably cannot choose a system not to have paritions at all.
The choice between C and A can occur many times within the same system at very fine granularity; not only can subsystems make different choices, but the choice can change according to the operation or even the specific data or user involved. For instance, the scope of consistency reflects the idea that, within some boundary (both time and space), state is consistent, but outside that boundary all bets are off.
All three properties are more continuous than binary. Availability is obviously continuous from 0 to 100 percent, but there can be many levels of consistency model as well.

Why traditional system design only talks about ACID not CAP?

Traditional systems mostly running on a single node or small-scale cluster under a single switch of network component. A network failure either won't occur at all or will bring down the whole system when it occurs.
In practice, most groups assume that a datacenter (single site) has no partitions within, and thus design for CA within a single site.
Therefore, network parition problem is implicity discussed in their system design as well.

Why it might make sense to give-up/relex consistency?

Availability is more important to (web) services, and paritions cannot be controlled by the system designers.
Given the high latency across the wide area, it is relatively common to forfeit perfect consistency across the wide area for better performance.
The subtle beauty of a consistent system is that the invariants tend to hold even when the designer does not know what they are. Consequently, a wide range of reasonable invariants will work just fine.

What is an invariant?

An invariant (in common sense) means some conditions that must be true at some point in time or even always while your program is executing.

What are the three steps to handle network partitions?

Detect network parition: when partitions are present or perceived, we need a strategy that detects partitions and explicitly accounts for them is in order.
Manage network parition: after partitions occur, the system should enter an explicit partition mode that can limit or control some operations to ensure the level of CA.
Recover from network parition: after paritions occur, the system should initiate a recovery process to restore consistency and compensate for mistakes made during a partition.

Why retry or two-phase commit doesn't really address the network parition problem?

It only delays the decision of when to enter the network parition mode.

What are the challenges in each of the steps? And how these challenges can be handled?

Detection:
- Problem
  - There might not be a global notion of a partition, since some nodes might detect a partition, and others might not.
  - Byzantine fault: a component such as a server can inconsistently appear both failed and functioning to failure-detection systems, presenting different symptoms to different observers.
- Solution: One-sided partitioning
  - The detecting side could have inconsistent operations, it must enter partition mode. But the undetecting side can continue its operation until partitions are detected.
  - Systems that use a quorum are an example of this one-sided partitioning.
Managing:
- Problem
  - How to define the relexed model for consistency and availability?
  - How to limit or control the operations to ensure the relexed models?
  - What information should be stored for recovery.
- Solution
  - All the new proposed system designs following the BASE paradigm.
Recovery:
- Problem
  - Designers must be explicit about all the invariants, which is both challenging and prone to error.
  - The store and recovery strategies for the invariants must timing, efficiency (with respect to resource consumption), and accurate.
  - How to resolve merge conflict from the states of the invariants?
- Solution
  - The known checkpoint and roll-back techniques.
  - Using communicative operations: The system concatenates logs, sorts them into some order, and then executes them. Commutativity implies the ability to rearrange operations into a preferred consistent global order.
  - Most systems cannot always merge conflicts. For example, CVS occasionally has conflicts that the user must resolve manually. Other systems can always merge conflicts by choosing certain operations. (like google file system, google doc)

Consistency, Availability, and Convergence

What are the main contributions from the paper?

It examines the limits of consistency in fault-tolerant distributed storage systems. In particular, it identifies fundamental tradeoffs among properties of consistency, availability, and convergence.
It proves that no consistency stronger than Real Time Causal Consistency (RTC) can be provided in an always-available, one-way convergent system and RTC can be provided in an always-available, one-way convergent system.
It introduces bounded fork join causal semantics that extends causal consistency to Byzantine environments while retaining availability and convergence.

What is the definitino of availability?

A system allows reads and writes to complete regardless of which messages are lost.

What is the definition of consistency?

The order that reads and writes may appear to occur. The constraints and requirements of the order is defined by the model of consistency.

What is the definitino of convergence?

Convergence requires connected nodes to observe one another’s updates

What is the definition of one-way convergence

An one-way convergent system guarantees that if node p can receive from node q, then eventually p’s state reflects updates known to q.
So it is a relexation of the strong definition of convergence.
And eventual consistency becomes a property of convergence.

What is a Byzantine environment?

A Byzantine environment means a system has Byzantine environment faults. A Byzantine fault is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed.

How to define consistency semantics?

A consistency semantics is a test on an execution—if the test for consistency C passes on an execution e, we say e is C-consistent.
- An execution comprises of a set of nodes and a sequence of read and write operations at each node.
- Note that we permit a read operation to return multiple results and assume serial execution at each node so that one operation at a node ends before the next one starts.
We say that a consistency semantics Cs is stronger than another consistency semantics Cw iff the set of executions accepted by Cs is a subset of the set of executions accepted by Cw.
We say that two consistency semantics are incomparable iff neither of them is stronger.

What is the definitino of casual consistency?

What is a BH (before hand) graph?

How is Real-time-causal consistency different from Causal consistency?

Why the authors need to define Real-time-causal consistency?

Why convergence matters?

Systems that fail to propagate information among nodes may technically have very strong consistency and availability.
When we examine weaker semantics like causal consistency, we find that we must explicitly consider convergence.
Convergence exposes the fundamental trade-off between safety (consistency) and liveness (availability and convergence).

Why RTC is not the strongest always-available consistency semantics under two-way convergence?

because nodes could conspire to impose order among logically-concurrent updates while exchanging those updates.

Is linearizability the strongest semantic for some natural convergence requirement?

Paxos Made Live & Paxos Made Simple

What are the contributions of the papers?

Paxos Made Simple
- Desribe the Paxos consensus algorithm.
- Prove the correctness of the algorithm.
Paxos Made Live
- Intorduce the consensus algorithm.
- Show to use the Paxos algorithm to build fault tolerant log system.

What is a consensus problem?

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen.
If no value is proposed, then no value should be chosen.
If a value has been chosen, then processes should be able to learn the chosen value.

What is the purpose of solving a consensus problem?

For storage systems with data replica, we need consensus algorithm to ensure mutual consistency.
Consensus algorithm can be used to build fault tolerant distributed systems.

What are the agents of a consensus problem?

Proposer: clients/replicas who propose values
Acceptor: coordinator who make consensus decision
Learner: clients/replicas who wait for the decision

Why single acceptor/coordinator does not work?

Vulnerable to the single-point-of-failure problem.

What are the challenges of multiple acceptors/coordinators?

There may be multiple agents who simultaneously believe they are the acceptors.
These acceptors may select different values as a result.
The consensus decision must be maintained even after the acceptor has failed.

What is the basic idea to solve the problem?

Ordering the acceptors allows each agent to distinguish between the current acceptor and previous acceptors. In this way, agents can reject messages from old acceptors and prevent them from disrupting consensus once it is reached.

How the Paxos algorithm works?

1. Elect a replica to be the coordinator.
1. The coordinator selects a value and broadcasts it to all replicas in a message called the accept message. Other replicas either acknowledge this message or reject it.
1. Once a majority of the replicas acknowledge the coordinator, consensus has been reached, and the coordinator broadcasts a commit message to notify replicas.

How to order the acceptors?

Paxos orders the coordinators by assigning them an increasing sequence number as follows. Each replica keeps track of the most recent sequence number it has seen so far. When a replica wants to become coordinator, it generates a unique1 sequence number higher than any it has seen, and broadcasts it to all replicas in a propose message. If a majority of replicas reply and indicate they have not seen a higher sequence number, then the replica acts as a coordinator. These replies are called promise messages since replicas promise henceforth to reject messages from old coordinators.

How to force future coordinators to select that same value in order to ensure continued agreement?

The promise messages from replicas include the most recent value they have heard, if any, along with the sequence number of the coordinator from whom they heard it. The new coordinator chooses the value from the most recent coordinator.
If consensus was achieved by a previous coordinator, the new coordinator is guaranteed to hear about the value decided upon from at least one replica.
That value will have the highest sequence number of all responses received, and so will be selected by the new coordinator.
So we can guarantee to force future coordinators to select that same value in order to ensure continued agreement

Raft: In Search of an Understandable Consensus Algorithm

What are the main contributions of the paper?

What is a replicated state machine, and what is it for?

為了應對fault tolerance problem in distributed system而產生的機制
多台機器(state machines)擁有相同的log(replicated)，他們讀取同樣順序的operation，進行一樣的command，因此可以確保所有機器都是在一樣的狀態。即使有些機器壞了，其他機器都可以取代它繼續正常工作

What is the role of a consensus algorithm in a replicated state machine?

Keep the replicated log consistent.

What are the properties and benefits from using a consensus algorithm in a replicated state machine?

What are the contributions from Paxos mentioned by the authors?

first defines a protocol capable of reaching agreement on a single decision
ensures both safety and liveness
supports changes in cluster membership
correctness
efficient in normal case

Why the authors claim that Paxos does not provide a good foundation either for system building or for education?

Intuitively, how Raft solves these problems?

What are the purposes of the four operations: leader election, log replication, safety, and membership changes?

What are the states of a server: leaders, followers, and candidates?

leaders: 負責回應所有client的要求，當上之後就要做到死(fail)
followers: 只會回應leader和candidates傳來的要求。若是一段時間沒有收到要求，就會成為candidates
candidates: leader的候選人

What is the purpose of term?

How leader election works in Raft?

How the election is triggered?
What does a candidate server do during election?
How do servers vote?
What is the restriction of voting?
When will a leader be selected? And why a server self-claim itself as the leader?
What is the purpose of letting a candidate server returns to a follower after it recognizes the learder?
Why a candidate will reject the AppendEntries RPC and continues in candidate state, if the term in the RPC is smaller than the candidate’s current term?
Why split voting problem could occur, and how it is solved by Raft?

What is the procedure to replicate logs?

If followers crash or run slowly, or if network packets are lost, the leader retries Append-Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries. Will this cause any problem?

When will a log entry be committed?

What is the property of a committed entries?

What is the log matching property? Why it can be guaranteed during normal operations?

What types of inconsistency could occur and why they could happen?

How the inconsistency is handled in Raft?

Why the authors claim that Raft can automate reconfiguration?

What is the Leader Completeness property? Why it is needed? How it can be guaranteed?

Why log entries can only flow in one direction, from leaders to followers?

Why the log will be consistent eventually?

Why Raft never commits log entries from previous terms by counting replicas.

What is the essential requirement for the availability guarantee of Raft?

What is the main issue from cluster membership changes?

How Raft solve the problem in cluster membership changes?

What is the problem if a new joining server gets elected as the leader? How it is solved by Raft?

What is the problem if the old cluster leader is not part of the new configuration?

What problem can be caused by removed servers? How it can be solved?

How to avoid unbounded log?

When will the system has more than one leaders?

Pros and Cons comparison for checkpoint locally vs. by the leader.

How to ensure linearizable semantic for clients?

Why leader crashes can leave the logs inconsistent?

How raft resolve inconsistency between leaders and followers?

How to avoid the logs to grow unbounded?

Can we you map the Raft system design to the fail (network parition) handling steps described in "CAP Twelve Years Later How the “Rules” Have Changed"?

What are the corresponding problems when each of the guarantees listed in figure3 is failed?

What are your own thoughts about Raft comparing to Paxos? What are the pros and cons?

The Chubby lock service for loosely-coupled distributed systems

What are the main contributions of the paper?

a coarse-grained locking service
reliable storage for loosely coupled system

What is a lock service?

The purpose of lock service is to synchronize client activities and to agree on basic infromation of their environment

How is a lock service different from a consensus protocol library?

We can use just simple lock service to synchronize the different service in a system, without embeding a consensus protocol library in code.

How the requirements are different between a fine-grained lock and a coarse-grained lock?

Frequency of acquiring lock
Coarse-grained: after acquiring lock, it takes days or weaks to release the lock
Fine-grained: after acquiring lock, it takes seconds to release the lock

What does cell, name space, file, directories, handle, namespace represent in Chubby?

cell:
name space:
file, directories: node
handle: file discriptor

Why Chubby does not expose operations that can move files from one directory to another?

Since different directories may serve for different Chubby master, moving files among directories may cause error.

What is the difference between advisory locks and mandatory locks?

advisory lock: it is just a lock at programming layer, we can still access the object if we want it.
mandatory lock: it is a lock at object layer, using mandatory lock to lock an object will cause no one can access that object.

Why Chubby chooses to implement advisory lock?

Chubby locks often protect resources implemented by other services, rather than just the file associated with the lock.
Chubby did not wish to force users to shut down appli- cations when they needed to access locked files for debugging or administrative purposes.
Benefit little from mandatory checks

Why Chubby needs a sequencer?

Since communication is typically uncertain, and process will fail independently. This may cause data inconsistent when many processes thought they have the lock.

How the sequencer works?

Informantion in sequencer has name, mode, lock generation number.
After get the lock, process will also get a sequencer, generation number can help identify the age.

What is an event in Chubby?

It is help for Chubby client doesn't need to keep asking for some event happen or not, after the event client registered happened, Chubby server will notify client.

How to use Chubby's API to implement primary election?

All potential primaries open the lock file and attempt to acquire the lock. One succeeds and becomes the primary, while the others act as replicas.

How to make Chubby scalable?

Make more cell, make sure different client can access closer server.
master can increase the lease time from default 12s to 60s under heavy load
Chubby clients cache file data, meta-data, the absence of files, and open handles to reduce the number of calls they make on the server.
protocol-conversion servers that translate the Chubby protocol into less-complex protocols such as DNS and others.
Proxy
Partitioning

Why it is common to use Chubby as a name server to replace DNS?

Since DNS when having great amont of lookup will under heavy load and the latency will be longer. But Chubby has cache mechanism, so it can reponse to lookup without asking server.

Weighted Voting for Replicated Data

What are the main contributions?

What are the benefits of having replicated data?

What is the proposed algorithm?

What are the important properties from the proposed algorithm?

What is a FileSuite?

How the performance and reliability can be adjusted by changing the configuration r, w? Why?

What does the quorum size imply?

Why the proposed storage system must be built on top of a transactional file system?

What is serial consistency, and why serial consistency can be guaranteed in the system?

What is are weak representatives? What is the purpose of adding them?

Why weak representatives can be stored in user's local disks and can be invalidated by simply setting its version number to unknown?

Why using stable file system to guarantee serial consistency can decrease concurrency? How it is solved?

How the performance of the proposed system is evaluated?

How is Quorum different from Paxos?

THE LOAD, CAPACITY AND AVAILABILITY OF QUORUM SYSTEM

What are the main contributions of the paper?

What is a quorum system?

What are the definitions of load, capacity, and availability of a quorum system?

What is the protocol template based on quorum systems?

What is the main problem of majority quorum systems?

What is the definition of global system failure probability of a quorum system?

Why the defined system load definition is the best case?

Dont Settle for Eventual

What are the main contributions of the paper?

What type of storage system is COPS? And what are the user requirements from this type of systems?

What is the meaning of consistency to users in practice?

How is eventual consistency different from casual consistency?

Why convergence conflicts can exist in casual consistency?

How COPS handles conflict?

What are the system design goals of COPS?

What is the progress property?

Why replication between COPOS clusters happen asycnhronously?

Why COPS needs to track the dependencies of stored objects?

What is the definition of nearest dependency, and what is its purpose?

What are the pros and cons comparing GT-COPS to COPS?

What are primary nodes and equivalent nodes of a key?

The authors argue previous works didn’t support casual+ convergence with scalability. What does it mean?

How the system is evaluated in experiments?

The Potential Dangers of Causal Consistency

What are the main contributions of the paper?

描述causal consistency由於需要write propagation以及dependency tracking，會造成其scalability嚴重受限
提出explicit consistency作為替代方案，降低causal consistency帶來的其他風險

What are the strengths of casual consistency?

在partition時能夠實現的最強consistency

What are the issues of casual consistency

causal consistency由於需要write propagation以及dependency tracking，會造成其scalability嚴重受限

What is explicit causality?

causal consistency由於需要write propagation以及dependency tracking，會造成其scalability嚴重受限

What are the benefits of explicit causality?

因為dependency變少，所以meta data所要花費的空間（資料量）也會降低。
增加concurrency。因為dependency變少，但是整體要寫入的需求量不會改變，因此會出現較多獨立的要求，就可以被同時寫入，加快處理速度。

What are the limitations of explicit causality?

應用程式端需要額外負責建立dependency，並且是必須的
無法解決Not Scaling Throughput and Data centers的問題，因為這是機制設計的限制

Consistent Hashing and Random Trees

What are the main contributions of the paper?

What is a hotspot?

What are the objectives ofits system design?

What is consistent hashing? How to use it store data?

What is a random tree? What does the parameter d. q means? How to use it locate a cache item?

What is the hotspot problem of a random tree? How the problem can be solved?

With high probability the number of requests a given cache gets is no more than what? Intuitively, how did the author prove it?

What are the properties (balance, monotonicity, spread, load) of consistent hashing? What are the benefits/implications of these properties?

Chord

What are the main contributions of the paper?

What is a P2P system?

Whate are the performance properties/complexity of Chord operations?

What are the system design goals of Chord, and how they are achieved?

What is the consistent hashing propotol used by Chord?

What is the definition of scalability for Chord?

How to support nodes join/leave/fail and data replication?

How the system is evaluated in the experiments?

Ceph: A Scalable, High-Performance Distributed File System

What are the main contributions?

Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs).

What are the primary goals of Ceph, and how it achieves them?

scalability (to hundreds of petabytes and beyond)
performance
reliability

What are the three fundamental design features of Ceph?

Decoupled Data and Metadata
- Metadata operations are collectively managed by a metadata server cluster, while clients interact directly with OSDs to perform file I/O (reads and writes).
Dynamic Distributed Metadata Management
- use Dynamic Subtree Partitioning
- A dynamic hierarchical partition preserves locality in each MDS’s workload, facilitating efficient updates and aggressive prefetching to improve performance for common workloads
Reliable Autonomic Distributed Object Storage
- Ceph delegates responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs that store the data, while at a high level, OSDs collectively provide a single logical object store to clients and metadata servers.

Why Ceph chooses to offer near-POSIX interface?

we find it appropriate to extend the interface and selectively relax con- sistency semantics in order to better align with the needs of applications and to improve system performance.

What is an MDS server, and what is an OSD server?

OSD server: store data, metadata
MDS server: manage namespace (file name, directory), coordinate consistency, coherence.

What is the file system architecture based on object storage devices (OSDs)? What are the benefits?

Why the authors said the consistency semantics of Ceph is relaxed?

The Ceph synchronization model thus retains its simplicity by providing correct read-write and shared-write semantics between clients via synchronous I/O, and extending the application interface to relax consistency for performance conscious distributed applications.

Why concurrent writes are more general in scientific computing?

since performance is often critical

What are the trade-offs between static sub-tree-based dictionary and hash function?

static subtree-based dictionary
- fails to cope with dynamic workloads and data sets
hash function
- destroys metadata locality and critical opportunities for efficient metadata prefetching and storage

How dynamic sub-tree partitioning works?

每個 MDS 都會使用指數時間衰退的計數器（counters with an exponential time decay)，測量目錄層級（directory hierarchy）内的元數據熱度（popularity ）
每個 operation都會讓從根目錄到這個子節點的 counter + 1
形成一顆 weighted tree，並且經過一段時間就會透過搬移 directory 去 balance 這棵樹

Why hot spot cannot be solved by dynamic sub-tree partitioning, and how it is solved by Ceph?

對於 hotspot 是較多 read 的檔案，會把檔案複製到多台 MDS 上。對於 write較多的檔案，會透過 file name來 hash 到不同的 MDS，雖然犧牲部分 locality但達到 load-balance

The Google File System

What are the main contributions?

What are the four design considerations of GFS? How are they different from the goals of general file systems?

What is the role of master and chunk server in GFS?

How GFS reduces the load of master?

What are the benefits of having a large chunk size?

Why GFS is not suitable for small files?

Explain the file region state after mutation.

How GFS uses lease to ensure data consistency?

Why separate data flow from control flow?

How GFS reduces the time of shapshot?

What is a shadow master?

How the system performance is evaluated?

Dynamo: Amazon’s Highly Available Key-value Store

What are the main contributions?

What is a blob?

What is the query model of DynamoDB?

What is the ACID property of DynamoDB?

When does DynamoDB resolve data conflicts and why?

Why DynamoDB let application to resolve conflicts?

Compare BigTable and DynamoDB

What is the routing problem of DHT? How DynamoDB solves it?

What are the problems of consistent hashing? How DynamoDB solve them?

Why DynamoDB can only provide eventual consistency?

How dynmoDB handle version inconsistency in its read and write requests?

Why a write can only be considered successful when more than W nodes response? (R+W > N)

What is sloppy quorum? What is it for?

What is a Merkle tree? How DynamoDB uses it to synchronize replicas?

Bigtable: A Distributed Storage System for Structured Data

What are the main contributions of big table?

What is the data moedl of BigTable?

What is the difference between the family and qualifier in a column key?

Why we usually store the url in the reversed order as the key in bigtable?

What is a tablets?

What is a SSTable and memtable? How they are used to handle random access?

What is the purpose of using the bloom filter in bigtable?

How Chubby is used by BigTable?

Spanner: Google’s Globally Distributed Database

What are the main contributions of spanner? And how spanner is different from bigtable

What is externally consistent?

What is the role for each of the component: universemaster, placement driver, zonmaster, location proxy, spanserver?

Why multiple Paxos state machines per tablet could allow for more flexible replication configurations?

Why decrees may be approved out of order, but applied in order?

What is a transaction manager?

What a Paxos group represents in Spanner?

What are the two dimensions of placement can be controlled by the administrators?

What is the TrueTime API?

How are the system design goals and solutions differnt in a geo-distributed system?

MapReduce: Simplified Data Processing on Large Clusters

What is the programming model of MapReduce?

What are the advantage of using MapReduce or programming?

What are the targeted applications for MapReduce?

How MapReduce handles failures?

What is data locality? Why is it important to MapReduce?

What is a straggler? How MapReduce addresses the problem?

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

What is a RDD?

What are the targeted applications for Spark?

What is a distributed memeory?

How Spark handles failures?

How Spark compared to MapReduce?

What does transformation mean?

What is an ephemeral dataset? What does materialized mean?

Disk-Locality in Datacenter Computing Considered Irrelevant

What is the argument proposed by the paper?

Why previous papers argued disk/date locality is important? Why this paper doesn't agree?

Why memory cached may not work well in practice?

Why SSD may not not help as well?

Kafka: a Distributed Messaging System for Log Processing

What is the design goals of Kafka?

What is a messaging system, and how it is different from Kafka?

How is the iterators of Kafka different from the traditional iterators? Why the difference occur?

What is the structure of an Kafka log?

Why Kafka avoids explicitly caching messages in memory at the Kafka layer, relies on the underlying file system page cache instead?

What is a sendfile API? How it helps Kafka?

What is the problem if the broker does not maintain the information about what messages each consumer has consumed? How Kafka overcome this problem?

What is a consummer group? How Kafka use it to achieve scalability? Why synchronization overhead can be avoided?

How Kafka uses Zookeeper?

What is the delivery guarantee of Kafka and why?

How is the performance of Kafka compared to other message queue systems?

A message system supporting fault tolerance

The Many Faces of Publish/Subscribe

Large-scale cluster management at Google with Borg

What are the benefits of Brog?

What is declarative job specification language?

What is a cell in Brog? Briefly describe the type of computing resources in a cell.

Why Brog programs are linked statically?

How is Alloc different from Task?

What are the priority bands of Brog?

What is admission control? How Brog enforces admission control?

What is purpose of Brog Name Service?

How Brog handles Brogmaster failure?

What is the feasibility check and scoring of Brog scheduling?

What is the performance tradeoff of best-fit and worst-fit scheduling policies?

What is the task startup latency problem? How Brog addresses it?

How Brog ensures its scalability?

How Brog ensures its availability?

What is cell compaction measured and defined? Why Brog chooses it to evaluate its utilization?

How Brog detects or measured performance interference?

Explain each of the experiment plots.

Why using smaller cells would require significantly more machines?

What does fine-grained resource request mean? Why it is important to Brog?

What is resource reclaimation and task reservation? What type of jobs will be scheduled according to the limit and reservation, respectively?

Resource oversubscription should be avoided? How Brog ensures it does not happen?

What are the good/bad lessions learned?

Omega: flexible, scalable schedulers for large compute clusters

What are the three scheduling architectures (monolithic, two-level, sharred state)?

Monolithic
- Single instance, no parallelism, must implement in a single code base
Two-level
- a central coordinator to decide how many resources each sub-cluster can have
- allocating resources to different sched- uler frameworks
- concurrency control is pessimistic
- Both Mesos and YARN are Two-level scheduling architecture
Shared-state scheduling
- grant each scheduler full access to the entire cluster
- use optimistic concurrency control
- master maintains a copy of the resource allocations in the cluster, which we call cell state
- Each scheduler has a local copy of cell state, and use the local copy to make placement decision
- It may have conflict -> Whether or not the transaction succeeds, the scheduler resyncs its local copy of cell state afterwards and, if necessary, re-runs its scheduling algorithm and tries again

What is lock-free optimistic concurrency control? What is the strength and weakness of this approach?

All resource in the cluster is visible to every schedulers, they can compete for all resource. Since, conflict is possible.
Strength: Can increase parallelism
Weakness: potentially increases the amount of wasted scheduling work if conflicts occur too frequently

Paper Studying: Distributed Systems

tags: Main

Time, Clocks, and the Ordering of Events in a Distributed System

What are the main contributions of the paper?

What is a distributed system?

Why a single computercan also be viewed as a distributed system?

What is the challenge of time in distributed system?

How is difference betwen logical clock, physical clock and physical time?

How to define logical clock with respect to the notion of "happen before"?

How to implement a logical clock?

How to obtain a total order (a=>b) among processes using logical clock?

Why we need total order in a distributed system?

Why the synchronization problem cannot be solved by a centralized scheduler?

What is the limitation of the proposed synchronization algorithm based on logical clock?

Why the entire concept of failure is only meaningful in the context of physical time.

What is the anomalous behavior mentioned in the paper?

How to avoid the anomalous behavior by physical clocks?

Cap Theorem: Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

What are the main contributions of the paper?

What is the definition of CAP?

What is the definition of linearizable/atomic consistency?

What is the definition of synchronous network model and asynchronous network model?

Why it is possible to determining whether a message has been lost in partially synchronous network model.

What does fair execution mean?

How to prove Theorem 1: It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: Availability, Atomic consistency in all fair executions (including those in which messages are lost).

What is the purpose of Corollary 1.1, and how is it different from Theorem 1?

Why there could be a solution for partially synchronous network model but not in asynchronous network model?

Give an example solution for asynchronous network model.

Why time is so important to the discussion of consistency or atomic operation?

CAP Twelve Years Later How the “Rules” Have Changed

What are the main contributions of the paper?

What does BASE and AICD refer to?

What are the missing leading and opportunities from CAP theory?

Why traditional system design only talks about ACID not CAP?

Why it might make sense to give-up/relex consistency?

What is an invariant?

What are the three steps to handle network partitions?

Why retry or two-phase commit doesn't really address the network parition problem?

What are the challenges in each of the steps? And how these challenges can be handled?

Consistency, Availability, and Convergence

What are the main contributions from the paper?

What is the definitino of availability?

What is the definition of consistency?

What is the definitino of convergence?

What is the definition of one-way convergence

What is a Byzantine environment?

How to define consistency semantics?

What is the definitino of casual consistency?

What is a BH (before hand) graph?

How is Real-time-causal consistency different from Causal consistency?

Why the authors need to define Real-time-causal consistency?

Why convergence matters?

Why RTC is not the strongest always-available consistency semantics under two-way convergence?

Is linearizability the strongest semantic for some natural convergence requirement?

Paxos Made Live & Paxos Made Simple

What are the contributions of the papers?

What is a consensus problem?

What is the purpose of solving a consensus problem?

What are the agents of a consensus problem?

Why single acceptor/coordinator does not work?

What are the challenges of multiple acceptors/coordinators?

What is the basic idea to solve the problem?

How the Paxos algorithm works?

How to order the acceptors?

How to force future coordinators to select that same value in order to ensure continued agreement?

Raft: In Search of an Understandable Consensus Algorithm

What are the main contributions of the paper?

What is a replicated state machine, and what is it for?

What is the role of a consensus algorithm in a replicated state machine?

What are the properties and benefits from using a consensus algorithm in a replicated state machine?

What are the contributions from Paxos mentioned by the authors?

Why the authors claim that Paxos does not provide a good foundation either for system building or for education?

Intuitively, how Raft solves these problems?

What are the purposes of the four operations: leader election, log replication, safety, and membership changes?

What are the states of a server: leaders, followers, and candidates?

What is the purpose of term?

How leader election works in Raft?

What is the procedure to replicate logs?

If followers crash or run slowly, or if network packets are lost, the leader retries Append-Entries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries. Will this cause any problem?

When will a log entry be committed?

tags: `Main`