2021/10/20 Meeting: <br><small>Design of Global Data Deduplication for A Scale-out Distributed Storage System</small>
===
###### tags: `CS Master`
**Authos:** Myoungwon Oh, Sejin Park, Jungyeon Yoon, Sangjae Kim, Kang-won Lee, Sage Weil, Heon Y. Yeom, Myoungsoo Jung
**From:** SK Telecom, Keimyoung University, Red Hat, Seoul National University, Yonsei University
## Abstract
* Distributed storage systems can hold data at PB levels. However, it is a challenge to store and manage large sets of contents being generated by the explosion of data.
* One of the promising solutions to mitigate big data issues is data deduplication, which removes redundant data across many nodes of the storage system.
* It is non-trivial to apply a conventional deduplication design
* chunk-lookup for deduplication is not as scalable
* managing the metadata associated to deduplication requires huge modification on existing distributed storage system.
* processing and additional I/O traffic imposed by deduplication can significantly degrade performance
* we propose a new deduplication method, which is highly scalable and compatible with the existing scale-out storage.
* Employs a double hashing algorithm that leverages hashes used by the underlying scale-out storage, which addresses the limits of current fingerprint hashing.
* The experimental results show that our design can save more than 90% of the total amount of storage space, while offering the same or similar performance, compared to the conventional scale-out storage.
## Motivation
* To store massive volume of data, scalable storage and a data reduction technique such as deduplication are needed.
* many previous studies demonstrated that data deduplication can reduce lots of redundant data with a low overhead.
* simply applying deduplication is inappropriate for distributed storage environments because it removes all extra copies appended by redundancy scheme(replication and erasure coding).
* Traditionally, a fingerprint index that stores the hash value of each chunk is needed for deduplication system. It compares(rather than searching) the hashed values of chunks to locate the existing chunks.

* It does not work well
* managing scalability of fingerprint index can be problematic with a limited working memory size.
* It is complex to ensure compatibility between newly applied deduplication metadata and existing metadata. (reference counts)
* need to solve performance degradation(IO) due to additional operations by adding deduplication.
### Proposal
* effectively remove the fingerprint index by leveraging underlying storage system’s existing hash table (Double hashing), and extend its metadata structure to effectively support data deduplication (Self-contained object). Also, performance degradation is minimized through rate control and selective deduplication.
* Implemented on Ceph
* main contributions of this paper
* An effective deduplication approach to support high availability and high performance
* does not require an additional fingerprint index or data structure for deduplication without any modification of existing storage structure.
* Minor changes for the existing system. On Ceph, only 4K lines of code is added/modified.
## Background
* Distributed scale-out storage systems
* (a) Centralized: a MDS server (e.g. Lustre, GFS, and pNFS)
* Bottle neck and single point of failure
* (b) Decedtralized ([shared nothing](https://en.wikipedia.org/wiki/Shared-nothing_architecture)) (e.g. Ceph, GlusterFS, Swift, Dynamo and Scality)
* Hash algorithm: CRUSH, DHT

### Deduplication Range
* most straightforward ways that can deduplicate without any violation against the policies of underlying storage system: block level deduplication (e.g. POD, Permabit(inquired by Red Hat))
* Experiment (details explained later)

ref: [FIO](https://github.com/axboe/fio)

as the number of OSD increases, the gap between local deduplication and global deduplication increases. So we target global deduplication for a distributed storage system to achieve a higher deduplication ratio.
## Problem and Key ideas
### Problem definition
**Scalability of fingerprint index**
Two challenges to manage scalability of fingerprint index: How to lookup fast? How to locate and distribute the fingerprint index equally?
* Fingerprint index can grow to a huge size, so it can not be put in memory anymore for fast lookup. => performance degraded. Previous studies tried to use representative fingerprint index to achieve small index table, but it can fail to remove all the duplication. Not to mention it can finally overgrown when fingerprint index grows unlimitedly.
* Once a fingerprint index is created/updated, it needs to be distributed to MDS(s) equally. Existing system use centralized MDS to avoid this issue => SPOF and bottleneck. *(My solution here is using consensus algorithm)*
**Compatibility between the newly applied deduplication metadata and exiting metadata**
External metadata structure for ease of implementation.

If we add fingerprint index or reference count information for deduplication, the underlying system’s storage features do not recognize these additional data structure. Therefore, the storage features such as high availability or data recovery cannot work for the external data structure. These features must be implemented separately in the external data structure => Not so easy to implement at the end. Or just unable.
**Minimizing performance degradation**
Inline deduplication vs Post-processing deduplication?
* Inline: Do it on the fly. No additional space is required, but with high latency.
* Post-processing: Work in background. Lower latency, but still can degrade the foreground IO performance.
Applying deduplication to frequently used data causes the overhead because it will be updated soon again. If deduplication is continuously applied on hot data during I/O, unnecessary frequent I/O for deduplication will cause performance degradation.
### Key idea
#### Double hashing
* Key mechanism of traditional fingerprint index: detect redundant chunks faster, by storing <fingerprint(hash value of a chunk) - location> pairs.
* Key mechanism of distributed hash algorithm (determines the location of an object): Hash object IDs to get locations (e.g. CRUSH in Ceph, DHT in GlusterFS).
* But there is no relationship between obj ID and its content => duplications everywhere. Traverse all to find them or maintain fingerprint index.
Combining those two. Remap the ordinary policy-based object ID (object ID 1, 2, 3) to the new content-based object ID (object ID K) by using an additional hashing.

* No fingerprint index
* Preserves the original scalability of underlying distributed storage system.
* No modification is required on client side because the client request is based on the original object ID.
#### Self-contained object
To solve the external deduplication metadata problem.

#### Deduplication rate control & selective deduplication based on post-processing
Post-processing with rate control and selective deduplication(hot object is not deduplicated until its state is changed).
## Design

### Object
metadata object and chunk object are all objects. System can handle them with same operation.
* Metadata object
* Stored in metadata pool
* Structure
* object ID (user-visible)
* chunk map
* offset range
* chunk ID
* cached bit (true if is cached)
* dirty bit (true if dedup. is needed)
* data (for cached obj)
* Chunk Object
* Structure
* ID (determined by chunk's contents)
* reference count
* data
### Pool-based Object Management
Different pools for different purpose/characteristic objects (like a namespace).
Metadata pool and chunk pool can separately select redundancy scheme between replication and erasure coding depending its usage and each pool can be placed to different storage location depending on the required performance.
### Cache Manager
Various cache algorithms could be applied here but in our experiment, we used a LRU based approach, which is simple.
### Data Deduplication
if two chunks have the same contents, their location in the storage system is the same and it naturally removes the duplicates (Double hashing).
* Deduplication engine
1. Find dirty metadata object from dirty obj list
2. Find the dirty chunk ID from chunk map in metadata. The dirty chunk is cached inside of the dirty object.
3. Checks whether the chunk entry corresponding to the dirty chunk already has a chunk object ID.
* If yes, it's referenced. => de-reference => new chunk obj and ID
* If no => new chunk obj and ID
4. Place chunk in chunk obj Pool
5. New in chunk obj pool -> reference count = 1, otherwise reference count += 1
6. When the chunk write at the chunk pool ends, update the metadata object’s chunk map.
* Deduplication rate control
* If IOPS is higher than high-watermark, single dedup. per foreground IO
* If IOPS is lower than high-watermark, but higher than low mark, single dedup. per 100 foreground IO
* If IOPS is lower than low-watermark, not limit on dedup.
### I/O Path example
* Write (similar with original)
1. Client issues an object write request with object ID, offset, size with data to the metadata pool.
2. The hash algorithm in the distributed storage determines the location of the new object according the object ID and write the data.
3. Chunk map is created in the metadata part of the object. The chunk ID is not determined. The cached bit and dirty bit are set to true.
4. Update the dirty object ID list for data deduplication.
* Read
1. Client issues an object read request with object ID, offset and size to the metadata pool.
2. The hash algorithm in the metadata pool determines the location of the requested object.
3. Read the object’s chunk map to find the exact chunk.
4. Cached?
a. If the chunk is cached, then read the chunk from the object’s data part and return the chunk to the client.
b. read the chunk ID and issue a read request to the chunk pool using the chunk ID, an offset and a size.
### Consistency Model
Everytime you get right data.

## Evaluation
### Environment setup
* 4 server nodes and each server has Intel Xeon E5-2690 2.6Ghz
* 128GB of RAM and four SSDs
* static chunk size of deduplication as 32KB.
### Performance Comparison
#### Small Random Performance

#### Sequential Performance

#### Space Saving

### Deduplication Rate Control

---
這篇可以改進的地方
比較偏概念性,要更詳細的解釋要去翻 Ceph 文件或是原始碼
這篇的衍申議題:consensus
誰引用,誰改善
https://github.com/kweonwooj/papers
```
1. What previous research and ideas were cited that this paper is building off of? (this info tends to live in the introduction)
2. Was there reasoning for performing this research, if so what was it? (introduction section)
3. Clearly list out the objectives of the study
4. Was any equipment/software used? (methods section)
5. What variables were measured during experimentation? (methods)
6. Were any statistical tests used? What were their results? (methods/results section)
7. What are the main findings? (results section)
8. How do these results fit into the context of other research and their 'field'? (discussion section)
9. Explain each figure and discuss their significance.
10. Can the results be reproduced and is there any code available?
11. Name the authors, year, and title of the paper!
12. Are any of the authors familiar, do you know their previous work?
13. What key terms and concepts do I not know and need to look up in a dictionary, textbook, or ask someone?
14. What are your thoughts on the results? Do they seem valid?
```
---
解決方法看起來簡單,但是它確實避免掉了很多問題
---
replication
erasure code
---
一些經典論文
consensus algorithm: Raft, Raft vs PAXOS
fine-grained tasks
map reduce
---
論文方向
* how to make a program **concurrent** without much effort
* 平等的 consensus algorithm (我比較喜歡叫它同步演算法)
* leader election and no leader election are both expensive (and also complex for the latter)
* it's not event driver (actually time is a kind of event but it's not in the current context)
* optionally preventing BF
* distributed workers routing
* distributed storage(too big)
* 二元散佈/環狀緊密二元散佈