owned this note
owned this note
Published
Linked with GitHub
# KEP-NNNN: Kubernetes CSI Differential Snapshot API - DO NOT USE
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Design Details](#design-details)
- [Alternative Designs](#Alternative-Designs)
<!-- /toc -->
## Summary
Kubernetes CSI Differential Snapshots provides a common API to query for the list of changes between any arbitrary pair of Kubernetes CSI snapshots of the same volumes to facilitate an efficient backup and restore for Kubernetes CSI volumes. This enhancement to CSI will only cover volumes backed by Block Volumes in the backend storage.
## Motivation
Efficient backup of data is an important feature for a backup system. Since all of the data in a volume does not change between backups, only backing up the data that has been changed is desirable. Many storage systems track the changes that were made since a previous point-in-time and are able to expose this to the backup application. Kubernetes CSI Snapshots provide a standard API to snapshot data but do not provide a standard way to find out which data has changed.
### Goals
* Provide changes between any arbitrary pair of snapshots of the same volume so that changed data can be identified quickly and easily for backup.
* Handle changes in volumes backed by block storage.
* Optional: This interface is optional. If this interface is not implemented by the storage vendor of the volumes being backed up, the backup software may use propiertary differential snapshot service or running full backup of the volumes.
* Changes should be able to be requested against snapshots that have been deleted. If the storage system supports it (vSphere for example) it should return the change tracking information otherwise it should return that all data has been changed.
* The minimum change granularity is a single device block.
* Handle multiple block sizes.
* Be supportable by a majority of hardware/software/cloud storage vendors.
### Non-Goals
* Handle changes for file share volumes
* Provide file system level differences
* Directory vs file level
* Blocks within files
* Subdirectory vs entire volume
## Proposal
The user/client creates GetChangedBlocks CR in the same namespace as the target VolumeSnapshots and specifies StartOffset 0. The user/client then watch for update of GetChangedBlocks until Status is Success or Failure. The DiffSnap Controller will listen to the creation of GetChangedBlocks CR and processes it accordingly. The ChangedBlocks field of the Status will contain the list of ChangedBlocks if the operation succeeds. As long as the NextOffset is not nil, the client can continue to get the next page of ChangedBlock by creating another GetChangedBlocks CR with the StartOffset is the NextOffset of the previous GetChangedBlocks.

The DiffSnap Controller is a sidecar of the CSI external-snapshotter that watches for GetChangedBlocks create events. The GetChangedBlocks object will contain VolumeSnapshotBase, VolumeSnapshotTarget. The DiffSnap Controller will fetch these VolumeSnapshot and the associated VolumeSnapshotContent. From these objects, the controller gets the handle of backend snapshots and CSI Driver name. The DiffSnap Controller then processes the GetChangedBlocks object by making gRPC call to the corresponding CSI Driver.
If the corresponding CSI Driver supports DIFFERENTIAL_SNAPSHOT_SERVICE, it will respond to the gRPC GetChangedBlocksRequest by creating differential snapshot between 2 snapshot specified in the request. The CSI Driver may convert the vendor specific metadata to GetChangedBlocks API metadata and sending the GetChangedBlocksResponse back to the gRPC caller.
The DiffSnap Controller then updates GetChangedBlocks Status with the metadata from GetChangedBlocksResponse.
### User Stories
#### User Story 1
#### User Story 2
### Notes/Constraints/Caveats
### Risks and Mitigations
#### Size of GetChangedBlocks CR:
If the changes between two snapshots are large the size of ChangedBlocks in the Status could be large and therefore creates burden on the etcd. This problem could be mitigated by the following ways:
- Limit the size of the ChangedBlocks by setting the field "MaxEntries" in the Spec.
- Pagination could split the changes into multiple GetChangedBlocks CR. Since the limit size of an object on etcd is 1.5MB, each GetChangedBlocks CR can potentially contain 98000 ChangedBlock entries. For a 2MB data block, this 98000 ChangedBlock entries can express the change of 192GB.
- Consecutive changed blocks could also be combined into 1 changed block with the total size of all consecutive changed blocks. This reduces number of ChangedBlock entry in the Status.
- Clean up regularly:
+ Client will delete GetChangedBlocks CR after it reads the ChangedBlocks.
+ DiffSnap Controller has a routine to cleans up all GetChangedBlocks CR that has expired by checking the Timeout field in the Status.
## Design Details
If VolumeSnapshotBase is invalid or the snapshot has been deleted, the controller will respond with respond with appropriate error. If VolumeSnapshotTarget is specified but VolumeSnapshotBase is nil, the controller will respond with all used blocks in the volume. Similar behavior when VolumeSnapshotTarget is specified and VolumeSnapshotBase is missing.
### Differential Snapshot CRDs
The Differential Snapshot API in this KEP only includes GetChangedBlocks CRD. However, GetChangedFiles CRD will be added to the Differential Snapshot API in the future. GetChangedBlocks is Namespace Scope CRD. GetChangedBlocks objects will be created in the same namespace with VolumeSnapshotBase and VolumeSnapshotTarget.
```
// GetChangedBlocks is a specification for a GetChangedBlocks resource
type GetChangedBlocks struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec GetChangedBlocksSpec `json:"spec"`
// +optional
Status GetChangedBlocksStatus `json:"status,omitempty"`
}
// GetChangedBlocksSpec is the spec for a GetChangedBlocks resource
type GetChangedBlocksSpec struct {
VolumeSnapshotBase string `json:"snapshotBase,omitempty"` // Base VolumeSnapshot, optional.
VolumeSnapshotTarget string `json:"snapshotTarget"` // Target VolumeSnapshot. Required.
VolumeId string `json:"volumeId,omitempty"` // optional
StartOffset string `json:"startOffset,omitempty"` // Logical offset from beginning of disk/volume.
// Use string instead of uint64 to give vendor
// the flexibility of implementing it either
// string "token" or a number.
MaxEntries uint64 `json:"maxEntries"` // Maximum number of entries in the response
Parameters map[string]string `json:"parameters,omitempty"` // Vendor specific parameters passed in as opaque key-value pairs. Optional.
}
// GetChangedBlocksStatus is the status for a GetChangedBlocks resource
type GetChangedBlocksStatus struct {
State string `json:"state"`
Error string `json:"error,omitempty"`
ChangeBlockList []ChangedBlock `json:"changeBlockList"` //array of ChangedBlock
NextOffset string `json:"nextOffset,omitempty"` // StartOffset of the next “page”.
VolumeSize uint64 `json:"volumeSize"` // size of volume in bytes
Timeout uint64 `json:"timeout"` // second since epoch
}
type ChangedBlock struct {
Offset uint64 `json:"offset"` // logical offset
Size uint64 `json:"size"` // size of the block data
Context []byte `json:"context,omitempty"` // additional vendor specific info. Optional.
ZeroOut bool `json:"zeroOut"` // If ZeroOut is true, this block in SnapshotTarget is zero out.
// This is for optimization to avoid data mover to transfer zero blocks.
// Not all vendors support this zeroout.
}
// GetChangedBlocksList is a list of GetChangedBlocks resources
type GetChangedBlocksList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata"`
Items []GetChangedBlocks `json:"items"`
}
```
### DiffSnap Controller
DiffSnap Controller is implemented as a sidecar in the CSI external-snapshotter. It watches for the create event of GetChangedBlocks CR. The GetChangedBlocks object will contain VolumeSnapshotBase, VolumeSnapshotTarget. The DiffSnap Controller will then fetch these VolumeSnapshot and the associated VolumeSnapshotContent. From these objects, the controller gets the handle of backend snapshots and CSI Driver name.
The controller then uses the CSI Driver name and snapshot handles to send gRPC call GetChangedBlocksRequest to the corresponding CSI Driver. When the controller receives the gRPC response, it will proceed to update the Status of GetChangedBlocks CR with content from gRPC GetChangeBlocksResponse.
If Timeout is not set in the GetChangedBlocksResponse, the controller will get the Timeout of the GetChangedBlock's Status to 1 hour after its creation time. This Timeout will be used by cleanup routine.
In the future, CSI DiffSnap Controller will also support GetChangedFiles CR.
DiffSnap Controller must have Get permission to access VolumeSnapshot, VolumeSnapshotContent and the Get, Update, List, Delete permission to GetChangedBlocks.
#### Cleanup Routine
It is the responsibility of the client to delete the GetChangedBlocks objects after the client fetches info from the objects. However, the controller also has a goroutine that will cleanup any GetChangedBlocks objects whose Timeout has expired. This Timeout field is the same as the Timeout in the GetChangedBlocksResponse (set by the CSI Driver). If this field is not set, the controller will set it to a default value of 1 hour after creation time. This default value could be configurable.
### Differential Snapshot Interface
The Differential Snapshot interface will be added to the existing CSI that Storage vendor will provide as part of their CSI Volume Driver. If this service is implemented, GetPluginCapabilities response will contain DIFFERENTIAL_SNAPSHOT_SERVICE along with other CSI services.
```
service DifferentialSnapshot {
rpc GetChangedBlocks(GetChangedBlocksRequest)
returns (GetChangedBlocksResponse) {}
}
type GetChangedBlocksRequest struct {
// If SnapshotBase is not specified, return all used blocks.
SnapshotBase string // Snapshot handle, optional.
SnapshotTarget string // Snapshot handle, required.
VolumeId string // optional
StartOffset string // Logical offset from beginning of disk/volume.
// Use string instead of uint64 to give vendor
// the flexibility of implementing it either
// string "token" or a number.
MaxEntries uint64 // Maximum number of entries in the response
Parameters map[string]string // Vendor specific parameters passed in as opaque key-value pairs. Optional.
}
type GetChangedBlocksResponse struct {
ChangeBlockList []ChangedBlock //array of ChangedBlock
NextOffset string // StartOffset of the next “page”.
VolumeSize uint64 // size of volume in bytes
Timeout uint64 //second since epoch
}
type ChangedBlock struct {
Offset uint64 // logical offset
Size uint64 // size of the block data
Context []byte // additional vendor specific info. Optional.
ZeroOut bool // If ZeroOut is true, this block in SnapshotTarget is zero out.
// This is for optimization to avoid data mover to transfer zero blocks.
// Not all vendors support this zeroout.
}
```
## CSI Driver
Storage vendors who wish to support CSI Differential Snapshot would add CSI Differential Snapshot Interface to their CSI Driver by implement the gRPC server to serve GetChangedBlocksRequest. gRPC server will create the differential snapshot for the 2 specified snapshots. If neccessary, gPRC server then convert the differential snapshot's metadata into format of GetChangedBlocksResponse. Any vendor specific information could also be packaged in the Context field of the ChangedBlock.
For consecutive changed blocks, the gRPC may combine them into 1 ChangedBlock with the Offset is the offset of the first changed block and the size is the total size of all consecutive changed blocks. This is done to reduce the footprint of the GetChangedBlock CR on the etcd server.
## Sample usages
### Backup Block PVC with data mover
Below is an example of a backup workflow that utilizes the CSI Differential Snapshots to increase backup efficiency:
* Create a VolumeSnapshot of the Block PVC to be backed up. (The use already have previous VolumeSnapshot of the same PVC).
* Create a new Block PVC (PVC2) using the VolumeSnapshot as Source
* Create a data mover pod with the new PVC mounted as a raw block device.
* Create GetChangedBlocks CR on Kubernetes API Server and watch its status until either Success or Failure.
* If the Status is success, the list of changed blocks will be specified in the Status's ChangedBlocks field. Then data mover will backup the changed data blocks by reading the raw block device at specific offset and length specified in the ChangedBlock.
* If the Status is Failure, then backup the entire volume.
### Backup Block PVC without data mover
Below is another example of a backup workflow that does not involve creating PVC from Snapshot if the backup solution can access the storage device directly.
* Create a VolumeSnapshot of the PVC to be backed up.
* Similarly create GetChangedBlocks CR on Kubernetes API Server and watch its status until either Success or Failure. The Context field of the ChangedBlock may contain additional data specific for the storage vendor.
* Based on this list and the Context field, the backup software can then connect directly backend storage to fetch specific data blocks.
### Backup FileSystem PVC with data mover
### Restore FileSystem PVC with data mover
### Test plan
## Alternative Designs:
### GetChangedBlocksStatus subresource
GetChangedBlocksStatus could be a Kubernetes subresource to be used in Kubernetes API Aggregation Layer to reduce storage space on the etcd of the Kubernetes API Server.
### Forwarding CBT Service
Instead of storing the result in the GetChangeBlocksStatus, the DiffSnap controller could add in the Status the URL so the user/client could fetch the stream of ChangedBlocks. The DiffSnap controller won't call the gRPC until the client fetch from URL. This way, no data will be stored on the etcd but the Diffsnap controller simply forwarding data from CSI Driver to the user/client.
## Rejection:
After being reviewed in the "Container Storage Interface (CSI) Community Sync" meeting on May 25th, this design was rejected because of the concern of having etcd in the datapath would be very bad. This is not only because storage on the etcd but the concern of the high I/O on etcd which was not designed for high I/O.