# Kubernetes DP: Changed Block Tracking API Design
## Introduction:
Efficient backup of data is an important feature for a backup system. Since all of the data in a volume does not change between backups, only backing up the data that has been changed is desirable. Many storage systems track the changes that were made since a previous point-in-time and are able to expose this to the backup application. Kubernetes/CSI Snapshots provide a standard API to snapshot data but do not provide a standard way to find out which data has changed.
In this document, we will focus on proposing differential snapshot for block devices only.
## Requirements:
- Provide information so that changed data can be identified quickly and easily for backup
- Handle changes in Kubernetes volumes
- Handle changes for block volumes
- Optional API:
This is an optional API. If this API is not implemented, backup is still possible but it is less efficient.
- CSI API addition:
We will define a CSI driver API that will be proposed as an addition to the CSI specification
- Kubernetes API
We will define a Kubernetes API that will be proposed as an addition to the existing Kubernetes snapshot APIs
- Provide changes between any arbitrary pair of snapshots
If the driver cannot produce a difference between the snapshots it should just mark all data as changed
Changes should be able to be requested against snapshots that have been deleted. If the storage system supports it (vSphere for example) it should return the change tracking information otherwise it should return that all data has been changed.
- Provide block level differences
For block devices, the minimum change granularity is a single device block
- Handle multiple block sizes
There are multiple block sizes in use. The change list format should not require a block size
- Be supportable by a majority of hardware/software/cloud storage vendors
## Additional requirements for File Tracking Change (not covered in this API):
- Handle changes for file share volumes
- Provide file system level differences
-- Directory vs file level
-- Blocks within files
-- Subdirectory vs entire volume
## Research of CBT APIs supported by storage vendors:
### VMWare CBT:
API:
```Go
type DiskChangeInfo struct {
StartOffset int64
Length int64
ChangeArea []DiskChangeExt
}
type DiskChangeExt struct {
Start int64
Length int64
}
func (this *GlobalObjectManager) QueryChangedDiskAreas(ctx context.Context, id vim.ID, snapshotId vim.ID, startOffset int64, changeId string) (*vim.DiskChangeInfo, error)
// Sample code:
do {
changes = theVM.queryChangedDiskAreas(snapshotMoRef, diskDeviceKey, startPosition, changeId);
for (int i = 0; i < changes.changedArea.length; i++) {
long length = changes.changedArea[i].length;
long offset = changes.changedArea[i].startOffset;
//
// Go get and save disk data here
}
startPosition = changes.startOffset + changes.length;
} while (startPosition < diskCapacity);
````
- VMWare CBT works with every VMWare storage (vmfs, NFS, vVOL, vSAN). Below is the list of vVOL vendors:
--- Dell EMC
--- Pure Storage
--- HPE Nimble/3PAR
--- NetApp
--- Hitachi Storage
--- IBM XIV
### AWS Incremental Snapshot:
API:
```json
Request:
GET /snapshots/secondSnapshotId/changedblocks?firstSnapshotId=FirstSnapshotId&maxResults=MaxResults&pageToken=NextToken&startingBlockIndex=StartingBlockIndex HTTP/1.1
Response body:
{
"BlockSize": number,
"ChangedBlocks": [
{
"BlockIndex": number,
"FirstBlockToken": "string",
"SecondBlockToken": "string"
}
],
"ExpiryTime": number,
"NextToken": "string",
"VolumeSize": number
}
```
Very compatible with VMWare CBT except for:
* the concept of Token as ID of the block on each Snapshot (instead of the logical offset from begining of the disk/volume)
* fixed BlockSize.
### Azure Disk CSI Driver Incremental Snapshot:
So far, we cannot find details of the API but based on the public document, it offers the incremental snapshot between current snapshot and the immediate previous snapshot (not between 2 random snapshots).
```bash=
sourceResourceId=$(az disk show -g <yourResourceGroupNameHere> -n <exampleDiskName> --query '[id]' -o tsv)
az snapshot create -g <yourResourceGroupNameHere> \
-n <yourDesiredSnapShotNameHere> \
-l <exampleLocation> \
--source "$sourceResourceId" \
--incremental
```
### Ceph RBD Incremental Backup:
Ceph RBD incremental backup response is a stream file that includes data. The file stream include the followings:
- Header
- Metadata records: Each metadata record has the followings
- From Snap:
- record tag ('f')
- snap name length
- snap name
- To Snap
- record tag ('t')
- snap name length
- snap name
- Size
- record tag ('s')
- image size
- Data records: Each data record has the followings:
- Updated Data:
- record tag ('w')
- offset
- length
- bytes of actual data
- Zero Data
- record tag ('z')
- offset
- length
- Final record
### Others API still need to research:
We currently don't have enough info of these APIs, or still in progress of researching these APIs, or their documents have not been published...:
- Dell-EMC PowerMax, PowerFlex...
- Google Cloud Platform Persistent Disk
## Kubernetes CSI CBT API:
API:
```Go
Request:
type RequestCBT struct {
SnapshotId1 string // Snapshot handle
SnapshotId2 string // Snapshot handle
VolumeId string
StartOffset string // Logical offset from beginning of disk/volume.
// Use string instead of uint64 to give vendor
// the flexibililty of implementing it either
// string "token" or a number.
MaxEntries uint64 // Maximum number of entries in the response
}
Response:
type ResponseCBT struct {
ChangeBlockList []ChangedBlock //array of ChangedBlock
NextOffset string // StartOffset of the next “page”.
VolumeSize uint64 // size of volume in bytes
Timeout uint64 //second since epoch
}
type ChangedBlock struc {
Offset uint64 // logical offset
Size uint64 // size of the block data
}
```
- Client will start request with the StartOffset 0 and process the list of ChangedBlock in the ResponseCBT. As long as the NextOffset in the ResponseCBT is not nil, the client can continue to fetch the next set of ChangedBlock.
- As long as client queries for these ChangedBlock within the Timeout specified in the Response, the serve should be able to send the next set of ChangedBlock to the client.
- If snapshotId1 is invalid or the snapshot has been deleted, server will response with appropriate error.
### Usage scenarios:
- To backup a Snapshot, create PV-PVC from the Snapshot, mount the PVC on data mover pod as raw block device, and communicate with CBT service to retrieve the CBT, and then only backup changed blocks. Most CSI vendor should support this datapath.
- Also in backup, data mover pod would mount the Snapshot directly (without create PV-PVC from the Snapshot), retrieve the CBT and read/backup specific changed blocks. This is vendor specific method.
- Nothing got mounted, receiving end would retrieve CBT and retrieve changed blocks from sending end (Ex: Replicate data between 2 storage devices).
## Implementation details:
(In progress)
- How to apply the CBT on File System PVC with block device underlined?
- Security, credential in secret...
- Is this CBT API implemented as a Service?
- How the client retrieved data?
## References:
- Azure:
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/disks-incremental-snapshots
- AWS: https://docs.aws.amazon.com/ebs/latest/APIReference/API_ListChangedBlocks.html
- CephRBD:
https://docs.ceph.com/en/latest/dev/rbd-diff/
- CBT requirements discusion document: https://docs.google.com/document/d/1NrIaej4YlXmBgbBJr1OCAhu2G4DpuyrVEo8ecoPX9Ro/edit
- VMware CBT:
https://code.vmware.com/docs/11750/virtual-disk-development-kit-programming-guide/GUID-9992FD61-1D92-4CA2-937D-2DE47DDD27D2.html
https://godoc.org/github.com/vmware/govmomi/vim25/types#DiskChangeInfo
https://godoc.org/github.com/vmware/govmomi/vim25/types#DiskChangeExtent
## Contributors:
- Antony Bett (Dell-EMC)
- Dave Smith-Uchida (VMware)
- John Strunk (Redhat)
- Oleksiy Stashok (Google)
- Phuong Hoang (Dell-EMC)
- Sukarna Grandhi (Dell-EMC)
- Thomas Watson (Dell-EMC)
- Tom Manville (Kasten)
- Xiangqian Yu (Google)
- Xing Yang (VMware)
###### tags: `CBT` `API` `Kubernetes` `CSI` `VMWare` `Dell-EMC` `AWS`