# Velero Changed Block Tracking (CBT) Integration Plan
## AI GENERATED NOTICE
Just roughly outline tracking drivers implementations/support. Do not take any implementation details/timeline seriously
## Executive Summary
This document outlines the plan for integrating Changed Block Tracking (CBT) capabilities into Velero, leveraging the Kubernetes CSI Snapshot Metadata Service APIs introduced in KEP-3314. CBT will enable Velero to perform efficient incremental backups by identifying and backing up only the data blocks that have changed between snapshots.
## Background
### What is Changed Block Tracking?
Changed Block Tracking (CBT) is a storage optimization feature that identifies which blocks of data have changed between two snapshots. For Velero, this means:
- **Faster backups**: Only changed blocks need to be processed
- **Reduced storage costs**: Store only deltas, not full copies
- **Lower network bandwidth**: Transfer only modified data
- **Improved RTO/RPO**: Faster backup and restore operations
### Kubernetes CBT Support Status
- **KEP-3314**: [CSI Snapshot Metadata Service](https://github.com/kubernetes/enhancements/issues/3314)
- **Status**: Alpha in Kubernetes 1.33
- **Documentation PR**: [kubernetes/website#48456](https://github.com/kubernetes/website/pull/48456)
- **Implementation**: [kubernetes-csi/external-snapshot-metadata](https://github.com/kubernetes-csi/external-snapshot-metadata)
## Current CSI Driver Support
### Drivers with CBT Support
| CSI Driver | Status | Tracking Issue/PR | Notes |
|------------|--------|------------------|-------|
| **Ceph CSI** | ✅ Implemented | [#5346](https://github.com/ceph/ceph-csi/issues/5346), [PR #5347](https://github.com/ceph/ceph-csi/pull/5347) | Full implementation of GetMetadataAllocated and GetMetadataDelta APIs |
| AWS EBS CSI | ❌ Not Implemented | - | Has incremental snapshots at AWS level, not exposed via CSI |
| Azure Disk CSI | ❌ Not Implemented | - | Has incremental snapshots at Azure level, not exposed via CSI |
| GCP PD CSI | ❌ Not Implemented | - | No CBT support |
| NetApp Trident | ❌ Not Implemented | - | No CBT support |
| Dell PowerStore/PowerMax | ❌ Not Implemented | - | No CBT support |
| Pure Storage | ❌ Not Implemented | - | No CBT support |
## Technical Architecture
### CSI Snapshot Metadata Service APIs
```protobuf
service SnapshotMetadata {
// Returns allocated blocks in a snapshot
rpc GetMetadataAllocated(GetMetadataAllocatedRequest)
returns (stream GetMetadataAllocatedResponse) {}
// Returns changed blocks between two snapshots
rpc GetMetadataDelta(GetMetadataDeltaRequest)
returns (stream GetMetadataDeltaResponse) {}
}
```
### Key Components
1. **External Snapshot Metadata Sidecar**
- Deployed alongside CSI drivers
- Exposes SnapshotMetadataService API
- Handles gRPC streaming of metadata
2. **SnapshotMetadataService CRD**
- Advertises CSI driver CBT capabilities
- Contains endpoint and authentication details
3. **Block Metadata Format**
- Fixed-length blocks or variable-length extents
- BlockMetadata tuples: `(byte_offset, size_bytes)`
- Streaming response for large datasets
## Velero Integration Plan
### Phase 1: Foundation
#### 1.1 Discovery and Capability Detection
- [ ] Implement discovery of SnapshotMetadataService CRDs
- [ ] Add capability detection for CSI drivers with CBT support
- [ ] Create feature flag for CBT functionality
- [ ] Update Velero CRDs to track CBT metadata
#### 1.2 Client Implementation
- [ ] Implement gRPC client for SnapshotMetadata service
- [ ] Add authentication/authorization handling
- [ ] Implement streaming metadata receiver
- [ ] Add retry and resume logic for interrupted transfers
### Phase 2: Core Integration
#### 2.1 Backup Workflow Enhancement
- [ ] Modify backup controller to detect CBT-capable volumes
- [ ] Implement full backup with GetMetadataAllocated
- [ ] Store block metadata in backup repository
- [ ] Add CBT metadata to backup manifests
#### 2.2 Incremental Backup Support
- [ ] Implement incremental backup using GetMetadataDelta
- [ ] Add parent snapshot tracking
- [ ] Implement delta block reading and storage
- [ ] Create incremental backup chain management
### Phase 3: Data Mover Integration
#### 3.1 CSI Data Mover Enhancement
- [ ] Integrate CBT APIs with existing CSI Data Mover
- [ ] Optimize data transfer to read only changed blocks
- [ ] Implement parallel block processing
- [ ] Add progress tracking for block-level operations
#### 3.2 Repository Integration
- [ ] Enhance repository format to store block metadata
- [ ] Implement deduplication at block level
- [ ] Add compression for block data
- [ ] Create block-level integrity verification
### Phase 4: Advanced Features
#### 4.1 Restore Optimization
- [ ] Implement selective block restore
- [ ] Add fast incremental restore from deltas
- [ ] Support point-in-time recovery with block chains
- [ ] Implement block-level verification during restore
#### 4.2 Performance and Monitoring
- [ ] Add metrics for CBT operations
- [ ] Implement bandwidth throttling for block transfers
- [ ] Add CBT-specific monitoring dashboards
- [ ] Create performance benchmarking tools
## Implementation Details
### Backup Flow with CBT
```mermaid
graph TD
A[Start Backup] --> B{CBT Supported?}
B -->|Yes| C[Create VolumeSnapshot]
B -->|No| D[Traditional Backup]
C --> E{First Backup?}
E -->|Yes| F[GetMetadataAllocated]
E -->|No| G[GetMetadataDelta]
F --> H[Read Allocated Blocks]
G --> I[Read Changed Blocks]
H --> J[Store in Repository]
I --> J
J --> K[Update Metadata]
K --> L[Complete Backup]
```
### Configuration Schema
```yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: default
spec:
config:
# CBT-specific configuration
enableCBT: true
cbtBlockSize: 4096 # Block size in bytes
cbtStreamTimeout: 300s
cbtMaxParallelStreams: 4
```
### API Extensions
```go
// VolumeSnapshotBackup with CBT metadata
type VolumeSnapshotBackup struct {
// Existing fields...
// CBT-specific fields
CBTEnabled bool
ParentSnapshotID string
BlockMetadata []BlockMetadata
BackupType string // "full" or "incremental"
}
type BlockMetadata struct {
Offset int64
Size int64
Hash string
}
```
## Testing Strategy
### Unit Tests
- Mock SnapshotMetadata service responses
- Test block metadata parsing and validation
- Verify incremental chain logic
### Integration Tests
- Test with Ceph CSI (currently only implementation)
- Validate full and incremental backup flows
- Test restore from incremental chains
- Verify data integrity after restore
### Performance Tests
- Benchmark backup time reduction
- Measure network bandwidth savings
- Compare storage usage (full vs incremental)
- Test with various block change patterns
## Migration and Compatibility
### Backward Compatibility
- CBT is opt-in via feature flag
- Graceful fallback to traditional backup
- Support mixed environments (CBT and non-CBT volumes)
- Maintain compatibility with existing backup formats
### Migration Path
1. Deploy updated Velero with CBT support
2. Enable CBT feature flag
3. New backups automatically use CBT where available
4. Existing backups continue to work
5. Gradual migration as snapshots are recreated
## Security Considerations
- **Authentication**: Use ServiceAccount tokens for gRPC calls
- **Authorization**: Implement RBAC for SnapshotMetadataService access
- **Data Integrity**: Verify block hashes during transfer
- **Encryption**: Support encryption of block data in transit and at rest
## Performance Expectations
### Based on Ceph CSI Implementation
| Metric | Traditional Backup | CBT-Enabled Backup | Improvement |
|--------|-------------------|-------------------|-------------|
| Backup Time (100GB, 10% change) | 30 min | 5 min | 83% reduction |
| Network Transfer | 100GB | 10GB | 90% reduction |
| Storage Used | 100GB | 10GB | 90% reduction |
| Restore Time | 30 min | 10 min | 67% reduction |
## Risks and Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| Limited CSI driver support | Low adoption | Contribute CBT implementations to popular drivers |
| API changes (alpha feature) | Breaking changes | Abstract API layer, version detection |
| Large metadata streams | Memory/performance issues | Implement efficient streaming, pagination |
| Complex incremental chains | Data corruption risk | Implement chain validation, periodic full backups |
## Timeline and Milestones
- **Q1 2025**: Foundation and discovery
- **Q2 2025**: Core backup integration
- **Q3 2025**: Data Mover enhancement
- **Q4 2025**: Advanced features and optimization
- **Q1 2026**: GA release with full CBT support
## Success Criteria
1. **Functional Success**
- Successfully perform incremental backups with CBT
- Restore data accurately from incremental chains
- Support at least 2 CSI drivers with CBT
2. **Performance Success**
- Achieve >70% reduction in backup time for incremental backups
- Reduce network bandwidth usage by >80% for incremental backups
- Reduce storage consumption by >70% for incremental backups
3. **Adoption Success**
- Documentation and examples available
- Community feedback incorporated
- Production deployments validated
## References
- [KEP-3314: CSI Snapshot Metadata Service](https://github.com/kubernetes/enhancements/issues/3314)
- [Kubernetes Documentation PR](https://github.com/kubernetes/website/pull/48456)
- [External Snapshot Metadata Implementation](https://github.com/kubernetes-csi/external-snapshot-metadata)
- [Ceph CSI CBT Implementation](https://github.com/ceph/ceph-csi/issues/5346)
- [Velero CSI Plugin](https://github.com/vmware-tanzu/velero-plugin-for-csi)
## Next Steps
1. **Immediate Actions**
- Create Velero enhancement proposal for CBT support
- Set up development environment with Ceph CSI
- Begin prototype implementation
2. **Community Engagement**
- Present plan to Velero community
- Collaborate with CSI driver maintainers
- Gather feedback from end users
3. **Development Kickoff**
- Form CBT working group
- Define detailed technical specifications
- Start Phase 1 implementation
---
*Last Updated: December 2024*
*Document Version: 1.0*
*Author: Velero Team*