# kopia repo maint notes
First pass at notes:
## Upstream doc:
* https://velero.io/docs/v1.14/repository-maintenance/
* kopia blob_gc_test.go https://github.com/kopia/kopia/blob/master/repo/maintenance/blob_gc_test.go
## customer info
https://docs.google.com/document/d/1zpyL_sux7H2p50rH0MBdo_JxppBuACVAnEvPvLBEvsA/edit?tab=t.0
## OADP dpa.spec
```
default-repo-maintain-frequency <integer>
How often (in nanoseconds) 'maintain' is run for backup repositories by
default.
```
```
oc explain dpa.spec.configuration.velero.args.default-repo-maintain-frequency
GROUP: oadp.openshift.io
KIND: DataProtectionApplication
VERSION: v1alpha1
FIELD: default-repo-maintain-frequency <integer>
DESCRIPTION:
How often (in nanoseconds) 'maintain' is run for backup repositories by
default.
```
## DPA
```
spec:
backupLocations:
- velero:
config:
profile: default
region: us-west-2
credential:
key: cloud
name: cloud-credentials
default: true
objectStorage:
bucket: cvpbucketuswest2
prefix: velero
provider: aws
configuration:
nodeAgent:
enable: true
uploaderType: kopia
velero:
args:
default-repo-maintain-frequency: 360
defaultPlugins:
- kubevirt
- csi
- openshift
- aws
snapshotLocations:
- velero:
config:
profile: default
region: us-west-2
provider: aws
status:
conditions:
- lastTransitionTime: '2024-10-22T19:27:00Z'
message: Reconcile complete
reason: Complete
status: 'True'
type: Reconciled
```
### Wes's test notes
* start w/ a clean bucket
```
aws s3 ls --summarize --recursive --human-readable s3://cvpbucketuswest2/
Total Objects: 0
Total Size: 0 Bytes
```
* first volley
```
velero backup create vol24-1 --include-namespaces minimal-24csivol --snapshot-move-data=true
```
* repo maint jobs started.. seem to be running FULL maint
* oc logs pod/repo-maintain-job-1729626671302-zbdsq
* https://termbin.com/079j
```
time="2024-10-22T19:51:12Z" level=warning msg="active indexes [xn0_41d7e4c001635d817cc31ffc4a15f0b7-sca1a20b0845c562c12e-c1 xn0_9e4b7b2bedf4e266102b87e865ec9800-sca19c4675875bd2b12e-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-10-22T19:51:12Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-10-22T19:51:12Z" level=warning msg="active indexes [xn0_41d7e4c001635d817cc31ffc4a15f0b7-sca1a20b0845c562c12e-c1 xn0_9e4b7b2bedf4e266102b87e865ec9800-sca19c4675875bd2b12e-c1 xn0_e8b4c3d621246975962f11cea3dc1eb1-sd12627bda768460f12e-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-10-22T19:51:12Z" level=info msg="Succeeded to open repo for maintenance" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:172"
time="2024-10-22T19:51:12Z" level=info msg="Running full maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
time="2024-10-22T19:51:12Z" level=info msg="Running full maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
```
* repomaint logs after backup deleted w/ velero cli
* https://termbin.com/1oku - after several repo maint runs.
### create dm backup w/ ttl 1hr
```
remote_velero backup create vol24-ttl-short --include-namespaces minimal-24csivol --snapshot-move-data=true --ttl 1h0m0s
Backup request "vol24-ttl-short" submitted successfully.
```
```
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
vol24-ttl-short Completed 0 0 2024-10-22 20:19:29 +0000 UTC 35m dpa-sample-1 <none>
```
```
Total Objects: 331
Total Size: 4.0 GiB
whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$ date
Tue Oct 22 02:40:40 PM MDT 2024
```
* backup was deleted after ttl expired
```
Total Objects: 352
Total Size: 4.0 GiB
whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$ date
Tue Oct 22 05:07:41 PM MDT 2024
```
### connected to repo manually
```
kopia maintenance info
```
```
Owner: default@default
Quick Cycle:
scheduled: true
interval: 1h0m0s
next run: now
Full Cycle:
scheduled: true
interval: 24h0m0s
next run: 2024-10-23 20:21:05 UTC (in 20h57m1s)
Log Retention:
max count: 10000
max age of logs: 720h0m0s
max total size: 1.1 GB
Object Lock Extension: disabled
Recent Maintenance Runs:
compact-single-epoch:
2024-10-22 22:22:10 UTC (0s) SUCCESS
2024-10-22 21:22:10 UTC (0s) SUCCESS
2024-10-22 20:21:06 UTC (0s) SUCCESS
delete-superseded-epoch-indexes:
2024-10-22 20:21:06 UTC (0s) SUCCESS
full-rewrite-contents:
2024-10-22 20:21:06 UTC (0s) SUCCESS
generate-epoch-range-index:
2024-10-22 20:21:06 UTC (0s) SUCCESS
snapshot-gc:
2024-10-22 20:21:05 UTC (0s) SUCCESS
advance-epoch:
2024-10-22 22:22:10 UTC (0s) SUCCESS
2024-10-22 21:22:10 UTC (0s) SUCCESS
2024-10-22 20:21:06 UTC (0s) SUCCESS
cleanup-epoch-markers:
2024-10-22 20:21:06 UTC (0s) SUCCESS
cleanup-logs:
2024-10-22 20:21:06 UTC (0s) SUCCESS
```
```
kopia repository status
```
```
Config file: /root/.config/kopia/repository.config
Description: Repository in S3: s3.amazonaws.com cvpbucketuswest2
Hostname: oadp-mustgather-pod
Username: root
Read-only: false
Format blob cache: 15m0s
Storage type: s3
Storage capacity: unbounded
Storage config: {
"bucket": "cvpbucketuswest2",
"prefix": "velero/kopia/minimal-24csivol/",
"endpoint": "s3.amazonaws.com",
"accessKeyID": "AKIAVBQYB2FD4NQGDQOB",
"secretAccessKey": "****************************************",
"sessionToken": ""
}
Unique ID: 7d847c816937c6b1bc282e0649cf0ae51ca1ec11d3042cb2d5308cbee6b9cfeb
Hash: BLAKE2B-256-128
Encryption: AES256-GCM-HMAC-SHA256
Splitter: DYNAMIC-4M-BUZHASH
Format version: 3
Content compression: true
Password changes: true
Max pack length: 21 MB
Index Format: v2
Epoch Manager: enabled
Current Epoch: 0
Epoch refresh frequency: 20m0s
Epoch advance on: 20 blobs or 10.5 MB, minimum 24h0m0s
Epoch cleanup margin: 4h0m0s
Epoch checkpoint every: 7 epochs
```
### update the maintenance owner and run maint
```
kopia maintenance set --owner=root@oadp-mustgather-pod
```
```
kopia maintenance run --full
```
```
Running full maintenance...
Looking for active contents...
Looking for unreferenced contents...
GC found 0 unused contents (0 B)
GC found 875 unused contents that are too recent to delete (4.3 GB)
GC found 0 in-use contents (0 B)
GC found 54 in-use system-contents (39.2 KB)
Previous content rewrite has not been finalized yet, waiting until the next blob deletion.
Not enough time has passed since previous successful Snapshot GC. Will try again next time.
Looking for unreferenced blobs...
Deleted total 0 unreferenced blobs (0 B)
Cleaned up 0 logs.
Cleaning up old index blobs which have already been compacted...
Finished full maintenance.
```
### turn off the child safety
```
kopia maintenance run --full --safety=none
```
```
kopia maintenance run --full --safety=none
Running full maintenance...
Looking for active contents...
Looking for unreferenced contents...
GC found 875 unused contents (4.3 GB)
GC found 0 unused contents that are too recent to delete (0 B)
GC found 0 in-use contents (0 B)
GC found 54 in-use system-contents (39.2 KB)
Rewriting contents from short packs...
Total bytes rewritten 13.8 MB
Found safe time to drop indexes: 2024-10-22 23:33:34.799396449 +0000 UTC
Dropping contents deleted before 2024-10-22 23:33:34.799396449 +0000 UTC
Looking for unreferenced blobs...
deleted 100 unreferenced blobs (2.3 GB)
deleted 200 unreferenced blobs (4.2 GB)
Deleted total 236 unreferenced blobs (4.3 GB)
Cleaned up 0 logs.afety
Cleaning up old index blobs which have already been compacted...
Finished full maintenance.
```
### AND THAT DID IT FOLKS
```
2024-10-22 17:33:35 18 Bytes velero/kopia/minimal-24csivol/xw1729640014
Total Objects: 133
Total Size: 13.6 MiB
whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$
```
## Some notes from digging into Kopia src
First, `safety=none` is only something we can run if we can guarantee that nothing will access the repository during that time. This means that before running a manual full maintenance with safety off, users must first shut down all Velero instances which reference the repository. This is not something we want to recommend that users do under normal circumstances -- it should be limited to exceptional circumstances -- something like "We accidentally backed up a 10TB volume we didn't intend to back up, so we deleted the Velero backup and need to immediately get rid of this in our bucket." Or also "We have found a bug in Velero and full maintenance is not working properly. The data should have been removed days ago but has not."
In terms of normal expectations for GC, these are the values Kopia uses to govern GC behavior during full maintenance:
```
// SafetyFull has default safety parameters which allow safe GC concurrent with snapshotting
// by other Kopia clients.
SafetyFull = SafetyParameters{
BlobDeleteMinAge: 24 * time.Hour, //nolint:mnd
DropContentFromIndexExtraMargin: time.Hour,
MarginBetweenSnapshotGC: 4 * time.Hour, //nolint:mnd
MinContentAgeSubjectToGC: 24 * time.Hour, //nolint:mnd
RewriteMinAge: 2 * time.Hour, //nolint:mnd
SessionExpirationAge: 96 * time.Hour, //nolint:mnd
RequireTwoGCCycles: true,
MinRewriteToOrphanDeletionDelay: time.Hour,
}
```
https://github.com/kopia/kopia/blob/master/repo/maintenance/maintenance_safety.go#L56C1-L68C1
The above values are used in all cases except when `safety=none`
It appears that`BlobDeleteMinAge` and `MinContentAgeSubjectToGC` reference creation time, meaning that content created fewer than 24 hour ago is not yet subject to garbage collection. Everything below assumes that we're talking about content from snapshots that were created at least 24 hours ago.
`RequireTwoGCCycles`, `MarginBetweenSnapshotGC`, and `DropContentFromIndexExtraMargin` are used as follows:
When `RequireTwoGCCycles` is true, then any content first marked for deletion during one full maintenance cycle won't be actually deleted until a second cycle confirms that it's still marked for deletion. That second cycle must be at least 4 hours later (`MarginBetweenSnapshotGC`). Once this condition has been met, as long as the content was marked for deletion no later than one hour after the start of that GC cycle (`DropContentFromIndexExtraMargin`) then it will be removed.
The reason for the above is to guard against the first full maintenance starting at about the same time as another snapshot is created which references the old content.