# kopia repo maint notes First pass at notes: ## Upstream doc: * https://velero.io/docs/v1.14/repository-maintenance/ * kopia blob_gc_test.go https://github.com/kopia/kopia/blob/master/repo/maintenance/blob_gc_test.go ## customer info https://docs.google.com/document/d/1zpyL_sux7H2p50rH0MBdo_JxppBuACVAnEvPvLBEvsA/edit?tab=t.0 ## OADP dpa.spec ``` default-repo-maintain-frequency <integer> How often (in nanoseconds) 'maintain' is run for backup repositories by default. ``` ``` oc explain dpa.spec.configuration.velero.args.default-repo-maintain-frequency GROUP: oadp.openshift.io KIND: DataProtectionApplication VERSION: v1alpha1 FIELD: default-repo-maintain-frequency <integer> DESCRIPTION: How often (in nanoseconds) 'maintain' is run for backup repositories by default. ``` ## DPA ``` spec: backupLocations: - velero: config: profile: default region: us-west-2 credential: key: cloud name: cloud-credentials default: true objectStorage: bucket: cvpbucketuswest2 prefix: velero provider: aws configuration: nodeAgent: enable: true uploaderType: kopia velero: args: default-repo-maintain-frequency: 360 defaultPlugins: - kubevirt - csi - openshift - aws snapshotLocations: - velero: config: profile: default region: us-west-2 provider: aws status: conditions: - lastTransitionTime: '2024-10-22T19:27:00Z' message: Reconcile complete reason: Complete status: 'True' type: Reconciled ``` ### Wes's test notes * start w/ a clean bucket ``` aws s3 ls --summarize --recursive --human-readable s3://cvpbucketuswest2/ Total Objects: 0 Total Size: 0 Bytes ``` * first volley ``` velero backup create vol24-1 --include-namespaces minimal-24csivol --snapshot-move-data=true ``` * repo maint jobs started.. seem to be running FULL maint * oc logs pod/repo-maintain-job-1729626671302-zbdsq * https://termbin.com/079j ``` time="2024-10-22T19:51:12Z" level=warning msg="active indexes [xn0_41d7e4c001635d817cc31ffc4a15f0b7-sca1a20b0845c562c12e-c1 xn0_9e4b7b2bedf4e266102b87e865ec9800-sca19c4675875bd2b12e-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error time="2024-10-22T19:51:12Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165" time="2024-10-22T19:51:12Z" level=warning msg="active indexes [xn0_41d7e4c001635d817cc31ffc4a15f0b7-sca1a20b0845c562c12e-c1 xn0_9e4b7b2bedf4e266102b87e865ec9800-sca19c4675875bd2b12e-c1 xn0_e8b4c3d621246975962f11cea3dc1eb1-sd12627bda768460f12e-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error time="2024-10-22T19:51:12Z" level=info msg="Succeeded to open repo for maintenance" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:172" time="2024-10-22T19:51:12Z" level=info msg="Running full maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" time="2024-10-22T19:51:12Z" level=info msg="Running full maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]" ``` * repomaint logs after backup deleted w/ velero cli * https://termbin.com/1oku - after several repo maint runs. ### create dm backup w/ ttl 1hr ``` remote_velero backup create vol24-ttl-short --include-namespaces minimal-24csivol --snapshot-move-data=true --ttl 1h0m0s Backup request "vol24-ttl-short" submitted successfully. ``` ``` NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR vol24-ttl-short Completed 0 0 2024-10-22 20:19:29 +0000 UTC 35m dpa-sample-1 <none> ``` ``` Total Objects: 331 Total Size: 4.0 GiB whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$ date Tue Oct 22 02:40:40 PM MDT 2024 ``` * backup was deleted after ttl expired ``` Total Objects: 352 Total Size: 4.0 GiB whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$ date Tue Oct 22 05:07:41 PM MDT 2024 ``` ### connected to repo manually ``` kopia maintenance info ``` ``` Owner: default@default Quick Cycle: scheduled: true interval: 1h0m0s next run: now Full Cycle: scheduled: true interval: 24h0m0s next run: 2024-10-23 20:21:05 UTC (in 20h57m1s) Log Retention: max count: 10000 max age of logs: 720h0m0s max total size: 1.1 GB Object Lock Extension: disabled Recent Maintenance Runs: compact-single-epoch: 2024-10-22 22:22:10 UTC (0s) SUCCESS 2024-10-22 21:22:10 UTC (0s) SUCCESS 2024-10-22 20:21:06 UTC (0s) SUCCESS delete-superseded-epoch-indexes: 2024-10-22 20:21:06 UTC (0s) SUCCESS full-rewrite-contents: 2024-10-22 20:21:06 UTC (0s) SUCCESS generate-epoch-range-index: 2024-10-22 20:21:06 UTC (0s) SUCCESS snapshot-gc: 2024-10-22 20:21:05 UTC (0s) SUCCESS advance-epoch: 2024-10-22 22:22:10 UTC (0s) SUCCESS 2024-10-22 21:22:10 UTC (0s) SUCCESS 2024-10-22 20:21:06 UTC (0s) SUCCESS cleanup-epoch-markers: 2024-10-22 20:21:06 UTC (0s) SUCCESS cleanup-logs: 2024-10-22 20:21:06 UTC (0s) SUCCESS ``` ``` kopia repository status ``` ``` Config file: /root/.config/kopia/repository.config Description: Repository in S3: s3.amazonaws.com cvpbucketuswest2 Hostname: oadp-mustgather-pod Username: root Read-only: false Format blob cache: 15m0s Storage type: s3 Storage capacity: unbounded Storage config: { "bucket": "cvpbucketuswest2", "prefix": "velero/kopia/minimal-24csivol/", "endpoint": "s3.amazonaws.com", "accessKeyID": "AKIAVBQYB2FD4NQGDQOB", "secretAccessKey": "****************************************", "sessionToken": "" } Unique ID: 7d847c816937c6b1bc282e0649cf0ae51ca1ec11d3042cb2d5308cbee6b9cfeb Hash: BLAKE2B-256-128 Encryption: AES256-GCM-HMAC-SHA256 Splitter: DYNAMIC-4M-BUZHASH Format version: 3 Content compression: true Password changes: true Max pack length: 21 MB Index Format: v2 Epoch Manager: enabled Current Epoch: 0 Epoch refresh frequency: 20m0s Epoch advance on: 20 blobs or 10.5 MB, minimum 24h0m0s Epoch cleanup margin: 4h0m0s Epoch checkpoint every: 7 epochs ``` ### update the maintenance owner and run maint ``` kopia maintenance set --owner=root@oadp-mustgather-pod ``` ``` kopia maintenance run --full ``` ``` Running full maintenance... Looking for active contents... Looking for unreferenced contents... GC found 0 unused contents (0 B) GC found 875 unused contents that are too recent to delete (4.3 GB) GC found 0 in-use contents (0 B) GC found 54 in-use system-contents (39.2 KB) Previous content rewrite has not been finalized yet, waiting until the next blob deletion. Not enough time has passed since previous successful Snapshot GC. Will try again next time. Looking for unreferenced blobs... Deleted total 0 unreferenced blobs (0 B) Cleaned up 0 logs. Cleaning up old index blobs which have already been compacted... Finished full maintenance. ``` ### turn off the child safety ``` kopia maintenance run --full --safety=none ``` ``` kopia maintenance run --full --safety=none Running full maintenance... Looking for active contents... Looking for unreferenced contents... GC found 875 unused contents (4.3 GB) GC found 0 unused contents that are too recent to delete (0 B) GC found 0 in-use contents (0 B) GC found 54 in-use system-contents (39.2 KB) Rewriting contents from short packs... Total bytes rewritten 13.8 MB Found safe time to drop indexes: 2024-10-22 23:33:34.799396449 +0000 UTC Dropping contents deleted before 2024-10-22 23:33:34.799396449 +0000 UTC Looking for unreferenced blobs... deleted 100 unreferenced blobs (2.3 GB) deleted 200 unreferenced blobs (4.2 GB) Deleted total 236 unreferenced blobs (4.3 GB) Cleaned up 0 logs.afety Cleaning up old index blobs which have already been compacted... Finished full maintenance. ``` ### AND THAT DID IT FOLKS ``` 2024-10-22 17:33:35 18 Bytes velero/kopia/minimal-24csivol/xw1729640014 Total Objects: 133 Total Size: 13.6 MiB whayutin@thinkdoe:~/OPENSHIFT/git/OADP/oadp-operator$ ``` ## Some notes from digging into Kopia src First, `safety=none` is only something we can run if we can guarantee that nothing will access the repository during that time. This means that before running a manual full maintenance with safety off, users must first shut down all Velero instances which reference the repository. This is not something we want to recommend that users do under normal circumstances -- it should be limited to exceptional circumstances -- something like "We accidentally backed up a 10TB volume we didn't intend to back up, so we deleted the Velero backup and need to immediately get rid of this in our bucket." Or also "We have found a bug in Velero and full maintenance is not working properly. The data should have been removed days ago but has not." In terms of normal expectations for GC, these are the values Kopia uses to govern GC behavior during full maintenance: ``` // SafetyFull has default safety parameters which allow safe GC concurrent with snapshotting // by other Kopia clients. SafetyFull = SafetyParameters{ BlobDeleteMinAge: 24 * time.Hour, //nolint:mnd DropContentFromIndexExtraMargin: time.Hour, MarginBetweenSnapshotGC: 4 * time.Hour, //nolint:mnd MinContentAgeSubjectToGC: 24 * time.Hour, //nolint:mnd RewriteMinAge: 2 * time.Hour, //nolint:mnd SessionExpirationAge: 96 * time.Hour, //nolint:mnd RequireTwoGCCycles: true, MinRewriteToOrphanDeletionDelay: time.Hour, } ``` https://github.com/kopia/kopia/blob/master/repo/maintenance/maintenance_safety.go#L56C1-L68C1 The above values are used in all cases except when `safety=none` It appears that`BlobDeleteMinAge` and `MinContentAgeSubjectToGC` reference creation time, meaning that content created fewer than 24 hour ago is not yet subject to garbage collection. Everything below assumes that we're talking about content from snapshots that were created at least 24 hours ago. `RequireTwoGCCycles`, `MarginBetweenSnapshotGC`, and `DropContentFromIndexExtraMargin` are used as follows: When `RequireTwoGCCycles` is true, then any content first marked for deletion during one full maintenance cycle won't be actually deleted until a second cycle confirms that it's still marked for deletion. That second cycle must be at least 4 hours later (`MarginBetweenSnapshotGC`). Once this condition has been met, as long as the content was marked for deletion no later than one hour after the start of that GC cycle (`DropContentFromIndexExtraMargin`) then it will be removed. The reason for the above is to guard against the first full maintenance starting at about the same time as another snapshot is created which references the old content.