Last-Liveness-Seen GC
=====================
This is a draft of notes about a potential garbage collection design system.
:::info
:wolf: This is written in the wee hours, and has a low professionalism quotient.
Kindly accept this first draft for what it is. Future drafts would surely be needed.
:::
This GC is aimed at content-addressable data storage systems (such as IPFS) and alludes to CIDs in several places, but is probably otherwise general.
key concerns
------------
- Garbage collection should be safe to run concurrently with the addition of new data.
- Any synchronization primitives used should be of minimal cost and in minimal number.
- synchronization primitives need to be reliable in the face of crashes (or able to fall back to resumable checkpoints).
- fsync costs are predominant in practice.
- journaling systems are difficult to implement. Simpler would be better.
- synchronization primitives could have costs in space needed, and it is preferable if these are A) predictable and B) minimal.
- an integer per object is a reasonable amount of overhead. (This is for large-ish data storage, not for a programming language memory GC system that needs bit twiddling to minimize metadata to sub-word sizes.)
- Garbage collection should be able to make partial progress (in other words, be an "anytime algorithm").
- Systems under heavy load should be able to decide to deprioritize, or even kill, a garbage collection process, and backslide a minimal amount.
approximate implementation
--------------------------
### main process
- there will be three global values which need to be flushed atomically: `gc-epoch` (an int), `gc-epoch-finished` (an int), and `root-in-prog` (a CID, or possibly set of them).
- don't attempt to maintain strict reference counts; instead,
keep a (fixed length!) record next to each object which remembers the last `gc-epoch` (an int) which "saw" an object.
- when preparing to do a marking run, increment the `gc-epoch` int
- start from a root (a pin or whatever) and walk.
Mark everything with the `gc-epoch` int as follows:
- for each object visited: set its `last-seen` record to the max of the current `gc-epoch` int or the object's existing `last-seen` record.
- if the number didn't rise, no need to continue recursing over this object's links;
otherwise, if the number did rise: recurse.
- you have now updated the last-seen for a whole DAG recursively.
_this does not mean you are ready to run GC._
- update the `gc-epoch-finished` number to the `gc-epoch` number you just finished with.
(this is important so that if we crashed part-way, we could notice it,
and start the process again -- still bumping the `gc-epoch` number, but starting over from the same root, since we don't know where exactly in the walk we crashed.)
- do this process again for each root.
- (optional optimization: you may do several roots with the same epoch number, but... consider tradeoffs and maybe don't; it'll make it the redo size in case of a crash bigger.)
Note at no point did I say "you're done! yeet stuff!". Not yet.
- Have a stable ordering function for roots.
- A simple sorting by CID works fine for this.
- Remember the root you started on, and the `gc-epoch` int you started with there.
- Okay, I lied. This is two more pieces of state than the three vars I noted at the top.
- **Go until you've looped back to the root you started on.**
- **Now** you can do GC.
### actual gc phase
- Yeet every object which has a `last-seen` value that's less than the `gc-epoch` that the last full loop started on.
- That's it.
(Bonus: consider replacing "yeet" with "move to recycle bin",
but hopefully this is only an emergency fallback position until we're sure the kinks are all ironed out.)
### concurrent adds
... are pretty easy, you just pick another, higher `gc-epoch` number
and start writing your new entries with that.
FIXME has some implications about how we ratchet up the number that I think 'main process' doesn't account for sufficiently yet.
I hope this can be solved in finite state, but at most it should take a small stack.
### removes
There are no special actions to 'remove' data;
just drop the pin/root.
(In other words, this is actually a GC.)
does this feel a lot like the way you write a high performance ring buffer with mechanical sympathy?
-------------
yes yes it does
pros and cons
-------------
Couple of things are cool about this:
- :arrow_up_small: can make partial progress, without stop-the-world synchronization! Almost no global mutexes. **Can pause at any time**, and can resume reasonably even after crashes.
- okay, can only really pause at granularity of root. but that's pretty heckin good.
- by "pause" here I mean: drop all but a few words of state, and forget about any in-progress walk completely.
- the pernicious case is if the entire dataset is one massive DAG covered by a single root. but it's unclear that we need to regard this as a serious problem (or if there's much that can be done to improve this).
- a milder form of pause in the walk -- remember where you are, but cease allocating resources to progressing -- can be done at any time, with zero backsliding, because there are **no** exclusive mutexes needed during any of the walk operations.
- :arrow_double_up: almost all of the operations are happy with _atomic writes_ only -- much lower cost than mutexes.
- what!! not even CAS? Yarly: any of the cases where a "race" occurs, both processes in the race are still pushing the number up high enough it's not going to get hit in the next GC, so it's functionally fine.
- :arrow_up_small: no synchronization or overhead at all on writes. (okay, maybe one more write op per object for the epoch number.)
Not everything is awesome:
- :arrow_down_small: This approach is not proactive about making data unavailable.
- (I think some user stories which involve billing might be less than thrilled about this.)
- :arrow_down_small: This approach is going to walk the data store *more* times than a single giant walk (unless you decide to do every root at once with a single `gc-epoch` int, which... is a tunable option, actually).
- This is the other side of the same coin of incremental progress. No other way to buy incrementalism won't do some additional work.
- _It's probably worth it._