Last-Liveness-Seen GC

Last-Liveness-Seen GC ===================== This is a draft of notes about a potential garbage collection design system. :::info :wolf: This is written in the wee hours, and has a low professionalism quotient. Kindly accept this first draft for what it is. Future drafts would surely be needed. ::: This GC is aimed at content-addressable data storage systems (such as IPFS) and alludes to CIDs in several places, but is probably otherwise general. key concerns ------------ - Garbage collection should be safe to run concurrently with the addition of new data. - Any synchronization primitives used should be of minimal cost and in minimal number. - synchronization primitives need to be reliable in the face of crashes (or able to fall back to resumable checkpoints). - fsync costs are predominant in practice. - journaling systems are difficult to implement. Simpler would be better. - synchronization primitives could have costs in space needed, and it is preferable if these are A) predictable and B) minimal. - an integer per object is a reasonable amount of overhead. (This is for large-ish data storage, not for a programming language memory GC system that needs bit twiddling to minimize metadata to sub-word sizes.) - Garbage collection should be able to make partial progress (in other words, be an "anytime algorithm"). - Systems under heavy load should be able to decide to deprioritize, or even kill, a garbage collection process, and backslide a minimal amount. approximate implementation -------------------------- ### main process - there will be three global values which need to be flushed atomically: `gc-epoch` (an int), `gc-epoch-finished` (an int), and `root-in-prog` (a CID, or possibly set of them). - don't attempt to maintain strict reference counts; instead, keep a (fixed length!) record next to each object which remembers the last `gc-epoch` (an int) which "saw" an object. - when preparing to do a marking run, increment the `gc-epoch` int - start from a root (a pin or whatever) and walk. Mark everything with the `gc-epoch` int as follows: - for each object visited: set its `last-seen` record to the max of the current `gc-epoch` int or the object's existing `last-seen` record. - if the number didn't rise, no need to continue recursing over this object's links; otherwise, if the number did rise: recurse. - you have now updated the last-seen for a whole DAG recursively. _this does not mean you are ready to run GC._ - update the `gc-epoch-finished` number to the `gc-epoch` number you just finished with. (this is important so that if we crashed part-way, we could notice it, and start the process again -- still bumping the `gc-epoch` number, but starting over from the same root, since we don't know where exactly in the walk we crashed.) - do this process again for each root. - (optional optimization: you may do several roots with the same epoch number, but... consider tradeoffs and maybe don't; it'll make it the redo size in case of a crash bigger.) Note at no point did I say "you're done! yeet stuff!". Not yet. - Have a stable ordering function for roots. - A simple sorting by CID works fine for this. - Remember the root you started on, and the `gc-epoch` int you started with there. - Okay, I lied. This is two more pieces of state than the three vars I noted at the top. - **Go until you've looped back to the root you started on.** - **Now** you can do GC. ### actual gc phase - Yeet every object which has a `last-seen` value that's less than the `gc-epoch` that the last full loop started on. - That's it. (Bonus: consider replacing "yeet" with "move to recycle bin", but hopefully this is only an emergency fallback position until we're sure the kinks are all ironed out.) ### concurrent adds ... are pretty easy, you just pick another, higher `gc-epoch` number and start writing your new entries with that. FIXME has some implications about how we ratchet up the number that I think 'main process' doesn't account for sufficiently yet. I hope this can be solved in finite state, but at most it should take a small stack. ### removes There are no special actions to 'remove' data; just drop the pin/root. (In other words, this is actually a GC.) does this feel a lot like the way you write a high performance ring buffer with mechanical sympathy? ------------- yes yes it does pros and cons ------------- Couple of things are cool about this: - :arrow_up_small: can make partial progress, without stop-the-world synchronization! Almost no global mutexes. **Can pause at any time**, and can resume reasonably even after crashes. - okay, can only really pause at granularity of root. but that's pretty heckin good. - by "pause" here I mean: drop all but a few words of state, and forget about any in-progress walk completely. - the pernicious case is if the entire dataset is one massive DAG covered by a single root. but it's unclear that we need to regard this as a serious problem (or if there's much that can be done to improve this). - a milder form of pause in the walk -- remember where you are, but cease allocating resources to progressing -- can be done at any time, with zero backsliding, because there are **no** exclusive mutexes needed during any of the walk operations. - :arrow_double_up: almost all of the operations are happy with _atomic writes_ only -- much lower cost than mutexes. - what!! not even CAS? Yarly: any of the cases where a "race" occurs, both processes in the race are still pushing the number up high enough it's not going to get hit in the next GC, so it's functionally fine. - :arrow_up_small: no synchronization or overhead at all on writes. (okay, maybe one more write op per object for the epoch number.) Not everything is awesome: - :arrow_down_small: This approach is not proactive about making data unavailable. - (I think some user stories which involve billing might be less than thrilled about this.) - :arrow_down_small: This approach is going to walk the data store *more* times than a single giant walk (unless you decide to do every root at once with a single `gc-epoch` int, which... is a tunable option, actually). - This is the other side of the same coin of incremental progress. No other way to buy incrementalism won't do some additional work. - _It's probably worth it._

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.