diff options
Diffstat (limited to 'doc/concepts/garbage.org')
-rw-r--r-- | doc/concepts/garbage.org | 82 |
1 files changed, 82 insertions, 0 deletions
diff --git a/doc/concepts/garbage.org b/doc/concepts/garbage.org new file mode 100644 index 00000000..26f6cc51 --- /dev/null +++ b/doc/concepts/garbage.org @@ -0,0 +1,82 @@ +* Garbage Collection + +For every build, for all non-failed actions an entry is created in +the action cache and the corresponding artifacts are stored in the +CAS. So, over time, a lot of files accumulate in the local build +root. Hence we have a way to reclaim disk space while keeping the +benefits of having a cache. This operation is referred to as garbage +collection and usually uses the heuristics to keeping what is most +recently used. Our approach follows this paradigm as well. + +** Invariants assumed by our build system + +Our tool assumes several invariants on the local build root, that we +need to maintain during garbage collection. Those are the following. +- If an artifact is referenced in any cache entry (action cache, + target-level cache), then the corresponding artifact is in CAS. +- If a tree is in CAS, then so are its immediate parts (and hence + also all transitive parts). + + +** Generations of cache and CAS + +In order to allow garbage collection while keeping the desired +invariants, we keep several (currently two) generations of cache +and CAS. Each generation in itself has to fulfill the invariants. +The effective cache or CAS is the union of the caches or CASes of +all generations, respectively. Obviously, then the effective cache +and CAS fulfill the invariants as well. + +The actual ~gc~ command rotates the generations: the oldest +generation is be removed and the remaining generations are moved +one number up (i.e., currently the young generation will simply +become the old generation), implicitly creating a new, empty, +youngest generation. As an empty generation fulfills the required +invariants, this operation preservers the requirement that each +generation individually fulfill the invariants. + +All additions are made to the youngest generation; in order to keep +the invariant, relevant entries only present in an older generation +are also added to the youngest generation first. Moreover, whenever +an entry is referenced in any way (cache hit, request for an entry +to be in CAS) and is only present in an older generation, it is +also added to the younger generation, again adding referenced +parts first. As a consequence, the youngest generation contains +everything directly or indirectly referenced since the last garbage +collection; in particular, everything referenced since the last +garbage collection will remain in the effective cache or CAS upon +the next garbage collection. + +These generations are stored as separate directories inside the +local build root. As the local build root is, starting from an +empty directory, entirely managed by `just` and compatible tools, +generations are on the same file system. Therefore the adding of +old entries to the youngest generation can be implemented in an +efficient way by using hard links. + +The moving up of generations can happen atomically by renaming the +respective directory. Also, the oldest generation can be removed +logically by renaming a directory to a name that is not searched +for when looking for existing generations. The actual recursive +removal from the file system can then happen in a separate step +without any requirements on order. + +** Parallel operations in the presence of garbage collection + +The addition to cache and CAS can continue to happen in parallel; +that certain values are taken from an older generation instead +of freshly computed does not make a difference for the youngest +generation (which is the only generation modified). But build +processes assume they don't violate the invariant if they first +add files to CAS and later a tree or cache entry referencing them. +This, however, only holds true if no generation rotation happens in +between. To avoid those kind of races, we make processes coordinate +over a single lock for each build root. +- Any build process keeps a shared lock for the entirety of the build. +- The garbage collection process takes an exclusive lock for the + period it does the directory renames. +We consider it acceptable that, in theory, local build processes +could starve local garbage collection. Moreover, it should be noted +that the actual removal of no-longer-needed files from the file +system happens without any lock being held. Hence the disturbance +of builds caused by garbage collection is small. |