doc/future-designs/large-blobs.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

# Handling of large blobs

## Current situation and shortcomings

When building locally, all intermediate artifacts end up in the
local CAS. Depending on the nature of the artifacts built, this
can include large ones (like disk images) that only differ in small
parts from the ones created in previous builds for a code base
with only small changes. In this way, the disk can fill up quite
quickly. Of course, space can always be reclaimed by `just gc`,
but that would limit how far the cache reaches back.

## Proposed changes

We propose to use the same solution that is already used for reducing
traffic when downloading large final artifacts for installing; there,
large blobs are split in a content-defined way and only chunks not
already known have to be transferred. Similarly, when storing the
chunks the large blobs are composed of, duplicate blobs have to be
stored only once.

### New store: large-objects CAS

A new form of CAS is added, called large-objects CAS. It follows
the same generation regime as the regular CAS; more precisely,
next to the already existing `casf`, `casx`, and `cast` two new
entries are generated, `cas-large-f` and `cas-large-t`, where the
latter is only available in native mode (i.e., when trees are hashed
differently than blobs).

The entries in the large-objects CAS are keyed by hash of the
large object and the value of an entry is the concatenation of the
hashes of the chunks the large object is composed of. An entry in
a large-object CAS promises
- that the chunks the large object is composed of are in the main
  CAS, more precisely `casf` in the same generation,
- the concatenation of the specified chunks indeed gives the
  requested object,
- if the object is a tree, the parts are also in the same generation,
  in main or larger-object CAS, and
- the object is strictly larger than the maximal size a chunk
  obtained by splitting can have.

Here, the last promise avoids that chunks of a large object later
can be replaced by a large-object entry themselves.

### Using objects from the large-objects CAS

Whenever an object is not found in the main CAS, the large-objects
CAS is inspected. If found there, then, in this order,
- if the entry is not already in the youngest generation, the chunks
  are promoted to the youngest generation,
- the object itself is spliced on disk in a temproary file,
- if the object is a tree, the parts are promoted to the youngest
  generation (only necessary if the large-object entry was not
  found in the youngest generation anyway),
- the large object is added to the respective main CAS, and finally
- the large-objects entry is added to the youngest generation (if
  not already there).

The promoting of the chunks ensures that they are already present
in the youngest generation at almost no storage cost, as promoting
is implemented using hard links. Therefore, the overhead of a later
splitting of that blob is minimal.

### Blob splitting uses large-objects CAS as cache

When `just execute` is asked to split an object, first the
large-objects CAS is inspected. If the entry is present in the
youngest generation, the answer is directly served from there;
if found in an older generation, it is served from there after
appropriate promotion (chunks; parts of the tree, if the object to
split was a tree; large-objects entry) to the youngest generation.

When `just execute` actually performs a split and the object that
was to be splitted was larger than the maximal size of a chunk,
after having added the chunks to the main CAS, it will also write
a corresponding entry to the large-objects CAS. In this way,
subsequent requests for the same object can be served from there
without having to inspect the object again.

Similarly, if a blob is split in order to be transferred to an
endpoint that supports blob-splicing, a corresponding entry is
added to the large-objects CAS (provided the original blob was
larger than the maximal size of a chunk, but we will transfer by
blob splice only for those objects anyway).

### Compactification of a CAS generation

During garbage collection, while already holding the exclusive
garbage-collection lock, the following compactification steps are
performed on the youngest generation before doing the generation
rotation.
- For every entry in the large-objects CAS the corresponding entries
  in the main CAS are removed (for an entry in `cas-large-f` the
  entries in both, `casf` and `casx` are removed, as files and
  executable files are both hashed the same way, as blobs).
- For every entry in the main CAS that is larger than the
  compactification threshold, the object is split, the chunks are
  added to the main CAS, the list of chunks the object is composed
  of is added to the large-objects CAS, and finally the object is
  removed from the main CAS.
It should be noted that these steps do not modify the objects
that can be obtained from that CAS generation. In particular, all
invariants are kept.

As compactification threshold we chose a fixed value larger than
the maximal size a chunk obtained from splitting can have. More
precisely, we take the maximal value where we still can transfer
a blob via `grpc` without having to use the streaming interface,
i.e, we chose 2MB. In this way, we already have correct blobs split
for transfer to an end point that supports blob splicing.

The compactification step will also be carried out if the `--no-rotate`
option is given to `gc`.