doc/concepts/blob-splitting.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223

Blob Splitting Protocol
=======================

Introduction
------------

Due to the generic nature of our build system, we do not impose any
restrictions on the type of artifacts that are generated by actions or
that are inputs to actions. Thus, artifacts might be large executables,
libraries, or a compiled set of slides in PDF format, or even whole file
system images. Such artifacts result in large blobs, which need to be
fetched from a remote CAS if they are to be inspected or used locally.
Depending on the network connection, this might imply a significant
waiting time until the complete artifact is downloaded as well as
results in a lot of network traffic. This situation is especially
painful in case of short modification-inspection cycles. For each small
modification of the sources, the complete artifact needs to be
downloaded even though maybe only a small fraction of the compiled
binary artifact has been changed.

Thus, we introduced a blob splitting API as conservative extension to
the original [remote execution
API](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto),
which allows to split the binary data of a blob into chunks and then to
transmit only the modified chunks instead of the whole blob data. This
reduces traffic between the remote server and the local host and avoids
unnecessary waiting times for users frequently working on large
artifacts. The downside of this API is an increased storage consumption,
since the binary data of the splitted artifacts will be stored twice.

Remote execution API extension
------------------------------

We extended the existing remote execution API by introducing a new
remote procedure call (rpc) to the `ContentAddressableStorage` service
to split a blob into chunks along with respective request and response
messages, and by adding a flag to the `CacheCapabilities` message to
indicate whether blob splitting is supported by the server instance or
not. The following code snippet depicts the required extensions to the
remote execution protocol:

    service ContentAddressableStorage {
      ...
      rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {}
    }

    message SplitBlobRequest {
      string instance_name = 1;
      Digest blob_digest = 2;
    }

    message SplitBlobResponse {
      repeated Digest chunk_digests = 1;
    }

    message CacheCapabilities {
      ...
      bool blob_split_support = <free field number>;
    }

### Contract between client and server

The contract between the client and the server for this protocol
extension is that if a client requests to split a blob at the remote
execution server, the server answers with a list of chunk digests and
gives the promise that first, all referenced blobs are available in its
CAS to be downloaded and second, the concatenation of the returned blobs
in the given order will result in the original blob the client has
requested.

The client does not give any promise to the server, it is free to not
use the blob splitting rpc, but in order to make sense of the protocol
extension (saving traffic), a meaningful behavior of the client would be
to request the chunk digests of a blob from the server, to fetch only
those chunks that are missing in its local CAS, and to store the fetched
chunks as well as the reconstructed original blob in its local CAS. If
the client requests to split a blob, but the server does not support
blob splitting or if an error occurred during the request, the client
falls back to fetch the entire blob.

Blob split procedure
--------------------

### Server side

When the server receives a request to split a blob via the `SplitBlob`
rpc, it first checks whether the blob given by the `blob_digest` from
the `SplitBlobRequest` message exists in its CAS. If not, the status
code `google.rpc.Code.NOT_FOUND` is returned. Otherwise, it loads the
blob data and splits it into chunks according to the implemented data
chunking algorithm. As explained later, there are different
implementation options for this algorithm, but they all must ensure one
property: the concatenation of the chunks result in the original blob.
After the blob is splitted, the server computes a digest for each chunk
according to the configured digest function and puts each chunk that is
not yet stored in its CAS. If an error occurs during storing the chunks
in the CAS due to storage shortage, the status code
`google.rpc.Code.RESOURCE_EXHAUSTED` is returned. Otherwise, the chunk
digests are collected in the `chunk_digests` field of a
`SplitBlobResponse` message in the same order as the chunks occur within
the binary blob data and the message is sent back to the client with the
status code `google.rpc.Code.OK`.

> Note: Since the same data is basically stored twice for each blob, the
> storage consumption for the remote server is roughly doubled. Some
> savings occur when chunks are reused in more than one blob or even
> several times in the same blob.

### Client side

If a client wants to take advantage from blob splitting, it requests to
split a blob into chunks at the remote side by calling the `SplitBlob`
rpc from the `ContentAddressableStorage` service given a
`SplitBlobRequest` message containing the `blob_digest` of the
respective blob. If the status code returned by the `SplitBlob` rpc
contains an error, the split operation failed and the client needs to
fall back to fetch the entire blob. Otherwise, blob splitting was
successful and the remote server returns an ordered list of
`chunk_digests` and guarantees that the chunks are available in its CAS
and that the concatenation of all chunks in the given order will result
in the requested blob. The client checks which chunks are available
locally, fetches only the locally missing chunks from the remote side
via the `BatchReadBlobs` rpc or the `ByteStream` API, and assembles the
requested blob by concatenating the binary data of its chunks. Finally,
it stores the chunks as well as the resulting blob in its local CAS.

> Note: Since the same data is basically stored twice for each blob, the
> local storage consumption is roughly doubled. Some savings occur when
> chunks are reused in more than one blob or even several times in the
> same blob.

Compatibility issues
--------------------

We aim to get this protocol extension into the official remote execution
API mentioned above. Once this is done and field numbers are assigned,
servers that have upgraded to the new API version inform clients that
have upgraded to the new API version whether they support blob splitting
or not by setting the `blob_split_support` field in the
`CacheCapabilities` message accordingly. A server only provides the
aforementioned guarantee to clients once it has announced to support
blob splitting using this field.

An old client can communicate with a server implementing the new API
without any modification. Nobody is forced to use the blob splitting
functionality and unknown fields are just ignored at the client side. A
client implementing the new API can communicate with an old server by
evaluating the `blob_split_support` field, which will be set to its
default value `false` at the client side.

Until the protocol extension is officially accepted, the field number
for the `blob_split_support` field is not known. In this case, early
implementors can use a _trial request_ to determine whether the remote
server supports blob splitting or not. This allows to implement a
prototypical client and server employing blob splitting without
requiring the protocol extension to be officially accepted. It also
allows new clients to communicate with old servers, they still behave
correctly and do not need to be adapted. The client implementation needs
to be rebuilt once a field number has been assigned.

Data chunking algorithm
-----------------------

As stated earlier, a blob split request from a client is answered by the
server with the promise that the concatenation of the returned chunks
results in the requested blob. While this property is guaranteed by the
server, it does not tell anything about the quality of the split result.
It is desirable to generate chunks that are likely to be known by the
client, because then less data needs to be transferred.

A trivial implementation of the split operation would be returning a
_singleton list_ containing the original blob as single chunk. While
this implementation is correct and fast, it is not very useful, since it
does not save any traffic.

A better approach is _fixed-block chunking_ where the data is split into
chunks of fixed and equal size. While this approach typically results in
more than one chunk and improves the probability to save some traffic,
it comes with the limitation of the data shifting problem. The insertion
of a single character at the beginning of a data stream shifts the
entire data while chunk boundaries are fixed. Thus, completely new
chunks, unknown to the client are created even though the data patterns
are mostly similar.

A more intelligent way of splitting blobs would be to look for locations
in the content of the data as chunk boundaries such that splitting of a
slightly modified blob results in almost similar chunks insensitive to
the data shifting problem of fixed-block chunking. Since the resulting
chunk sizes are typically different, this approach is also called
_variable-block chunking_ and is explained in the following section.

### Variable-block chunking

In variable-block chunking, the borders of the chunks are content
defined and shifted with the data pattern. This can be achieved by
computing hash values of data blocks of a specific size at every byte
position in the data stream and matching those hash values against a
predefined pattern. In case of a match, that byte position is declared
as chunk boundary. This approach avoids the data shifting problem of
fixed-block chunking and ensures that unmodified data more than a block
size away from the changes will have the same boundaries. A possible
pattern to be matched might be all zeroes in the lower `n` bits. This
would result in chunks with an average size of `2^n` bytes.

### Rolling hashes

A common technique to efficiently compute hash values of consecutive
bytes at every byte position in the data stream is to use a rolling hash
function. A rolling hash function allows to cheaply compute the hash
value of a chunk of bytes of a specific size at byte position `i` from
the hash value of the data chunk of the same size at byte position
`i-1`. Different plain rolling hash functions are available for
implementation such as a [polynomial rolling
hash](https://ieeexplore.ieee.org/document/5390135), the [Rabin
fingerprint](http://www.cs.cmu.edu/~15-749/READINGS/optional/rabin1981.pdf),
or a [cyclic polynomial rolling
hash](https://dl.acm.org/doi/abs/10.1145/256163.256168) (also called
Buzhash). However, the [fast rolling Gear hash
algorithm](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia)
(also called FastCDC) has been proven to be very compute efficient and
faster than the other rolling hash algorithms specifically for
content-based chunking of a stream of data while achieving similar
deduplication ratios as the Rabin fingerprint.