diff options
-rw-r--r-- | doc/future-designs/blob-splitting.md | 225 |
1 files changed, 225 insertions, 0 deletions
diff --git a/doc/future-designs/blob-splitting.md b/doc/future-designs/blob-splitting.md new file mode 100644 index 00000000..598996b8 --- /dev/null +++ b/doc/future-designs/blob-splitting.md @@ -0,0 +1,225 @@ +Blob Splitting Protocol +======================= + +Introduction +------------ + +Due to the generic nature of our build system, we do not impose any +restrictions on the type of artifacts that are generated by actions or +that are inputs to actions. Thus, artifacts might be large executables, +libraries, or a compiled set of slides in PDF format, or even whole file +system images. Such artifacts result in large blobs, which need to be +fetched from a remote CAS if they are to be inspected or used locally. +Depending on the network connection, this might imply a significant +waiting time until the complete artifact is downloaded as well as +results in a lot of network traffic. This situation is especially +painful in case of short modification-inspection cycles. For each small +modification of the sources, the complete artifact needs to be +downloaded even though maybe only a small fraction of the compiled +binary artifact has been changed. + +Thus, we propose a blob splitting strategy as conservative extension to +the original [remote execution +API](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto), +which allows to split the binary data of a blob into chunks and to +transmit only the modified chunks instead of the whole blob data. The +goal is to reduce traffic between local host and remote server and to +avoid unnecessary waiting times for users frequently working on large +artifacts at the cost of an increased storage consumption. + +Remote execution API extension +------------------------------ + +We extend the existing remote execution API by introducing a new remote +procedure call (rpc) to the `ContentAddressableStorage` service to split +a blob into chunks along with respective request and response messages, +and by adding a flag to the `ServerCapabilities` message to indicate +whether blob splitting is supported by the server or not. The following +code snippet depicts the proposed extensions to the remote execution +protocol: + + service ContentAddressableStorage { + ... + rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {} + } + + message SplitBlobRequest { + string instance_name = 1; + Digest digest = 2; + DigestFunction.Value digest_function = 3; + } + + message SplitBlobResponse { + repeated Digest digests = 1; + google.rpc.Status status = 2; + } + + message ServerCapabilities { + ... + bool blob_split_support = <free field number>; + } + +### Contract between client and server + +The contract between the client and the server for this protocol +extension is that if a client requests to split a blob at the remote +execution server, the server answers with a list of chunk digests and +gives the promise that first, all referenced blobs are available in its +CAS to be downloaded and second, the concatenation of the returned blobs +in the given order will result in the original blob the client has +requested. + +The client does not give any promise to the server, it is free to not +use the blob splitting rpc, but in order to make sense of the protocol +extension (saving traffic), a meaningful behavior of the client would be +to request the chunk digests of a blob from the server, to fetch only +those chunks that are missing in its local CAS, and to store the fetched +chunks as well as the reconstructed original blob in its local CAS. If +the client requests to split a blob, but the server does not support +blob splitting or if an error occured during the request, the client +falls back to fetch the entire blob. + +Blob split procedure +-------------------- + +### Server side + +When the server receives a request to split a blob via the `SplitBlob` +rpc, it first checks whether the blob given by the `digest` from the +`SplitBlobRequest` message exists in its CAS and if not, an error is +returned by sending back a `SplitBlobResponse` message with the +`status.code` field set to `google.rpc.Code.NOT_FOUND`. Otherwise, it +loads the blob data and splits it into chunks according to the +implemented data chunking algorithm. As explained later, there are +different implementation options for this algorithm, but they all must +ensure one property: the concatenation of the chunks result in the +original blob. After the blob is splitted, the server computes a digest +for each chunk according to the `digest_function` specified in the +`SplitBlobRequest` message and puts each chunk that is not yet stored in +its CAS. If an error occurs during storing the chunks in the CAS, an +error is returned by sending back a `SplitBlobResponse` message with the +`status.code` field set to `google.rpc.Code.INTERNAL`. Otherwise, the +chunk digests are collected in the `digests` field of a +`SplitBlobResponse` message in the same order as the chunks within the +blob data block and sent back to the client with `status.code` set to +`google.rpc.Code.OK`. + +> Note: Since the same data is basically stored twice for each blob, the +> storage consumption for the remote server is roughly doubled. Some +> savings occur when chunks are reused in more than one blob or even +> several times in the same blob. + +### Client side + +If a client wants to take advantage from blob splitting, it requests to +split a blob into chunks at the remote side by calling the `SplitBlob` +rpc from the `ContentAddressableStorage` service given a valid +`SplitBlobRequest` message containing the respective blob `digest` and +`digest_function`. If the `status` field of the returned +`SplitBlobResponse` message contains an error, the split operation +failed and the client needs to fall back to fetch the entire blob. +Otherwise, blob splitting was successful and the remote server returns +an ordered list of chunk `digests` and guarantees that the chunks are +available in its CAS and that the concatention of all chunks in the +given order will result in the requested blob. The client checks which +chunks are available locally, fetches only the locally missing chunks +from the remote side via the `BatchReadBlobs` rpc or the `ByteStream` +API, and reconstructs the requested blob by concatenating the binary +data of its chunks. Finally, it stores the chunks as well as the +resulting blob in its local CAS. + +> Note: Since the same data is basically stored twice for each blob, the +> local storage consumption is roughly doubled. Some savings occur when +> chunks are reused in more than one blob or even several times in the +> same blob. + +Compatibility issues +-------------------- + +We aim to get this protocol extension into the official remote execution +API mentioned above. Once this is done and field numbers are assigned, +servers that have upgraded to the new API version inform clients that +have upgraded to the new API version whether they support blob splitting +or not by setting the `blob_split_support` field in the +`ServerCapabilities` message accordingly. A server only provides the +aforementioned guarantee to clients once it has announced to support +blob splitting using this field. + +An old client can communicate with a server implementing the new API +without any modification. Nobody is forced to use the blob splitting +functionality and unknown fields are just ignored at the client side. A +client implementing the new API can communicate with an old server by +evaluating the `blob_split_support` field, which will be set to its +default value `false` at the client side. + +Until the protocol extension is officially accepted, the field number +for the `blob_split_support` field is not known. In this case, early +implementors can use a _trial request_ to determine whether the remote +server supports blob splitting or not. This allows to implement a +prototypical client and server employing blob splitting without +requiring the protocol extension to be officially accepted. It also +allows new clients to communicate with old servers, they still behave +correctly and do not need to be adapted. The client implementation needs +to be rebuilt once a field number has been assigned. + +Data chunking algorithm +----------------------- + +As stated earlier, a blob split request from a client is answered by the +server with the promise that the concatenation of the returned chunks +results in the requested blob. While this property is guaranteed by the +server, it does not tell anything about the quality of the split result. +It is desirable to generate chunks that are likely to be known by the +client, because then less data needs to be transferred. + +A trivial implementation of the split operation would be returning a +_singleton list_ containing the original blob as single chunk. While +this implementation is correct and fast, it is not very useful, since it +does not save any traffic. + +A better approach is _fixed-block chunking_ where the data is split into +chunks of fixed and equal size. While this approach typically results in +more than one chunk and improves the probability to save some traffic, +it comes with the limitation of the data shifting problem. The insertion +of a single character at the beginning of a data stream shifts the +entire data while chunk boundaries are fixed. Thus, completely new +chunks, unknown to the client are created even though the data patterns +are mostly similar. + +A more intelligent way of splitting blobs would be to look for locations +in the content of the data as chunk boundaries such that splitting of a +slightly modified blob results in almost similar chunks insensitive to +the data shifting problem of fixed-block chunking. Since the resulting +chunk sizes are typically different, this approach is also called +_variable-block chunking_ and is explained in the following section. + +### Variable-block chunking + +In variable-block chunking, the borders of the chunks are content +defined and shifted with the data pattern. This can be achieved by +computing hash values of data blocks of a specific size at every byte +position in the data stream and matching those hash values against a +predefined pattern. In case of a match, that byte position is declared +as chunk boundary. This approach avoids the data shifting problem of +fixed-block chunking and ensures that unmodified data more than a block +size away from the changes will have the same boundaries. A possible +pattern to be matched might be all zeroes in the lower `n` bits. This +would result in chunks with an average size of `2^n` bytes. + +### Rolling hashes + +A common technique to efficiently compute hash values of consecutive +bytes at every byte position in the data stream is to use a rolling hash +function. A rolling hash function allows to cheaply compute the hash +value of a chunk of bytes of a specific size at byte position `i` from +the hash value of the data chunk of the same size at byte position +`i-1`. Different plain rolling hash functions are available for +implementation such as a [polynomial rolling +hash](https://ieeexplore.ieee.org/document/5390135), the [Rabin +fingerprint](http://www.cs.cmu.edu/~15-749/READINGS/optional/rabin1981.pdf), +or a [cyclic polynomial rolling +hash](https://dl.acm.org/doi/abs/10.1145/256163.256168). However, the +[fast rolling Gear hash +algorithm](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia) +has been proven to be very compute efficient specifically for +content-based chunking of a stream of data. |