Blob Splitting Protocol ======================= Introduction ------------ Due to the generic nature of our build system, we do not impose any restrictions on the type of artifacts that are generated by actions or that are inputs to actions. Thus, artifacts might be large executables, libraries, or a compiled set of slides in PDF format, or even whole file system images. Such artifacts result in large blobs, which need to be fetched from a remote CAS if they are to be inspected or used locally. Depending on the network connection, this might imply a significant waiting time until the complete artifact is downloaded as well as results in a lot of network traffic. This situation is especially painful in case of short modification-inspection cycles. For each small modification of the sources, the complete artifact needs to be downloaded even though maybe only a small fraction of the compiled binary artifact has been changed. Thus, we introduced a blob splitting API as conservative extension to the original [remote execution API](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto), which allows to split the binary data of a blob into chunks and then to transmit only the modified chunks instead of the whole blob data. This reduces traffic between the remote server and the local host and avoids unnecessary waiting times for users frequently working on large artifacts. The downside of this API is an increased storage consumption, since the binary data of the splitted artifacts will be stored twice. Remote execution API extension ------------------------------ We extended the existing remote execution API by introducing a new remote procedure call (rpc) to the `ContentAddressableStorage` service to split a blob into chunks along with respective request and response messages, and by adding a flag to the `CacheCapabilities` message to indicate whether blob splitting is supported by the server instance or not. The following code snippet depicts the required extensions to the remote execution protocol: service ContentAddressableStorage { ... rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {} } message SplitBlobRequest { string instance_name = 1; Digest blob_digest = 2; } message SplitBlobResponse { repeated Digest chunk_digests = 1; } message CacheCapabilities { ... bool blob_split_support = ; } ### Contract between client and server The contract between the client and the server for this protocol extension is that if a client requests to split a blob at the remote execution server, the server answers with a list of chunk digests and gives the promise that first, all referenced blobs are available in its CAS to be downloaded and second, the concatenation of the returned blobs in the given order will result in the original blob the client has requested. The client does not give any promise to the server, it is free to not use the blob splitting rpc, but in order to make sense of the protocol extension (saving traffic), a meaningful behavior of the client would be to request the chunk digests of a blob from the server, to fetch only those chunks that are missing in its local CAS, and to store the fetched chunks as well as the reconstructed original blob in its local CAS. If the client requests to split a blob, but the server does not support blob splitting or if an error occurred during the request, the client falls back to fetch the entire blob. Blob split procedure -------------------- ### Server side When the server receives a request to split a blob via the `SplitBlob` rpc, it first checks whether the blob given by the `blob_digest` from the `SplitBlobRequest` message exists in its CAS. If not, the status code `google.rpc.Code.NOT_FOUND` is returned. Otherwise, it loads the blob data and splits it into chunks according to the implemented data chunking algorithm. As explained later, there are different implementation options for this algorithm, but they all must ensure one property: the concatenation of the chunks result in the original blob. After the blob is splitted, the server computes a digest for each chunk according to the configured digest function and puts each chunk that is not yet stored in its CAS. If an error occurs during storing the chunks in the CAS due to storage shortage, the status code `google.rpc.Code.RESOURCE_EXHAUSTED` is returned. Otherwise, the chunk digests are collected in the `chunk_digests` field of a `SplitBlobResponse` message in the same order as the chunks occur within the binary blob data and the message is sent back to the client with the status code `google.rpc.Code.OK`. > Note: Since the same data is basically stored twice for each blob, the > storage consumption for the remote server is roughly doubled. Some > savings occur when chunks are reused in more than one blob or even > several times in the same blob. ### Client side If a client wants to take advantage from blob splitting, it requests to split a blob into chunks at the remote side by calling the `SplitBlob` rpc from the `ContentAddressableStorage` service given a `SplitBlobRequest` message containing the `blob_digest` of the respective blob. If the status code returned by the `SplitBlob` rpc contains an error, the split operation failed and the client needs to fall back to fetch the entire blob. Otherwise, blob splitting was successful and the remote server returns an ordered list of `chunk_digests` and guarantees that the chunks are available in its CAS and that the concatenation of all chunks in the given order will result in the requested blob. The client checks which chunks are available locally, fetches only the locally missing chunks from the remote side via the `BatchReadBlobs` rpc or the `ByteStream` API, and assembles the requested blob by concatenating the binary data of its chunks. Finally, it stores the chunks as well as the resulting blob in its local CAS. > Note: Since the same data is basically stored twice for each blob, the > local storage consumption is roughly doubled. Some savings occur when > chunks are reused in more than one blob or even several times in the > same blob. Compatibility issues -------------------- We aim to get this protocol extension into the official remote execution API mentioned above. Once this is done and field numbers are assigned, servers that have upgraded to the new API version inform clients that have upgraded to the new API version whether they support blob splitting or not by setting the `blob_split_support` field in the `CacheCapabilities` message accordingly. A server only provides the aforementioned guarantee to clients once it has announced to support blob splitting using this field. An old client can communicate with a server implementing the new API without any modification. Nobody is forced to use the blob splitting functionality and unknown fields are just ignored at the client side. A client implementing the new API can communicate with an old server by evaluating the `blob_split_support` field, which will be set to its default value `false` at the client side. Until the protocol extension is officially accepted, the field number for the `blob_split_support` field is not known. In this case, early implementors can use a _trial request_ to determine whether the remote server supports blob splitting or not. This allows to implement a prototypical client and server employing blob splitting without requiring the protocol extension to be officially accepted. It also allows new clients to communicate with old servers, they still behave correctly and do not need to be adapted. The client implementation needs to be rebuilt once a field number has been assigned. Data chunking algorithm ----------------------- As stated earlier, a blob split request from a client is answered by the server with the promise that the concatenation of the returned chunks results in the requested blob. While this property is guaranteed by the server, it does not tell anything about the quality of the split result. It is desirable to generate chunks that are likely to be known by the client, because then less data needs to be transferred. A trivial implementation of the split operation would be returning a _singleton list_ containing the original blob as single chunk. While this implementation is correct and fast, it is not very useful, since it does not save any traffic. A better approach is _fixed-block chunking_ where the data is split into chunks of fixed and equal size. While this approach typically results in more than one chunk and improves the probability to save some traffic, it comes with the limitation of the data shifting problem. The insertion of a single character at the beginning of a data stream shifts the entire data while chunk boundaries are fixed. Thus, completely new chunks, unknown to the client are created even though the data patterns are mostly similar. A more intelligent way of splitting blobs would be to look for locations in the content of the data as chunk boundaries such that splitting of a slightly modified blob results in almost similar chunks insensitive to the data shifting problem of fixed-block chunking. Since the resulting chunk sizes are typically different, this approach is also called _variable-block chunking_ and is explained in the following section. ### Variable-block chunking In variable-block chunking, the borders of the chunks are content defined and shifted with the data pattern. This can be achieved by computing hash values of data blocks of a specific size at every byte position in the data stream and matching those hash values against a predefined pattern. In case of a match, that byte position is declared as chunk boundary. This approach avoids the data shifting problem of fixed-block chunking and ensures that unmodified data more than a block size away from the changes will have the same boundaries. A possible pattern to be matched might be all zeroes in the lower `n` bits. This would result in chunks with an average size of `2^n` bytes. ### Rolling hashes A common technique to efficiently compute hash values of consecutive bytes at every byte position in the data stream is to use a rolling hash function. A rolling hash function allows to cheaply compute the hash value of a chunk of bytes of a specific size at byte position `i` from the hash value of the data chunk of the same size at byte position `i-1`. Different plain rolling hash functions are available for implementation such as a [polynomial rolling hash](https://ieeexplore.ieee.org/document/5390135), the [Rabin fingerprint](http://www.cs.cmu.edu/~15-749/READINGS/optional/rabin1981.pdf), or a [cyclic polynomial rolling hash](https://dl.acm.org/doi/abs/10.1145/256163.256168) (also called Buzhash). However, the [fast rolling Gear hash algorithm](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia) (also called FastCDC) has been proven to be very compute efficient and faster than the other rolling hash algorithms specifically for content-based chunking of a stream of data while achieving similar deduplication ratios as the Rabin fingerprint.