1 files changed, 225 insertions, 0 deletions
diff --git a/doc/future-designs/blob-splitting.md b/doc/future-designs/blob-splitting.md
new file mode 100644
index 00000000..598996b8
--- /dev/null
+++ b/doc/future-designs/blob-splitting.md
@@ -0,0 +1,225 @@
+Blob Splitting Protocol
+=======================
+
+Introduction
+------------
+
+Due to the generic nature of our build system, we do not impose any
+restrictions on the type of artifacts that are generated by actions or
+that are inputs to actions. Thus, artifacts might be large executables,
+libraries, or a compiled set of slides in PDF format, or even whole file
+system images. Such artifacts result in large blobs, which need to be
+fetched from a remote CAS if they are to be inspected or used locally.
+Depending on the network connection, this might imply a significant
+waiting time until the complete artifact is downloaded as well as
+results in a lot of network traffic. This situation is especially
+painful in case of short modification-inspection cycles. For each small
+modification of the sources, the complete artifact needs to be
+downloaded even though maybe only a small fraction of the compiled
+binary artifact has been changed.
+
+Thus, we propose a blob splitting strategy as conservative extension to
+the original [remote execution
+API](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto),
+which allows to split the binary data of a blob into chunks and to
+transmit only the modified chunks instead of the whole blob data. The
+goal is to reduce traffic between local host and remote server and to
+avoid unnecessary waiting times for users frequently working on large
+artifacts at the cost of an increased storage consumption.
+
+Remote execution API extension
+------------------------------
+
+We extend the existing remote execution API by introducing a new remote
+procedure call (rpc) to the `ContentAddressableStorage` service to split
+a blob into chunks along with respective request and response messages,
+and by adding a flag to the `ServerCapabilities` message to indicate
+whether blob splitting is supported by the server or not. The following
+code snippet depicts the proposed extensions to the remote execution
+protocol:
+
+    service ContentAddressableStorage {
+      ...
+      rpc SplitBlob(SplitBlobRequest) returns (SplitBlobResponse) {}
+    }
+
+    message SplitBlobRequest {
+      string instance_name = 1;
+      Digest digest = 2;
+      DigestFunction.Value digest_function = 3;
+    }
+
+    message SplitBlobResponse {
+      repeated Digest digests = 1;
+      google.rpc.Status status = 2;
+    }
+
+    message ServerCapabilities {
+      ...
+      bool blob_split_support = <free field number>;
+    }
+
+### Contract between client and server
+
+The contract between the client and the server for this protocol
+extension is that if a client requests to split a blob at the remote
+execution server, the server answers with a list of chunk digests and
+gives the promise that first, all referenced blobs are available in its
+CAS to be downloaded and second, the concatenation of the returned blobs
+in the given order will result in the original blob the client has
+requested.
+
+The client does not give any promise to the server, it is free to not
+use the blob splitting rpc, but in order to make sense of the protocol
+extension (saving traffic), a meaningful behavior of the client would be
+to request the chunk digests of a blob from the server, to fetch only
+those chunks that are missing in its local CAS, and to store the fetched
+chunks as well as the reconstructed original blob in its local CAS. If
+the client requests to split a blob, but the server does not support
+blob splitting or if an error occured during the request, the client
+falls back to fetch the entire blob.
+
+Blob split procedure
+--------------------
+
+### Server side
+
+When the server receives a request to split a blob via the `SplitBlob`
+rpc, it first checks whether the blob given by the `digest` from the
+`SplitBlobRequest` message exists in its CAS and if not, an error is
+returned by sending back a `SplitBlobResponse` message with the
+`status.code` field set to `google.rpc.Code.NOT_FOUND`. Otherwise, it
+loads the blob data and splits it into chunks according to the
+implemented data chunking algorithm. As explained later, there are
+different implementation options for this algorithm, but they all must
+ensure one property: the concatenation of the chunks result in the
+original blob. After the blob is splitted, the server computes a digest
+for each chunk according to the `digest_function` specified in the
+`SplitBlobRequest` message and puts each chunk that is not yet stored in
+its CAS. If an error occurs during storing the chunks in the CAS, an
+error is returned by sending back a `SplitBlobResponse` message with the
+`status.code` field set to `google.rpc.Code.INTERNAL`. Otherwise, the
+chunk digests are collected in the `digests` field of a
+`SplitBlobResponse` message in the same order as the chunks within the
+blob data block and sent back to the client with `status.code` set to
+`google.rpc.Code.OK`.
+
+> Note: Since the same data is basically stored twice for each blob, the
+> storage consumption for the remote server is roughly doubled. Some
+> savings occur when chunks are reused in more than one blob or even
+> several times in the same blob.
+
+### Client side
+
+If a client wants to take advantage from blob splitting, it requests to
+split a blob into chunks at the remote side by calling the `SplitBlob`
+rpc from the `ContentAddressableStorage` service given a valid
+`SplitBlobRequest` message containing the respective blob `digest` and
+`digest_function`. If the `status` field of the returned
+`SplitBlobResponse` message contains an error, the split operation
+failed and the client needs to fall back to fetch the entire blob.
+Otherwise, blob splitting was successful and the remote server returns
+an ordered list of chunk `digests` and guarantees that the chunks are
+available in its CAS and that the concatention of all chunks in the
+given order will result in the requested blob. The client checks which
+chunks are available locally, fetches only the locally missing chunks
+from the remote side via the `BatchReadBlobs` rpc or the `ByteStream`
+API, and reconstructs the requested blob by concatenating the binary
+data of its chunks. Finally, it stores the chunks as well as the
+resulting blob in its local CAS.
+
+> Note: Since the same data is basically stored twice for each blob, the
+> local storage consumption is roughly doubled. Some savings occur when
+> chunks are reused in more than one blob or even several times in the
+> same blob.
+
+Compatibility issues
+--------------------
+
+We aim to get this protocol extension into the official remote execution
+API mentioned above. Once this is done and field numbers are assigned,
+servers that have upgraded to the new API version inform clients that
+have upgraded to the new API version whether they support blob splitting
+or not by setting the `blob_split_support` field in the
+`ServerCapabilities` message accordingly. A server only provides the
+aforementioned guarantee to clients once it has announced to support
+blob splitting using this field.
+
+An old client can communicate with a server implementing the new API
+without any modification. Nobody is forced to use the blob splitting
+functionality and unknown fields are just ignored at the client side. A
+client implementing the new API can communicate with an old server by
+evaluating the `blob_split_support` field, which will be set to its
+default value `false` at the client side.
+
+Until the protocol extension is officially accepted, the field number
+for the `blob_split_support` field is not known. In this case, early
+implementors can use a _trial request_ to determine whether the remote
+server supports blob splitting or not. This allows to implement a
+prototypical client and server employing blob splitting without
+requiring the protocol extension to be officially accepted. It also
+allows new clients to communicate with old servers, they still behave
+correctly and do not need to be adapted. The client implementation needs
+to be rebuilt once a field number has been assigned.
+
+Data chunking algorithm
+-----------------------
+
+As stated earlier, a blob split request from a client is answered by the
+server with the promise that the concatenation of the returned chunks
+results in the requested blob. While this property is guaranteed by the
+server, it does not tell anything about the quality of the split result.
+It is desirable to generate chunks that are likely to be known by the
+client, because then less data needs to be transferred.
+
+A trivial implementation of the split operation would be returning a
+_singleton list_ containing the original blob as single chunk. While
+this implementation is correct and fast, it is not very useful, since it
+does not save any traffic.
+
+A better approach is _fixed-block chunking_ where the data is split into
+chunks of fixed and equal size. While this approach typically results in
+more than one chunk and improves the probability to save some traffic,
+it comes with the limitation of the data shifting problem. The insertion
+of a single character at the beginning of a data stream shifts the
+entire data while chunk boundaries are fixed. Thus, completely new
+chunks, unknown to the client are created even though the data patterns
+are mostly similar.
+
+A more intelligent way of splitting blobs would be to look for locations
+in the content of the data as chunk boundaries such that splitting of a
+slightly modified blob results in almost similar chunks insensitive to
+the data shifting problem of fixed-block chunking. Since the resulting
+chunk sizes are typically different, this approach is also called
+_variable-block chunking_ and is explained in the following section.
+
+### Variable-block chunking
+
+In variable-block chunking, the borders of the chunks are content
+defined and shifted with the data pattern. This can be achieved by
+computing hash values of data blocks of a specific size at every byte
+position in the data stream and matching those hash values against a
+predefined pattern. In case of a match, that byte position is declared
+as chunk boundary. This approach avoids the data shifting problem of
+fixed-block chunking and ensures that unmodified data more than a block
+size away from the changes will have the same boundaries. A possible
+pattern to be matched might be all zeroes in the lower `n` bits. This
+would result in chunks with an average size of `2^n` bytes.
+
+### Rolling hashes
+
+A common technique to efficiently compute hash values of consecutive
+bytes at every byte position in the data stream is to use a rolling hash
+function. A rolling hash function allows to cheaply compute the hash
+value of a chunk of bytes of a specific size at byte position `i` from
+the hash value of the data chunk of the same size at byte position
+`i-1`. Different plain rolling hash functions are available for
+implementation such as a [polynomial rolling
+hash](https://ieeexplore.ieee.org/document/5390135), the [Rabin
+fingerprint](http://www.cs.cmu.edu/~15-749/READINGS/optional/rabin1981.pdf),
+or a [cyclic polynomial rolling
+hash](https://dl.acm.org/doi/abs/10.1145/256163.256168). However, the
+[fast rolling Gear hash
+algorithm](https://www.usenix.org/conference/atc16/technical-sessions/presentation/xia)
+has been proven to be very compute efficient specifically for
+content-based chunking of a stream of data.