summaryrefslogtreecommitdiff
path: root/doc/specification
diff options
context:
space:
mode:
authorOliver Reiche <oliver.reiche@huawei.com>2023-06-01 13:36:32 +0200
committerOliver Reiche <oliver.reiche@huawei.com>2023-06-12 16:29:05 +0200
commitb66a7359fbbff35af630c88c56598bbc06b393e1 (patch)
treed866802c4b44c13cbd90f9919cc7fc472091be0c /doc/specification
parent144b2c619f28c91663936cd445251ca28af45f88 (diff)
downloadjustbuild-b66a7359fbbff35af630c88c56598bbc06b393e1.tar.gz
doc: Convert orgmode files to markdown
Diffstat (limited to 'doc/specification')
-rw-r--r--doc/specification/remote-protocol.md145
-rw-r--r--doc/specification/remote-protocol.org139
2 files changed, 145 insertions, 139 deletions
diff --git a/doc/specification/remote-protocol.md b/doc/specification/remote-protocol.md
new file mode 100644
index 00000000..1afd7e32
--- /dev/null
+++ b/doc/specification/remote-protocol.md
@@ -0,0 +1,145 @@
+Specification of the just Remote Execution Protocol
+===================================================
+
+Introduction
+------------
+
+just supports remote execution of actions across multiple machines. As
+such, it makes use of a remote execution protocol. The basis of our
+protocol is the open-source gRPC [remote execution
+API](https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto).
+We use this protocol in a **compatible** mode, but by default, we use a
+modified version, allowing us to pass git trees and files directly
+without even looking at their content or traversing them. This
+modification makes sense since it is more efficient if sources are
+available in git repositories and much open-source code is hosted in git
+repositories. With this protocol, we take advantage of already hashed
+git content as much as possible by avoiding unnecessary conversion and
+communication overhead.
+
+In the following sections, we explain which modifications we applied to
+the original protocol and which requirements we have to the remote
+execution service to seamlessly work with just.
+
+just Protocol Description
+-------------------------
+
+### git Blob and Tree Hashes
+
+In order to be able work with git hashes, both client side as well as
+server side need to be extended to support the regular git hash
+functions for blobs and trees:
+
+The hash of a blob is computed as
+
+ sha1sum(b"blob <size_of_content>\0<content>")
+
+The hash of a tree is computed as
+
+ sha1sum(b"tree <size_of_entries>\0<entries>")
+
+where `<entries>` is a sequence (without newlines) of `<entry>`, and
+each `<entry>` is
+
+ <mode> <file or dir name>\0<git-hash of the corresponding blob or tree>
+
+`<mode>` is a number defining if the object is a file (`100644`), an
+executable file (`100755`), a tree (`040000`), or a symbolic link
+(`120000`). More information on how git internally stores its objects
+can be found in the official [git
+documentation](https://git-scm.com/book/en/v2/git-Internals-git-Objects).
+
+Since git hashes blob content differently from trees, this type of
+information has to be transmitted in addition to the content and the
+hash. To this aim, just prefixes the git hash values passed over the
+wire with a single-byte marker. Thus allowing the remote side to
+distinguish a blob from a tree without inspecting the (potentially
+large) content. The markers are
+
+ - `0x62` for a git blob (`0x62` corresponds to the character `b`)
+ - `0x74` for a git tree (`0x74` corresponds to the character `t`)
+
+Since hashes are transmitted as hexadecimal string, the resulting length
+of such prefixed git hashes is 42 characters. The server side has to
+accept this hash length as valid hash length to detect our protocol and
+to apply the according git hash functions based on the detected prefix.
+
+### Blob and Tree Availability
+
+Typically, it makes sense for a client to check the availability of a
+blob or a tree at the remote side, before it actually uploads it. Thus,
+the remote side should be able to answer availability requests based on
+our prefixed hash values.
+
+### Blob Upload
+
+A blob is uploaded to the remote side by passing its raw content as well
+as its `Digest` containing the git hash value for a blob prefixed by
+`0x62`. The remote side needs to verify the received content by applying
+the git blob hash function to it, before the blob is stored in the
+content addressable storage (CAS).
+
+If a blob is part of git repository and already known to the remote
+side, we even do not have to calculate the hash value from a possible
+large file, instead we can directly use the hash value calculated by git
+and pass it through.
+
+### Tree Upload
+
+In contrast to regular files, which are uploaded as blobs, the original
+protocol has no notion of directories on the remote side. Thus,
+directories need to be traversed and converted to `Directory` Protobuf
+messages, which are then serialized and uploaded as blobs.
+
+In our modified protocol, we prevent this traversing and conversion
+overhead by directly uploading the git tree objects instead of the
+serialized Protobuf messages if the directory is part of a git
+repository. Consequently, we can also reuse the corresponding git hash
+value for a tree object, which just needs to be prefixed by `74`, when
+uploaded.
+
+The remote side must accepts git tree objects instead `Directory`
+Protobuf messages at any location where `Directory` messages are
+referred (e.g., the root directory of an action). The tree content is
+verified using the git hash function for trees. In addition, it has to
+be modified to parse the git tree object format.
+
+Using this git tree representation makes tree handling much more
+efficient, since the effort of traversing and uploading the content of a
+git tree occurs only once and for each subsequent request, we directly
+pass around the git tree id. We require the invariant that if a tree is
+part of any CAS then all its content is also available in this CAS. To
+adhere to this invariant, the client side has to prove that the content
+of a tree is available in the CAS, before uploading this tree. One way
+to ensure that the tree content is known to the remote side is that it
+is uploaded by the client. The server side has to ensure this invariant
+holds. In particular, if the remote side implements any sort of pruning
+strategy for the CAS, it has to honor this invariant when an element got
+pruned.
+
+Another consequence of this efficient tree handling is that it improves
+**action digest** calculation noticeably, since known git trees referred
+by the root directory do not need to be traversed. This in turn allows
+to faster determine whether an action result is already available in the
+action cache or not.
+
+### Tree Download
+
+Once an action is successfully executed, it might have generated output
+files or output directories in its staging area on the remote side. Each
+output file needs to be uploaded to its CAS with the corresponding git
+blob hash. Each output directory needs to be translated to a git tree
+object and uploaded to the CAS with the corresponding git tree hash.
+Only if the content of a tree is available in the CAS, the server side
+is allowed to return the tree to the client.
+
+In case of a generated output directory, the server only returns the
+corresponding git tree id to the client instead of a flat list of all
+recursively generated output directories as part of a `Tree` Protobuf
+message as it is done in the original protocol. The remote side promises
+that each blob and subtree contained in the root tree is available in
+the remote CAS. Such blobs and trees must be accessible, using the
+streaming interface, without specifying the size (since sizes are not
+stored in a git tree). Due to the Protobuf 3 specification, which is
+used in this remote execution API, not specifying the size means the
+default value 0 is used.
diff --git a/doc/specification/remote-protocol.org b/doc/specification/remote-protocol.org
deleted file mode 100644
index dea7177e..00000000
--- a/doc/specification/remote-protocol.org
+++ /dev/null
@@ -1,139 +0,0 @@
-* Specification of the just Remote Execution Protocol
-
-** Introduction
-
-just supports remote execution of actions across multiple machines. As such, it
-makes use of a remote execution protocol. The basis of our protocol is the
-open-source gRPC
-[[https://github.com/bazelbuild/remote-apis/blob/main/build/bazel/remote/execution/v2/remote_execution.proto][remote
-execution API]]. We use this protocol in a *compatible* mode, but by default, we
-use a modified version, allowing us to pass git trees and files directly without
-even looking at their content or traversing them. This modification makes sense
-since it is more efficient if sources are available in git repositories and much
-open-source code is hosted in git repositories. With this protocol, we take
-advantage of already hashed git content as much as possible by avoiding
-unnecessary conversion and communication overhead.
-
-In the following sections, we explain which modifications we applied to the
-original protocol and which requirements we have to the remote execution service
-to seamlessly work with just.
-
-
-** just Protocol Description
-
-*** git Blob and Tree Hashes
-
-In order to be able work with git hashes, both client side as well as server
-side need to be extended to support the regular git hash functions for blobs and
-trees:
-
-The hash of a blob is computed as
-#+BEGIN_SRC
-sha1sum(b"blob <size_of_content>\0<content>")
-#+END_SRC
-The hash of a tree is computed as
-#+BEGIN_SRC
-sha1sum(b"tree <size_of_entries>\0<entries>")
-#+END_SRC
-where ~<entries>~ is a sequence (without newlines) of ~<entry>~, and each
-~<entry>~ is
-#+BEGIN_SRC
-<mode> <file or dir name>\0<git-hash of the corresponding blob or tree>
-#+END_SRC
-~<mode>~ is a number defining if the object is a file (~100644~), an executable
-file (~100755~), a tree (~040000~), or a symbolic link (~120000~). More
-information on how git internally stores its objects can be found in the
-official [[https://git-scm.com/book/en/v2/git-Internals-git-Objects][git
-documentation]].
-
-Since git hashes blob content differently from trees, this type of information
-has to be transmitted in addition to the content and the hash. To this aim, just
-prefixes the git hash values passed over the wire with a single-byte marker.
-Thus allowing the remote side to distinguish a blob from a tree without
-inspecting the (potentially large) content. The markers are
-
-- ~0x62~ for a git blob (~0x62~ corresponds to the character ~b~)
-- ~0x74~ for a git tree (~0x74~ corresponds to the character ~t~)
-
-Since hashes are transmitted as hexadecimal string, the resulting length of such
-prefixed git hashes is 42 characters. The server side has to accept this hash
-length as valid hash length to detect our protocol and to apply the according
-git hash functions based on the detected prefix.
-
-
-*** Blob and Tree Availability
-
-Typically, it makes sense for a client to check the availability of a blob or a
-tree at the remote side, before it actually uploads it. Thus, the remote side
-should be able to answer availability requests based on our prefixed hash
-values.
-
-
-*** Blob Upload
-
-A blob is uploaded to the remote side by passing its raw content as well as its
-~Digest~ containing the git hash value for a blob prefixed by ~0x62~. The remote
-side needs to verify the received content by applying the git blob hash function
-to it, before the blob is stored in the content addressable storage (CAS).
-
-If a blob is part of git repository and already known to the remote side, we
-even do not have to calculate the hash value from a possible large file, instead
-we can directly use the hash value calculated by git and pass it through.
-
-
-*** Tree Upload
-
-In contrast to regular files, which are uploaded as blobs, the original protocol
-has no notion of directories on the remote side. Thus, directories need to be
-traversed and converted to ~Directory~ Protobuf messages, which are then
-serialized and uploaded as blobs.
-
-In our modified protocol, we prevent this traversing and conversion overhead by
-directly uploading the git tree objects instead of the serialized Protobuf
-messages if the directory is part of a git repository. Consequently, we can also
-reuse the corresponding git hash value for a tree object, which just needs to be
-prefixed by ~74~, when uploaded.
-
-The remote side must accepts git tree objects instead ~Directory~ Protobuf
-messages at any location where ~Directory~ messages are referred (e.g., the root
-directory of an action). The tree content is verified using the git hash
-function for trees. In addition, it has to be modified to parse the git tree
-object format.
-
-Using this git tree representation makes tree handling much more efficient,
-since the effort of traversing and uploading the content of a git tree occurs
-only once and for each subsequent request, we directly pass around the git tree
-id. We require the invariant that if a tree is part of any CAS then all its
-content is also available in this CAS. To adhere to this invariant, the client
-side has to prove that the content of a tree is available in the CAS, before
-uploading this tree. One way to ensure that the tree content is known to the
-remote side is that it is uploaded by the client. The server side has to ensure
-this invariant holds. In particular, if the remote side implements any sort of
-pruning strategy for the CAS, it has to honor this invariant when an element got
-pruned.
-
-Another consequence of this efficient tree handling is that it improves *action
-digest* calculation noticeably, since known git trees referred by the root
-directory do not need to be traversed. This in turn allows to faster determine
-whether an action result is already available in the action cache or not.
-
-
-*** Tree Download
-
-Once an action is successfully executed, it might have generated output files or
-output directories in its staging area on the remote side. Each output file
-needs to be uploaded to its CAS with the corresponding git blob hash. Each
-output directory needs to be translated to a git tree object and uploaded to the
-CAS with the corresponding git tree hash. Only if the content of a tree is
-available in the CAS, the server side is allowed to return the tree to the
-client.
-
-In case of a generated output directory, the server only returns the
-corresponding git tree id to the client instead of a flat list of all
-recursively generated output directories as part of a ~Tree~ Protobuf message as
-it is done in the original protocol. The remote side promises that each blob and
-subtree contained in the root tree is available in the remote CAS. Such blobs
-and trees must be accessible, using the streaming interface, without specifying
-the size (since sizes are not stored in a git tree). Due to the Protobuf 3
-specification, which is used in this remote execution API, not specifying the
-size means the default value 0 is used.