From 8a9664f9e63bf0528ee70ce40904314dd9deee64 Mon Sep 17 00:00:00 2001
From: Klaus Aehlig <klaus.aehlig@huawei.com>
Date: Wed, 15 Jun 2022 12:31:53 +0200
Subject: Document the concept of target-level caching

---
 doc/concepts/target-cache.org | 219 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 doc/concepts/target-cache.org

(limited to 'doc/concepts')

diff --git a/doc/concepts/target-cache.org b/doc/concepts/target-cache.org
new file mode 100644
index 00000000..dccbd5e7
--- /dev/null
+++ b/doc/concepts/target-cache.org
@@ -0,0 +1,219 @@
+* Target-level caching
+
+** ~git~ trees as content-fixed roots
+
+*** The ~"git tree"~ root scheme
+
+The multi-repository configuration supports a scheme ~"git tree"~.
+This scheme is given by two parameters,
+- the id of the tree (as a string with the hex encoding), and
+- an arbitrary ~git~ repository containing the specified tree
+  object, as well as all needed tree and blob objects reachable
+  from that tree.
+For example, a root could be specified as follows.
+#+BEGIN_SRC
+["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"]
+#+END_SRC
+
+It should be noted that the ~git~ tree identifier alone already
+specifies the content of the full tree. However, ~just~ needs access
+to some repository containing the tree in order to know what the
+tree looks like.
+
+Nevertheless, it is an important observation that the tree identifier
+alone already specifies the content of the whole (logical) directory.
+The equality of two such directories can be established by comparing
+the two identifiers _without_ the need to read any file from
+disk. Those "fixed-content" descriptions, i.e., descriptions of a
+repository root that already fully determines the content are the
+key to caching whole targets.
+
+*** ~KNOWN~ artifacts
+
+The in-memory representation of known artifacts has an optional
+reference to a repository containing that artifact. Artifacts
+"known" from local repositories might not be known to the CAS used
+for the action execution; this additional reference allows to fill
+such misses in the CAS.
+
+** Content-fixed repositories
+
+*** The parts of a content-fixed repository
+
+In order to meaningfully cache a target, we need to be able to
+efficiently compute the cache key. We restrict this to the case where
+we can compute the information about the repository without file-system
+access. This requires that all roots (workspace, target root, etc)
+be content fixed, as well as the bindings of the free repository
+names (and hence also all transitively reachable repositories).
+The call such repositories "content-fixed" repositories.
+
+*** Canonical description of a content-fixed repository
+
+The local data of a repository consists of the following.
+- The roots (for workspace, targets, rules, expressions). As the
+  tree identifier already defines the content, we leave out the
+  path to the repository containing the tree.
+- The names of the targets, rules, and expression files.
+- The names of the outgoing "bindings".
+
+Additionally, repositories can reach additional repositories via
+bindings. Moreover, this repository-level dependency relation
+is not necessarily cycle free. In particular, we cannot use the
+tree unfolding as canonical representation of that graph up to
+bisumlation, as we do with most other data structures. To still get
+a canonical representation, we factor out the largest bisimulation,
+i.e., minimize the respective automaton (with repositories as
+states, local data as locally observable properties, and the binding
+relation as edges).
+
+Finally, for each repository individually, the reachable repositories
+are renamed ~"0"~, ~"1"~, ~"2"~, etc, following a depth-first
+traversal starting from the repository in question where outgoing
+edges are traversed in lexicographical order. The entry point is
+hence recognisable as repository ~"0"~.
+
+The repository key content-identifier of the canonically formatted
+canonical serialisaiton of the JSON encoding of the obtain
+multi-repository configuration (with repository-free git-root
+descriptions). The serialisation itself is stored in CAS.
+
+These identifications and replacement of global names does not change
+the semantics, as our name data types are completely opaque to our
+expression language. In the ~"json_encode"~ expression, they're
+serialized as ~null~ and string representation is only generated in
+user messages not available to the language itself. Moreover, names
+cannot be compared for equality either, so their only observable
+properties, i.e., the way ~"DEP_ARTIFACTS"~, ~"DEP_RUNFILES~, and
+~"DEP_PROVIDES"~ reacts to them are invariant under repository
+bisimulation.
+
+** Configuration and the ~"export"~ rule
+
+Targets not only depend on the content of their repository, but also
+on their configurations. Normally,
+the effective part of a configuration is only determined after
+analysing the target. However, for caching, we need to compute
+the cache key directly. This property is provided by the built-in ~"export"~ rule; only ~"export"~ targets
+residing in content-fixed repositories will be cached. This also
+serves as indication, which targets of a repository are intended
+for consumption by other repositories.
+
+An ~"export"~ rule takes precisely the following arguments.
+- ~"target"~ specifying a single target, the target to be cached.
+  It must not be tainted.
+- ~"flexible_config"~ a list of strings; those specify the variables
+  of the configuration that are considered. All other parts of
+  the configuration are ignored. So the effective configuration for
+  the ~"export"~ target is the configuration restricted to those
+  variables (filled up with ~null~ if the variable was not present
+  in the original configuration).
+- ~"fixed_config"~ a dict with of arbitrary JSON values (taken
+  unevaluated) with keys disjoint from the ~"flexible_config"~.
+
+An ~"export"~ target is analyzed as follows. The configuration is
+restricted to the variables specified in the ~"flexible_config"~;
+this will result in the effective configuration for the exported
+target. It is a requirement that the effective configuration contain
+only pure JSON values. The (necessarily conflict-free) union with
+the ~"fixed_config"~ is computed and the ~"target"~ is evaluated
+in this configuration. The result (artifacts, runfiles, provided
+information) is the result of that evaluation. It is a requirement
+that the provided information does only contain pure JSON values
+and artifacts (including tree artifacts); in particular, they may
+not contain names.
+
+** Cache key
+
+We only consider ~"export"~ targets in content-fixed repositories
+for caching. An export target is then fully described by
+- the repository key of the repository the export target resides in,
+- the target name of the export target within that repository,
+  described as module-name pair, and
+- the effective configuration.
+More precisely, the canoncical description is the JSON object with
+those values for the keys ~"repo_key"~, ~"target_name"~, and ~"effective_config"~,
+respectively. The repository key is the blob identifier of the
+canonical serialisation (including sorted keys, etc) of the just
+described piece of JSON. To allow debugging and cooperation with
+other tools, whenever a cache key is computed, it is ensured,
+that the serialisation ends up in the applicable CAS.
+
+It should be noted that the cache key can be computed _without_
+analyzing the target referred to. This is possible, as the
+configuration is pruned a priori instead of the usual procedure
+to analyse and afterwards determine the parts of the configuration
+that were relevant.
+
+** Cached value
+
+The value to be cached is the result of evaluating the target,
+that is, its artifacts, runfiles, and provided data. All artifacts
+inside those data structures will be described as known artifacts.
+
+As serialisation, we will essentially use our usual JSON encoding;
+while this can be used as is for artifacts and runfiles where we
+know that they have to be a map from strings to artifacts, additional
+information will be added for the provided data. The provided data
+can contain artifacts, but also legitimately pure JSON values that
+coincide with our JSON encoding of artifacts; the same holds true
+for nodes and result values. Moreover, the tree unfolding implicit
+in the JSON serialisation can be exponentially larger than the value.
+
+Therefore, in our serialisation, we add an entry for every subexpression
+and separately add a list of which subexpressions are artifacts,
+nodes, or results. During deserialisation, we use this subexpression
+structure to deserialize every subexpression only one.
+
+** Sharding of target cache
+
+In our target description, the execution environment is not included.
+For local execution, it is implicit anyway. As we also want to
+cache high-level targets when using remote execution, we shard the
+target cache (e.g., by using appropriate subdirectories) by the blob
+identifier of the serialisation of the description of the execution
+backend. Here, ~null~ stands for local execution, and for remote
+execution we use an object with keys ~"remote_execution_address"~
+and ~"remote_execution_properties"~ filled in the obvious way. As
+usual, we add the serialisation to the CAS.
+
+** ~"export"~ targets, strictness and the extensional projection
+
+As opposed to the target that is exported, the corresponding export
+target, if part of a content-fixed repository, will be strict: a
+build depending on such a target can only succeed if all artifacts
+in the result of target (regardless whether direct artifacts,
+runfiles, or as part of the provided data) can be built, even if
+not all (or even none) are actually used in the build.
+
+Upon cache hit, the artifacts of an export target are the known
+artifacts corresponding to the artifacts of the exported target.
+While extensionally equal, known artifacts are defined differently,
+so an export target and the exported target are intensionally
+different (and that difference might only be visible on the second
+build). As intensional equality is used when testing for absence
+of conflicts in staging, a target and its exported version almost
+always conflict and hence should not be used together. One way to
+achieve this is to always use the export target for any target that
+is exported. This fits well together with the recommendation of
+only depending on export targets of other repositories.
+
+If a target forwards artifacts of an exported target (indirect header
+files, indirect link dependencies, etc), and is exported again, no
+additional conflicts occur; replacing by the corresponding known
+artifact is a projection: the known artifact corresponding to a
+known artifact is the artifact itself. Moreover, by the strictness
+property described earlier, if an export target has a cache hit,
+then so have all export targets it depends upon. Keep in mind that
+a repository can only be content-fixed if all its dependencies are.
+
+For this strictness-based approach to work, it is, however, a
+requirement that any artifact that is exported (typically indirectly,
+e.g., as part of a common dependency) by several targets is only
+used through the same export target. For a well-structured repository,
+this should not be a natural property anyway.
+
+The forwarding of artifacts are the reason we chose that in the
+non-cached anlysis of an export target the artifacts are passed on
+as received and are not wrapped in an "add to cache" action. The
+latter choice would violate that projection property we rely upon.
-- 
cgit v1.2.3