From 8a9664f9e63bf0528ee70ce40904314dd9deee64 Mon Sep 17 00:00:00 2001 From: Klaus Aehlig Date: Wed, 15 Jun 2022 12:31:53 +0200 Subject: Document the concept of target-level caching --- doc/concepts/target-cache.org | 219 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 219 insertions(+) create mode 100644 doc/concepts/target-cache.org (limited to 'doc/concepts') diff --git a/doc/concepts/target-cache.org b/doc/concepts/target-cache.org new file mode 100644 index 00000000..dccbd5e7 --- /dev/null +++ b/doc/concepts/target-cache.org @@ -0,0 +1,219 @@ +* Target-level caching + +** ~git~ trees as content-fixed roots + +*** The ~"git tree"~ root scheme + +The multi-repository configuration supports a scheme ~"git tree"~. +This scheme is given by two parameters, +- the id of the tree (as a string with the hex encoding), and +- an arbitrary ~git~ repository containing the specified tree + object, as well as all needed tree and blob objects reachable + from that tree. +For example, a root could be specified as follows. +#+BEGIN_SRC +["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"] +#+END_SRC + +It should be noted that the ~git~ tree identifier alone already +specifies the content of the full tree. However, ~just~ needs access +to some repository containing the tree in order to know what the +tree looks like. + +Nevertheless, it is an important observation that the tree identifier +alone already specifies the content of the whole (logical) directory. +The equality of two such directories can be established by comparing +the two identifiers _without_ the need to read any file from +disk. Those "fixed-content" descriptions, i.e., descriptions of a +repository root that already fully determines the content are the +key to caching whole targets. + +*** ~KNOWN~ artifacts + +The in-memory representation of known artifacts has an optional +reference to a repository containing that artifact. Artifacts +"known" from local repositories might not be known to the CAS used +for the action execution; this additional reference allows to fill +such misses in the CAS. + +** Content-fixed repositories + +*** The parts of a content-fixed repository + +In order to meaningfully cache a target, we need to be able to +efficiently compute the cache key. We restrict this to the case where +we can compute the information about the repository without file-system +access. This requires that all roots (workspace, target root, etc) +be content fixed, as well as the bindings of the free repository +names (and hence also all transitively reachable repositories). +The call such repositories "content-fixed" repositories. + +*** Canonical description of a content-fixed repository + +The local data of a repository consists of the following. +- The roots (for workspace, targets, rules, expressions). As the + tree identifier already defines the content, we leave out the + path to the repository containing the tree. +- The names of the targets, rules, and expression files. +- The names of the outgoing "bindings". + +Additionally, repositories can reach additional repositories via +bindings. Moreover, this repository-level dependency relation +is not necessarily cycle free. In particular, we cannot use the +tree unfolding as canonical representation of that graph up to +bisumlation, as we do with most other data structures. To still get +a canonical representation, we factor out the largest bisimulation, +i.e., minimize the respective automaton (with repositories as +states, local data as locally observable properties, and the binding +relation as edges). + +Finally, for each repository individually, the reachable repositories +are renamed ~"0"~, ~"1"~, ~"2"~, etc, following a depth-first +traversal starting from the repository in question where outgoing +edges are traversed in lexicographical order. The entry point is +hence recognisable as repository ~"0"~. + +The repository key content-identifier of the canonically formatted +canonical serialisaiton of the JSON encoding of the obtain +multi-repository configuration (with repository-free git-root +descriptions). The serialisation itself is stored in CAS. + +These identifications and replacement of global names does not change +the semantics, as our name data types are completely opaque to our +expression language. In the ~"json_encode"~ expression, they're +serialized as ~null~ and string representation is only generated in +user messages not available to the language itself. Moreover, names +cannot be compared for equality either, so their only observable +properties, i.e., the way ~"DEP_ARTIFACTS"~, ~"DEP_RUNFILES~, and +~"DEP_PROVIDES"~ reacts to them are invariant under repository +bisimulation. + +** Configuration and the ~"export"~ rule + +Targets not only depend on the content of their repository, but also +on their configurations. Normally, +the effective part of a configuration is only determined after +analysing the target. However, for caching, we need to compute +the cache key directly. This property is provided by the built-in ~"export"~ rule; only ~"export"~ targets +residing in content-fixed repositories will be cached. This also +serves as indication, which targets of a repository are intended +for consumption by other repositories. + +An ~"export"~ rule takes precisely the following arguments. +- ~"target"~ specifying a single target, the target to be cached. + It must not be tainted. +- ~"flexible_config"~ a list of strings; those specify the variables + of the configuration that are considered. All other parts of + the configuration are ignored. So the effective configuration for + the ~"export"~ target is the configuration restricted to those + variables (filled up with ~null~ if the variable was not present + in the original configuration). +- ~"fixed_config"~ a dict with of arbitrary JSON values (taken + unevaluated) with keys disjoint from the ~"flexible_config"~. + +An ~"export"~ target is analyzed as follows. The configuration is +restricted to the variables specified in the ~"flexible_config"~; +this will result in the effective configuration for the exported +target. It is a requirement that the effective configuration contain +only pure JSON values. The (necessarily conflict-free) union with +the ~"fixed_config"~ is computed and the ~"target"~ is evaluated +in this configuration. The result (artifacts, runfiles, provided +information) is the result of that evaluation. It is a requirement +that the provided information does only contain pure JSON values +and artifacts (including tree artifacts); in particular, they may +not contain names. + +** Cache key + +We only consider ~"export"~ targets in content-fixed repositories +for caching. An export target is then fully described by +- the repository key of the repository the export target resides in, +- the target name of the export target within that repository, + described as module-name pair, and +- the effective configuration. +More precisely, the canoncical description is the JSON object with +those values for the keys ~"repo_key"~, ~"target_name"~, and ~"effective_config"~, +respectively. The repository key is the blob identifier of the +canonical serialisation (including sorted keys, etc) of the just +described piece of JSON. To allow debugging and cooperation with +other tools, whenever a cache key is computed, it is ensured, +that the serialisation ends up in the applicable CAS. + +It should be noted that the cache key can be computed _without_ +analyzing the target referred to. This is possible, as the +configuration is pruned a priori instead of the usual procedure +to analyse and afterwards determine the parts of the configuration +that were relevant. + +** Cached value + +The value to be cached is the result of evaluating the target, +that is, its artifacts, runfiles, and provided data. All artifacts +inside those data structures will be described as known artifacts. + +As serialisation, we will essentially use our usual JSON encoding; +while this can be used as is for artifacts and runfiles where we +know that they have to be a map from strings to artifacts, additional +information will be added for the provided data. The provided data +can contain artifacts, but also legitimately pure JSON values that +coincide with our JSON encoding of artifacts; the same holds true +for nodes and result values. Moreover, the tree unfolding implicit +in the JSON serialisation can be exponentially larger than the value. + +Therefore, in our serialisation, we add an entry for every subexpression +and separately add a list of which subexpressions are artifacts, +nodes, or results. During deserialisation, we use this subexpression +structure to deserialize every subexpression only one. + +** Sharding of target cache + +In our target description, the execution environment is not included. +For local execution, it is implicit anyway. As we also want to +cache high-level targets when using remote execution, we shard the +target cache (e.g., by using appropriate subdirectories) by the blob +identifier of the serialisation of the description of the execution +backend. Here, ~null~ stands for local execution, and for remote +execution we use an object with keys ~"remote_execution_address"~ +and ~"remote_execution_properties"~ filled in the obvious way. As +usual, we add the serialisation to the CAS. + +** ~"export"~ targets, strictness and the extensional projection + +As opposed to the target that is exported, the corresponding export +target, if part of a content-fixed repository, will be strict: a +build depending on such a target can only succeed if all artifacts +in the result of target (regardless whether direct artifacts, +runfiles, or as part of the provided data) can be built, even if +not all (or even none) are actually used in the build. + +Upon cache hit, the artifacts of an export target are the known +artifacts corresponding to the artifacts of the exported target. +While extensionally equal, known artifacts are defined differently, +so an export target and the exported target are intensionally +different (and that difference might only be visible on the second +build). As intensional equality is used when testing for absence +of conflicts in staging, a target and its exported version almost +always conflict and hence should not be used together. One way to +achieve this is to always use the export target for any target that +is exported. This fits well together with the recommendation of +only depending on export targets of other repositories. + +If a target forwards artifacts of an exported target (indirect header +files, indirect link dependencies, etc), and is exported again, no +additional conflicts occur; replacing by the corresponding known +artifact is a projection: the known artifact corresponding to a +known artifact is the artifact itself. Moreover, by the strictness +property described earlier, if an export target has a cache hit, +then so have all export targets it depends upon. Keep in mind that +a repository can only be content-fixed if all its dependencies are. + +For this strictness-based approach to work, it is, however, a +requirement that any artifact that is exported (typically indirectly, +e.g., as part of a common dependency) by several targets is only +used through the same export target. For a well-structured repository, +this should not be a natural property anyway. + +The forwarding of artifacts are the reason we chose that in the +non-cached anlysis of an export target the artifacts are passed on +as received and are not wrapped in an "add to cache" action. The +latter choice would violate that projection property we rely upon. -- cgit v1.2.3