diff options
Diffstat (limited to 'doc/concepts/target-cache.md')
-rw-r--r-- | doc/concepts/target-cache.md | 231 |
1 files changed, 231 insertions, 0 deletions
diff --git a/doc/concepts/target-cache.md b/doc/concepts/target-cache.md new file mode 100644 index 00000000..0db627e1 --- /dev/null +++ b/doc/concepts/target-cache.md @@ -0,0 +1,231 @@ +Target-level caching +==================== + +`git` trees as content-fixed roots +---------------------------------- + +### The `"git tree"` root scheme + +The multi-repository configuration supports a scheme `"git tree"`. This +scheme is given by two parameters, + + - the id of the tree (as a string with the hex encoding), and + - an arbitrary `git` repository containing the specified tree object, + as well as all needed tree and blob objects reachable from that + tree. + +For example, a root could be specified as follows. + +``` jsonc +["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"] +``` + +It should be noted that the `git` tree identifier alone already +specifies the content of the full tree. However, `just` needs access to +some repository containing the tree in order to know what the tree looks +like. + +Nevertheless, it is an important observation that the tree identifier +alone already specifies the content of the whole (logical) directory. +The equality of two such directories can be established by comparing the +two identifiers *without* the need to read any file from +disk. Those "fixed-content" descriptions, i.e., descriptions of a +repository root that already fully determines the content are the key to +caching whole targets. + +### `KNOWN` artifacts + +The in-memory representation of known artifacts has an optional +reference to a repository containing that artifact. Artifacts "known" +from local repositories might not be known to the CAS used for the +action execution; this additional reference allows to fill such misses +in the CAS. + +Content-fixed repositories +-------------------------- + +### The parts of a content-fixed repository + +In order to meaningfully cache a target, we need to be able to +efficiently compute the cache key. We restrict this to the case where we +can compute the information about the repository without file-system +access. This requires that all roots (workspace, target root, etc) be +content fixed, as well as the bindings of the free repository names (and +hence also all transitively reachable repositories). The call such +repositories "content-fixed" repositories. + +### Canonical description of a content-fixed repository + +The local data of a repository consists of the following. + + - The roots (for workspace, targets, rules, expressions). As the tree + identifier already defines the content, we leave out the path to the + repository containing the tree. + - The names of the targets, rules, and expression files. + - The names of the outgoing "bindings". + +Additionally, repositories can reach additional repositories via +bindings. Moreover, this repository-level dependency relation is not +necessarily cycle free. In particular, we cannot use the tree unfolding +as canonical representation of that graph up to bisimulation, as we do +with most other data structures. To still get a canonical +representation, we factor out the largest bisimulation, i.e., minimize +the respective automaton (with repositories as states, local data as +locally observable properties, and the binding relation as edges). + +Finally, for each repository individually, the reachable repositories +are renamed `"0"`, `"1"`, `"2"`, etc, following a depth-first traversal +starting from the repository in question where outgoing edges are +traversed in lexicographical order. The entry point is hence +recognisable as repository `"0"`. + +The repository key content-identifier of the canonically formatted +canonical serialisation of the JSON encoding of the obtain +multi-repository configuration (with repository-free git-root +descriptions). The serialisation itself is stored in CAS. + +These identifications and replacement of global names does not change +the semantics, as our name data types are completely opaque to our +expression language. In the `"json_encode"` expression, they're +serialized as `null` and string representation is only generated in user +messages not available to the language itself. Moreover, names cannot be +compared for equality either, so their only observable properties, i.e., +the way `"DEP_ARTIFACTS"`, `"DEP_RUNFILES`, and `"DEP_PROVIDES"` reacts +to them are invariant under repository bisimulation. + +Configuration and the `"export"` rule +------------------------------------- + +Targets not only depend on the content of their repository, but also on +their configurations. Normally, the effective part of a configuration is +only determined after analysing the target. However, for caching, we +need to compute the cache key directly. This property is provided by the +built-in `"export"` rule; only `"export"` targets residing in +content-fixed repositories will be cached. This also serves as +indication, which targets of a repository are intended for consumption +by other repositories. + +An `"export"` rule takes precisely the following arguments. + + - `"target"` specifying a single target, the target to be cached. It + must not be tainted. + - `"flexible_config"` a list of strings; those specify the variables + of the configuration that are considered. All other parts of the + configuration are ignored. So the effective configuration for the + `"export"` target is the configuration restricted to those variables + (filled up with `null` if the variable was not present in the + original configuration). + - `"fixed_config"` a dict with of arbitrary JSON values (taken + unevaluated) with keys disjoint from the `"flexible_config"`. + +An `"export"` target is analyzed as follows. The configuration is +restricted to the variables specified in the `"flexible_config"`; this +will result in the effective configuration for the exported target. It +is a requirement that the effective configuration contain only pure JSON +values. The (necessarily conflict-free) union with the `"fixed_config"` +is computed and the `"target"` is evaluated in this configuration. The +result (artifacts, runfiles, provided information) is the result of that +evaluation. It is a requirement that the provided information does only +contain pure JSON values and artifacts (including tree artifacts); in +particular, they may not contain names. + +Cache key +--------- + +We only consider `"export"` targets in content-fixed repositories for +caching. An export target is then fully described by + + - the repository key of the repository the export target resides in, + - the target name of the export target within that repository, + described as module-name pair, and + - the effective configuration. + +More precisely, the canonical description is the JSON object with those +values for the keys `"repo_key"`, `"target_name"`, and +`"effective_config"`, respectively. The repository key is the blob +identifier of the canonical serialisation (including sorted keys, etc) +of the just described piece of JSON. To allow debugging and cooperation +with other tools, whenever a cache key is computed, it is ensured, that +the serialisation ends up in the applicable CAS. + +It should be noted that the cache key can be computed +*without* analyzing the target referred to. This is +possible, as the configuration is pruned a priori instead of the usual +procedure to analyse and afterwards determine the parts of the +configuration that were relevant. + +Cached value +------------ + +The value to be cached is the result of evaluating the target, that is, +its artifacts, runfiles, and provided data. All artifacts inside those +data structures will be described as known artifacts. + +As serialisation, we will essentially use our usual JSON encoding; while +this can be used as is for artifacts and runfiles where we know that +they have to be a map from strings to artifacts, additional information +will be added for the provided data. The provided data can contain +artifacts, but also legitimately pure JSON values that coincide with our +JSON encoding of artifacts; the same holds true for nodes and result +values. Moreover, the tree unfolding implicit in the JSON serialisation +can be exponentially larger than the value. + +Therefore, in our serialisation, we add an entry for every subexpression +and separately add a list of which subexpressions are artifacts, nodes, +or results. During deserialisation, we use this subexpression structure +to deserialize every subexpression only one. + +Sharding of target cache +------------------------ + +In our target description, the execution environment is not included. +For local execution, it is implicit anyway. As we also want to cache +high-level targets when using remote execution, we shard the target +cache (e.g., by using appropriate subdirectories) by the blob identifier +of the serialisation of the description of the execution backend. Here, +`null` stands for local execution, and for remote execution we use an +object with keys `"remote_execution_address"` and +`"remote_execution_properties"` filled in the obvious way. As usual, we +add the serialisation to the CAS. + +`"export"` targets, strictness and the extensional projection +------------------------------------------------------------- + +As opposed to the target that is exported, the corresponding export +target, if part of a content-fixed repository, will be strict: a build +depending on such a target can only succeed if all artifacts in the +result of target (regardless whether direct artifacts, runfiles, or as +part of the provided data) can be built, even if not all (or even none) +are actually used in the build. + +Upon cache hit, the artifacts of an export target are the known +artifacts corresponding to the artifacts of the exported target. While +extensionally equal, known artifacts are defined differently, so an +export target and the exported target are intensionally different (and +that difference might only be visible on the second build). As +intensional equality is used when testing for absence of conflicts in +staging, a target and its exported version almost always conflict and +hence should not be used together. One way to achieve this is to always +use the export target for any target that is exported. This fits well +together with the recommendation of only depending on export targets of +other repositories. + +If a target forwards artifacts of an exported target (indirect header +files, indirect link dependencies, etc), and is exported again, no +additional conflicts occur; replacing by the corresponding known +artifact is a projection: the known artifact corresponding to a known +artifact is the artifact itself. Moreover, by the strictness property +described earlier, if an export target has a cache hit, then so have all +export targets it depends upon. Keep in mind that a repository can only +be content-fixed if all its dependencies are. + +For this strictness-based approach to work, it is, however, a +requirement that any artifact that is exported (typically indirectly, +e.g., as part of a common dependency) by several targets is only used +through the same export target. For a well-structured repository, this +should not be a natural property anyway. + +The forwarding of artifacts are the reason we chose that in the +non-cached analysis of an export target the artifacts are passed on as +received and are not wrapped in an "add to cache" action. The latter +choice would violate that projection property we rely upon. |