summaryrefslogtreecommitdiff
path: root/doc/concepts/target-cache.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/concepts/target-cache.md')
-rw-r--r--doc/concepts/target-cache.md231
1 files changed, 231 insertions, 0 deletions
diff --git a/doc/concepts/target-cache.md b/doc/concepts/target-cache.md
new file mode 100644
index 00000000..0db627e1
--- /dev/null
+++ b/doc/concepts/target-cache.md
@@ -0,0 +1,231 @@
+Target-level caching
+====================
+
+`git` trees as content-fixed roots
+----------------------------------
+
+### The `"git tree"` root scheme
+
+The multi-repository configuration supports a scheme `"git tree"`. This
+scheme is given by two parameters,
+
+ - the id of the tree (as a string with the hex encoding), and
+ - an arbitrary `git` repository containing the specified tree object,
+ as well as all needed tree and blob objects reachable from that
+ tree.
+
+For example, a root could be specified as follows.
+
+``` jsonc
+["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"]
+```
+
+It should be noted that the `git` tree identifier alone already
+specifies the content of the full tree. However, `just` needs access to
+some repository containing the tree in order to know what the tree looks
+like.
+
+Nevertheless, it is an important observation that the tree identifier
+alone already specifies the content of the whole (logical) directory.
+The equality of two such directories can be established by comparing the
+two identifiers *without* the need to read any file from
+disk. Those "fixed-content" descriptions, i.e., descriptions of a
+repository root that already fully determines the content are the key to
+caching whole targets.
+
+### `KNOWN` artifacts
+
+The in-memory representation of known artifacts has an optional
+reference to a repository containing that artifact. Artifacts "known"
+from local repositories might not be known to the CAS used for the
+action execution; this additional reference allows to fill such misses
+in the CAS.
+
+Content-fixed repositories
+--------------------------
+
+### The parts of a content-fixed repository
+
+In order to meaningfully cache a target, we need to be able to
+efficiently compute the cache key. We restrict this to the case where we
+can compute the information about the repository without file-system
+access. This requires that all roots (workspace, target root, etc) be
+content fixed, as well as the bindings of the free repository names (and
+hence also all transitively reachable repositories). The call such
+repositories "content-fixed" repositories.
+
+### Canonical description of a content-fixed repository
+
+The local data of a repository consists of the following.
+
+ - The roots (for workspace, targets, rules, expressions). As the tree
+ identifier already defines the content, we leave out the path to the
+ repository containing the tree.
+ - The names of the targets, rules, and expression files.
+ - The names of the outgoing "bindings".
+
+Additionally, repositories can reach additional repositories via
+bindings. Moreover, this repository-level dependency relation is not
+necessarily cycle free. In particular, we cannot use the tree unfolding
+as canonical representation of that graph up to bisimulation, as we do
+with most other data structures. To still get a canonical
+representation, we factor out the largest bisimulation, i.e., minimize
+the respective automaton (with repositories as states, local data as
+locally observable properties, and the binding relation as edges).
+
+Finally, for each repository individually, the reachable repositories
+are renamed `"0"`, `"1"`, `"2"`, etc, following a depth-first traversal
+starting from the repository in question where outgoing edges are
+traversed in lexicographical order. The entry point is hence
+recognisable as repository `"0"`.
+
+The repository key content-identifier of the canonically formatted
+canonical serialisation of the JSON encoding of the obtain
+multi-repository configuration (with repository-free git-root
+descriptions). The serialisation itself is stored in CAS.
+
+These identifications and replacement of global names does not change
+the semantics, as our name data types are completely opaque to our
+expression language. In the `"json_encode"` expression, they're
+serialized as `null` and string representation is only generated in user
+messages not available to the language itself. Moreover, names cannot be
+compared for equality either, so their only observable properties, i.e.,
+the way `"DEP_ARTIFACTS"`, `"DEP_RUNFILES`, and `"DEP_PROVIDES"` reacts
+to them are invariant under repository bisimulation.
+
+Configuration and the `"export"` rule
+-------------------------------------
+
+Targets not only depend on the content of their repository, but also on
+their configurations. Normally, the effective part of a configuration is
+only determined after analysing the target. However, for caching, we
+need to compute the cache key directly. This property is provided by the
+built-in `"export"` rule; only `"export"` targets residing in
+content-fixed repositories will be cached. This also serves as
+indication, which targets of a repository are intended for consumption
+by other repositories.
+
+An `"export"` rule takes precisely the following arguments.
+
+ - `"target"` specifying a single target, the target to be cached. It
+ must not be tainted.
+ - `"flexible_config"` a list of strings; those specify the variables
+ of the configuration that are considered. All other parts of the
+ configuration are ignored. So the effective configuration for the
+ `"export"` target is the configuration restricted to those variables
+ (filled up with `null` if the variable was not present in the
+ original configuration).
+ - `"fixed_config"` a dict with of arbitrary JSON values (taken
+ unevaluated) with keys disjoint from the `"flexible_config"`.
+
+An `"export"` target is analyzed as follows. The configuration is
+restricted to the variables specified in the `"flexible_config"`; this
+will result in the effective configuration for the exported target. It
+is a requirement that the effective configuration contain only pure JSON
+values. The (necessarily conflict-free) union with the `"fixed_config"`
+is computed and the `"target"` is evaluated in this configuration. The
+result (artifacts, runfiles, provided information) is the result of that
+evaluation. It is a requirement that the provided information does only
+contain pure JSON values and artifacts (including tree artifacts); in
+particular, they may not contain names.
+
+Cache key
+---------
+
+We only consider `"export"` targets in content-fixed repositories for
+caching. An export target is then fully described by
+
+ - the repository key of the repository the export target resides in,
+ - the target name of the export target within that repository,
+ described as module-name pair, and
+ - the effective configuration.
+
+More precisely, the canonical description is the JSON object with those
+values for the keys `"repo_key"`, `"target_name"`, and
+`"effective_config"`, respectively. The repository key is the blob
+identifier of the canonical serialisation (including sorted keys, etc)
+of the just described piece of JSON. To allow debugging and cooperation
+with other tools, whenever a cache key is computed, it is ensured, that
+the serialisation ends up in the applicable CAS.
+
+It should be noted that the cache key can be computed
+*without* analyzing the target referred to. This is
+possible, as the configuration is pruned a priori instead of the usual
+procedure to analyse and afterwards determine the parts of the
+configuration that were relevant.
+
+Cached value
+------------
+
+The value to be cached is the result of evaluating the target, that is,
+its artifacts, runfiles, and provided data. All artifacts inside those
+data structures will be described as known artifacts.
+
+As serialisation, we will essentially use our usual JSON encoding; while
+this can be used as is for artifacts and runfiles where we know that
+they have to be a map from strings to artifacts, additional information
+will be added for the provided data. The provided data can contain
+artifacts, but also legitimately pure JSON values that coincide with our
+JSON encoding of artifacts; the same holds true for nodes and result
+values. Moreover, the tree unfolding implicit in the JSON serialisation
+can be exponentially larger than the value.
+
+Therefore, in our serialisation, we add an entry for every subexpression
+and separately add a list of which subexpressions are artifacts, nodes,
+or results. During deserialisation, we use this subexpression structure
+to deserialize every subexpression only one.
+
+Sharding of target cache
+------------------------
+
+In our target description, the execution environment is not included.
+For local execution, it is implicit anyway. As we also want to cache
+high-level targets when using remote execution, we shard the target
+cache (e.g., by using appropriate subdirectories) by the blob identifier
+of the serialisation of the description of the execution backend. Here,
+`null` stands for local execution, and for remote execution we use an
+object with keys `"remote_execution_address"` and
+`"remote_execution_properties"` filled in the obvious way. As usual, we
+add the serialisation to the CAS.
+
+`"export"` targets, strictness and the extensional projection
+-------------------------------------------------------------
+
+As opposed to the target that is exported, the corresponding export
+target, if part of a content-fixed repository, will be strict: a build
+depending on such a target can only succeed if all artifacts in the
+result of target (regardless whether direct artifacts, runfiles, or as
+part of the provided data) can be built, even if not all (or even none)
+are actually used in the build.
+
+Upon cache hit, the artifacts of an export target are the known
+artifacts corresponding to the artifacts of the exported target. While
+extensionally equal, known artifacts are defined differently, so an
+export target and the exported target are intensionally different (and
+that difference might only be visible on the second build). As
+intensional equality is used when testing for absence of conflicts in
+staging, a target and its exported version almost always conflict and
+hence should not be used together. One way to achieve this is to always
+use the export target for any target that is exported. This fits well
+together with the recommendation of only depending on export targets of
+other repositories.
+
+If a target forwards artifacts of an exported target (indirect header
+files, indirect link dependencies, etc), and is exported again, no
+additional conflicts occur; replacing by the corresponding known
+artifact is a projection: the known artifact corresponding to a known
+artifact is the artifact itself. Moreover, by the strictness property
+described earlier, if an export target has a cache hit, then so have all
+export targets it depends upon. Keep in mind that a repository can only
+be content-fixed if all its dependencies are.
+
+For this strictness-based approach to work, it is, however, a
+requirement that any artifact that is exported (typically indirectly,
+e.g., as part of a common dependency) by several targets is only used
+through the same export target. For a well-structured repository, this
+should not be a natural property anyway.
+
+The forwarding of artifacts are the reason we chose that in the
+non-cached analysis of an export target the artifacts are passed on as
+received and are not wrapped in an "add to cache" action. The latter
+choice would violate that projection property we rely upon.