doc/concepts/target-cache.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219

* Target-level caching

** ~git~ trees as content-fixed roots

*** The ~"git tree"~ root scheme

The multi-repository configuration supports a scheme ~"git tree"~.
This scheme is given by two parameters,
- the id of the tree (as a string with the hex encoding), and
- an arbitrary ~git~ repository containing the specified tree
  object, as well as all needed tree and blob objects reachable
  from that tree.
For example, a root could be specified as follows.
#+BEGIN_SRC
["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"]
#+END_SRC

It should be noted that the ~git~ tree identifier alone already
specifies the content of the full tree. However, ~just~ needs access
to some repository containing the tree in order to know what the
tree looks like.

Nevertheless, it is an important observation that the tree identifier
alone already specifies the content of the whole (logical) directory.
The equality of two such directories can be established by comparing
the two identifiers _without_ the need to read any file from
disk. Those "fixed-content" descriptions, i.e., descriptions of a
repository root that already fully determines the content are the
key to caching whole targets.

*** ~KNOWN~ artifacts

The in-memory representation of known artifacts has an optional
reference to a repository containing that artifact. Artifacts
"known" from local repositories might not be known to the CAS used
for the action execution; this additional reference allows to fill
such misses in the CAS.

** Content-fixed repositories

*** The parts of a content-fixed repository

In order to meaningfully cache a target, we need to be able to
efficiently compute the cache key. We restrict this to the case where
we can compute the information about the repository without file-system
access. This requires that all roots (workspace, target root, etc)
be content fixed, as well as the bindings of the free repository
names (and hence also all transitively reachable repositories).
The call such repositories "content-fixed" repositories.

*** Canonical description of a content-fixed repository

The local data of a repository consists of the following.
- The roots (for workspace, targets, rules, expressions). As the
  tree identifier already defines the content, we leave out the
  path to the repository containing the tree.
- The names of the targets, rules, and expression files.
- The names of the outgoing "bindings".

Additionally, repositories can reach additional repositories via
bindings. Moreover, this repository-level dependency relation
is not necessarily cycle free. In particular, we cannot use the
tree unfolding as canonical representation of that graph up to
bisumlation, as we do with most other data structures. To still get
a canonical representation, we factor out the largest bisimulation,
i.e., minimize the respective automaton (with repositories as
states, local data as locally observable properties, and the binding
relation as edges).

Finally, for each repository individually, the reachable repositories
are renamed ~"0"~, ~"1"~, ~"2"~, etc, following a depth-first
traversal starting from the repository in question where outgoing
edges are traversed in lexicographical order. The entry point is
hence recognisable as repository ~"0"~.

The repository key content-identifier of the canonically formatted
canonical serialisaiton of the JSON encoding of the obtain
multi-repository configuration (with repository-free git-root
descriptions). The serialisation itself is stored in CAS.

These identifications and replacement of global names does not change
the semantics, as our name data types are completely opaque to our
expression language. In the ~"json_encode"~ expression, they're
serialized as ~null~ and string representation is only generated in
user messages not available to the language itself. Moreover, names
cannot be compared for equality either, so their only observable
properties, i.e., the way ~"DEP_ARTIFACTS"~, ~"DEP_RUNFILES~, and
~"DEP_PROVIDES"~ reacts to them are invariant under repository
bisimulation.

** Configuration and the ~"export"~ rule

Targets not only depend on the content of their repository, but also
on their configurations. Normally,
the effective part of a configuration is only determined after
analysing the target. However, for caching, we need to compute
the cache key directly. This property is provided by the built-in ~"export"~ rule; only ~"export"~ targets
residing in content-fixed repositories will be cached. This also
serves as indication, which targets of a repository are intended
for consumption by other repositories.

An ~"export"~ rule takes precisely the following arguments.
- ~"target"~ specifying a single target, the target to be cached.
  It must not be tainted.
- ~"flexible_config"~ a list of strings; those specify the variables
  of the configuration that are considered. All other parts of
  the configuration are ignored. So the effective configuration for
  the ~"export"~ target is the configuration restricted to those
  variables (filled up with ~null~ if the variable was not present
  in the original configuration).
- ~"fixed_config"~ a dict with of arbitrary JSON values (taken
  unevaluated) with keys disjoint from the ~"flexible_config"~.

An ~"export"~ target is analyzed as follows. The configuration is
restricted to the variables specified in the ~"flexible_config"~;
this will result in the effective configuration for the exported
target. It is a requirement that the effective configuration contain
only pure JSON values. The (necessarily conflict-free) union with
the ~"fixed_config"~ is computed and the ~"target"~ is evaluated
in this configuration. The result (artifacts, runfiles, provided
information) is the result of that evaluation. It is a requirement
that the provided information does only contain pure JSON values
and artifacts (including tree artifacts); in particular, they may
not contain names.

** Cache key

We only consider ~"export"~ targets in content-fixed repositories
for caching. An export target is then fully described by
- the repository key of the repository the export target resides in,
- the target name of the export target within that repository,
  described as module-name pair, and
- the effective configuration.
More precisely, the canoncical description is the JSON object with
those values for the keys ~"repo_key"~, ~"target_name"~, and ~"effective_config"~,
respectively. The repository key is the blob identifier of the
canonical serialisation (including sorted keys, etc) of the just
described piece of JSON. To allow debugging and cooperation with
other tools, whenever a cache key is computed, it is ensured,
that the serialisation ends up in the applicable CAS.

It should be noted that the cache key can be computed _without_
analyzing the target referred to. This is possible, as the
configuration is pruned a priori instead of the usual procedure
to analyse and afterwards determine the parts of the configuration
that were relevant.

** Cached value

The value to be cached is the result of evaluating the target,
that is, its artifacts, runfiles, and provided data. All artifacts
inside those data structures will be described as known artifacts.

As serialisation, we will essentially use our usual JSON encoding;
while this can be used as is for artifacts and runfiles where we
know that they have to be a map from strings to artifacts, additional
information will be added for the provided data. The provided data
can contain artifacts, but also legitimately pure JSON values that
coincide with our JSON encoding of artifacts; the same holds true
for nodes and result values. Moreover, the tree unfolding implicit
in the JSON serialisation can be exponentially larger than the value.

Therefore, in our serialisation, we add an entry for every subexpression
and separately add a list of which subexpressions are artifacts,
nodes, or results. During deserialisation, we use this subexpression
structure to deserialize every subexpression only one.

** Sharding of target cache

In our target description, the execution environment is not included.
For local execution, it is implicit anyway. As we also want to
cache high-level targets when using remote execution, we shard the
target cache (e.g., by using appropriate subdirectories) by the blob
identifier of the serialisation of the description of the execution
backend. Here, ~null~ stands for local execution, and for remote
execution we use an object with keys ~"remote_execution_address"~
and ~"remote_execution_properties"~ filled in the obvious way. As
usual, we add the serialisation to the CAS.

** ~"export"~ targets, strictness and the extensional projection

As opposed to the target that is exported, the corresponding export
target, if part of a content-fixed repository, will be strict: a
build depending on such a target can only succeed if all artifacts
in the result of target (regardless whether direct artifacts,
runfiles, or as part of the provided data) can be built, even if
not all (or even none) are actually used in the build.

Upon cache hit, the artifacts of an export target are the known
artifacts corresponding to the artifacts of the exported target.
While extensionally equal, known artifacts are defined differently,
so an export target and the exported target are intensionally
different (and that difference might only be visible on the second
build). As intensional equality is used when testing for absence
of conflicts in staging, a target and its exported version almost
always conflict and hence should not be used together. One way to
achieve this is to always use the export target for any target that
is exported. This fits well together with the recommendation of
only depending on export targets of other repositories.

If a target forwards artifacts of an exported target (indirect header
files, indirect link dependencies, etc), and is exported again, no
additional conflicts occur; replacing by the corresponding known
artifact is a projection: the known artifact corresponding to a
known artifact is the artifact itself. Moreover, by the strictness
property described earlier, if an export target has a cache hit,
then so have all export targets it depends upon. Keep in mind that
a repository can only be content-fixed if all its dependencies are.

For this strictness-based approach to work, it is, however, a
requirement that any artifact that is exported (typically indirectly,
e.g., as part of a common dependency) by several targets is only
used through the same export target. For a well-structured repository,
this should not be a natural property anyway.

The forwarding of artifacts are the reason we chose that in the
non-cached anlysis of an export target the artifacts are passed on
as received and are not wrapped in an "add to cache" action. The
latter choice would violate that projection property we rely upon.