1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
|
* Anonymous targets
** Motivation
Using [[https://github.com/protocolbuffers/protobuf][Protocol
buffers]] allows to specify, in a language-independent way, a wire
format for structured data. This is done by using description files
from which APIs for various languages can be generated. As protocol
buffers can contain other protocol buffers, the description files
themselves have a dependency structure.
From a software-engeneering point of view, the challenge is to
ensure that the author of the description files does not have to
be aware of the languages for which APIs will be generated later.
In fact, the main benefit of the language-independent description
is that clients in various languages can be implemented using the
same wire protocol (and thus capable of communicating with the
same server).
For a build system that means that we have to expect that language
bindings at places far away from the protocol definition, and
potentially several times. Such a duplication can also occur
implicitly if two buffers, for which language bindings are generated
both use a common buffer for which bindings are never requested
explicitly. Still, we want to avoid duplicate work for common parts
and we have to avoid conflicts with duplicate symbols and staging
conflicts for the libraries for the common part.
Our approach is that a "proto" target only provides the description
files together with their dependency structure. From those, a
consuming target generates "anonymous targets" as additional
dependencies; as those targets will have an appropriate notion of
equality, no duplicate work is done and hence, as a side effect,
staging or symbol conflicts are avoided as well.
** Prelimanary remark: action identifiers
Actions are defined as Merkle-tree hash of the contents. As all
components (input tree, list of output strings, command vector,
environment, and cache pragma) are given by expressions, that can
quickly be computed. This identifier also defines the notion of
equality for actions, and hence action artifacts. Recall that eqality
of artifacts is also (implicitly) used in our notion of disjoint
map union (where the set of keys does not have to be disjoint, as
long as the values for all duplicate keys are equal).
When constructing the action graph for traversal, we can drop
duplicates (i.e., actions with the same identifier, and hence the
same description). For the serialization of the graph as part of
the analyse command, we can afford the preperatory step to compute
a map from action id to list of origins.
** Equality
*** Notions of equality
In the context of builds, there are different concepts of equality
to consider. We recall the definitions, as well as their use in
our build tool.
**** Locational equality ("Defined at the same place")
Names (for targets and rules) are given by repository name, module
name, and target name (inside the module); additionally, for target
names, there's a bit specifying that we explicitly refer to a file.
Names are equal if and only if the respective strings (and the file
bit) are equal.
For targets, we use locational equality, i.e., we consider targets
equal precisely if their names are equal; targets defined at different
places are considered different, even if they're defined in the
same way. The reason we use notion of equality is that we have to
refer to targets (and also check if we already have a pending task
to analyse them) before we have fully explored them with all the
targets referred to in their definition.
**** Intensional equality ("Defined in the same way")
In our expression language we handle definitions; in particular,
we treat artifacts by their definition: a particular source file,
the output of a particular action, etc. Hence we use intensional
equality in our expression language; two objects are equal precisely
if they are defined in the same way. This notion of equality is easy
to determine without the need of reading a source file or running
an action. We implement quick tests by keeping a Merkle-tree hash
of all expression values.
**** Extensional equality ("Defining the same object")
For built artifacts, we use extensional equality, i.e., we consider
two files equal, if they are bit-by-bit identical. Implementation-wise,
we compare an appropriate cryptographic hash. Before running an
action, we built its inputs. In particular (as inputs are considered
extensionally) an action might cause a cache hit with an intensionally
different one.
**** Observable equality ("The defined objects behave in the same way")
Finally, there is the notion of observable equality, i.e., the
property that two binaries behaving the same way in all situations.
As this notion is undecidable, it is never used directly by any
build tool. However, it is often the motivation for a build in the
first place: we want a binary that behaves in a particular way.
*** Relation between these notions
The notions of equality were introduced in order from most fine grained
to most coarse. Targets defined at the same place are obviously defined
in the same way. Intensionally equal artifacts create equal action
graphs; here we can confidently say "equal" and not only isomorphic:
due to our preliminary clean up, even the node names are equal.
Making sure that equal actions produce bit-by-bit equal outputs
is the realm of [[https://reproducible-builds.org/][reproducibe
builds]]. The tool can support this by appropriate sandboxing,
etc, but the rules still have to define actions that don't pick
up non-input information like the current time, user id, readdir
order, etc. Files that are bit-by-bit identical will behave in
the same way.
*** Example
Consider the following target file.
#+BEGIN_SRC
{ "foo":
{ "type": "generic"
, "outs": ["out.txt"]
, "cmds": ["echo Hello World > out.txt"]
}
, "bar":
{ "type": "generic"
, "outs": ["out.txt"]
, "cmds": ["echo Hello World > out.txt"]
}
, "baz":
{ "type": "generic"
, "outs": ["out.txt"]
, "cmds": ["echo -n Hello > out.txt && echo ' World' >> out.txt"]
}
, "foo upper":
{ "type": "generic"
, "deps": ["foo"]
, "outs": ["upper.txt"]
, "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
}
, "bar upper":
{ "type": "generic"
, "deps": ["bar"]
, "outs": ["upper.txt"]
, "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
}
, "baz upper":
{ "type": "generic"
, "deps": ["baz"]
, "outs": ["upper.txt"]
, "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
}
, "ALL":
{ "type": "install"
, "files":
{"foo.txt": "foo upper", "bar.txt": "bar upper", "baz.txt": "baz upper"}
}
}
#+END_SRC
Assume we build the target ~"ALL"~. Then we will analyse 7 targets,
all the locationally different ones (~"foo"~, ~"bar"~, ~"baz"~,
~"foo upper"~, ~"bar upper"~, ~"baz upper"~). For the targets ~"foo"~
and ~"bar"~, we immediately see that the definition is equal; their
intensional equality also renders ~"foo upper"~ and ~"bar upper"~
intensionally equal. Our action graph will contain 4 actions: one
with origins ~["foo", "bar"]~, one with origins ~["baz"]~, one with
origins ~["foo upper", "bar upper"]~, and one with origins ~["baz
upper"]~. The ~"install"~ target will, of course, not create any
actions. Building sequentially (~-J 1~), we will get one cache hit.
Even though the artifacts of ~"foo"~ and ~"bar"~ and of ~"baz~"
are defined differently, they are extensionally equal; both define
a file with contents ~"Hello World\n"~.
** Anonymous targets
Besides named targets we also have additional targets (and hence also
configured targets) that are not associated with a location they are
defined at. Due to the absence of definition location, their notion
of equality will take care of the necessary deduplication (implicitly,
by the way our dependency exploration works). We will call them
"anonymous targets", even though, technically, they're not fully
anonymous as the rules that are part of their structure will be
given by name, i.e., defining rule location.
*** Value type: target graph node
In order to allow targets to adequately describe a dependency
structure, we have a value type in our expression language, that
of a (target) graph node. As with all value types, equality is
intensional, i.e., nodes defined in the same way are equal even
if defined at different places. This can be achieved by our usual
approach for expressions of having cached Merkle-tree hashes and
comparing them when an equality test is required. This efficient
test for equality also allows using graph nodes as part of a map
key, e.g., for our asynchronous map conumers.
As a graph node can only be defined with all data given, the defined
dependency structure is cycle-free by construction. However, the
tree unfolding will usually be exponentially larger. For internal
handling, this is not a problem: our shared-pointer implementation
can efficently represent a directed acyclic graph and since we
cache hashes in expressions, we can compute the overall hash without
folding the structure to a tree. When presenting nodes to the user,
we only show the map of identifier to definition, to avoid that
exponential unfolding.
We have two kinds of nodes.
**** Value nodes
These represent a target that, in any configuration, returns a fixed
value. Source files would typically be represented this way. The
constructor function ~"VALUE_NODE"~ takes a single argument ~"$1"~
that has to be a result value.
**** Abstract nodes
These represent internal nodes in the dag. Their constructor
~"ABSTRACT_NODE"~ takes the following arguments (all evaluated).
- ~"node_type"~. An arbitrary string, not interpreted in any way, to
indicate the role that the node has in the dependency structure.
When we create an anonymous target from a node, this will serve
as the key into the rule mapping to be applied.
- ~"string_fields"~. This has to be a map of strings.
- ~"target_fields"~. These have to be a map of lists of graph nodes.
Moreover, we require that the keys for maps provided as ~"string_fields"~
and ~"target_fields"~ be disjoint.
*** Graph nodes in ~export~ targets
Graph nodes are completely free of names and hence are eligible
for exporting. As with other values, in the cache the intensional
definition of artifacts implict in them will be replaced by the
corresponding, extensionally equal, known value.
However, some care has to be taken in the serialisation that is
part of the caching, as we do not want to unfold the the dag to
a tree. Therefore, we take as JSON serialisation a simple dict
with ~"type"~ set to ~"NODE"~, and ~"value"~ set to the Merkle-tree
hash. That serialisation respects intensional equality. To allow
deserialisation, we add an additional map to the serialisation from
node hash to its definition.
*** Dependings on anonymous targets
**** Parts of an anonymous target
An anonymous target is given by a pair of a node and a map mapping
the abstract node-type specifying strings to rule names. So, in
the implementation these are just two expression pointers (with
their defined notion of equality, i.e., equality of the respective
Merkle-tree hashes). Such a pair of pointers also forms an additional
variant of a name value, refering to such an anonymous target.
It should be noted that such an anonymous target contains all the
information needed to evaluate it in the same way as a regular (named)
target defined by a user-defined rule. It is an analysis error
analysing an anonymous target where there is no entry in the rules
map for the string given as ~"node_type"~ for the corresponding node.
**** Anonymous targets as additional dependencies
We keep the property that a user can only request named targets.
So anonymous targets have to be requested by other targets. We
also keep the property that other targets are only requested at
certain fixed steps in the evaluation of a target. To still achieve
a meaningful use of anonymous targets our rule language hanldes
anonymous targets in the following way.
***** Rules parameter ~"anonymous"~
In the rule definition a parameter ~"anonymous"~ (with empty map as
default) is allowed. It is used to define an additional dependency on
anonymous targets. The value has to be a map with keys the additional
implicitly defined field names. It is hence a requirement that the
set of keys be disjoint from all other field names (the values of
~"config_fields"~, ~"string_fields"~, and ~"target_fields"~, as well as
the keys of the ~"implict"~ parameter). Another consequence is that
~"config_transitions"~ map may now also have meaningful entries for
the keys of the ~"anonymous"~ map. Each value in the map has to be
itself a map, with entries ~"target"~, ~"provider"~, and ~"rule_map"~.
For ~"target"~, a single string has to be specifed, and the value has
to be a member of the ~"target_fields"~ list. For provider, a single
string has to be specified as well. The idea is that the nodes are
collected from that provider of the targets in the specified target
field. For ~"rule_map"~ a map has to be specified from strings to
rule names; the latter are evaluated in the context of the rule
definition.
****** Example
For generating language bindings for protocol buffers, a rule might
look as follows.
#+BEGIN_SRC
{ "cc_proto_bindings":
{ "target_fields": ["proto_deps"]
, "anonymous":
{ "protos":
{ "target": "proto_deps"
, "provider": "proto"
, "rule_map": {"proto_library": "cc_proto_library"}
}
}
, "expression": {...}
}
}
#+END_SRC
***** Evaluation mechanism
The evaluation of a target defined by a user-defined rule is handled
as follows. After the target fields are evaluated as usual, an
additional step is carried out.
For each anymous-target field, i.e., for each key in the ~"anonymous"~
map, a list of anymous targets is generated from the corresponding
value: take all targets from the specified ~"target"~ field in all
their specified configuration transitions (they have already been
evaluated) and take the values provided for the specified ~"provider"~
key (using the empty list as default). That value has to be a list
of nodes. All the node lists obtained that way are concatenated.
The configuration transition for the respective field name is
evaluated. Those targets are then evaluated for all the transitioned
configurations requested.
In the final evaluation of the defining expression, the anonymous-target
fields are available in the same way as any other target field.
Also, they contribute to the effective configuration in the same
way as regular target fields.
|