diff options
author | Oliver Reiche <oliver.reiche@huawei.com> | 2023-06-01 13:36:32 +0200 |
---|---|---|
committer | Oliver Reiche <oliver.reiche@huawei.com> | 2023-06-12 16:29:05 +0200 |
commit | b66a7359fbbff35af630c88c56598bbc06b393e1 (patch) | |
tree | d866802c4b44c13cbd90f9919cc7fc472091be0c /doc/concepts | |
parent | 144b2c619f28c91663936cd445251ca28af45f88 (diff) | |
download | justbuild-b66a7359fbbff35af630c88c56598bbc06b393e1.tar.gz |
doc: Convert orgmode files to markdown
Diffstat (limited to 'doc/concepts')
-rw-r--r-- | doc/concepts/anonymous-targets.md | 345 | ||||
-rw-r--r-- | doc/concepts/anonymous-targets.org | 336 | ||||
-rw-r--r-- | doc/concepts/built-in-rules.md | 172 | ||||
-rw-r--r-- | doc/concepts/built-in-rules.org | 167 | ||||
-rw-r--r-- | doc/concepts/cache-pragma.md | 134 | ||||
-rw-r--r-- | doc/concepts/cache-pragma.org | 130 | ||||
-rw-r--r-- | doc/concepts/configuration.md | 115 | ||||
-rw-r--r-- | doc/concepts/configuration.org | 107 | ||||
-rw-r--r-- | doc/concepts/doc-strings.md | 152 | ||||
-rw-r--r-- | doc/concepts/doc-strings.org | 145 | ||||
-rw-r--r-- | doc/concepts/expressions.md | 368 | ||||
-rw-r--r-- | doc/concepts/expressions.org | 344 | ||||
-rw-r--r-- | doc/concepts/garbage.md | 86 | ||||
-rw-r--r-- | doc/concepts/garbage.org | 82 | ||||
-rw-r--r-- | doc/concepts/multi-repo.md | 170 | ||||
-rw-r--r-- | doc/concepts/multi-repo.org | 167 | ||||
-rw-r--r-- | doc/concepts/overview.md | 210 | ||||
-rw-r--r-- | doc/concepts/overview.org | 206 | ||||
-rw-r--r-- | doc/concepts/rules.md | 567 | ||||
-rw-r--r-- | doc/concepts/rules.org | 551 | ||||
-rw-r--r-- | doc/concepts/target-cache.md | 231 | ||||
-rw-r--r-- | doc/concepts/target-cache.org | 219 |
22 files changed, 2550 insertions, 2454 deletions
diff --git a/doc/concepts/anonymous-targets.md b/doc/concepts/anonymous-targets.md new file mode 100644 index 00000000..6692d0ae --- /dev/null +++ b/doc/concepts/anonymous-targets.md @@ -0,0 +1,345 @@ +Anonymous targets +================= + +Motivation +---------- + +Using [Protocol buffers](https://github.com/protocolbuffers/protobuf) +allows to specify, in a language-independent way, a wire format for +structured data. This is done by using description files from which APIs +for various languages can be generated. As protocol buffers can contain +other protocol buffers, the description files themselves have a +dependency structure. + +From a software-engineering point of view, the challenge is to ensure +that the author of the description files does not have to be aware of +the languages for which APIs will be generated later. In fact, the main +benefit of the language-independent description is that clients in +various languages can be implemented using the same wire protocol (and +thus capable of communicating with the same server). + +For a build system that means that we have to expect that language +bindings at places far away from the protocol definition, and +potentially several times. Such a duplication can also occur implicitly +if two buffers, for which language bindings are generated both use a +common buffer for which bindings are never requested explicitly. Still, +we want to avoid duplicate work for common parts and we have to avoid +conflicts with duplicate symbols and staging conflicts for the libraries +for the common part. + +Our approach is that a "proto" target only provides the description +files together with their dependency structure. From those, a consuming +target generates "anonymous targets" as additional dependencies; as +those targets will have an appropriate notion of equality, no duplicate +work is done and hence, as a side effect, staging or symbol conflicts +are avoided as well. + +Preliminary remark: action identifiers +-------------------------------------- + +Actions are defined as Merkle-tree hash of the contents. As all +components (input tree, list of output strings, command vector, +environment, and cache pragma) are given by expressions, that can +quickly be computed. This identifier also defines the notion of equality +for actions, and hence action artifacts. Recall that equality of +artifacts is also (implicitly) used in our notion of disjoint map union +(where the set of keys does not have to be disjoint, as long as the +values for all duplicate keys are equal). + +When constructing the action graph for traversal, we can drop duplicates +(i.e., actions with the same identifier, and hence the same +description). For the serialization of the graph as part of the analyse +command, we can afford the preparatory step to compute a map from action +id to list of origins. + +Equality +-------- + +### Notions of equality + +In the context of builds, there are different concepts of equality to +consider. We recall the definitions, as well as their use in our build +tool. + +#### Locational equality ("Defined at the same place") + +Names (for targets and rules) are given by repository name, module +name, and target name (inside the module); additionally, for target +names, there's a bit specifying that we explicitly refer to a file. +Names are equal if and only if the respective strings (and the file +bit) are equal. + +For targets, we use locational equality, i.e., we consider targets +equal precisely if their names are equal; targets defined at +different places are considered different, even if they're defined +in the same way. The reason we use notion of equality is that we +have to refer to targets (and also check if we already have a +pending task to analyse them) before we have fully explored them +with all the targets referred to in their definition. + +#### Intensional equality ("Defined in the same way") + +In our expression language we handle definitions; in particular, we +treat artifacts by their definition: a particular source file, the +output of a particular action, etc. Hence we use intensional +equality in our expression language; two objects are equal precisely +if they are defined in the same way. This notion of equality is easy +to determine without the need of reading a source file or running an +action. We implement quick tests by keeping a Merkle-tree hash of +all expression values. + +#### Extensional equality ("Defining the same object") + +For built artifacts, we use extensional equality, i.e., we consider +two files equal, if they are bit-by-bit identical. +Implementation-wise, we compare an appropriate cryptographic hash. +Before running an action, we built its inputs. In particular (as +inputs are considered extensionally) an action might cause a cache +hit with an intensionally different one. + +#### Observable equality ("The defined objects behave in the same way") + +Finally, there is the notion of observable equality, i.e., the +property that two binaries behaving the same way in all situations. +As this notion is undecidable, it is never used directly by any +build tool. However, it is often the motivation for a build in the +first place: we want a binary that behaves in a particular way. + +### Relation between these notions + +The notions of equality were introduced in order from most fine grained +to most coarse. Targets defined at the same place are obviously defined +in the same way. Intensionally equal artifacts create equal action +graphs; here we can confidently say "equal" and not only isomorphic: +due to our preliminary clean up, even the node names are equal. Making +sure that equal actions produce bit-by-bit equal outputs is the realm of +[reproducibe builds](https://reproducible-builds.org/). The tool can +support this by appropriate sandboxing, etc, but the rules still have to +define actions that don't pick up non-input information like the +current time, user id, readdir order, etc. Files that are bit-by-bit +identical will behave in the same way. + +### Example + +Consider the following target file. + +```jsonc +{ "foo": + { "type": "generic" + , "outs": ["out.txt"] + , "cmds": ["echo Hello World > out.txt"] + } +, "bar": + { "type": "generic" + , "outs": ["out.txt"] + , "cmds": ["echo Hello World > out.txt"] + } +, "baz": + { "type": "generic" + , "outs": ["out.txt"] + , "cmds": ["echo -n Hello > out.txt && echo ' World' >> out.txt"] + } +, "foo upper": + { "type": "generic" + , "deps": ["foo"] + , "outs": ["upper.txt"] + , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] + } +, "bar upper": + { "type": "generic" + , "deps": ["bar"] + , "outs": ["upper.txt"] + , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] + } +, "baz upper": + { "type": "generic" + , "deps": ["baz"] + , "outs": ["upper.txt"] + , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] + } +, "ALL": + { "type": "install" + , "files": + {"foo.txt": "foo upper", "bar.txt": "bar upper", "baz.txt": "baz upper"} + } +} +``` + +Assume we build the target `"ALL"`. Then we will analyse 7 targets, all +the locationally different ones (`"foo"`, `"bar"`, `"baz"`, +`"foo upper"`, `"bar upper"`, `"baz upper"`). For the targets `"foo"` +and `"bar"`, we immediately see that the definition is equal; their +intensional equality also renders `"foo upper"` and `"bar upper"` +intensionally equal. Our action graph will contain 4 actions: one with +origins `["foo", "bar"]`, one with origins `["baz"]`, one with origins +`["foo upper", "bar upper"]`, and one with origins `["baz +upper"]`. The `"install"` target will, of course, not create any +actions. Building sequentially (`-J 1`), we will get one cache hit. Even +though the artifacts of `"foo"` and `"bar"` and of `"baz"` are defined +differently, they are extensionally equal; both define a file with +contents `"Hello World\n"`. + +Anonymous targets +----------------- + +Besides named targets we also have additional targets (and hence also +configured targets) that are not associated with a location they are +defined at. Due to the absence of definition location, their notion of +equality will take care of the necessary deduplication (implicitly, by +the way our dependency exploration works). We will call them "anonymous +targets", even though, technically, they're not fully anonymous as the +rules that are part of their structure will be given by name, i.e., +defining rule location. + +### Value type: target graph node + +In order to allow targets to adequately describe a dependency structure, +we have a value type in our expression language, that of a (target) +graph node. As with all value types, equality is intensional, i.e., +nodes defined in the same way are equal even if defined at different +places. This can be achieved by our usual approach for expressions of +having cached Merkle-tree hashes and comparing them when an equality +test is required. This efficient test for equality also allows using +graph nodes as part of a map key, e.g., for our asynchronous map +consumers. + +As a graph node can only be defined with all data given, the defined +dependency structure is cycle-free by construction. However, the tree +unfolding will usually be exponentially larger. For internal handling, +this is not a problem: our shared-pointer implementation can efficiently +represent a directed acyclic graph and since we cache hashes in +expressions, we can compute the overall hash without folding the +structure to a tree. When presenting nodes to the user, we only show the +map of identifier to definition, to avoid that exponential unfolding. + +We have two kinds of nodes. + +#### Value nodes + +These represent a target that, in any configuration, returns a fixed +value. Source files would typically be represented this way. The +constructor function `"VALUE_NODE"` takes a single argument `"$1"` +that has to be a result value. + +#### Abstract nodes + +These represent internal nodes in the dag. Their constructor +`"ABSTRACT_NODE"` takes the following arguments (all evaluated). + + - `"node_type"`. An arbitrary string, not interpreted in any way, + to indicate the role that the node has in the dependency + structure. When we create an anonymous target from a node, this + will serve as the key into the rule mapping to be applied. + - `"string_fields"`. This has to be a map of strings. + - `"target_fields"`. These have to be a map of lists of graph + nodes. + +Moreover, we require that the keys for maps provided as +`"string_fields"` and `"target_fields"` be disjoint. + +### Graph nodes in `export` targets + +Graph nodes are completely free of names and hence are eligible for +exporting. As with other values, in the cache the intensional definition +of artifacts implicit in them will be replaced by the corresponding, +extensionally equal, known value. + +However, some care has to be taken in the serialisation that is part of +the caching, as we do not want to unfold the dag to a tree. Therefore, +we take as JSON serialisation a simple dict with `"type"` set to +`"NODE"`, and `"value"` set to the Merkle-tree hash. That serialisation +respects intensional equality. To allow deserialisation, we add an +additional map to the serialisation from node hash to its definition. + +### Dependings on anonymous targets + +#### Parts of an anonymous target + +An anonymous target is given by a pair of a node and a map mapping +the abstract node-type specifying strings to rule names. So, in the +implementation these are just two expression pointers (with their +defined notion of equality, i.e., equality of the respective +Merkle-tree hashes). Such a pair of pointers also forms an +additional variant of a name value, referring to such an anonymous +target. + +It should be noted that such an anonymous target contains all the +information needed to evaluate it in the same way as a regular +(named) target defined by a user-defined rule. It is an analysis +error analysing an anonymous target where there is no entry in the +rules map for the string given as `"node_type"` for the +corresponding node. + +#### Anonymous targets as additional dependencies + +We keep the property that a user can only request named targets. So +anonymous targets have to be requested by other targets. We also +keep the property that other targets are only requested at certain +fixed steps in the evaluation of a target. To still achieve a +meaningful use of anonymous targets our rule language handles +anonymous targets in the following way. + +##### Rules parameter `"anonymous"` + +In the rule definition a parameter `"anonymous"` (with empty map +as default) is allowed. It is used to define an additional +dependency on anonymous targets. The value has to be a map with +keys the additional implicitly defined field names. It is hence +a requirement that the set of keys be disjoint from all other +field names (the values of `"config_fields"`, `"string_fields"`, +and `"target_fields"`, as well as the keys of the `"implict"` +parameter). Another consequence is that `"config_transitions"` +map may now also have meaningful entries for the keys of the +`"anonymous"` map. Each value in the map has to be itself a map, +with entries `"target"`, `"provider"`, and `"rule_map"`. + +For `"target"`, a single string has to be specifed, and the +value has to be a member of the `"target_fields"` list. For +provider, a single string has to be specified as well. The idea +is that the nodes are collected from that provider of the +targets in the specified target field. For `"rule_map"` a map +has to be specified from strings to rule names; the latter are +evaluated in the context of the rule definition. + +###### Example + +For generating language bindings for protocol buffers, a +rule might look as follows. + +``` jsonc +{ "cc_proto_bindings": + { "target_fields": ["proto_deps"] + , "anonymous": + { "protos": + { "target": "proto_deps" + , "provider": "proto" + , "rule_map": {"proto_library": "cc_proto_library"} + } + } + , "expression": {...} + } +} +``` + +##### Evaluation mechanism + +The evaluation of a target defined by a user-defined rule is +handled as follows. After the target fields are evaluated as +usual, an additional step is carried out. + +For each anonymous-target field, i.e., for each key in the +`"anonymous"` map, a list of anonymous targets is generated from +the corresponding value: take all targets from the specified +`"target"` field in all their specified configuration +transitions (they have already been evaluated) and take the +values provided for the specified `"provider"` key (using the +empty list as default). That value has to be a list of nodes. +All the node lists obtained that way are concatenated. The +configuration transition for the respective field name is +evaluated. Those targets are then evaluated for all the +transitioned configurations requested. + +In the final evaluation of the defining expression, the +anonymous-target fields are available in the same way as any +other target field. Also, they contribute to the effective +configuration in the same way as regular target fields. diff --git a/doc/concepts/anonymous-targets.org b/doc/concepts/anonymous-targets.org deleted file mode 100644 index 98d194c7..00000000 --- a/doc/concepts/anonymous-targets.org +++ /dev/null @@ -1,336 +0,0 @@ -* Anonymous targets -** Motivation - -Using [[https://github.com/protocolbuffers/protobuf][Protocol -buffers]] allows to specify, in a language-independent way, a wire -format for structured data. This is done by using description files -from which APIs for various languages can be generated. As protocol -buffers can contain other protocol buffers, the description files -themselves have a dependency structure. - -From a software-engineering point of view, the challenge is to -ensure that the author of the description files does not have to -be aware of the languages for which APIs will be generated later. -In fact, the main benefit of the language-independent description -is that clients in various languages can be implemented using the -same wire protocol (and thus capable of communicating with the -same server). - -For a build system that means that we have to expect that language -bindings at places far away from the protocol definition, and -potentially several times. Such a duplication can also occur -implicitly if two buffers, for which language bindings are generated -both use a common buffer for which bindings are never requested -explicitly. Still, we want to avoid duplicate work for common parts -and we have to avoid conflicts with duplicate symbols and staging -conflicts for the libraries for the common part. - -Our approach is that a "proto" target only provides the description -files together with their dependency structure. From those, a -consuming target generates "anonymous targets" as additional -dependencies; as those targets will have an appropriate notion of -equality, no duplicate work is done and hence, as a side effect, -staging or symbol conflicts are avoided as well. - -** Preliminary remark: action identifiers - -Actions are defined as Merkle-tree hash of the contents. As all -components (input tree, list of output strings, command vector, -environment, and cache pragma) are given by expressions, that can -quickly be computed. This identifier also defines the notion of -equality for actions, and hence action artifacts. Recall that equality -of artifacts is also (implicitly) used in our notion of disjoint -map union (where the set of keys does not have to be disjoint, as -long as the values for all duplicate keys are equal). - -When constructing the action graph for traversal, we can drop -duplicates (i.e., actions with the same identifier, and hence the -same description). For the serialization of the graph as part of -the analyse command, we can afford the preparatory step to compute -a map from action id to list of origins. - -** Equality - -*** Notions of equality - -In the context of builds, there are different concepts of equality -to consider. We recall the definitions, as well as their use in -our build tool. - -**** Locational equality ("Defined at the same place") - -Names (for targets and rules) are given by repository name, module -name, and target name (inside the module); additionally, for target -names, there's a bit specifying that we explicitly refer to a file. -Names are equal if and only if the respective strings (and the file -bit) are equal. - -For targets, we use locational equality, i.e., we consider targets -equal precisely if their names are equal; targets defined at different -places are considered different, even if they're defined in the -same way. The reason we use notion of equality is that we have to -refer to targets (and also check if we already have a pending task -to analyse them) before we have fully explored them with all the -targets referred to in their definition. - -**** Intensional equality ("Defined in the same way") - -In our expression language we handle definitions; in particular, -we treat artifacts by their definition: a particular source file, -the output of a particular action, etc. Hence we use intensional -equality in our expression language; two objects are equal precisely -if they are defined in the same way. This notion of equality is easy -to determine without the need of reading a source file or running -an action. We implement quick tests by keeping a Merkle-tree hash -of all expression values. - -**** Extensional equality ("Defining the same object") - -For built artifacts, we use extensional equality, i.e., we consider -two files equal, if they are bit-by-bit identical. Implementation-wise, -we compare an appropriate cryptographic hash. Before running an -action, we built its inputs. In particular (as inputs are considered -extensionally) an action might cause a cache hit with an intensionally -different one. - -**** Observable equality ("The defined objects behave in the same way") - -Finally, there is the notion of observable equality, i.e., the -property that two binaries behaving the same way in all situations. -As this notion is undecidable, it is never used directly by any -build tool. However, it is often the motivation for a build in the -first place: we want a binary that behaves in a particular way. - -*** Relation between these notions - -The notions of equality were introduced in order from most fine grained -to most coarse. Targets defined at the same place are obviously defined -in the same way. Intensionally equal artifacts create equal action -graphs; here we can confidently say "equal" and not only isomorphic: -due to our preliminary clean up, even the node names are equal. -Making sure that equal actions produce bit-by-bit equal outputs -is the realm of [[https://reproducible-builds.org/][reproducibe -builds]]. The tool can support this by appropriate sandboxing, -etc, but the rules still have to define actions that don't pick -up non-input information like the current time, user id, readdir -order, etc. Files that are bit-by-bit identical will behave in -the same way. - -*** Example - -Consider the following target file. - -#+BEGIN_SRC -{ "foo": - { "type": "generic" - , "outs": ["out.txt"] - , "cmds": ["echo Hello World > out.txt"] - } -, "bar": - { "type": "generic" - , "outs": ["out.txt"] - , "cmds": ["echo Hello World > out.txt"] - } -, "baz": - { "type": "generic" - , "outs": ["out.txt"] - , "cmds": ["echo -n Hello > out.txt && echo ' World' >> out.txt"] - } -, "foo upper": - { "type": "generic" - , "deps": ["foo"] - , "outs": ["upper.txt"] - , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] - } -, "bar upper": - { "type": "generic" - , "deps": ["bar"] - , "outs": ["upper.txt"] - , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] - } -, "baz upper": - { "type": "generic" - , "deps": ["baz"] - , "outs": ["upper.txt"] - , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"] - } -, "ALL": - { "type": "install" - , "files": - {"foo.txt": "foo upper", "bar.txt": "bar upper", "baz.txt": "baz upper"} - } -} -#+END_SRC - -Assume we build the target ~"ALL"~. Then we will analyse 7 targets, -all the locationally different ones (~"foo"~, ~"bar"~, ~"baz"~, -~"foo upper"~, ~"bar upper"~, ~"baz upper"~). For the targets ~"foo"~ -and ~"bar"~, we immediately see that the definition is equal; their -intensional equality also renders ~"foo upper"~ and ~"bar upper"~ -intensionally equal. Our action graph will contain 4 actions: one -with origins ~["foo", "bar"]~, one with origins ~["baz"]~, one with -origins ~["foo upper", "bar upper"]~, and one with origins ~["baz -upper"]~. The ~"install"~ target will, of course, not create any -actions. Building sequentially (~-J 1~), we will get one cache hit. -Even though the artifacts of ~"foo"~ and ~"bar"~ and of ~"baz~" -are defined differently, they are extensionally equal; both define -a file with contents ~"Hello World\n"~. - -** Anonymous targets - -Besides named targets we also have additional targets (and hence also -configured targets) that are not associated with a location they are -defined at. Due to the absence of definition location, their notion -of equality will take care of the necessary deduplication (implicitly, -by the way our dependency exploration works). We will call them -"anonymous targets", even though, technically, they're not fully -anonymous as the rules that are part of their structure will be -given by name, i.e., defining rule location. - -*** Value type: target graph node - -In order to allow targets to adequately describe a dependency -structure, we have a value type in our expression language, that -of a (target) graph node. As with all value types, equality is -intensional, i.e., nodes defined in the same way are equal even -if defined at different places. This can be achieved by our usual -approach for expressions of having cached Merkle-tree hashes and -comparing them when an equality test is required. This efficient -test for equality also allows using graph nodes as part of a map -key, e.g., for our asynchronous map consumers. - -As a graph node can only be defined with all data given, the defined -dependency structure is cycle-free by construction. However, the -tree unfolding will usually be exponentially larger. For internal -handling, this is not a problem: our shared-pointer implementation -can efficiently represent a directed acyclic graph and since we -cache hashes in expressions, we can compute the overall hash without -folding the structure to a tree. When presenting nodes to the user, -we only show the map of identifier to definition, to avoid that -exponential unfolding. - -We have two kinds of nodes. - -**** Value nodes - -These represent a target that, in any configuration, returns a fixed -value. Source files would typically be represented this way. The -constructor function ~"VALUE_NODE"~ takes a single argument ~"$1"~ -that has to be a result value. - -**** Abstract nodes - -These represent internal nodes in the dag. Their constructor -~"ABSTRACT_NODE"~ takes the following arguments (all evaluated). -- ~"node_type"~. An arbitrary string, not interpreted in any way, to - indicate the role that the node has in the dependency structure. - When we create an anonymous target from a node, this will serve - as the key into the rule mapping to be applied. -- ~"string_fields"~. This has to be a map of strings. -- ~"target_fields"~. These have to be a map of lists of graph nodes. -Moreover, we require that the keys for maps provided as ~"string_fields"~ -and ~"target_fields"~ be disjoint. - -*** Graph nodes in ~export~ targets - -Graph nodes are completely free of names and hence are eligible -for exporting. As with other values, in the cache the intensional -definition of artifacts implicit in them will be replaced by the -corresponding, extensionally equal, known value. - -However, some care has to be taken in the serialisation that is -part of the caching, as we do not want to unfold the dag to -a tree. Therefore, we take as JSON serialisation a simple dict -with ~"type"~ set to ~"NODE"~, and ~"value"~ set to the Merkle-tree -hash. That serialisation respects intensional equality. To allow -deserialisation, we add an additional map to the serialisation from -node hash to its definition. - -*** Dependings on anonymous targets - -**** Parts of an anonymous target - -An anonymous target is given by a pair of a node and a map mapping -the abstract node-type specifying strings to rule names. So, in -the implementation these are just two expression pointers (with -their defined notion of equality, i.e., equality of the respective -Merkle-tree hashes). Such a pair of pointers also forms an additional -variant of a name value, referring to such an anonymous target. - -It should be noted that such an anonymous target contains all the -information needed to evaluate it in the same way as a regular (named) -target defined by a user-defined rule. It is an analysis error -analysing an anonymous target where there is no entry in the rules -map for the string given as ~"node_type"~ for the corresponding node. - -**** Anonymous targets as additional dependencies - -We keep the property that a user can only request named targets. -So anonymous targets have to be requested by other targets. We -also keep the property that other targets are only requested at -certain fixed steps in the evaluation of a target. To still achieve -a meaningful use of anonymous targets our rule language handles -anonymous targets in the following way. - -***** Rules parameter ~"anonymous"~ - -In the rule definition a parameter ~"anonymous"~ (with empty map as -default) is allowed. It is used to define an additional dependency on -anonymous targets. The value has to be a map with keys the additional -implicitly defined field names. It is hence a requirement that the -set of keys be disjoint from all other field names (the values of -~"config_fields"~, ~"string_fields"~, and ~"target_fields"~, as well as -the keys of the ~"implict"~ parameter). Another consequence is that -~"config_transitions"~ map may now also have meaningful entries for -the keys of the ~"anonymous"~ map. Each value in the map has to be -itself a map, with entries ~"target"~, ~"provider"~, and ~"rule_map"~. - -For ~"target"~, a single string has to be specifed, and the value has -to be a member of the ~"target_fields"~ list. For provider, a single -string has to be specified as well. The idea is that the nodes are -collected from that provider of the targets in the specified target -field. For ~"rule_map"~ a map has to be specified from strings to -rule names; the latter are evaluated in the context of the rule -definition. - -****** Example - -For generating language bindings for protocol buffers, a rule might -look as follows. - -#+BEGIN_SRC -{ "cc_proto_bindings": - { "target_fields": ["proto_deps"] - , "anonymous": - { "protos": - { "target": "proto_deps" - , "provider": "proto" - , "rule_map": {"proto_library": "cc_proto_library"} - } - } - , "expression": {...} - } -} -#+END_SRC - -***** Evaluation mechanism - -The evaluation of a target defined by a user-defined rule is handled -as follows. After the target fields are evaluated as usual, an -additional step is carried out. - -For each anonymous-target field, i.e., for each key in the ~"anonymous"~ -map, a list of anonymous targets is generated from the corresponding -value: take all targets from the specified ~"target"~ field in all -their specified configuration transitions (they have already been -evaluated) and take the values provided for the specified ~"provider"~ -key (using the empty list as default). That value has to be a list -of nodes. All the node lists obtained that way are concatenated. -The configuration transition for the respective field name is -evaluated. Those targets are then evaluated for all the transitioned -configurations requested. - -In the final evaluation of the defining expression, the anonymous-target -fields are available in the same way as any other target field. -Also, they contribute to the effective configuration in the same -way as regular target fields. diff --git a/doc/concepts/built-in-rules.md b/doc/concepts/built-in-rules.md new file mode 100644 index 00000000..3672df36 --- /dev/null +++ b/doc/concepts/built-in-rules.md @@ -0,0 +1,172 @@ +Built-in rules +============== + +Targets are defined in `TARGETS` files. Each target file is a single +`JSON` object. If the target name is contained as a key in that object, +the corresponding value defines the target; otherwise it is implicitly +considered a source file. The target definition itself is a `JSON` +object as well. The mandatory key `"type"` specifies the rule defining +the target; the meaning of the remaining keys depends on the rule +defining the target. + +There are a couple of rules built in, all named by a single string. The +user can define additional rules (and, in fact, we expect the majority +of targets to be defined by user-defined rules); referring to them in a +qualified way (with module) will always refer to those even if new +built-in rules are added later (as built-in rules will always be only +named by a single string). + +The following rules are built in. Built-in rules can have a special +syntax. + +`"export"` +---------- + +The `"export"` rule evaluates a given target in a specified +configuration. More precisely, the field `"target"` has to name a single +target (not a list of targets), the field `"flexible_config"` a list of +strings, treated as variable names, and the field `"fixed_config"` has +to be a map that is taken unevaluated. It is a requirement that the +domain of the `"fixed_config"` and the `"flexible_config"` be disjoint. +The optional fields `"doc"` and `"config_doc"` can be used to describe +the target and the `"flexible_config"`, respectively. + +To evaluate an `"export"` target, first the configuration is restricted +to the `"flexible_config"` and then the union with the `"fixed_config"` +is built. The target specified in `"target"` is then evaluated. It is a +requirement that this target be untainted. The result is the result of +this evaluation; artifacts, runfiles, and provides map are forwarded +unchanged. + +The main point of the `"export"` rule is, that the relevant part of the +configuration can be determined without having to analyze the target +itself. This makes such rules eligible for target-level caching +(provided the content of the repository as well as all reachable ones +can be determined cheaply). This eligibility is also the reason why it +is good practice to only depend on `"export"` targets of other +repositories. + +`"install"` +----------- + +The `"install"` rules allows to stage artifacts (and runfiles) of other +targets in a different way. More precisely, a new stage (i.e., map of +artifacts with keys treated as file names) is constructed in the +following way. + +The runfiles from all targets in the `"deps"` field are taken; the +`"deps"` field is an evaluated field and has to evaluate to a list of +targets. It is an error, if those runfiles conflict. + +The `"files"` argument is a special form. It has to be a map, and the +keys are taken as paths. The values are evaluated and have to evaluate +to a single target. That target has to have a single artifact or no +artifacts and a single run file. In this way, `"files"` defines a stage; +this stage overlays the runfiles of the `"deps"` and conflicts are +ignored. + +Finally, the `"dirs"` argument has to evaluate to a list of pairs (i.e., +lists of length two) with the first argument a target name and the +second argument a string, taken as directory name. For each entry, both, +runfiles and artifacts of the specified target are staged to the +specified directory. It is an error if a conflict with the stage +constructed so far occurs. + +Both, runfiles and artifacts of the `"install"` target are the stage +just described. An `"install"` target always has an empty provides map. +Any provided information of the dependencies is discarded. + +`"generic"` +----------- + +The `"generic"` rules allows to define artifacts as the output of an +action. This is mainly useful for ad-hoc constructions; for anything +occurring more often, a proper user-defined rule is usually the better +choice. + +The `"deps"` argument is evaluated and has to evaluate to a list of +target names. The runfiles and artifacts of these targets form the +inputs of the action. Conflicts are not an error and resolved by giving +precedence to the artifacts over the runfiles; conflicts within +artifacts or runfiles are resolved in a latest-wins fashion using the +order of the targets in the evaluated `"deps"` argument. + +The fields `"cmds"`, `"out_dirs"`, `"outs"`, and `"env"` are evaluated +fields where `"cmds"`, `"out_dirs"`, and `"outs"` have to evaluate to a +list of strings, and `"env"` has to evaluate to a map of strings. During +their evaluation, the functions `"out_dirs"`, `"outs"` and `"runfiles"` +can be used to access the logical paths of the directories, artifacts +and runfiles, respectively, of a target specified in `"deps"`. Here, +`"env"` specifies the environment in which the action is carried out. +`"out_dirs"` and `"outs"` define the output directories and files, +respectively, the action has to produce. Since some artifacts are to be +produced, at least one of `"out_dirs"` or `"outs"` must be a non-empty +list of strings. It is an error if one or more paths are present in both +the `"out_dirs"` and `"outs"`. Finally, the strings in `"cmds"` are +extended by a newline character and joined, and command of the action is +interpreting this string by `sh`. + +The artifacts of this target are the outputs (as declared by +`"out_dirs"` and `"outs"`) of this action. Runfiles and provider map are +empty. + +`"file_gen"` +------------ + +The `"file_gen"` rule allows to specify a file with a given content. To +be able to accurately report about file names of artifacts or runfiles +of other targets, they can be specified in the field `"deps"` which has +to evaluate to a list of targets. The names of the artifacts and +runfiles of a target specified in `"deps"` can be accessed through the +functions `"outs"` and `"runfiles"`, respectively, during the evaluation +of the arguments `"name"` and `"data"` which have to evaluate to a +single string. + +Artifacts and runfiles of a `"file_gen"` target are a singleton map with +key the result of evaluating `"name"` and value a (non-executable) file +with content the result of evaluating `"data"`. The provides map is +empty. + +`"tree"` +-------- + +The `"tree"` rule allows to specify a tree out of the artifact stage of +given targets. More precisely, the deps field `"deps"` has to evaluate +to a list of targets. For each target, runfiles and artifacts are +overlayed in an artifacts-win fashion and the union of the resulting +stages is taken; it is an error if conflicts arise in this way. The +resulting stage is transformed into a tree. Both, artifacts and runfiles +of the `"tree"` target are a singleton map with the key the result of +evaluating `"name"` (which has to evaluate to a single string) and value +that tree. + +`"configure"` +------------- + +The `"configure"` rule allows to configure a target with a given +configuration. The field `"target"` is evaluated and the result of the +evaluation must name a single target (not a list). The `"config"` field +is evaluated and must result in a map, which is used as configuration +for the given target. + +This rule uses the given configuration to overlay the current +environment for evaluating the given target, and thereby performs a +configuration transition. It forwards all results +(artifacts/runfiles/provides map) of the configured target to the upper +context. The result of a target that uses this rule is the result of the +target given in the `"target"` field (the configured target). + +As a full configuration transition is performed, the same care has to be +taken when using this rule as when writing a configuration transition in +a rule. Typically, this rule is used only at a top-level target of a +project and configures only variables internally to the project. In any +case, when using non-internal targets as dependencies (i.e., targets +that a caller of the `"configure"` potentially might use as well), care +should be taken that those are only used in the initial configuration. +Such preservation of the configuration is necessary to avoid conflicts, +if the targets depended upon are visible in the `"configure"` target +itself, e.g., as link dependency (which almost always happens when +depending on a library). Even if a non-internal target depended upon is +not visible in the `"configure"` target itself, requesting it in a +modified configuration causes additional overhead by increasing the +target graph and potentially the action graph. diff --git a/doc/concepts/built-in-rules.org b/doc/concepts/built-in-rules.org deleted file mode 100644 index 9463b10c..00000000 --- a/doc/concepts/built-in-rules.org +++ /dev/null @@ -1,167 +0,0 @@ -* Built-in rules - -Targets are defined in ~TARGETS~ files. Each target file is a single -~JSON~ object. If the target name is contained as a key in that -object, the corresponding value defines the target; otherwise it is -implicitly considered a source file. The target definition itself -is a ~JSON~ object as well. The mandatory key ~"type"~ specifies -the rule defining the target; the meaning of the remaining keys -depends on the rule defining the target. - -There are a couple of rules built in, all named by a single string. -The user can define additional rules (and, in fact, we expect the -majority of targets to be defined by user-defined rules); referring -to them in a qualified way (with module) will always refer to those -even if new built-in rules are added later (as built-in rules will -always be only named by a single string). - -The following rules are built in. Built-in rules can have a -special syntax. - -** ~"export"~ - -The ~"export"~ rule evaluates a given target in a specified -configuration. More precisely, the field ~"target"~ has to name a single -target (not a list of targets), the field ~"flexible_config"~ a list -of strings, treated as variable names, and the field ~"fixed_config"~ -has to be a map that is taken unevaluated. It is a requirement that -the domain of the ~"fixed_config"~ and the ~"flexible_config"~ be -disjoint. The optional fields ~"doc"~ and ~"config_doc"~ can be used -to describe the target and the ~"flexible_config"~, respectively. - -To evaluate an ~"export"~ target, first the configuration is -restricted to the ~"flexible_config"~ and then the union with the -~"fixed_config"~ is built. The target specified in ~"target"~ is -then evaluated. It is a requirement that this target be untainted. -The result is the result of this evaluation; artifacts, runfiles, -and provides map are forwarded unchanged. - -The main point of the ~"export"~ rule is, that the relevant part -of the configuration can be determined without having to analyze -the target itself. This makes such rules eligible for target-level -caching (provided the content of the repository as well as all -reachable ones can be determined cheaply). This eligibility is also -the reason why it is good practice to only depend on ~"export"~ -targets of other repositories. - -** ~"install"~ - -The ~"install"~ rules allows to stage artifacts (and runfiles) of -other targets in a different way. More precisely, a new stage (i.e., -map of artifacts with keys treated as file names) is constructed -in the following way. - -The runfiles from all targets in the ~"deps"~ field are taken; the -~"deps"~ field is an evaluated field and has to evaluate to a list -of targets. It is an error, if those runfiles conflict. - -The ~"files"~ argument is a special form. It has to be a map, and -the keys are taken as paths. The values are evaluated and have -to evaluate to a single target. That target has to have a single -artifact or no artifacts and a single run file. In this way, ~"files"~ -defines a stage; this stage overlays the runfiles of the ~"deps"~ -and conflicts are ignored. - -Finally, the ~"dirs"~ argument has to evaluate to a list of -pairs (i.e., lists of length two) with the first argument a target -name and the second argument a string, taken as directory name. For -each entry, both, runfiles and artifacts of the specified target -are staged to the specified directory. It is an error if a conflict -with the stage constructed so far occurs. - -Both, runfiles and artifacts of the ~"install"~ target are the stage -just described. An ~"install"~ target always has an empty provides -map. Any provided information of the dependencies is discarded. - -** ~"generic"~ - -The ~"generic"~ rules allows to define artifacts as the output -of an action. This is mainly useful for ad-hoc constructions; for -anything occurring more often, a proper user-defined rule is usually -the better choice. - -The ~"deps"~ argument is evaluated and has to evaluate to a list -of target names. The runfiles and artifacts of these targets form -the inputs of the action. Conflicts are not an error and resolved -by giving precedence to the artifacts over the runfiles; conflicts -within artifacts or runfiles are resolved in a latest-wins fashion -using the order of the targets in the evaluated ~"deps"~ argument. - -The fields ~"cmds"~, ~"out_dirs"~, ~"outs"~, and ~"env"~ are evaluated -fields where ~"cmds"~, ~"out_dirs"~, and ~"outs"~ have to evaluate to -a list of strings, and ~"env"~ has to evaluate to a map of -strings. During their evaluation, the functions ~"out_dirs"~, ~"outs"~ -and ~"runfiles"~ can be used to access the logical paths of the -directories, artifacts and runfiles, respectively, of a target -specified in ~"deps"~. Here, ~"env"~ specifies the environment in -which the action is carried out. ~"out_dirs"~ and ~"outs"~ define the -output directories and files, respectively, the action has to -produce. Since some artifacts are to be produced, at least one of -~"out_dirs"~ or ~"outs"~ must be a non-empty list of strings. It is an -error if one or more paths are present in both the ~"out_dirs"~ and -~"outs"~. Finally, the strings in ~"cmds"~ are extended by a newline -character and joined, and command of the action is interpreting this -string by ~sh~. - -The artifacts of this target are the outputs (as declared by -~"out_dirs"~ and ~"outs"~) of this action. Runfiles and provider map -are empty. - -** ~"file_gen"~ - -The ~"file_gen"~ rule allows to specify a file with a given content. -To be able to accurately report about file names of artifacts -or runfiles of other targets, they can be specified in the field -~"deps"~ which has to evaluate to a list of targets. The names -of the artifacts and runfiles of a target specified in ~"deps"~ -can be accessed through the functions ~"outs"~ and ~"runfiles"~, -respectively, during the evaluation of the arguments ~"name"~ and -~"data"~ which have to evaluate to a single string. - -Artifacts and runfiles of a ~"file_gen"~ target are a singleton map -with key the result of evaluating ~"name"~ and value a (non-executable) -file with content the result of evaluating ~"data"~. The provides -map is empty. - -** ~"tree"~ - -The ~"tree"~ rule allows to specify a tree out of the artifact -stage of given targets. More precisely, the deps field ~"deps"~ -has to evaluate to a list of targets. For each target, runfiles -and artifacts are overlayed in an artifacts-win fashion and -the union of the resulting stages is taken; it is an error if conflicts -arise in this way. The resulting stage is transformed into a tree. -Both, artifacts and runfiles of the ~"tree"~ target are a singleton map -with the key the result of evaluating ~"name"~ (which has to evaluate to -a single string) and value that tree. - - -** ~"configure"~ - -The ~"configure"~ rule allows to configure a target with a given -configuration. The field ~"target"~ is evaluated and the result -of the evaluation must name a single target (not a list). The -~"config"~ field is evaluated and must result in a map, which is -used as configuration for the given target. - -This rule uses the given configuration to overlay the current environment for -evaluating the given target, and thereby performs a configuration transition. It -forwards all results (artifacts/runfiles/provides map) of the configured target -to the upper context. The result of a target that uses this rule is the result -of the target given in the ~"target"~ field (the configured target). - -As a full configuration transition is performed, the same care has -to be taken when using this rule as when writing a configuration -transition in a rule. Typically, this rule is used only at a -top-level target of a project and configures only variables internally -to the project. In any case, when using non-internal targets as -dependencies (i.e., targets that a caller of the ~"configure"~ -potentially might use as well), care should be taken that those -are only used in the initial configuration. Such preservation of -the configuration is necessary to avoid conflicts, if the targets -depended upon are visible in the ~"configure"~ target itself, e.g., -as link dependency (which almost always happens when depending on a -library). Even if a non-internal target depended upon is not visible -in the ~"configure"~ target itself, requesting it in a modified -configuration causes additional overhead by increasing the target -graph and potentially the action graph. diff --git a/doc/concepts/cache-pragma.md b/doc/concepts/cache-pragma.md new file mode 100644 index 00000000..858f2b4f --- /dev/null +++ b/doc/concepts/cache-pragma.md @@ -0,0 +1,134 @@ +Action caching pragma +===================== + +Introduction: exit code, build failures, and caching +---------------------------------------------------- + +The exit code of a process is used to signal success or failure of that +process. By convention, 0 indicates success and any other value +indicates some form of failure. + +Our tool expects all build actions to follow this convention. A non-zero +exit code of a regular build action has two consequences. + + - As the action failed, the whole build is aborted and considered + failed. + - As such a failed action can never be part of a successful build, it + is (effectively) not cached. + +This non-caching is achieved by rerequesting an action without cache +look up, if a failed action from cache is reported. + +In particular, for building, we have the property that everything that +does not lead to aborting the build can (and will) be cached. This +property is justified as we expect build actions to behave in a +functional way. + +Test and run actions +-------------------- + +Tests have a lot of similarity to regular build actions: a process is +run with given inputs, and the results are processed further (e.g., to +create reports on test suites). However, they break the above described +connection between caching and continuation of the build: we expect that +some tests might be flaky (even though they shouldn't be, of course) +and hence only want to cache successful tests. Nevertheless, we do want +to continue testing after the first test failure. + +Another breakage of the functionality assumption of actions are "run" +actions, i.e., local actions that are executed either because of their +side effect on the host system, or because of their non-deterministic +results (e.g., monitoring some resource). Those actions should never be +cached, but if they fail, the build should be aborted. + +Tainting +-------- + +Targets that, directly or indirectly, depend on non-functional actions +are not regular targets. They are test targets, run targets, benchmark +results, etc; in any case, they are tainted in some way. When adding +high-level caching of targets, we will only support caching for +untainted targets. + +To make everybody aware of their special nature, they are clearly marked +as such: tainted targets not generated by a tainted rule (e.g., a test +rule) have to explicitly state their taintedness in their attributes. +This declaration also gives a natural way to mark targets that are +technically pure, but still should be used only in test, e.g., a mock +version of a larger library. + +Besides being for tests only, there might be other reasons why a target +might not be fit for general use, e.g., configuration files with +accounts for developer access, or files under restrictive licences. To +avoid having to extend the framework for each new use case, we allow +arbitrary strings as markers for the kind of taintedness of a target. Of +course, a target can be tainted in more than one way. + +More precisely, rules can have `"tainted"` as an additional property. +Moreover `"tainted"` is another reserved keyword for target arguments +(like `"type"` and `"arguments_config"`). In both cases, the value has +to be a list of strings, and the empty list is assumed, if not +specified. + +A rule is tainted with the set of strings in its `"tainted"` property. A +target is tainted with the union of the set of strings of its +`"tainted"` argument and the set of strings its generating rule is +tainted with. + +Every target has to be tainted with (at least) the union of what its +dependencies are tainted with. + +For tainted targets, the `analyse`, `build`, and `install` commands +report the set of strings the target is tainted with. + +### `"may_fail"` and `"no_cache"` properties of `"ACTION"` + +The `"ACTION"` function in the defining expression of a rule have two +additional (besides inputs, etc) parameters `"may_fail"` and +`"no_cache"`. Those are not evaluated and have to be lists of strings +(with empty assumed if the respective parameter is not present). Only +strings the defining rule is tainted with may occur in that list. If the +list is not empty, the corresponding may-fail or no-cache bit of the +action is set. + +For actions with the `"may_fail"` bit set, the optional parameter +`"fail_message"` with default value `"action failed"` is evaluated. That +message will be reported if the action returns a non-zero exit value. + +Actions with the no-cache bit set are never cached. If an action with +the may-fail bit set exits with non-zero exit value, the build is +continued if the action nevertheless managed to produce all expected +outputs. We continue to ignore actions with non-zero exit status from +cache. + +### Marking of failed artifacts + +To simplify finding failures in accumulated reports, our tool keeps +track of artifacts generated by failed actions. More precisely, +artifacts are considered failed if one of the following conditions +applies. + + - Artifacts generated by failed actions are failed. + - Tree artifacts containing a failed artifact are failed. + - Artifacts generated by an action taking a failed artifact as input + are failed. + +The identifiers used for built artifacts (including trees) remain +unchanged; in particular, they will only describe the contents and not +if they were obtained in a failed way. + +When reporting artifacts, e.g., in the log file, an additional marker is +added to indicate that the artifact is a failed one. After every `build` +or `install` command, if the requested artifacts contain failed one, a +different exit code is returned. + +### The `install-cas` subcommand + +A typical workflow for testing is to first run the full test suite and +then only look at the failed tests in more details. As we don't take +failed actions from cache, installing the output can't be done by +rerunning the same target as `install` instead of `build`. Instead, the +output has to be taken from CAS using the identifier shown in the build +log. To simplify this workflow, there is the `install-cas` subcommand +that installs a CAS entry, identified by the identifier as shown in the +log to a given location or (if no location is specified) to `stdout`. diff --git a/doc/concepts/cache-pragma.org b/doc/concepts/cache-pragma.org deleted file mode 100644 index 11953702..00000000 --- a/doc/concepts/cache-pragma.org +++ /dev/null @@ -1,130 +0,0 @@ -* Action caching pragma - -** Introduction: exit code, build failures, and caching - -The exit code of a process is used to signal success or failure -of that process. By convention, 0 indicates success and any other -value indicates some form of failure. - -Our tool expects all build actions to follow this convention. A -non-zero exit code of a regular build action has two consequences. -- As the action failed, the whole build is aborted and considered failed. -- As such a failed action can never be part of a successful build, - it is (effectively) not cached. -This non-caching is achieved by rerequesting an action without -cache look up, if a failed action from cache is reported. - -In particular, for building, we have the property that everything -that does not lead to aborting the build can (and will) be cached. -This property is justified as we expect build actions to behave in -a functional way. - -** Test and run actions - -Tests have a lot of similarity to regular build actions: a process is -run with given inputs, and the results are processed further (e.g., -to create reports on test suites). However, they break the above -described connection between caching and continuation of the -build: we expect that some tests might be flaky (even though they -shouldn't be, of course) and hence only want to cache successful -tests. Nevertheless, we do want to continue testing after the first -test failure. - -Another breakage of the functionality assumption of actions are -"run" actions, i.e., local actions that are executed either because -of their side effect on the host system, or because of their -non-deterministic results (e.g., monitoring some resource). Those -actions should never be cached, but if they fail, the build should -be aborted. - -** Tainting - -Targets that, directly or indirectly, depend on non-functional -actions are not regular targets. They are test targets, run targets, -benchmark results, etc; in any case, they are tainted in some way. -When adding high-level caching of targets, we will only support -caching for untainted targets. - -To make everybody aware of their special nature, they are clearly -marked as such: tainted targets not generated by a tainted rule (e.g., -a test rule) have to explicitly state their taintedness in their -attributes. This declaration also gives a natural way to mark targets -that are technically pure, but still should be used only in test, -e.g., a mock version of a larger library. - -Besides being for tests only, there might be other reasons why a -target might not be fit for general use, e.g., configuration files -with accounts for developer access, or files under restrictive -licences. To avoid having to extend the framework for each new -use case, we allow arbitrary strings as markers for the kind of -taintedness of a target. Of course, a target can be tainted in more -than one way. - -More precisely, rules can have ~"tainted"~ as an additional -property. Moreover ~"tainted"~ is another reserved keyword for -target arguments (like ~"type"~ and ~"arguments_config"~). In both -cases, the value has to be a list of strings, and the empty list -is assumed, if not specified. - -A rule is tainted with the set of strings in its ~"tainted"~ -property. A target is tainted with the union of the set of strings -of its ~"tainted"~ argument and the set of strings its generating -rule is tainted with. - -Every target has to be tainted with (at least) the union of what -its dependencies are tainted with. - -For tainted targets, the ~analyse~, ~build~, and ~install~ commands -report the set of strings the target is tainted with. - -*** ~"may_fail"~ and ~"no_cache"~ properties of ~"ACTION"~ - -The ~"ACTION"~ function in the defining expression of a rule -have two additional (besides inputs, etc) parameters ~"may_fail"~ -and ~"no_cache"~. Those are not evaluated and have to be lists -of strings (with empty assumed if the respective parameter is not -present). Only strings the defining rule is tainted with may occur -in that list. If the list is not empty, the corresponding may-fail -or no-cache bit of the action is set. - -For actions with the ~"may_fail"~ bit set, the optional parameter -~"fail_message"~ with default value ~"action failed"~ is evaluated. -That message will be reported if the action returns a non-zero -exit value. - -Actions with the no-cache bit set are never cached. If an action -with the may-fail bit set exits with non-zero exit value, the build -is continued if the action nevertheless managed to produce all -expected outputs. We continue to ignore actions with non-zero exit -status from cache. - -*** Marking of failed artifacts - -To simplify finding failures in accumulated reports, our tool -keeps track of artifacts generated by failed actions. More -precisely, artifacts are considered failed if one of the following -conditions applies. -- Artifacts generated by failed actions are failed. -- Tree artifacts containing a failed artifact are failed. -- Artifacts generated by an action taking a failed artifact as - input are failed. -The identifiers used for built artifacts (including trees) remain -unchanged; in particular, they will only describe the contents and -not if they were obtained in a failed way. - -When reporting artifacts, e.g., in the log file, an additional marker -is added to indicate that the artifact is a failed one. After every -~build~ or ~install~ command, if the requested artifacts contain -failed one, a different exit code is returned. - -*** The ~install-cas~ subcommand - -A typical workflow for testing is to first run the full test suite -and then only look at the failed tests in more details. As we don't -take failed actions from cache, installing the output can't be -done by rerunning the same target as ~install~ instead of ~build~. -Instead, the output has to be taken from CAS using the identifier -shown in the build log. To simplify this workflow, there is the -~install-cas~ subcommand that installs a CAS entry, identified by -the identifier as shown in the log to a given location or (if no -location is specified) to ~stdout~. diff --git a/doc/concepts/configuration.md b/doc/concepts/configuration.md new file mode 100644 index 00000000..743ed41e --- /dev/null +++ b/doc/concepts/configuration.md @@ -0,0 +1,115 @@ +Configuration +============= + +Targets describe abstract concepts like "library". Depending on +requirements, a library might manifest itself in different ways. For +example, + + - it can be built for various target architectures, + - it can have the requirement to produce position-independent code, + - it can be a special build for debugging, profiling, etc. + +So, a target (like a library described by header files, source files, +dependencies, etc) has some additional input. As those inputs are +typically of a global nature (e.g., a profiling build usually wants all +involved libraries to be built for profiling), this additional input, +called "configuration" follows the same approach as the `UNIX` +environment: it is a global collection of key-value pairs and every +target picks, what it needs. + +Top-level configuration +----------------------- + +The configuration is a `JSON` object. The configuration for the target +requested can be specified on the command line using the `-c` option; +its argument is a file name and that file is supposed to contain the +`JSON` object. + +Propagation +----------- + +Rules and target definitions have to declare which parts of the +configuration they want to have access to. The (essentially) full +configuration, however, is passed on to the dependencies; in this way, a +target not using a part of the configuration can still depend on it, if +one of its dependencies does. + +### Rules configuration and configuration transitions + +As part of the definition of a rule, it specifies a set `"config_vars"` +of variables. During the evaluation of the rule, the configuration +restricted to those variables (variables unset in the original +configuration are set to `null`) is used as environment. + +Additionally, the rule can request that certain targets be evaluated in +a modified configuration by specifying `"config_transitions"` +accordingly. Typically, this is done when a tool is required during the +build; then this tool has to be built for the architecture on which the +build is carried out and not the target architecture. Those tools often +are `"implicit"` dependencies, i.e., dependencies that every target +defined by that rule has, without the need to specify it in the target +definition. + +### Target configuration + +Additionally (and independently of the configuration-dependency of the +rule), the target definition itself can depend on the configuration. +This can happen, if a debug version of a library has additional +dependencies (e.g., for structured debug logs). + +If such a configuration-dependency is needed, the reserved key word +`"arguments_config"` is used to specify a set of variables (if unset, +the empty set is assumed; this should be the usual case). The +environment in which all arguments of the target definition are +evaluated is the configuration restricted to those variables (again, +with values unset in the original configuration set to `null`). + +For example, a library where the debug version has an additional +dependency could look as follows. + +``` jsonc +{ "libfoo": + { "type": ["@", "rules", "CC", "library"] + , "arguments_config": ["DEBUG"] + , "name": ["foo"] + , "hdrs": ["foo.hpp"] + , "srcs": ["foo.cpp"] + , "local defines": + { "type": "if" + , "cond": {"type": "var", "name": "DEBUG"} + , "then": ["DEBUG"] + } + , "deps": + { "type": "++" + , "$1": + [ ["libbar", "libbaz"] + , { "type": "if" + , "cond": {"type": "var", "name": "DEBUG"} + , "then": ["libdebuglog"] + } + ] + } + } +} +``` + +Effective configuration +----------------------- + +A target is influenced by the configuration through + + - the configuration dependency of target definition, as specified in + `"arguments_config"`, + - the configuration dependency of the underlying rule, as specified in + the rule's `"config_vars"` field, and + - the configuration dependency of target dependencies, not taking into + account values explicitly set by a configuration transition. + +Restricting the configuration to this collection of variables yields the +effective configuration for that target-configuration pair. The +`--dump-targets` option of the `analyse` subcommand allows to inspect +the effective configurations of all involved targets. Due to +configuration transitions, a target can be analyzed in more than one +configuration, e.g., if a library is used both, for a tool needed during +the build, as well as for the final binary cross-compiled for a +different target architecture. diff --git a/doc/concepts/configuration.org b/doc/concepts/configuration.org deleted file mode 100644 index 4217d22d..00000000 --- a/doc/concepts/configuration.org +++ /dev/null @@ -1,107 +0,0 @@ -* Configuration - -Targets describe abstract concepts like "library". Depending on -requirements, a library might manifest itself in different ways. -For example, -- it can be built for various target architectures, -- it can have the requirement to produce position-independent code, -- it can be a special build for debugging, profiling, etc. - -So, a target (like a library described by header files, source files, -dependencies, etc) has some additional input. As those inputs are -typically of a global nature (e.g., a profiling build usually wants -all involved libraries to be built for profiling), this additional -input, called "configuration" follows the same approach as the -~UNIX~ environment: it is a global collection of key-value pairs -and every target picks, what it needs. - -** Top-level configuration - -The configuration is a ~JSON~ object. The configuration for the -target requested can be specified on the command line using the -~-c~ option; its argument is a file name and that file is supposed -to contain the ~JSON~ object. - -** Propagation - -Rules and target definitions have to declare which parts of the -configuration they want to have access to. The (essentially) full -configuration, however, is passed on to the dependencies; in this way, -a target not using a part of the configuration can still depend on -it, if one of its dependencies does. - -*** Rules configuration and configuration transitions - -As part of the definition of a rule, it specifies a set ~"config_vars"~ -of variables. During the evaluation of the rule, the configuration -restricted to those variables (variables unset in the original -configuration are set to ~null~) is used as environment. - -Additionally, the rule can request that certain targets be evaluated -in a modified configuration by specifying ~"config_transitions"~ -accordingly. Typically, this is done when a tool is required during -the build; then this tool has to be built for the architecture on -which the build is carried out and not the target architecture. Those -tools often are ~"implicit"~ dependencies, i.e., dependencies that -every target defined by that rule has, without the need to specify -it in the target definition. - -*** Target configuration - -Additionally (and independently of the configuration-dependency -of the rule), the target definition itself can depend on the -configuration. This can happen, if a debug version of a library -has additional dependencies (e.g., for structured debug logs). - -If such a configuration-dependency is needed, the reserved key -word ~"arguments_config"~ is used to specify a set of variables (if -unset, the empty set is assumed; this should be the usual case). -The environment in which all arguments of the target definition are -evaluated is the configuration restricted to those variables (again, -with values unset in the original configuration set to ~null~). - -For example, a library where the debug version has an additional -dependency could look as follows. -#+BEGIN_SRC -{ "libfoo": - { "type": ["@", "rules", "CC", "library"] - , "arguments_config": ["DEBUG"] - , "name": ["foo"] - , "hdrs": ["foo.hpp"] - , "srcs": ["foo.cpp"] - , "local defines": - { "type": "if" - , "cond": {"type": "var", "name": "DEBUG"} - , "then": ["DEBUG"] - } - , "deps": - { "type": "++" - , "$1": - [ ["libbar", "libbaz"] - , { "type": "if" - , "cond": {"type": "var", "name": "DEBUG"} - , "then": ["libdebuglog"] - } - ] - } - } -} -#+END_SRC - -** Effective configuration - -A target is influenced by the configuration through -- the configuration dependency of target definition, as specified - in ~"arguments_config"~, -- the configuration dependency of the underlying rule, as specified - in the rule's ~"config_vars"~ field, and -- the configuration dependency of target dependencies, not taking - into account values explicitly set by a configuration transition. -Restricting the configuration to this collection of variables yields -the effective configuration for that target-configuration pair. -The ~--dump-targets~ option of the ~analyse~ subcommand allows to -inspect the effective configurations of all involved targets. Due to -configuration transitions, a target can be analyzed in more than one -configuration, e.g., if a library is used both, for a tool needed -during the build, as well as for the final binary cross-compiled -for a different target architecture. diff --git a/doc/concepts/doc-strings.md b/doc/concepts/doc-strings.md new file mode 100644 index 00000000..a1a156ac --- /dev/null +++ b/doc/concepts/doc-strings.md @@ -0,0 +1,152 @@ +Documentation of build rules, expressions, etc +============================================== + +Build rules can obtain a non-trivial complexity. This is especially true +if several rules have to exist for slightly different use cases, or if +the rule supports many different fields. Therefore, documentation of the +rules (and also expressions for the benefit of rule authors) is +desirable. + +Experience shows that documentation that is not versioned together with +the code it refers to quickly gets out of date, or lost. Therefore, we +add documentation directly into the respective definitions. + +Multi-line strings in JSON +-------------------------- + +In JSON, the newline character is encoded specially and not taken +literally; also, there is not implicit joining of string literals. So, +in order to also have documentation readable in the JSON representation +itself, instead of single strings, we take arrays of strings, with the +understanding that they describe the strings obtained by joining the +entries with newline characters. + +Documentation is optional +------------------------- + +While documentation is highly recommended, it still remains optional. +Therefore, when in the following we state that a key is for a list or a +map, it is always implied that it may be absent; in this case, the empty +array or the empty map is taken as default, respectively. + +Rules +----- + +Each rule is described as a JSON object with a fixed set of keys. So +having fixed keys for documentation does not cause conflicts. More +precisely, the keys `doc`, `field doc`, `config_doc`, `artifacts_doc`, +`runfiles_doc`, and `provides_doc` are reserved for documentation. Here, +`doc` has to be a list of strings describing the rule in general. +`field doc` has to be a map from (some of) the field names to an array +of strings, containing additional information on that particular field. +`config_doc` has to be a map from (some of) the config variables to an +array of strings describing the respective variable. `artifacts_doc` is +an array of strings describing the artifacts produced by the rule. +`runfiles_doc` is an array of strings describing the runfiles produced +by this rule. Finally, `provides_doc` is a map describing (some of) the +providers by that rule; as opposed to fields or config variables there +is no authoritative list of providers given elsewhere in the rule, so it +is up to the rule author to give an accurate documentation on the +provided data. + +### Example + +``` jsonc +{ "library": + { "doc": + [ "A C library" + , "" + , "Define a library that can be used to be statically linked to a" + , "binary. To do so, the target can simply be specified in the deps" + , "field of a binary; it can also be a dependency of another library" + , "and the information is then propagated to the corresponding binary." + ] + , "string_fields": ["name"] + , "target_fields": ["srcs", "hdrs", "private-hdrs", "deps"] + , "field_doc": + { "name": + ["The base name of the library (i.e., the name without the leading lib)."] + , "srcs": ["The source files (i.e., *.c files) of the library."] + , "hdrs": + [ "The public header files of this library. Targets depending on" + , "this library will have access to those header files" + ] + , "private-hdrs": + [ "Additional internal header files that are used when compiling" + , "the source files. Targets depending on this library have no access" + , "to those header files." + ] + , "deps": + [ "Any other libraries that this library uses. The dependency is" + , "also propagated (via the link-deps provider) to any consumers of" + , "this target. So only direct dependencies should be declared." + ] + } + , "config_vars": ["CC"] + , "config_doc": + { "CC": + [ "single string. defaulting to \"cc\", specifying the compiler" + , "to be used. The compiler is also used to launch the preprocessor." + ] + } + , "artifacts_doc": + ["The actual library (libname.a) staged in the specified directory"] + , "runfiles_doc": ["The public headers of this library"] + , "provides_doc": + { "compile-deps": + [ "Map of artifacts specifying any additional files that, besides the runfiles," + , "have to be present in compile actions of targets depending on this library" + ] + , "link-deps": + [ "Map of artifacts specifying any additional files that, besides the artifacts," + , "have to be present in a link actions of targets depending on this library" + ] + , "link-args": + [ "List of strings that have to be added to the command line for linking actions" + , "in targets depending on this library" + ] + } + , "expression": { ... } + } +} +``` + +Expressions +----------- + +Expressions are also described by a JSON object with a fixed set of +keys. Here we use the keys `doc` and `vars_doc` for documentation, where +`doc` is an array of strings describing the expression as a whole and +`vars_doc` is a map from (some of) the `vars` to an array of strings +describing this variable. + +Export targets +-------------- + +As export targets play the role of interfaces between repositories, it +is important that they be documented as well. Again, export targets are +described as a JSON object with fixed set of keys amd we use the keys +`doc` and `config_doc` for documentation. Here `doc` is an array of +strings describing the targeted in general and `config_doc` is a map +from (some of) the variables of the `flexible_config` to an array of +strings describing this parameter. + +Presentation of the documentation +--------------------------------- + +As all documentation are just values (that need not be evaluated) in +JSON objects, it is easy to write tool rendering documentation pages for +rules, etc, and we expect those tools to be written independently. +Nevertheless, for the benefit of developers using rules from a git-tree +roots that might not be checked out, there is a subcommand `describe` +which takes a target specification like the `analyze` command, looks up +the corresponding rule and describes it fully, i.e., prints in +human-readable form + + - the documentation for the rule + - all the fields available for that rule together with + - their type (`string_field`, `target_field`, etc), and + - their documentation, + - all the configuration variables of the rule with their documentation + (if given), and + - the documented providers. diff --git a/doc/concepts/doc-strings.org b/doc/concepts/doc-strings.org deleted file mode 100644 index d9a94dc5..00000000 --- a/doc/concepts/doc-strings.org +++ /dev/null @@ -1,145 +0,0 @@ -* Documentation of build rules, expressions, etc - -Build rules can obtain a non-trivial complexity. This is especially -true if several rules have to exist for slightly different use -cases, or if the rule supports many different fields. Therefore, -documentation of the rules (and also expressions for the benefit -of rule authors) is desirable. - -Experience shows that documentation that is not versioned together with -the code it refers to quickly gets out of date, or lost. Therefore, -we add documentation directly into the respective definitions. - -** Multi-line strings in JSON - -In JSON, the newline character is encoded specially and not taken -literally; also, there is not implicit joining of string literals. -So, in order to also have documentation readable in the JSON -representation itself, instead of single strings, we take arrays -of strings, with the understanding that they describe the strings -obtained by joining the entries with newline characters. - -** Documentation is optional - -While documentation is highly recommended, it still remains optional. -Therefore, when in the following we state that a key is for a list -or a map, it is always implied that it may be absent; in this case, -the empty array or the empty map is taken as default, respectively. - -** Rules - -Each rule is described as a JSON object with a fixed set of keys. -So having fixed keys for documentation does not cause conflicts. -More precisely, the keys ~doc~, ~field doc~, ~config_doc~, -~artifacts_doc~, ~runfiles_doc~, and ~provides_doc~ -are reserved for documentation. Here, ~doc~ has to be a list of -strings describing the rule in general. ~field doc~ has to be a map -from (some of) the field names to an array of strings, containing -additional information on that particular field. ~config_doc~ has -to be a map from (some of) the config variables to an array of -strings describing the respective variable. ~artifacts_doc~ is -an array of strings describing the artifacts produced by the rule. -~runfiles_doc~ is an array of strings describing the runfiles produced -by this rule. Finally, ~provides_doc~ is a map describing (some -of) the providers by that rule; as opposed to fields or config -variables there is no authoritative list of providers given elsewhere -in the rule, so it is up to the rule author to give an accurate -documentation on the provided data. - -*** Example - -#+BEGIN_SRC -{ "library": - { "doc": - [ "A C library" - , "" - , "Define a library that can be used to be statically linked to a" - , "binary. To do so, the target can simply be specified in the deps" - , "field of a binary; it can also be a dependency of another library" - , "and the information is then propagated to the corresponding binary." - ] - , "string_fields": ["name"] - , "target_fields": ["srcs", "hdrs", "private-hdrs", "deps"] - , "field_doc": - { "name": - ["The base name of the library (i.e., the name without the leading lib)."] - , "srcs": ["The source files (i.e., *.c files) of the library."] - , "hdrs": - [ "The public header files of this library. Targets depending on" - , "this library will have access to those header files" - ] - , "private-hdrs": - [ "Additional internal header files that are used when compiling" - , "the source files. Targets depending on this library have no access" - , "to those header files." - ] - , "deps": - [ "Any other libraries that this library uses. The dependency is" - , "also propagated (via the link-deps provider) to any consumers of" - , "this target. So only direct dependencies should be declared." - ] - } - , "config_vars": ["CC"] - , "config_doc": - { "CC": - [ "single string. defaulting to \"cc\", specifying the compiler" - , "to be used. The compiler is also used to launch the preprocessor." - ] - } - , "artifacts_doc": - ["The actual library (libname.a) staged in the specified directory"] - , "runfiles_doc": ["The public headers of this library"] - , "provides_doc": - { "compile-deps": - [ "Map of artifacts specifying any additional files that, besides the runfiles," - , "have to be present in compile actions of targets depending on this library" - ] - , "link-deps": - [ "Map of artifacts specifying any additional files that, besides the artifacts," - , "have to be present in a link actions of targets depending on this library" - ] - , "link-args": - [ "List of strings that have to be added to the command line for linking actions" - , "in targets depending on this library" - ] - } - , "expression": { ... } - } -} -#+END_SRC - -** Expressions - -Expressions are also described by a JSON object with a fixed set of -keys. Here we use the keys ~doc~ and ~vars_doc~ for documentation, -where ~doc~ is an array of strings describing the expression as a -whole and ~vars_doc~ is a map from (some of) the ~vars~ to an array -of strings describing this variable. - -** Export targets - -As export targets play the role of interfaces between repositories, -it is important that they be documented as well. Again, export targets -are described as a JSON object with fixed set of keys amd we use -the keys ~doc~ and ~config_doc~ for documentation. Here ~doc~ is an -array of strings describing the targeted in general and ~config_doc~ -is a map from (some of) the variables of the ~flexible_config~ to -an array of strings describing this parameter. - -** Presentation of the documentation - -As all documentation are just values (that need not be evaluated) -in JSON objects, it is easy to write tool rendering documentation -pages for rules, etc, and we expect those tools to be written -independently. Nevertheless, for the benefit of developers using -rules from a git-tree roots that might not be checked out, there is -a subcommand ~describe~ which takes a target specification like the -~analyze~ command, looks up the corresponding rule and describes -it fully, i.e., prints in human-readable form -- the documentation for the rule -- all the fields available for that rule together with - - their type (~string_field~, ~target_field~, etc), and - - their documentation, -- all the configuration variables of the rule with their - documentation (if given), and -- the documented providers. diff --git a/doc/concepts/expressions.md b/doc/concepts/expressions.md new file mode 100644 index 00000000..9e8a8f36 --- /dev/null +++ b/doc/concepts/expressions.md @@ -0,0 +1,368 @@ +Expression language +=================== + +At various places, in particular in order to define a rule, we need a +restricted form of functional computation. This is achieved by our +expression language. + +Syntax +------ + +All expressions are given by JSON values. One can think of expressions +as abstract syntax trees serialized to JSON; nevertheless, the precise +semantics is given by the evaluation mechanism described later. + +Semantic Values +--------------- + +Expressions evaluate to semantic values. Semantic values are JSON values +extended by additional atomic values for build-internal values like +artifacts, names, etc. + +### Truth + +Every value can be treated as a boolean condition. We follow a +convention similar to `LISP` considering everything true that is not +empty. More precisely, the values + + - `null`, + - `false`, + - `0`, + - `""`, + - the empty map, and + - the empty list + +are considered logically false. All other values are logically true. + +Evaluation +---------- + +The evaluation follows a strict, functional, call-by-value evaluation +mechanism; the precise evaluation is as follows. + + - Atomic values (`null`, booleans, strings, numbers) evaluate to + themselves. + - For lists, each entry is evaluated in the order they occur in the + list; the result of the evaluation is the list of the results. + - For JSON objects (wich can be understood as maps, or dicts), the key + `"type"` has to be present and has to be a literal string. That + string determines the syntactical construct (sloppily also referred + to as "function") the object represents, and the remaining + evaluation depends on the syntactical construct. The syntactical + construct has to be either one of the built-in ones or a special + function available in the given context (e.g., `"ACTION"` within the + expression defining a rule). + +All evaluation happens in an "environment" which is a map from strings +to semantic values. + +### Built-in syntactical constructs + +#### Special forms + +##### Variables: `"var"` + +There has to be a key `"name"` that (i.e., the expression in the +object at that key) has to be a literal string, taken as +variable name. If the variable name is in the domain of the +environment and the value of the environment at the variable +name is non-`null`, then the result of the evaluation is the +value of the variable in the environment. + +Otherwise, the key `"default"` is taken (if present, otherwise +the value `null` is taken as default for `"default"`) and +evaluated. The value obtained this way is the result of the +evaluation. + +##### Sequential binding: `"let*"` + +The key `"bindings"` (default `[]`) has to be (syntactically) a +list of pairs (i.e., lists of length two) with the first +component a literal string. + +For each pair in `"bindings"` the second component is evaluated, +in the order the pairs occur. After each evaluation, a new +environment is taken for the subsequent evaluations; the new +environment is like the old one but amended at the position +given by the first component of the pair to now map to the value +just obtained. + +Finally, the `"body"` is evaluated in the final environment +(after evaluating all binding entries) and the result of +evaluating the `"body"` is the value for the whole `"let*"` +expression. + +##### Environment Map: `"env"` + +Creates a map from selected environment variables. + +The key `"vars"` (default `[]`) has to be a list of literal +strings referring to the variable names that should be included +in the produced map. This field is not evaluated. This +expression is only for convenience and does not give new +expression power. It is equivalent but lot shorter to multiple +`singleton_map` expressions combined with `map_union`. + +##### Conditionals + +###### Binary conditional: `"if"` + +First the key `"cond"` is evaluated. If it evaluates to a +value that is logically true, then the key `"then"` is +evaluated and its value is the result of the evaluation. +Otherwise, the key `"else"` (if present, otherwise `[]` is +taken as default) is evaluated and the obtained value is the +result of the evaluation. + +###### Sequential conditional: `"cond"` + +The key `"cond"` has to be a list of pairs. In the order of +the list, the first components of the pairs are evaluated, +until one evaluates to a value that is logically true. For +that pair, the second component is evaluated and the result +of this evaluation is the result of the `"cond"` expression. + +If all first components evaluate to a value that is +logically false, the result of the expression is the result +of evaluating the key `"default"` (defaulting to `[]`). + +###### String case distinction: `"case"` + +If the key `"case"` is present, it has to be a map (an +"object", in JSON's terminology). In this case, the key +`"expr"` is evaluated; it has to evaluate to a string. If +the value is a key in the `"case"` map, the expression at +this key is evaluated and the result of that evaluation is +the value for the `"case"` expression. + +Otherwise (i.e., if `"case"` is absent or `"expr"` evaluates +to a string that is not a key in `"case"`), the key +`"default"` (with default `[]`) is evaluated and this gives +the result of the `"case"` expression. + +###### Sequential case distinction on arbitrary values: `"case*"` + +If the key `"case"` is present, it has to be a list of +pairs. In this case, the key `"expr"` is evaluated. It is an +error if that evaluates to a name-containing value. The +result of that evaluation is sequentially compared to the +evaluation of the first components of the `"case"` list +until an equal value is found. In this case, the evaluation +of the second component of the pair is the value of the +`"case*"` expression. + +If the `"case"` key is absent, or no equality is found, the +result of the `"case*"` expression is the result of +evaluating the `"default"` key (with default `[]`). + +##### Conjunction and disjunction: `"and"` and `"or"` + +For conjunction, if the key `"$1"` (with default `[]`) is +syntactically a list, its entries are sequentially evaluated +until a logically false value is found; in that case, the result +is `false`, otherwise true. If the key `"$1"` has a different +shape, it is evaluated and has to evaluate to a list. The result +is the conjunction of the logical values of the entries. In +particular, `{"type": "and"}` evaluates to `true`. + +For disjunction, the evaluation mechanism is the same, but the +truth values and connective are taken dually. So, `"and"` and +`"or"` are logical conjunction and disjunction, respectively, +using short-cut evaluation if syntactically admissible (i.e., if +the argument is syntactically a list). + +##### Mapping + +###### Mapping over lists: `"foreach"` + +First the key `"range"` is evaluated and has to evaluate to +a list. For each entry of this list, the expression `"body"` +is evaluated in an environment that is obtained from the +original one by setting the value for the variable specified +at the key `"var"` (which has to be a literal string, +default `"_"`) to that value. The result is the list of +those evaluation results. + +###### Mapping over maps: `"foreach_map"` + +Here, `"range"` has to evaluate to a map. For each entry (in +lexicographic order (according to native byte order) by +keys), the expression `"body"` is evaluated in an +environment obtained from the original one by setting the +variables specified at `"var_key"` and `"var_val"` (literal +strings, default values `"_"` and `"$_"`, respectively). The +result of the evaluation is the list of those values. + +##### Folding: `"foldl"` + +The key `"range"` is evaluated and has to evaluate to a list. +Starting from the result of evaluating `"start"` (default `[]`) +a new value is obtained for each entry of the range list by +evaluating `"body"` in an environment obtained from the original +by binding the variable specified by `"var"` (literal string, +default `"_"`) to the list entry and the variable specified by +`"accum_var"` (literal string, default value `"$1"`) to the old +value. The result is the last value obtained. + +#### Regular functions + +First `"$1"` is evaluated; for binary functions `"$2"` is evaluated +next. For functions that accept keyword arguments, those are +evaluated as well. Finally the function is applied to this (or +those) argument(s) to obtain the final result. + +##### Unary functions + + - `"nub_right"` The argument has to be a list. It is an error + if that list contains (directly or indirectly) a name. The + result is the input list, except that for all duplicate + values, all but the rightmost occurrence is removed. + + - `"basename"` The argument has to be a string. This string is + interpreted as a path, and the file name thereof is + returned. + + - `"keys"` The argument has to be a map. The result is the + list of keys of this map, in lexicographical order + (according to native byte order). + + - `"values"` The argument has to be a map. The result are the + values of that map, ordered by the corresponding keys + (lexicographically according to native byte order). + + - `"range"` The argument is interpreted as a non-negative + integer as follows. Non-negative numbers are rounded to the + nearest integer; strings have to be the decimal + representation of an integer; everything else is considered + zero. The result is a list of the given length, consisting + of the decimal representations of the first non-negative + integers. For example, `{"type": "range", + "$1": "3"}` evaluates to `["0", "1", "2"]`. + + - `"enumerate"` The argument has to be a list. The result is a + map containing one entry for each element of the list. The + key is the decimal representation of the position in the + list (starting from `0`), padded with leading zeros to + length at least 10. The value is the element. The padding is + chosen in such a way that iterating over the resulting map + (which happens in lexicographic order of the keys) has the + same iteration order as the list for all lists indexable by + 32-bit integers. + + - `"++"` The argument has to be a list of lists. The result is + the concatenation of those lists. + + - `"map_union"` The argument has to be a list of maps. The + result is a map containing as keys the union of the keys of + the maps in that list. For each key, the value is the value + of that key in the last map in the list that contains that + key. + + - `"join_cmd"` The argument has to be a list of strings. A + single string is returned that quotes the original vector in + a way understandable by a POSIX shell. As the command for an + action is directly given by an argument vector, `"join_cmd"` + is typically only used for generated scripts. + + - `"json_encode"` The result is a single string that is the + canonical JSON encoding of the argument (with minimal white + space); all atomic values that are not part of JSON (i.e., + the added atomic values to represent build-internal values) + are serialized as `null`. + +##### Unary functions with keyword arguments + + - `"change_ending"` The argument has to be a string, + interpreted as path. The ending is replaced by the value of + the keyword argument `"ending"` (a string, default `""`). + For example, `{"type": + "change_ending", "$1": "foo/bar.c", "ending": ".o"}` + evaluates to `"foo/bar.o"`. + + - `"join"` The argument has to be a list of strings. The + return value is the concatenation of those strings, + separated by the the specified `"separator"` (strings, + default `""`). + + - `"escape_chars"` Prefix every in the argument every + character occuring in `"chars"` (a string, default `""`) by + `"escape_prefix"` (a strings, default `"\"`). + + - `"to_subdir"` The argument has to be a map (not necessarily + of artifacts). The keys as well as the `"subdir"` (string, + default `"."`) argument are interpreted as paths and keys + are replaced by the path concatenation of those two paths. + If the optional argument `"flat"` (default `false`) + evaluates to a true value, the keys are instead replaced by + the path concatenation of the `"subdir"` argument and the + base name of the old key. It is an error if conflicts occur + in this way; in case of such a user error, the argument + `"msg"` is also evaluated and the result of that evaluation + reported in the error message. Note that conflicts can also + occur in non-flat staging if two keys are different as + strings, but name the same path (like `"foo.txt"` and + `"./foo.txt"`), and are assigned different values. It also + is an error if the values for keys in conflicting positions + are name-containing. + +##### Binary functions + + - `"=="` The result is `true` is the arguments are equal, + `false` otherwise. It is an error if one of the arguments + are name-containing values. + + - `"concat_target_name"` This function is only present to + simplify transitions from some other build systems and + normally not used outside code generated by transition + tools. The second argument has to be a string or a list of + strings (in the latter case, it is treated as strings by + concatenating the entries). If the first argument is a + string, the result is the concatenation of those two + strings. If the first argument is a list of strings, the + result is that list with the second argument concatenated to + the last entry of that list (if any). + +##### Other functions + + - `"empty_map"` This function takes no arguments and always + returns an empty map. + + - `"singleton_map"` This function takes two keyword arguments, + `"key"` and `"value"` and returns a map with one entry, + mapping the given key to the given value. + + - `"lookup"` This function takes two keyword arguments, + `"key"` and `"map"`. The `"key"` argument has to evaluate to + a string and the `"map"` argument has to evaluate to a map. + If that map contains the given key and the corresponding + value is non-`null`, the value is returned. Otherwise the + `"default"` argument (with default `null`) is evaluated and + returned. + +#### Constructs related to reporting of user errors + +Normally, if an error occurs during the evaluation the error is +reported together with a stack trace. This, however, might not be +the most informative way to present a problem to the user, +especially if the underlying problem is a proper user error, e.g., +in rule usage (leaving out mandatory arguments, violating semantical +prerequisites, etc). To allow proper error reporting, the following +functions are available. All of them have an optional argument +`"msg"` that is evaluated (only) in case of error and the result of +that evaluation included in the error message presented to the user. + + - `"fail"` Evaluation of this function unconditionally fails. + + - `"context"` This function is only there to provide additional + information in case of error. Otherwise it is the identify + function (a unary function, i.e., the result of the evaluation + is the result of evaluating the argument `"$1"`). + + - `"assert_non_empty"` Evaluate the argument (given by the + parameter `"$1"`). If it evaluates to a non-empty string, map, + or list, return the result of the evaluation. Otherwise fail. + + - `"disjoint_map_union"` Like `"map_union"` but it is an error, if + two (or more) maps contain the same key, but map it to different + values. It is also an error if the argument is a name-containing + value. diff --git a/doc/concepts/expressions.org b/doc/concepts/expressions.org deleted file mode 100644 index ac66e878..00000000 --- a/doc/concepts/expressions.org +++ /dev/null @@ -1,344 +0,0 @@ -* Expression language - -At various places, in particular in order to define a rule, we need -a restricted form of functional computation. This is achieved by -our expression language. - -** Syntax - -All expressions are given by JSON values. One can think of expressions -as abstract syntax trees serialized to JSON; nevertheless, the precise -semantics is given by the evaluation mechanism described later. - -** Semantic Values - -Expressions evaluate to semantic values. Semantic values are JSON -values extended by additional atomic values for build-internal -values like artifacts, names, etc. - -*** Truth - -Every value can be treated as a boolean condition. We follow a -convention similar to ~LISP~ considering everything true that is -not empty. More precisely, the values -- ~null~, -- ~false~, -- ~0~, -- ~""~, -- the empty map, and -- the empty list -are considered logically false. All other values are logically true. - -** Evaluation - -The evaluation follows a strict, functional, call-by-value evaluation -mechanism; the precise evaluation is as follows. - -- Atomic values (~null~, booleans, strings, numbers) evaluate to - themselves. -- For lists, each entry is evaluated in the order they occur in the - list; the result of the evaluation is the list of the results. -- For JSON objects (wich can be understood as maps, or dicts), the - key ~"type"~ has to be present and has to be a literal string. - That string determines the syntactical construct (sloppily also - referred to as "function") the object represents, and the remaining - evaluation depends on the syntactical construct. The syntactical - construct has to be either one of the built-in ones or a special - function available in the given context (e.g., ~"ACTION"~ within - the expression defining a rule). - -All evaluation happens in an "environment" which is a map from -strings to semantic values. - -*** Built-in syntactical constructs - -**** Special forms - -***** Variables: ~"var"~ - -There has to be a key ~"name"~ that (i.e., the expression in the -object at that key) has to be a literal string, taken as variable -name. If the variable name is in the domain of the environment and -the value of the environment at the variable name is non-~null~, -then the result of the evaluation is the value of the variable in -the environment. - -Otherwise, the key ~"default"~ is taken (if present, otherwise the -value ~null~ is taken as default for ~"default"~) and evaluated. -The value obtained this way is the result of the evaluation. - -***** Sequential binding: ~"let*"~ - -The key ~"bindings"~ (default ~[]~) has to be (syntactically) a -list of pairs (i.e., lists of length two) with the first component -a literal string. - -For each pair in ~"bindings"~ the second component is evaluated, in -the order the pairs occur. After each evaluation, a new environment -is taken for the subsequent evaluations; the new environment is -like the old one but amended at the position given by the first -component of the pair to now map to the value just obtained. - -Finally, the ~"body"~ is evaluated in the final environment (after -evaluating all binding entries) and the result of evaluating the -~"body"~ is the value for the whole ~"let*"~ expression. - -***** Environment Map: ~"env"~ - -Creates a map from selected environment variables. - -The key ~"vars"~ (default ~[]~) has to be a list of literal strings referring to -the variable names that should be included in the produced map. This field is -not evaluated. This expression is only for convenience and does not give new -expression power. It is equivalent but lot shorter to multiple ~singleton_map~ -expressions combined with ~map_union~. - -***** Conditionals - -****** Binary conditional: ~"if"~ - -First the key ~"cond"~ is evaluated. If it evaluates to a value that -is logically true, then the key ~"then"~ is evaluated and its value -is the result of the evaluation. Otherwise, the key ~"else"~ (if -present, otherwise ~[]~ is taken as default) is evaluated and the -obtained value is the result of the evaluation. - -****** Sequential conditional: ~"cond"~ - -The key ~"cond"~ has to be a list of pairs. In the order of the -list, the first components of the pairs are evaluated, until one -evaluates to a value that is logically true. For that pair, the -second component is evaluated and the result of this evaluation is -the result of the ~"cond"~ expression. - -If all first components evaluate to a value that is logically false, -the result of the expression is the result of evaluating the key -~"default"~ (defaulting to ~[]~). - -****** String case distinction: ~"case"~ - -If the key ~"case"~ is present, it has to be a map (an "object", in -JSON's terminology). In this case, the key ~"expr"~ is evaluated; it -has to evaluate to a string. If the value is a key in the ~"case"~ -map, the expression at this key is evaluated and the result of that -evaluation is the value for the ~"case"~ expression. - -Otherwise (i.e., if ~"case"~ is absent or ~"expr"~ evaluates to a -string that is not a key in ~"case"~), the key ~"default"~ (with -default ~[]~) is evaluated and this gives the result of the ~"case"~ -expression. - -****** Sequential case distinction on arbitrary values: ~"case*"~ - -If the key ~"case"~ is present, it has to be a list of pairs. In this -case, the key ~"expr"~ is evaluated. It is an error if that evaluates -to a name-containing value. The result of that evaluation -is sequentially compared to the evaluation of the first components -of the ~"case"~ list until an equal value is found. In this case, -the evaluation of the second component of the pair is the value of -the ~"case*"~ expression. - -If the ~"case"~ key is absent, or no equality is found, the result of -the ~"case*"~ expression is the result of evaluating the ~"default"~ -key (with default ~[]~). - -***** Conjunction and disjunction: ~"and"~ and ~"or"~ - -For conjunction, if the key ~"$1"~ (with default ~[]~) is syntactically -a list, its entries are sequentially evaluated until a logically -false value is found; in that case, the result is ~false~, otherwise -true. If the key ~"$1"~ has a different shape, it is evaluated and -has to evaluate to a list. The result is the conjunction of the -logical values of the entries. In particular, ~{"type": "and"}~ -evaluates to ~true~. - -For disjunction, the evaluation mechanism is the same, but the truth -values and connective are taken dually. So, ~"and"~ and ~"or"~ are -logical conjunction and disjunction, respectively, using short-cut -evaluation if syntactically admissible (i.e., if the argument is -syntactically a list). - -***** Mapping - -****** Mapping over lists: ~"foreach"~ - -First the key ~"range"~ is evaluated and has to evaluate to a list. -For each entry of this list, the expression ~"body"~ is evaluated -in an environment that is obtained from the original one by setting -the value for the variable specified at the key ~"var"~ (which has -to be a literal string, default ~"_"~) to that value. The result -is the list of those evaluation results. - -****** Mapping over maps: ~"foreach_map"~ - -Here, ~"range"~ has to evaluate to a map. For each entry (in -lexicographic order (according to native byte order) by keys), the -expression ~"body"~ is evaluated in an environment obtained from -the original one by setting the variables specified at ~"var_key"~ -and ~"var_val"~ (literal strings, default values ~"_"~ and -~"$_"~, respectively). The result of the evaluation is the list of -those values. - -***** Folding: ~"foldl"~ - -The key ~"range"~ is evaluated and has to evaluate to a list. -Starting from the result of evaluating ~"start"~ (default ~[]~) a -new value is obtained for each entry of the range list by evaluating -~"body"~ in an environment obtained from the original by binding -the variable specified by ~"var"~ (literal string, default ~"_"~) to -the list entry and the variable specified by ~"accum_var"~ (literal -string, default value ~"$1"~) to the old value. The result is the -last value obtained. - -**** Regular functions - -First ~"$1"~ is evaluated; for binary functions ~"$2"~ is evaluated -next. For functions that accept keyword arguments, those are -evaluated as well. Finally the function is applied to this (or -those) argument(s) to obtain the final result. - -***** Unary functions - -- ~"nub_right"~ The argument has to be a list. It is an error if that list - contains (directly or indirectly) a name. The result is the - input list, except that for all duplicate values, all but the - rightmost occurrence is removed. - -- ~"basename"~ The argument has to be a string. This string is - interpreted as a path, and the file name thereof is returned. - -- ~"keys"~ The argument has to be a map. The result is the list of - keys of this map, in lexicographical order (according to native - byte order). - -- ~"values"~ The argument has to be a map. The result are the values - of that map, ordered by the corresponding keys (lexicographically - according to native byte order). - -- ~"range"~ The argument is interpreted as a non-negative integer as - follows. Non-negative numbers are rounded to the nearest integer; - strings have to be the decimal representation of an integer; - everything else is considered zero. The result is a list of the - given length, consisting of the decimal representations of the - first non-negative integers. For example, ~{"type": "range", - "$1": "3"}~ evaluates to ~["0", "1", "2"]~. - -- ~"enumerate"~ The argument has to be a list. The result is a map - containing one entry for each element of the list. The key is - the decimal representation of the position in the list (starting - from ~0~), padded with leading zeros to length at least 10. The - value is the element. The padding is chosen in such a way that - iterating over the resulting map (which happens in lexicographic - order of the keys) has the same iteration order as the list for - all lists indexable by 32-bit integers. - -- ~"++"~ The argument has to be a list of lists. The result is the - concatenation of those lists. - -- ~"map_union"~ The argument has to be a list of maps. The result - is a map containing as keys the union of the keys of the maps in - that list. For each key, the value is the value of that key in - the last map in the list that contains that key. - -- ~"join_cmd"~ The argument has to be a list of strings. A single - string is returned that quotes the original vector in a way - understandable by a POSIX shell. As the command for an action is - directly given by an argument vector, ~"join_cmd"~ is typically - only used for generated scripts. - -- ~"json_encode"~ The result is a single string that is the canonical - JSON encoding of the argument (with minimal white space); all atomic - values that are not part of JSON (i.e., the added atomic values - to represent build-internal values) are serialized as ~null~. - -***** Unary functions with keyword arguments - -- ~"change_ending"~ The argument has to be a string, interpreted as - path. The ending is replaced by the value of the keyword argument - ~"ending"~ (a string, default ~""~). For example, ~{"type": - "change_ending", "$1": "foo/bar.c", "ending": ".o"}~ evaluates - to ~"foo/bar.o"~. - -- ~"join"~ The argument has to be a list of strings. The return - value is the concatenation of those strings, separated by the - the specified ~"separator"~ (strings, default ~""~). - -- ~"escape_chars"~ Prefix every in the argument every character - occuring in ~"chars"~ (a string, default ~""~) by ~"escape_prefix"~ (a - strings, default ~"\\"~). - -- ~"to_subdir"~ The argument has to be a map (not necessarily of - artifacts). The keys as well as the ~"subdir"~ (string, default - ~"."~) argument are interpreted as paths and keys are replaced - by the path concatenation of those two paths. If the optional - argument ~"flat"~ (default ~false~) evaluates to a true value, - the keys are instead replaced by the path concatenation of the - ~"subdir"~ argument and the base name of the old key. It is an - error if conflicts occur in this way; in case of such a user - error, the argument ~"msg"~ is also evaluated and the result - of that evaluation reported in the error message. Note that - conflicts can also occur in non-flat staging if two keys are - different as strings, but name the same path (like ~"foo.txt"~ - and ~"./foo.txt"~), and are assigned different values. - It also is an error if the values for keys in conflicting positions - are name-containing. - -***** Binary functions - -- ~"=="~ The result is ~true~ is the arguments are equal, ~false~ - otherwise. It is an error if one of the arguments are name-containing - values. - -- ~"concat_target_name"~ This function is only present to simplify - transitions from some other build systems and normally not used - outside code generated by transition tools. The second argument - has to be a string or a list of strings (in the latter case, - it is treated as strings by concatenating the entries). If the - first argument is a string, the result is the concatenation of - those two strings. If the first argument is a list of strings, - the result is that list with the second argument concatenated to - the last entry of that list (if any). - -***** Other functions - -- ~"empty_map"~ This function takes no arguments and always returns - an empty map. - -- ~"singleton_map"~ This function takes two keyword arguments, - ~"key"~ and ~"value"~ and returns a map with one entry, mapping - the given key to the given value. - -- ~"lookup"~ This function takes two keyword arguments, ~"key"~ - and ~"map"~. The ~"key"~ argument has to evaluate to a string - and the ~"map"~ argument has to evaluate to a map. If that map - contains the given key and the corresponding value is non-~null~, - the value is returned. Otherwise the ~"default"~ argument (with - default ~null~) is evaluated and returned. - -**** Constructs related to reporting of user errors - -Normally, if an error occurs during the evaluation the error is -reported together with a stack trace. This, however, might not -be the most informative way to present a problem to the user, -especially if the underlying problem is a proper user error, e.g., -in rule usage (leaving out mandatory arguments, violating semantical -prerequisites, etc). To allow proper error reporting, the following -functions are available. All of them have an optional argument -~"msg"~ that is evaluated (only) in case of error and the result of -that evaluation included in the error message presented to the user. - -- ~"fail"~ Evaluation of this function unconditionally fails. - -- ~"context"~ This function is only there to provide additional - information in case of error. Otherwise it is the identify - function (a unary function, i.e., the result of the evaluation - is the result of evaluating the argument ~"$1"~). - -- ~"assert_non_empty"~ Evaluate the argument (given by the parameter - ~"$1"~). If it evaluates to a non-empty string, map, or list, - return the result of the evaluation. Otherwise fail. - -- ~"disjoint_map_union"~ Like ~"map_union"~ but it is an error, - if two (or more) maps contain the same key, but map it to - different values. It is also an error if the argument is a - name-containing value. diff --git a/doc/concepts/garbage.md b/doc/concepts/garbage.md new file mode 100644 index 00000000..69594b1c --- /dev/null +++ b/doc/concepts/garbage.md @@ -0,0 +1,86 @@ +Garbage Collection +================== + +For every build, for all non-failed actions an entry is created in the +action cache and the corresponding artifacts are stored in the CAS. So, +over time, a lot of files accumulate in the local build root. Hence we +have a way to reclaim disk space while keeping the benefits of having a +cache. This operation is referred to as garbage collection and usually +uses the heuristics to keeping what is most recently used. Our approach +follows this paradigm as well. + +Invariants assumed by our build system +-------------------------------------- + +Our tool assumes several invariants on the local build root, that we +need to maintain during garbage collection. Those are the following. + + - If an artifact is referenced in any cache entry (action cache, + target-level cache), then the corresponding artifact is in CAS. + - If a tree is in CAS, then so are its immediate parts (and hence also + all transitive parts). + +Generations of cache and CAS +---------------------------- + +In order to allow garbage collection while keeping the desired +invariants, we keep several (currently two) generations of cache and +CAS. Each generation in itself has to fulfill the invariants. The +effective cache or CAS is the union of the caches or CASes of all +generations, respectively. Obviously, then the effective cache and CAS +fulfill the invariants as well. + +The actual `gc` command rotates the generations: the oldest generation +is be removed and the remaining generations are moved one number up +(i.e., currently the young generation will simply become the old +generation), implicitly creating a new, empty, youngest generation. As +an empty generation fulfills the required invariants, this operation +preservers the requirement that each generation individually fulfill the +invariants. + +All additions are made to the youngest generation; in order to keep the +invariant, relevant entries only present in an older generation are also +added to the youngest generation first. Moreover, whenever an entry is +referenced in any way (cache hit, request for an entry to be in CAS) and +is only present in an older generation, it is also added to the younger +generation, again adding referenced parts first. As a consequence, the +youngest generation contains everything directly or indirectly +referenced since the last garbage collection; in particular, everything +referenced since the last garbage collection will remain in the +effective cache or CAS upon the next garbage collection. + +These generations are stored as separate directories inside the local +build root. As the local build root is, starting from an empty +directory, entirely managed by \`just\` and compatible tools, +generations are on the same file system. Therefore the adding of old +entries to the youngest generation can be implemented in an efficient +way by using hard links. + +The moving up of generations can happen atomically by renaming the +respective directory. Also, the oldest generation can be removed +logically by renaming a directory to a name that is not searched for +when looking for existing generations. The actual recursive removal from +the file system can then happen in a separate step without any +requirements on order. + +Parallel operations in the presence of garbage collection +--------------------------------------------------------- + +The addition to cache and CAS can continue to happen in parallel; that +certain values are taken from an older generation instead of freshly +computed does not make a difference for the youngest generation (which +is the only generation modified). But build processes assume they don't +violate the invariant if they first add files to CAS and later a tree or +cache entry referencing them. This, however, only holds true if no +generation rotation happens in between. To avoid those kind of races, we +make processes coordinate over a single lock for each build root. + + - Any build process keeps a shared lock for the entirety of the build. + - The garbage collection process takes an exclusive lock for the + period it does the directory renames. + +We consider it acceptable that, in theory, local build processes could +starve local garbage collection. Moreover, it should be noted that the +actual removal of no-longer-needed files from the file system happens +without any lock being held. Hence the disturbance of builds caused by +garbage collection is small. diff --git a/doc/concepts/garbage.org b/doc/concepts/garbage.org deleted file mode 100644 index 26f6cc51..00000000 --- a/doc/concepts/garbage.org +++ /dev/null @@ -1,82 +0,0 @@ -* Garbage Collection - -For every build, for all non-failed actions an entry is created in -the action cache and the corresponding artifacts are stored in the -CAS. So, over time, a lot of files accumulate in the local build -root. Hence we have a way to reclaim disk space while keeping the -benefits of having a cache. This operation is referred to as garbage -collection and usually uses the heuristics to keeping what is most -recently used. Our approach follows this paradigm as well. - -** Invariants assumed by our build system - -Our tool assumes several invariants on the local build root, that we -need to maintain during garbage collection. Those are the following. -- If an artifact is referenced in any cache entry (action cache, - target-level cache), then the corresponding artifact is in CAS. -- If a tree is in CAS, then so are its immediate parts (and hence - also all transitive parts). - - -** Generations of cache and CAS - -In order to allow garbage collection while keeping the desired -invariants, we keep several (currently two) generations of cache -and CAS. Each generation in itself has to fulfill the invariants. -The effective cache or CAS is the union of the caches or CASes of -all generations, respectively. Obviously, then the effective cache -and CAS fulfill the invariants as well. - -The actual ~gc~ command rotates the generations: the oldest -generation is be removed and the remaining generations are moved -one number up (i.e., currently the young generation will simply -become the old generation), implicitly creating a new, empty, -youngest generation. As an empty generation fulfills the required -invariants, this operation preservers the requirement that each -generation individually fulfill the invariants. - -All additions are made to the youngest generation; in order to keep -the invariant, relevant entries only present in an older generation -are also added to the youngest generation first. Moreover, whenever -an entry is referenced in any way (cache hit, request for an entry -to be in CAS) and is only present in an older generation, it is -also added to the younger generation, again adding referenced -parts first. As a consequence, the youngest generation contains -everything directly or indirectly referenced since the last garbage -collection; in particular, everything referenced since the last -garbage collection will remain in the effective cache or CAS upon -the next garbage collection. - -These generations are stored as separate directories inside the -local build root. As the local build root is, starting from an -empty directory, entirely managed by `just` and compatible tools, -generations are on the same file system. Therefore the adding of -old entries to the youngest generation can be implemented in an -efficient way by using hard links. - -The moving up of generations can happen atomically by renaming the -respective directory. Also, the oldest generation can be removed -logically by renaming a directory to a name that is not searched -for when looking for existing generations. The actual recursive -removal from the file system can then happen in a separate step -without any requirements on order. - -** Parallel operations in the presence of garbage collection - -The addition to cache and CAS can continue to happen in parallel; -that certain values are taken from an older generation instead -of freshly computed does not make a difference for the youngest -generation (which is the only generation modified). But build -processes assume they don't violate the invariant if they first -add files to CAS and later a tree or cache entry referencing them. -This, however, only holds true if no generation rotation happens in -between. To avoid those kind of races, we make processes coordinate -over a single lock for each build root. -- Any build process keeps a shared lock for the entirety of the build. -- The garbage collection process takes an exclusive lock for the - period it does the directory renames. -We consider it acceptable that, in theory, local build processes -could starve local garbage collection. Moreover, it should be noted -that the actual removal of no-longer-needed files from the file -system happens without any lock being held. Hence the disturbance -of builds caused by garbage collection is small. diff --git a/doc/concepts/multi-repo.md b/doc/concepts/multi-repo.md new file mode 100644 index 00000000..c465360e --- /dev/null +++ b/doc/concepts/multi-repo.md @@ -0,0 +1,170 @@ +Multi-repository build +====================== + +Repository configuration +------------------------ + +### Open repository names + +A repository can have external dependencies. This is realized by having +unbound ("open") repository names being used as references. The actual +definition of those external repositories is not part of the repository; +we think of them as inputs, i.e., we think of this repository as a +function of the referenced external targets. + +### Binding in a separate repository configuration + +The actual binding of the free repository names is specified in a +separate repository-configuration file, which is specified on the +command line (via the `-C` option); this command-line argument is +optional and the default is that the repository worked on has no +external dependencies. Typically (but not necessarily), this +repository-configuration file is located outside the referenced +repositories and versioned separately or generated from such a file via +`bin/just-mr.py`. It serves as meta-data for a group of repositories +belonging together. + +This file contains one JSON object. For the key `"repositories"` the +value is an object; its keys are the global names of the specified +repositories. For each repository, there is an object describing it. The +key `"workspace_root"` describes where to find the repository and should +be present for all (direct or indirect) external dependencies of the +repository worked upon. Additional roots file names (for target, rule, +and expression) can be specified. For keys not given, the same rules for +default values apply as for the corresponding command-line arguments. +Additionally, for each repository, the key "bindings" specifies the +map of the open repository names to the global names that provide these +dependencies. Repositories may depend on each other (or even +themselves), but the resulting global target graph has to be cycle free. + +Whenever a location has to be specified, the value has to be a list, +with the first entry being specifying the naming scheme; the semantics +of the remaining entries depends on the scheme (see "Root Naming +Schemes" below). + +Additionally, the key `"main"` (with default `""`) specifies the main +repository. The target to be built (as specified on the command line) is +taken from this repository. Also, the command-line arguments `-w`, +`--target_root`, etc, apply to this repository. If no option `-w` is +given and `"workspace_root"` is not specified in the +repository-configuration file either, the root is determined from the +working directory as usual. + +The value of `main` can be overwritten on the command line (with the +`--main` option) In this way, a consistent configuration of +interdependent repositories can be versioned and referred to regardless +of the repository worked on. + +#### Root naming scheme + +##### `"file"` + +The `"file"` scheme tells that the repository (or respective +root) can be found in a directory in the local file system; the +only argument is the absolute path to that directory. + +##### `"git tree"` + +The `"git tree"` scheme tells that the root is defined to be a +tree given by a git tree identifier. It takes two arguments + + - the tree identifier, as hex-encoded string, and + - the absolute path to some repository containing that tree + +#### Example + +Consider, for example, the following repository-configuration file. +In the following, we assume it is located at `/etc/just/repos.json`. + +``` jsonc +{ "main": "env" +, "repositories": + { "foobar": + { "workspace_root": ["file", "/opt/foobar/repo"] + , "rule_root": ["file", "/etc/just/rules"] + , "bindings": {"base": "barimpl"} + } + , "barimpl": + { "workspace_root": ["file", "/opt/barimpl"] + , "target_file_name": "TARGETS.bar" + } + , "env": {"bindings": {"foo": "foobar", "bar": "barimpl"}} + } +} +``` + +It specifies 3 repositories, with global names `foobar`, `barimpl`, +and `env`. Within `foobar`, the repository name `base` refers to +`barimpl`, the repository that can be found at `/opt/barimpl`. + +The repository `env` is the main repository and there is no +workspace root defined for it, so it only provides bindings for +external repositories `foo` and `bar`, but the actual repository is +taken from the working directory (unless `-w` is specified). In this +way, it provides an environment for developing applications based on +`foo` and `bar`. + +For example, the invocation `just build -C /etc/just/repos.conf +baz` tells our tool to build the target `baz` from the module the +working directory is located in. `foo` will refer to the repository +found at `/opt/foobar/repo` (using rules from `/etc/just/rules`, +taking `base` refer to the repository at `/opt/barimpl`) and `bar` +will refer to the repository at `/opts/barimpl`. + +Naming of targets +----------------- + +### Reference in target files + +In addition to the normal target references (string for a target in the +name module, module-target pair for a target in same repository, +`["./", relpath, target]` relative addressing, `["FILE", null, +name]` explicit file reference in the same module), references of the +form `["@", repo, module, target]` can be specified, where `repo` is +string referring to an open name. That open repository name is resolved +to the global name by the `"bindings"` parameter of the repository the +target reference is made in. Within the repository the resolved name +refers to, the target `[module, target]` is taken. + +### Expression language: names as abstract values + +Targets are a global concept as they distinguish targets from different +repositories. Their names, however, depend on the repository they occur +in (as the local names might differ in various repositories). Moreover, +some targets cannot be named in certain repositories as not every +repository has a local name in every other repository. + +To handle this naming problem, we note the following. During the +evaluation of a target names occur at two places: as the result of +evaluating the parameters (for target fields) and in the evaluation of +the defining expression when requesting properties of a target dependent +upon (via `DEP_ARTIFACTS` and related functions). In the later case, +however, the only legitimate way to obtain a target name is by the +`FIELD` function. To enforce this behavior, and to avoid problems with +serializing target names, our expression language considers target names +as opaque values. More precisely, + + - in a target description, the target fields are evaluated and the + result of the evaluation is parsed, in the context of the module the + `TARGET` file belongs to, as a target name, and + - during evaluation of the defining expression of a the target's + rule, when accessing `FIELD` the values of target fields will be + reported as abstract name values and when querying values of + dependencies (via `DEP_ARTIFACTS` etc) the correct abstract target + name has to be provided. + +While the defining expression has access to target names (via target +fields), it is not useful to provide them in provided data; a consuming +data cannot use names unless it has those fields as dependency anyway. +Our tool will not enforce this policy; however, only targets not having +names in their provided data are eligible to be used in `export` rules. + +File layout in actions +---------------------- + +As `just` does full staging for actions, no special considerations are +needed when combining targets of different repositories. Each target +brings its staging of artifacts as usual. In particular, no repository +names (neither local nor global ones) will ever be visible in any +action. So for the consuming target it makes no difference if its +dependency comes from the same or a different repository. diff --git a/doc/concepts/multi-repo.org b/doc/concepts/multi-repo.org deleted file mode 100644 index f1ad736f..00000000 --- a/doc/concepts/multi-repo.org +++ /dev/null @@ -1,167 +0,0 @@ -* Multi-repository build - -** Repository configuration - -*** Open repository names - -A repository can have external dependencies. This is realized by -having unbound ("open") repository names being used as references. -The actual definition of those external repositories is not part -of the repository; we think of them as inputs, i.e., we think of -this repository as a function of the referenced external targets. - -*** Binding in a separate repository configuration - -The actual binding of the free repository names is specified in a -separate repository-configuration file, which is specified on the -command line (via the ~-C~ option); this command-line argument -is optional and the default is that the repository worked on has -no external dependencies. Typically (but not necessarily), this -repository-configuration file is located outside the referenced -repositories and versioned separately or generated from such a -file via ~bin/just-mr.py~. It serves as meta-data for a group of -repositories belonging together. - -This file contains one JSON object. For the key ~"repositories"~ the -value is an object; its keys are the global names of the specified -repositories. For each repository, there is an object describing it. -The key ~"workspace_root"~ describes where to find the repository and -should be present for all (direct or indirect) external dependencies -of the repository worked upon. Additional roots file names (for -target, rule, and expression) can be specified. For keys not given, -the same rules for default values apply as for the corresponding -command-line arguments. Additionally, for each repository, the -key "bindings" specifies the map of the open repository names to -the global names that provide these dependencies. Repositories may -depend on each other (or even themselves), but the resulting global -target graph has to be cycle free. - -Whenever a location has to be specified, the value has to be a -list, with the first entry being specifying the naming scheme; the -semantics of the remaining entries depends on the scheme (see "Root -Naming Schemes" below). - -Additionally, the key ~"main"~ (with default ~""~) specifies -the main repository. The target to be built (as specified on the -command line) is taken from this repository. Also, the command-line -arguments ~-w~, ~--target_root~, etc, apply to this repository. If -no option ~-w~ is given and ~"workspace_root"~ is not specified in -the repository-configuration file either, the root is determined -from the working directory as usual. - -The value of ~main~ can be overwritten on the command line (with -the ~--main~ option) In this way, a consistent configuration -of interdependent repositories can be versioned and referred to -regardless of the repository worked on. - -**** Root naming scheme - -***** ~"file"~ - -The ~"file"~ scheme tells that the repository (or respective root) -can be found in a directory in the local file system; the only -argument is the absolute path to that directory. - - -***** ~"git tree"~ - -The ~"git tree"~ scheme tells that the root is defined to be a tree -given by a git tree identifier. It takes two arguments -- the tree identifier, as hex-encoded string, and -- the absolute path to some repository containing that tree - -**** Example - -Consider, for example, the following repository-configuration file. -In the following, we assume it is located at ~/etc/just/repos.json~. - -#+BEGIN_SRC -{ "main": "env" -, "repositories": - { "foobar": - { "workspace_root": ["file", "/opt/foobar/repo"] - , "rule_root": ["file", "/etc/just/rules"] - , "bindings": {"base": "barimpl"} - } - , "barimpl": - { "workspace_root": ["file", "/opt/barimpl"] - , "target_file_name": "TARGETS.bar" - } - , "env": {"bindings": {"foo": "foobar", "bar": "barimpl"}} - } -} -#+END_SRC - -It specifies 3 repositories, with global names ~foobar~, ~barimpl~, -and ~env~. Within ~foobar~, the repository name ~base~ refers to -~barimpl~, the repository that can be found at ~/opt/barimpl~. - -The repository ~env~ is the main repository and there is no workspace -root defined for it, so it only provides bindings for external -repositories ~foo~ and ~bar~, but the actual repository is taken -from the working directory (unless ~-w~ is specified). In this way, -it provides an environment for developing applications based on -~foo~ and ~bar~. - -For example, the invocation ~just build -C /etc/just/repos.conf -baz~ tells our tool to build the target ~baz~ from the module the -working directory is located in. ~foo~ will refer to the repository -found at ~/opt/foobar/repo~ (using rules from ~/etc/just/rules~, -taking ~base~ refer to the repository at ~/opt/barimpl~) and -~bar~ will refer to the repository at ~/opts/barimpl~. - -** Naming of targets - -*** Reference in target files - -In addition to the normal target references (string for a target in -the name module, module-target pair for a target in same repository, -~["./", relpath, target]~ relative addressing, ~["FILE", null, -name]~ explicit file reference in the same module), references of the -form ~["@", repo, module, target]~ can be specified, where ~repo~ -is string referring to an open name. That open repository name is -resolved to the global name by the ~"bindings"~ parameter of the -repository the target reference is made in. Within the repository -the resolved name refers to, the target ~[module, target]~ is taken. - -*** Expression language: names as abstract values - -Targets are a global concept as they distinguish targets from different -repositories. Their names, however, depend on the repository they -occur in (as the local names might differ in various repositories). -Moreover, some targets cannot be named in certain repositories as -not every repository has a local name in every other repository. - -To handle this naming problem, we note the following. During the -evaluation of a target names occur at two places: as the result of -evaluating the parameters (for target fields) and in the evaluation -of the defining expression when requesting properties of a target -dependent upon (via ~DEP_ARTIFACTS~ and related functions). In the -later case, however, the only legitimate way to obtain a target -name is by the ~FIELD~ function. To enforce this behavior, and -to avoid problems with serializing target names, our expression -language considers target names as opaque values. More precisely, -- in a target description, the target fields are evaluated and the - result of the evaluation is parsed, in the context of the module - the ~TARGET~ file belongs to, as a target name, and -- during evaluation of the defining expression of a the target's - rule, when accessing ~FIELD~ the values of target fields will - be reported as abstract name values and when querying values of - dependencies (via ~DEP_ARTIFACTS~ etc) the correct abstract target - name has to be provided. - -While the defining expression has access to target names (via -target fields), it is not useful to provide them in provided data; -a consuming data cannot use names unless it has those fields as -dependency anyway. Our tool will not enforce this policy; however, -only targets not having names in their provided data are eligible -to be used in ~export~ rules. - -** File layout in actions - -As ~just~ does full staging for actions, no special considerations -are needed when combining targets of different repositories. Each -target brings its staging of artifacts as usual. In particular, no -repository names (neither local nor global ones) will ever be visible -in any action. So for the consuming target it makes no difference -if its dependency comes from the same or a different repository. diff --git a/doc/concepts/overview.md b/doc/concepts/overview.md new file mode 100644 index 00000000..a9bcc847 --- /dev/null +++ b/doc/concepts/overview.md @@ -0,0 +1,210 @@ +Tool Overview +============= + +Structuring +----------- + +### Structuring the Build: Targets, Rules, and Actions + +The primary units this build system deals with are targets: the user +requests the system to build (or install) a target, targets depend on +other targets, etc. Targets typically reflect the units a software +developer thinks in: libraries, binaries, etc. The definition of a +target only describes the information directly belonging to the target, +e.g., its source, private and public header files, and its direct +dependencies. Any other information needed to build a target (like the +public header files of an indirect dependency) are inferred by the build +tool. In this way, the build description can be kept maintainable + +A built target consists of files logically belonging together (like the +actual library file and its public headers) as well as information on +how to use the target (linking arguments, transitive header files, etc). +For a consumer of a target, the definition of this collection of files +as well as the additionally provided information is what defines the +target as a dependency, respectively of where the target is coming from +(i.e., targets coinciding here are indistinguishable for other targets). + +Of course, to actually build a single target from its dependencies, many +invocations of the compiler or other tools are necessary (so called +"actions"); the build tool translates these high level description +into the individual actions necessary and only re-executes those where +inputs have changed. + +This translation of high-level concepts into individual actions is not +hard coded into the tool. It is provided by the user as "rules" and +forms additional input to the build. To avoid duplicate work, rules are +typically maintained centrally for a project or an organization. + +### Structuring the Code: Modules and Repositories + +The code base is usually split into many directories, each containing +source files belonging together. To allow the definition of targets +where their code is, the targets are structured in a similar way. For +each directory, there can be a targets files. Directories for which such +a targets file exists are called "modules". Each file belongs to the +module that is closest when searching upwards in the directory tree. The +targets file of a module defines the targets formed from the source +files belonging to this module. + +Larger projects are often split into "repositories". For this build +tool, a repository is a logical unit. Often those coincide with the +repositories in the sense of version control. This, however, does not +have to be the case. Also, from one directory in the file system many +repositories can be formed that might differ in the rules used, targets +defined, or binding of their dependencies. + +Staging +------- + +A peculiarity of this build system is the complete separation between +physical and logical paths. Targets have their own view of the world, +i.e., they can place their artifacts at any logical path they like, and +this is how they look to other targets. It is up to the consuming +targets what they do with artifacts of the targets they depend on; in +particular, they are not obliged to leave them at the logical location +their dependency put them. + +When such a collection of artifacts at logical locations (often referred +to as the "stage") is realized on the file system (when installing a +target, or as inputs to actions), the paths are interpreted as paths +relative to the respective root (installation or action directory). + +This separation is what allows flexible combination of targets from +various sources without leaking repository names or different file +arrangement if a target is in the "main" repository. + +Repository data +--------------- + +A repository uses a (logical) directory for several purposes: to obtain +source files, to read definitions of targets, to read rules, and to read +expressions that can be used by rules. While all those directories can +(and often are) be the same, this does not have to be the case. For each +of those purposes, a different logical directory (also called "root") +can be used. In this way, one can, e.g., add target definitions to a +source tree originally written for a different build tool without +modifying the original source tree. + +Those roots are usually defined in a repository configuration. For the +"main" repository, i.e., the repository from which the target to be +built is requested, the roots can also be overwritten at the command +line. Roots can be defined as paths in the file system, but also as +`git` tree identifiers (together with the location of some repository +containing that tree). The latter definition is preferable for rules and +dependencies, as it allows high-level caching of targets. It also +motivates the need of adding target definitions without changing the +root itself. + +The same flexibility as for the roots is also present for the names of +the files defining targets, rules, and expressions. While the default +names `TARGETS`, `RULES`, and `EXPRESSIONS` are often used, other file +names can be specified for those as well, either in the repository +configuration or (for the main repository) on the command line. + +The final piece of data needed to describe a repository is the binding +of the open repository names that are used to refer to other +repositories. More details can be found in the documentation on +multi-repository builds. + +Targets +------- + +### Target naming + +In description files, targets, rules, and expressions are referred to by +name. As the context always fixes if a name for a target, rule, or +expression is expected, they use the same naming scheme. + + - A single string refers to the target with this name in the same + module. + - A pair `[module, name]` refers to the target `name` in the module + `module` of the same repository. There are no module names with a + distinguished meaning. The naming scheme is unambiguous, as all + other names given by lists have length at least 3. + - A list `["./", relative-module-path, name]` refers to a target with + the given name in the module that has the specified path relative to + the current module (in the current repository). + - A list `["@", repository, module, name]` refers to the target with + the specified name in the specified module of the specified + repository. + +Additionally, there are special targets that can also be referred to in +target files. + + - An explicit reference of a source-file target in the same module, + specified as `["FILE", null, name]`. The explicit `null` at the + second position (where normally the module would be) is necessary to + ensure the name has length more than 2 to distinguish it from a + reference to the module `"FILE"`. + - A reference to an collection, given by a shell pattern, of explicit + source files in the top-level directory of the same module, + specified as `["GLOB", null, pattern]`. The explicit `null` at + second position is required for the same reason as in the explicit + file reference. + - A reference to a tree target in the same module, specified as + `["TREE", null, name]`. The explicit `null` at second position is + required for the same reason as in the explicit file reference. + +### Data of an analyzed target + +Analyzing a target results in 3 pieces of data. + + - The "artifacts" are a staged collection of artifacts. Typically, + these are what is normally considered the main reason to build a + target, e.g., the actual library file in case of a library. + + - The "runfiles" are another staged collection of artifacts. + Typically, these are files that directly belong to the target and + are somehow needed to use the target. For example, in case of a + library that would be the public header files of the library itself. + + - A "provides" map with additional information the target wants to + provide to its consumers. The data contained in that map can also + contain additional artifacts. Typically, this the remaining + information needed to use the target in a build. + + In case of a library, that typically would include any other + libraries this library transitively depends upon (a stage), the + correct linking order (a list of strings), and the public headers of + the transitive dependencies (another stage). + +A target is completely determined by these 3 pieces of data. A consumer +of the target will have no other information available. Hence it is +crucial, that everything (apart from artifacts and runfiles) needed to +build against that target is contained in the provides map. + +When the installation of a target is requested on the command line, +artifacts and runfiles are installed; in case of staging conflicts, +artifacts take precedence. + +### Source targets + +#### Files + +If a target is not found in the targets file, it is implicitly +treated as a source file. Both, explicit and implicit source files +look the same. The artifacts stage has a single entry: the path is +the relative path of the file to the module root and the value the +file artifact located at the specified location. The runfiles are +the same as the artifacts and the provides map is empty. + +#### Collection of files given by a shell pattern + +A collection of files given by a shell pattern has, both as +artifacts and runfiles, the (necessarily disjoint) union of the +artifact maps of the (zero or more) source targets that match the +pattern. Only *files* in the *top-level* directory of the given +modules are considered for matches. The provides map is empty. + +#### Trees + +A tree describes a directory. Internally, however, it is a single +opaque artifact. Consuming targets cannot look into the internal +structure of that tree. Only when realized in the file system (when +installation is requested or as part of the input to an action), the +directory structure is visible again. + +An explicit tree target is similar to an explicit file target, +except that at the specified location there has to be a directory +rather than a file and the tree artifact corresponding to that +directory is taken instead of a file artifact. diff --git a/doc/concepts/overview.org b/doc/concepts/overview.org deleted file mode 100644 index 5dc7ad20..00000000 --- a/doc/concepts/overview.org +++ /dev/null @@ -1,206 +0,0 @@ -* Tool Overview - -** Structuring - -*** Structuring the Build: Targets, Rules, and Actions - -The primary units this build system deals with are targets: the -user requests the system to build (or install) a target, targets -depend on other targets, etc. Targets typically reflect the units a -software developer thinks in: libraries, binaries, etc. The definition -of a target only describes the information directly belonging to -the target, e.g., its source, private and public header files, and -its direct dependencies. Any other information needed to build a -target (like the public header files of an indirect dependency) -are inferred by the build tool. In this way, the build description -can be kept maintainable - -A built target consists of files logically belonging together (like -the actual library file and its public headers) as well as information -on how to use the target (linking arguments, transitive header files, -etc). For a consumer of a target, the definition of this collection -of files as well as the additionally provided information is what -defines the target as a dependency, respectively of where the target -is coming from (i.e., targets coinciding here are indistinguishable -for other targets). - -Of course, to actually build a single target from its dependencies, -many invocations of the compiler or other tools are necessary (so -called "actions"); the build tool translates these high level -description into the individual actions necessary and only re-executes -those where inputs have changed. - -This translation of high-level concepts into individual actions -is not hard coded into the tool. It is provided by the user as -"rules" and forms additional input to the build. To avoid duplicate -work, rules are typically maintained centrally for a project or an -organization. - -*** Structuring the Code: Modules and Repositories - -The code base is usually split into many directories, each containing -source files belonging together. To allow the definition of targets -where their code is, the targets are structured in a similar way. -For each directory, there can be a targets files. Directories for -which such a targets file exists are called "modules". Each file -belongs to the module that is closest when searching upwards in the -directory tree. The targets file of a module defines the targets -formed from the source files belonging to this module. - -Larger projects are often split into "repositories". For this build -tool, a repository is a logical unit. Often those coincide with -the repositories in the sense of version control. This, however, -does not have to be the case. Also, from one directory in the file -system many repositories can be formed that might differ in the -rules used, targets defined, or binding of their dependencies. - -** Staging - -A peculiarity of this build system is the complete separation -between physical and logical paths. Targets have their own view of -the world, i.e., they can place their artifacts at any logical path -they like, and this is how they look to other targets. It is up to -the consuming targets what they do with artifacts of the targets -they depend on; in particular, they are not obliged to leave them -at the logical location their dependency put them. - -When such a collection of artifacts at logical locations (often -referred to as the "stage") is realized on the file system (when -installing a target, or as inputs to actions), the paths are -interpreted as paths relative to the respective root (installation -or action directory). - -This separation is what allows flexible combination of targets from -various sources without leaking repository names or different file -arrangement if a target is in the "main" repository. - -** Repository data - -A repository uses a (logical) directory for several purposes: to -obtain source files, to read definitions of targets, to read rules, -and to read expressions that can be used by rules. While all those -directories can (and often are) be the same, this does not have -to be the case. For each of those purposes, a different logical -directory (also called "root") can be used. In this way, one can, -e.g., add target definitions to a source tree originally written for -a different build tool without modifying the original source tree. - -Those roots are usually defined in a repository configuration. For -the "main" repository, i.e., the repository from which the target -to be built is requested, the roots can also be overwritten at the -command line. Roots can be defined as paths in the file system, -but also as ~git~ tree identifiers (together with the location -of some repository containing that tree). The latter definition -is preferable for rules and dependencies, as it allows high-level -caching of targets. It also motivates the need of adding target -definitions without changing the root itself. - -The same flexibility as for the roots is also present for the names -of the files defining targets, rules, and expressions. While the -default names ~TARGETS~, ~RULES~, and ~EXPRESSIONS~ are often used, -other file names can be specified for those as well, either in -the repository configuration or (for the main repository) on the -command line. - -The final piece of data needed to describe a repository is the -binding of the open repository names that are used to refer to -other repositories. More details can be found in the documentation -on multi-repository builds. - -** Targets - -*** Target naming - -In description files, targets, rules, and expressions are referred -to by name. As the context always fixes if a name for a target, -rule, or expression is expected, they use the same naming scheme. -- A single string refers to the target with this name in the - same module. -- A pair ~[module, name]~ refers to the target ~name~ in the module - ~module~ of the same repository. There are no module names with - a distinguished meaning. The naming scheme is unambiguous, as - all other names given by lists have length at least 3. -- A list ~["./", relative-module-path, name]~ refers to a target - with the given name in the module that has the specified path - relative to the current module (in the current repository). -- A list ~["@", repository, module, name]~ refers to the target - with the specified name in the specified module of the specified - repository. - -Additionally, there are special targets that can also be referred -to in target files. -- An explicit reference of a source-file target in the same module, - specified as ~["FILE", null, name]~. The explicit ~null~ at the - second position (where normally the module would be) is necessary - to ensure the name has length more than 2 to distinguish it from - a reference to the module ~"FILE"~. -- A reference to an collection, given by a shell pattern, of explicit - source files in the top-level directory of the same module, - specified as ~["GLOB", null, pattern]~. The explicit ~null~ at - second position is required for the same reason as in the explicit - file reference. -- A reference to a tree target in the same module, specified as - ~["TREE", null, name]~. The explicit ~null~ at second position is - required for the same reason as in the explicit file reference. - -*** Data of an analyzed target - -Analyzing a target results in 3 pieces of data. -- The "artifacts" are a staged collection of artifacts. Typically, - these are what is normally considered the main reason to build - a target, e.g., the actual library file in case of a library. -- The "runfiles" are another staged collection of artifacts. Typically, - these are files that directly belong to the target and are somehow - needed to use the target. For example, in case of a library that - would be the public header files of the library itself. -- A "provides" map with additional information the target wants - to provide to its consumers. The data contained in that map can - also contain additional artifacts. Typically, this the remaining - information needed to use the target in a build. - - In case of a library, that typically would include any other - libraries this library transitively depends upon (a stage), - the correct linking order (a list of strings), and the public - headers of the transitive dependencies (another stage). - -A target is completely determined by these 3 pieces of data. A -consumer of the target will have no other information available. -Hence it is crucial, that everything (apart from artifacts and -runfiles) needed to build against that target is contained in the -provides map. - -When the installation of a target is requested on the command line, -artifacts and runfiles are installed; in case of staging conflicts, -artifacts take precedence. - -*** Source targets - -**** Files - -If a target is not found in the targets file, it is implicitly -treated as a source file. Both, explicit and implicit source files -look the same. The artifacts stage has a single entry: the path is -the relative path of the file to the module root and the value the -file artifact located at the specified location. The runfiles are -the same as the artifacts and the provides map is empty. - -**** Collection of files given by a shell pattern - -A collection of files given by a shell pattern has, both as artifacts -and runfiles, the (necessarily disjoint) union of the artifact -maps of the (zero or more) source targets that match the pattern. -Only /files/ in the /top-level/ directory of the given modules are -considered for matches. The provides map is empty. - -**** Trees - -A tree describes a directory. Internally, however, it is a single -opaque artifact. Consuming targets cannot look into the internal -structure of that tree. Only when realized in the file system (when -installation is requested or as part of the input to an action), -the directory structure is visible again. - -An explicit tree target is similar to an explicit file target, except -that at the specified location there has to be a directory rather -than a file and the tree artifact corresponding to that directory -is taken instead of a file artifact. diff --git a/doc/concepts/rules.md b/doc/concepts/rules.md new file mode 100644 index 00000000..2ab4c334 --- /dev/null +++ b/doc/concepts/rules.md @@ -0,0 +1,567 @@ +User-defined Rules +================== + +Targets are defined in terms of high-level concepts like "libraries", +"binaries", etc. In order to translate these high-level definitions +into actionable tasks, the user defines rules, explaining at a single +point how all targets of a given type are built. + +Rules files +----------- + +Rules are defined in rules files (by default named `RULES`). Those +contain a JSON object mapping rule names to their rule definition. For +rules, the same naming scheme as for targets applies. However, built-in +rules (always named by a single string) take precedence in naming; to +explicitly refer to a rule defined in the current module, the module has +to be specified, possibly by a relative path, e.g., +`["./", ".", "install"]`. + +Basic components of a rule +-------------------------- + +A rule is defined through a JSON object with various keys. The only +mandatory key is `"expression"` containing the defining expression of +the rule. + +### `"config_fields"`, `"string_fields"` and `"target_fields"` + +These keys specify the fields that a target defined by that rule can +have. In particular, those have to be disjoint lists of strings. + +For `"config_fields"` and `"string_fields"` the respective field has to +evaluate to a list of strings, whereas `"target_fields"` have to +evaluate to a list of target references. Those references are evaluated +immediately, and in the name context of the target they occur in. + +The difference between `"config_fields"` and `"string_fields"` is that +`"config_fields"` are evaluated before the target fields and hence can +be used by the rule to specify config transitions for the target fields. +`"string_fields"` on the other hand are evaluated *after* +the target fields; hence the rule cannot use them to specify a +configuration transition, however the target definition in those fields +may use the `"outs"` and `"runfiles"` functions to have access to the +names of the artifacts or runfiles of a target specified in one of the +target fields. + +### `"implicit"` + +This key specifies a map of implicit dependencies. The keys of the map +are additional target fields, the values are the fixed list of targets +for those fields. If a short-form name of a target is used (e.g., only a +string instead of a module-target pair), it is interpreted relative to +the repository and module the rule is defined in, not the one the rule +is used in. Other than this, those fields are evaluated the same way as +target fields settable on invocation of the rule. + +### `"config_vars"` + +This is a list of strings specifying which parts of the configuration +the rule uses. The defining expression of the rule is evaluated in an +environment that is the configuration restricted to those variables; if +one of those variables is not specified in the configuration the value +in the restriction is `null`. + +### `"config_transitions"` + +This key specifies a map of (some of) the target fields (whether +declared as `"target_fields"` or as `"implicit"`) to a configuration +expression. Here, a configuration expression is any expression in our +language. It has access to the `"config_vars"` and the `"config_fields"` +and has to evaluate to a list of maps. Each map specifies a transition +to the current configuration by amending it on the domain of that map to +the given value. + +### `"imports"` + +This specifies a map of expressions that can later be used by +`CALL_EXPRESSION`. In this way, duplication of (rule) code can be +avoided. For each key, we have to have a name of an expression; +expressions are named following the same naming scheme as targets and +rules. The names are resolved in the context of the rule. Expressions +themselves are defined in expression files, the default name being +`EXPRESSIONS`. + +Each expression is a JSON object. The only mandatory key is +`"expression"` which has to be an expression in our language. It +optionally can have a key `"vars"` where the value has to be a list of +strings (and the default is the empty list). Additionally, it can have +another optional key `"imports"` following the same scheme as the +`"imports"` key of a rule; in the `"imports"` key of an expression, +names are resolved in the context of that expression. It is a +requirement that the `"imports"` graph be cycle free. + +### `"expression"` + +This specifies the defining expression of the rule. The value has to be +an expression of our expression language (basically, an abstract syntax +tree serialized as JSON). It has access to the following extra functions +and, when evaluated, has to return a result value. + +#### `FIELD` + +The field function takes one argument, `name` which has to evaluate +to the name of a field. For string fields, the given list of strings +is returned; for target fields, the list of abstract names for the +given target is returned. These abstract names are opaque within the +rule language (but meaningful when reported in error messages) and +should only be used to be passed on to other functions that expect +names as inputs. + +#### `DEP_ARTIFACTS` and `DEP_RUNFILES` + +These functions give access to the artifacts, or runfiles, +respectively, of one of the targets depended upon. It takes two +(evaluated) arguments, the mandatory `"dep"` and the optional +`"transition"`. + +The argument `"dep"` has to evaluate to an abstract name (as can be +obtained from the `FIELD` function) of some target specified in one +of the target fields. The `"transition"` argument has to evaluate to +a configuration transition (i.e., a map) and the empty transition is +taken as default. It is an error to request a target-transition pair +for a target that was not requested in the given transition through +one of the target fields. + +#### `DEP_PROVIDES` + +This function gives access to a particular entry of the provides map +of one of the targets depended upon. The arguments `"dep"` and +`"transition"` are as for `DEP_ARTIFACTS`; additionally, there is +the mandatory argument `"provider"` which has to evaluate to a +string. The function returns the value of the provides map of the +target at the given provider. If the key is not in the provides map +(or the value at that key is `null`), the optional argument +`"default"` is evaluated and returned. The default for `"default"` +is the empty list. + +#### `BLOB` + +The `BLOB` function takes a single (evaluated) argument `data` which +is optional and defaults to the empty string. This argument has to +evaluate to a string. The function returns an artifact that is a +non-executable file with the given string as content. + +#### `TREE` + +The `TREE` function takes a single (evaluated) argument `$1` which +has to be a map of artifacts. The result is a single tree artifact +formed from the input map. It is an error if the map cannot be +transformed into a tree (e.g., due to staging conflicts). + +#### `ACTION` + +Actions are a way to define new artifacts from (zero or more) +already defined artifacts by running a command, typically a +compiler, linker, archiver, etc. The action function takes the +following arguments. + + - `"inputs"` A map of artifacts. These artifacts are present when + the command is executed; the keys of the map are the relative + path from the working directory of the command. The command must + not make any assumption about the location of the working + directory in the file system (and instead should refer to files + by path relative to the working directory). Moreover, the + command must not modify the input files in any way. (In-place + operations can be simulated by staging, as is shown in the + example later in this document.) + + It is an additional requirement that no conflicts occur when + interpreting the keys as paths. For example, `"foo.txt"` and + `"./foo.txt"` are different as strings and hence legitimately + can be assigned different values in a map. When interpreted as a + path, however, they name the same path; so, if the `"inputs"` + map contains both those keys, the corresponding values have to + be equal. + + - `"cmd"` The command to execute, given as `argv` vector, i.e., a + non-empty list of strings. The 0'th element of that list will + also be the program to be executed. + + - `"env"` The environment in which the command should be executed, + given as a map of strings to strings. + + - `"outs"` and `"out_dirs"` Two list of strings naming the files + and directories, respectively, the command is expected to + create. It is an error if the command fails to create the + promised output files. These two lists have to be disjoint, but + an entry of `"outs"` may well name a location inside one of the + `"out_dirs"`. + +This function returns a map with keys the strings mentioned in +`"outs"` and `"out_dirs"`. As values this map has artifacts defined +to be the ones created by running the given command (in the given +environment with the given inputs). + +#### `RESULT` + +The `RESULT` function is the only way to obtain a result value. It +takes three (evaluated) arguments, `"artifacts"`, `"runfiles"`, and +`"provides"`, all of which are optional and default to the empty +map. It defines the result of a target that has the given artifacts, +runfiles, and provided data, respectively. In particular, +`"artifacts"` and `"runfiles"` have to be maps to artifacts, and +`"provides"` has to be a map. Moreover, they keys in `"runfiles"` +and `"artifacts"` are treated as paths; it is an error if this +interpretation yields to conflicts. The keys in the artifacts or +runfile maps as seen by other targets are the normalized paths of +the keys given. + +Result values themselves are opaque in our expression language and +cannot be deconstructed in any way. Their only purpose is to be the +result of the evaluation of the defining expression of a target. + +#### `CALL_EXPRESSION` + +This function takes one mandatory argument `"name"` which is +unevaluated; it has to a be a string literal. The expression +imported by that name through the imports field is evaluated in the +current environment restricted to the variables of that expression. +The result of that evaluation is the result of the `CALL_EXPRESSION` +statement. + +During the evaluation of an expression, rule fields can still be +accessed through the functions `FIELD`, `DEP_ARTIFACTS`, etc. In +particular, even an expression with no variables (that, hence, is +always evaluated in the empty environment) can carry out non-trivial +computations and be non-constant. The special functions `BLOB`, +`ACTION`, and `RESULT` are also available. If inside the evaluation +of an expression the function `CALL_EXPRESSION` is used, the name +argument refers to the `"imports"` map of that expression. So the +call graph is deliberately recursion free. + +Evaluation of a target +---------------------- + +A target defined by a user-defined rule is evaluated in the following +way. + + - First, the config fields are evaluated. + + - Then, the target-fields are evaluated. This happens for each field + as follows. + + - The configuration transition for this field is evaluated and the + transitioned configurations determined. + - The argument expression for this field is evaluated. The result + is interpreted as a list of target names. Each of those targets + is analyzed in all the specified configurations. + + - The string fields are evaluated. If the expression for a string + field queries a target (via `outs` or `runfiles`), the value for + that target is returned in the first configuration. The rational + here is that such generator expressions are intended to refer to the + corresponding target in its "main" configuration; they are hardly + used anyway for fields branching their targets over many + configurations. + + - The effective configuration for the target is determined. The target + effectively has used of the configuration the variables used by the + `arguments_config` in the rule invocation, the `config_vars` the + rule specified, and the parts of the configuration used by a target + dependent upon. For a target dependent upon, all parts it used of + its configuration are relevant expect for those fixed by the + configuration transition. + + - The rule expression is evaluated and the result of that evaluation + is the result of the rule. + +Example of developing a rule +---------------------------- + +Let's consider step by step an example of writing a rule. Say we want +to write a rule that programmatically patches some files. + +### Framework: The minimal rule + +Every rule has to have a defining expression evaluating to a `RESULT`. +So the minimally correct rule is the `"null"` rule in the following +example rule file. + + { "null": {"expression": {"type": "RESULT"}}} + +This rule accepts no parameters, and has the empty map as artifacts, +runfiles, and provided data. So it is not very useful. + +### String inputs + +Let's allow the target definition to have some fields. The most simple +fields are `string_fields`; they are given by a list of strings. In the +defining expression we can access them directly via the `FIELD` +function. Strings can be used when defining maps, but we can also create +artifacts from them, using the `BLOB` function. To create a map, we can +use the `singleton_map` function. We define values step by step, using +the `let*` construct. + +``` jsonc +{ "script only": + { "string_fields": ["script"] + , "expression": + { "type": "let*" + , "bindings": + [ [ "script content" + , { "type": "join" + , "separator": "\n" + , "$1": + { "type": "++" + , "$1": + [["H"], {"type": "FIELD", "name": "script"}, ["w", "q", ""]] + } + } + ] + , [ "script" + , { "type": "singleton_map" + , "key": "script.ed" + , "value": + {"type": "BLOB", "data": {"type": "var", "name": "script content"}} + } + ] + ] + , "body": + {"type": "RESULT", "artifacts": {"type": "var", "name": "script"}} + } + } +} +``` + +### Target inputs and derived artifacts + +Now it is time to add the input files. Source files are targets like any +other target (and happen to contain precisely one artifact). So we add a +target field `"srcs"` for the file to be patched. Here we have to keep +in mind that, on the one hand, target fields accept a list of targets +and, on the other hand, the artifacts of a target are a whole map. We +chose to patch all the artifacts of all given `"srcs"` targets. We can +iterate over lists with `foreach` and maps with `foreach_map`. + +Next, we have to keep in mind that targets may place their artifacts at +arbitrary logical locations. For us that means that first we have to +make a decision at which logical locations we want to place the output +artifacts. As one thinks of patching as an in-place operation, we chose +to logically place the outputs where the inputs have been. Of course, we +do not modify the input files in any way; after all, we have to define a +mathematical function computing the output artifacts, not a collection +of side effects. With that choice of logical artifact placement, we have +to decide what to do if two (or more) input targets place their +artifacts at logically the same location. We could simply take a +"latest wins" semantics (keep in mind that target fields give a list +of targets, not a set) as provided by the `map_union` function. We chose +to consider it a user error if targets with conflicting artifacts are +specified. This is provided by the `disjoint_map_union` that also allows +to specify an error message to be provided the user. Here, conflict +means that values for the same map position are defined in a different +way. + +The actual patching is done by an `ACTION`. We have the script already; +to make things easy, we stage the input to a fixed place and also expect +a fixed output location. Then the actual command is a simple shell +script. The only thing we have to keep in mind is that we want useful +output precisely if the action fails. Also note that, while we define +our actions sequentially, they will be executed in parallel, as none of +them depends on the output of another one of them. + +``` jsonc +{ "ed patch": + { "string_fields": ["script"] + , "target_fields": ["srcs"] + , "expression": + { "type": "let*" + , "bindings": + [ [ "script content" + , { "type": "join" + , "separator": "\n" + , "$1": + { "type": "++" + , "$1": + [["H"], {"type": "FIELD", "name": "script"}, ["w", "q", ""]] + } + } + ] + , [ "script" + , { "type": "singleton_map" + , "key": "script.ed" + , "value": + {"type": "BLOB", "data": {"type": "var", "name": "script content"}} + } + ] + , [ "patched files per target" + , { "type": "foreach" + , "var": "src" + , "range": {"type": "FIELD", "name": "srcs"} + , "body": + { "type": "foreach_map" + , "var_key": "file_name" + , "var_val": "file" + , "range": + {"type": "DEP_ARTIFACTS", "dep": {"type": "var", "name": "src"}} + , "body": + { "type": "let*" + , "bindings": + [ [ "action output" + , { "type": "ACTION" + , "inputs": + { "type": "map_union" + , "$1": + [ {"type": "var", "name": "script"} + , { "type": "singleton_map" + , "key": "in" + , "value": {"type": "var", "name": "file"} + } + ] + } + , "cmd": + [ "/bin/sh" + , "-c" + , "cp in out && chmod 644 out && /bin/ed out < script.ed > log 2>&1 || (cat log && exit 1)" + ] + , "outs": ["out"] + } + ] + ] + , "body": + { "type": "singleton_map" + , "key": {"type": "var", "name": "file_name"} + , "value": + { "type": "lookup" + , "map": {"type": "var", "name": "action output"} + , "key": "out" + } + } + } + } + } + ] + , [ "artifacts" + , { "type": "disjoint_map_union" + , "msg": "srcs artifacts must not overlap" + , "$1": + { "type": "++" + , "$1": {"type": "var", "name": "patched files per target"} + } + } + ] + ] + , "body": + {"type": "RESULT", "artifacts": {"type": "var", "name": "artifacts"}} + } + } +} +``` + +A typical invocation of that rule would be a target file like the +following. + +``` jsonc +{ "input.txt": + { "type": "ed patch" + , "script": ["%g/world/s//user/g", "%g/World/s//USER/g"] + , "srcs": [["FILE", null, "input.txt"]] + } +} +``` + +As the input file has the same name as a target (in the same module), we +use the explicit file reference in the specification of the sources. + +### Implicit dependencies and config transitions + +Say, instead of patching a file, we want to generate source files from +some high-level description using our actively developed code generator. +Then we have to do some additional considerations. + + - First of all, every target defined by this rule not only depends on + the targets the user specifies. Additionally, our code generator is + also an implicit dependency. And as it is under active development, + we certainly do not want it to be taken from the ambient build + environment (as we did in the previous example with `ed` which, + however, is a pretty stable tool). So we use an `implicit` target + for this. + - Next, we notice that our code generator is used during the build. In + particular, we want that tool (written in some compiled language) to + be built for the platform we run our actions on, not the target + platform we build our final binaries for. Therefore, we have to use + a configuration transition. + - As our defining expression also needs the configuration transition + to access the artifacts of that implicit target, we better define it + as a reusable expression. Other rules in our rule collection might + also have the same task; so `["transitions", "for host"]` might be a + good place to define it. In fact, it can look like the expression + with that name in our own code base. + +So, the overall organization of our rule might be as follows. + +``` jsonc +{ "generated code": + { "target_fields": ["srcs"] + , "implicit": {"generator": [["generators", "foogen"]]} + , "config_vars": ["HOST_ARCH"] + , "imports": {"for host": ["transitions", "for host"]} + , "config_transitions": + {"generator": [{"type": "CALL_EXPRESSION", "name": "for host"}]} + , "expression": ... + } +} +``` + +### Providing information to consuming targets + +In the simple case of patching, the resulting file is indeed the only +information the consumer of that target needs; in fact, the main point +was that the resulting target could be a drop-in replacement of a source +file. A typical rule, however, defines something like a library and a +library is much more, than just the actual library file and the public +headers: a library may depend on other libraries; therefore, in order to +use it, we need + + - to have the header files of dependencies available that might be + included by the public header files of that library, + - to have the libraries transitively depended upon available during + linking, and + - to know the order in which to link the dependencies (as they might + have dependencies among each other). + +In order to keep a maintainable build description, all this should be +taken care of by simply depending on that library. We do +*not* want the consumer of a target having to be aware of +such transitive dependencies (e.g., when constructing the link command +line), as it used to be the case in early build tools like `make`. + +It is a deliberate design choice that a target is given only by the +result of its analysis, regardless of where it is coming from. +Therefore, all this information needs to be part of the result of a +target. Such kind of information is precisely, what the mentioned +`"provides"` map is for. As a map, it can contain an arbitrary amount of +information and the interface function `"DEP_PROVIDES"` is in such a way +that adding more providers does not affect targets not aware of them +(there is no function asking for all providers of a target). The keys +and their meaning have to be agreed upon by a target and its consumers. +As the latter, however, typically are a target of the same family +(authored by the same group), this usually is not a problem. + +A typical example of computing a provided value is the `"link-args"` in +the rules used by `just` itself. They are defined by the following +expression. + +``` jsonc +{ "type": "nub_right" +, "$1": + { "type": "++" + , "$1": + [ {"type": "keys", "$1": {"type": "var", "name": "lib"}} + , {"type": "CALL_EXPRESSION", "name": "link-args-deps"} + , {"type": "var", "name": "link external", "default": []} + ] + } +} +``` + +This expression + + - collects the respective provider of its dependencies, + - adds itself in front, and + - deduplicates the resulting list, keeping only the right-most + occurrence of each entry. + +In this way, the invariant is kept, that the `"link-args"` from a +topological ordering of the dependencies (in the order that a each entry +is mentioned before its dependencies). diff --git a/doc/concepts/rules.org b/doc/concepts/rules.org deleted file mode 100644 index d4c61b5e..00000000 --- a/doc/concepts/rules.org +++ /dev/null @@ -1,551 +0,0 @@ -* User-defined Rules - -Targets are defined in terms of high-level concepts like "libraries", -"binaries", etc. In order to translate these high-level definitions -into actionable tasks, the user defines rules, explaining at a -single point how all targets of a given type are built. - -** Rules files - -Rules are defined in rules files (by default named ~RULES~). Those -contain a JSON object mapping rule names to their rule definition. -For rules, the same naming scheme as for targets applies. However, -built-in rules (always named by a single string) take precedence -in naming; to explicitly refer to a rule defined in the current -module, the module has to be specified, possibly by a relative -path, e.g., ~["./", ".", "install"]~. - -** Basic components of a rule - -A rule is defined through a JSON object with various keys. The only -mandatory key is ~"expression"~ containing the defining expression -of the rule. - -*** ~"config_fields"~, ~"string_fields"~ and ~"target_fields"~ - -These keys specify the fields that a target defined by that rule can -have. In particular, those have to be disjoint lists of strings. - -For ~"config_fields"~ and ~"string_fields"~ the respective field -has to evaluate to a list of strings, whereas ~"target_fields"~ -have to evaluate to a list of target references. Those references -are evaluated immediately, and in the name context of the target -they occur in. - -The difference between ~"config_fields"~ and ~"string_fields"~ is -that ~"config_fields"~ are evaluated before the target fields and -hence can be used by the rule to specify config transitions for the -target fields. ~"string_fields"~ on the other hand are evaluated -_after_ the target fields; hence the rule cannot use them to -specify a configuration transition, however the target definition -in those fields may use the ~"outs"~ and ~"runfiles"~ functions to -have access to the names of the artifacts or runfiles of a target -specified in one of the target fields. - -*** ~"implicit"~ - -This key specifies a map of implicit dependencies. The keys of the -map are additional target fields, the values are the fixed list -of targets for those fields. If a short-form name of a target is -used (e.g., only a string instead of a module-target pair), it is -interpreted relative to the repository and module the rule is defined -in, not the one the rule is used in. Other than this, those fields -are evaluated the same way as target fields settable on invocation -of the rule. - -*** ~"config_vars"~ - -This is a list of strings specifying which parts of the configuration -the rule uses. The defining expression of the rule is evaluated in an -environment that is the configuration restricted to those variables; -if one of those variables is not specified in the configuration -the value in the restriction is ~null~. - -*** ~"config_transitions"~ - -This key specifies a map of (some of) the target fields (whether -declared as ~"target_fields"~ or as ~"implicit"~) to a configuration -expression. Here, a configuration expression is any expression -in our language. It has access to the ~"config_vars"~ and the -~"config_fields"~ and has to evaluate to a list of maps. Each map -specifies a transition to the current configuration by amending -it on the domain of that map to the given value. - -*** ~"imports"~ - -This specifies a map of expressions that can later be used by -~CALL_EXPRESSION~. In this way, duplication of (rule) code can be -avoided. For each key, we have to have a name of an expression; -expressions are named following the same naming scheme as targets -and rules. The names are resolved in the context of the rule. -Expressions themselves are defined in expression files, the default -name being ~EXPRESSIONS~. - -Each expression is a JSON object. The only mandatory key is -~"expression"~ which has to be an expression in our language. It -optionally can have a key ~"vars"~ where the value has to be a list -of strings (and the default is the empty list). Additionally, it -can have another optional key ~"imports"~ following the same scheme -as the ~"imports"~ key of a rule; in the ~"imports"~ key of an -expression, names are resolved in the context of that expression. -It is a requirement that the ~"imports"~ graph be cycle free. - -*** ~"expression"~ - -This specifies the defining expression of the rule. The value has to -be an expression of our expression language (basically, an abstract -syntax tree serialized as JSON). It has access to the following -extra functions and, when evaluated, has to return a result value. - -**** ~FIELD~ - -The field function takes one argument, ~name~ which has to evaluate -to the name of a field. For string fields, the given list of strings -is returned; for target fields, the list of abstract names for the -given target is returned. These abstract names are opaque within -the rule language (but meaningful when reported in error messages) -and should only be used to be passed on to other functions that -expect names as inputs. - -**** ~DEP_ARTIFACTS~ and ~DEP_RUNFILES~ - -These functions give access to the artifacts, or runfiles, respectively, -of one of the targets depended upon. It takes two (evaluated) -arguments, the mandatory ~"dep"~ and the optional ~"transition"~. - -The argument ~"dep"~ has to evaluate to an abstract name (as can be -obtained from the ~FIELD~ function) of some target specified in one -of the target fields. The ~"transition"~ argument has to evaluate -to a configuration transition (i.e., a map) and the empty transition -is taken as default. It is an error to request a target-transition -pair for a target that was not requested in the given transition -through one of the target fields. - -**** ~DEP_PROVIDES~ - -This function gives access to a particular entry of the provides -map of one of the targets depended upon. The arguments ~"dep"~ -and ~"transition"~ are as for ~DEP_ARTIFACTS~; additionally, there -is the mandatory argument ~"provider"~ which has to evaluate to a -string. The function returns the value of the provides map of the -target at the given provider. If the key is not in the provides -map (or the value at that key is ~null~), the optional argument -~"default"~ is evaluated and returned. The default for ~"default"~ -is the empty list. - -**** ~BLOB~ - -The ~BLOB~ function takes a single (evaluated) argument ~data~ -which is optional and defaults to the empty string. This argument -has to evaluate to a string. The function returns an artifact that -is a non-executable file with the given string as content. - -**** ~TREE~ - -The ~TREE~ function takes a single (evaluated) argument ~$1~ which -has to be a map of artifacts. The result is a single tree artifact -formed from the input map. It is an error if the map cannot be -transformed into a tree (e.g., due to staging conflicts). - -**** ~ACTION~ - -Actions are a way to define new artifacts from (zero or more) already -defined artifacts by running a command, typically a compiler, linker, -archiver, etc. The action function takes the following arguments. -- ~"inputs"~ A map of artifacts. These artifacts are present when - the command is executed; the keys of the map are the relative path - from the working directory of the command. The command must not - make any assumption about the location of the working directory - in the file system (and instead should refer to files by path - relative to the working directory). Moreover, the command must - not modify the input files in any way. (In-place operations can - be simulated by staging, as is shown in the example later in - this document.) - - It is an additional requirement that no conflicts occur when - interpreting the keys as paths. For example, ~"foo.txt"~ and - ~"./foo.txt"~ are different as strings and hence legitimately - can be assigned different values in a map. When interpreted as - a path, however, they name the same path; so, if the ~"inputs"~ - map contains both those keys, the corresponding values have - to be equal. -- ~"cmd"~ The command to execute, given as ~argv~ vector, i.e., - a non-empty list of strings. The 0'th element of that list will - also be the program to be executed. -- ~"env"~ The environment in which the command should be executed, - given as a map of strings to strings. -- ~"outs"~ and ~"out_dirs"~ Two list of strings naming the files - and directories, respectively, the command is expected to create. - It is an error if the command fails to create the promised output - files. These two lists have to be disjoint, but an entry of - ~"outs"~ may well name a location inside one of the ~"out_dirs"~. - -This function returns a map with keys the strings mentioned in -~"outs"~ and ~"out_dirs"~. As values this map has artifacts defined -to be the ones created by running the given command (in the given -environment with the given inputs). - -**** ~RESULT~ - -The ~RESULT~ function is the only way to obtain a result value. -It takes three (evaluated) arguments, ~"artifacts"~, ~"runfiles"~, and -~"provides"~, all of which are optional and default to the empty map. -It defines the result of a target that has the given artifacts, -runfiles, and provided data, respectively. In particular, ~"artifacts"~ -and ~"runfiles"~ have to be maps to artifacts, and ~"provides"~ has -to be a map. Moreover, they keys in ~"runfiles"~ and ~"artifacts"~ -are treated as paths; it is an error if this interpretation yields -to conflicts. The keys in the artifacts or runfile maps as seen by -other targets are the normalized paths of the keys given. - - -Result values themselves are opaque in our expression language -and cannot be deconstructed in any way. Their only purpose is to -be the result of the evaluation of the defining expression of a target. - -**** ~CALL_EXPRESSION~ - -This function takes one mandatory argument ~"name"~ which is -unevaluated; it has to a be a string literal. The expression imported -by that name through the imports field is evaluated in the current -environment restricted to the variables of that expression. The result -of that evaluation is the result of the ~CALL_EXPRESSION~ statement. - -During the evaluation of an expression, rule fields can still be -accessed through the functions ~FIELD~, ~DEP_ARTIFACTS~, etc. In -particular, even an expression with no variables (that, hence, is -always evaluated in the empty environment) can carry out non-trivial -computations and be non-constant. The special functions ~BLOB~, -~ACTION~, and ~RESULT~ are also available. If inside the evaluation -of an expression the function ~CALL_EXPRESSION~ is used, the name -argument refers to the ~"imports"~ map of that expression. So the -call graph is deliberately recursion free. - -** Evaluation of a target - -A target defined by a user-defined rule is evaluated in the -following way. - -- First, the config fields are evaluated. - -- Then, the target-fields are evaluated. This happens for each - field as follows. - - The configuration transition for this field is evaluated and - the transitioned configurations determined. - - The argument expression for this field is evaluated. The result - is interpreted as a list of target names. Each of those targets - is analyzed in all the specified configurations. - -- The string fields are evaluated. If the expression for a string - field queries a target (via ~outs~ or ~runfiles~), the value for - that target is returned in the first configuration. The rational - here is that such generator expressions are intended to refer to - the corresponding target in its "main" configuration; they are - hardly used anyway for fields branching their targets over many - configurations. - -- The effective configuration for the target is determined. The target - effectively has used of the configuration the variables used by - the ~arguments_config~ in the rule invocation, the ~config_vars~ - the rule specified, and the parts of the configuration used by - a target dependent upon. For a target dependent upon, all parts - it used of its configuration are relevant expect for those fixed - by the configuration transition. - -- The rule expression is evaluated and the result of that evaluation - is the result of the rule. - -** Example of developing a rule - -Let's consider step by step an example of writing a rule. Say we want -to write a rule that programmatically patches some files. - -*** Framework: The minimal rule - -Every rule has to have a defining expression evaluating -to a ~RESULT~. So the minimally correct rule is the ~"null"~ -rule in the following example rule file. - -#+BEGIN_SRC -{ "null": {"expression": {"type": "RESULT"}}} -#+END_SRC - -This rule accepts no parameters, and has the empty map as artifacts, -runfiles, and provided data. So it is not very useful. - -*** String inputs - -Let's allow the target definition to have some fields. The most -simple fields are ~string_fields~; they are given by a list of -strings. In the defining expression we can access them directly via -the ~FIELD~ function. Strings can be used when defining maps, but -we can also create artifacts from them, using the ~BLOB~ function. -To create a map, we can use the ~singleton_map~ function. We define -values step by step, using the ~let*~ construct. - -#+BEGIN_SRC -{ "script only": - { "string_fields": ["script"] - , "expression": - { "type": "let*" - , "bindings": - [ [ "script content" - , { "type": "join" - , "separator": "\n" - , "$1": - { "type": "++" - , "$1": - [["H"], {"type": "FIELD", "name": "script"}, ["w", "q", ""]] - } - } - ] - , [ "script" - , { "type": "singleton_map" - , "key": "script.ed" - , "value": - {"type": "BLOB", "data": {"type": "var", "name": "script content"}} - } - ] - ] - , "body": - {"type": "RESULT", "artifacts": {"type": "var", "name": "script"}} - } - } -} -#+END_SRC - -*** Target inputs and derived artifacts - -Now it is time to add the input files. Source files are targets like -any other target (and happen to contain precisely one artifact). So -we add a target field ~"srcs"~ for the file to be patched. Here we -have to keep in mind that, on the one hand, target fields accept a -list of targets and, on the other hand, the artifacts of a target -are a whole map. We chose to patch all the artifacts of all given -~"srcs"~ targets. We can iterate over lists with ~foreach~ and maps -with ~foreach_map~. - -Next, we have to keep in mind that targets may place their artifacts -at arbitrary logical locations. For us that means that first -we have to make a decision at which logical locations we want -to place the output artifacts. As one thinks of patching as an -in-place operation, we chose to logically place the outputs where -the inputs have been. Of course, we do not modify the input files -in any way; after all, we have to define a mathematical function -computing the output artifacts, not a collection of side effects. -With that choice of logical artifact placement, we have to decide -what to do if two (or more) input targets place their artifacts at -logically the same location. We could simply take a "latest wins" -semantics (keep in mind that target fields give a list of targets, -not a set) as provided by the ~map_union~ function. We chose to -consider it a user error if targets with conflicting artifacts are -specified. This is provided by the ~disjoint_map_union~ that also -allows to specify an error message to be provided the user. Here, -conflict means that values for the same map position are defined -in a different way. - -The actual patching is done by an ~ACTION~. We have the script -already; to make things easy, we stage the input to a fixed place -and also expect a fixed output location. Then the actual command -is a simple shell script. The only thing we have to keep in mind -is that we want useful output precisely if the action fails. Also -note that, while we define our actions sequentially, they will -be executed in parallel, as none of them depends on the output of -another one of them. - -#+BEGIN_SRC -{ "ed patch": - { "string_fields": ["script"] - , "target_fields": ["srcs"] - , "expression": - { "type": "let*" - , "bindings": - [ [ "script content" - , { "type": "join" - , "separator": "\n" - , "$1": - { "type": "++" - , "$1": - [["H"], {"type": "FIELD", "name": "script"}, ["w", "q", ""]] - } - } - ] - , [ "script" - , { "type": "singleton_map" - , "key": "script.ed" - , "value": - {"type": "BLOB", "data": {"type": "var", "name": "script content"}} - } - ] - , [ "patched files per target" - , { "type": "foreach" - , "var": "src" - , "range": {"type": "FIELD", "name": "srcs"} - , "body": - { "type": "foreach_map" - , "var_key": "file_name" - , "var_val": "file" - , "range": - {"type": "DEP_ARTIFACTS", "dep": {"type": "var", "name": "src"}} - , "body": - { "type": "let*" - , "bindings": - [ [ "action output" - , { "type": "ACTION" - , "inputs": - { "type": "map_union" - , "$1": - [ {"type": "var", "name": "script"} - , { "type": "singleton_map" - , "key": "in" - , "value": {"type": "var", "name": "file"} - } - ] - } - , "cmd": - [ "/bin/sh" - , "-c" - , "cp in out && chmod 644 out && /bin/ed out < script.ed > log 2>&1 || (cat log && exit 1)" - ] - , "outs": ["out"] - } - ] - ] - , "body": - { "type": "singleton_map" - , "key": {"type": "var", "name": "file_name"} - , "value": - { "type": "lookup" - , "map": {"type": "var", "name": "action output"} - , "key": "out" - } - } - } - } - } - ] - , [ "artifacts" - , { "type": "disjoint_map_union" - , "msg": "srcs artifacts must not overlap" - , "$1": - { "type": "++" - , "$1": {"type": "var", "name": "patched files per target"} - } - } - ] - ] - , "body": - {"type": "RESULT", "artifacts": {"type": "var", "name": "artifacts"}} - } - } -} -#+END_SRC - -A typical invocation of that rule would be a target file like the following. -#+BEGIN_SRC -{ "input.txt": - { "type": "ed patch" - , "script": ["%g/world/s//user/g", "%g/World/s//USER/g"] - , "srcs": [["FILE", null, "input.txt"]] - } -} -#+END_SRC -As the input file has the same name as a target (in the same module), -we use the explicit file reference in the specification of the sources. - -*** Implicit dependencies and config transitions - -Say, instead of patching a file, we want to generate source files -from some high-level description using our actively developed code -generator. Then we have to do some additional considerations. -- First of all, every target defined by this rule not only depends - on the targets the user specifies. Additionally, our code - generator is also an implicit dependency. And as it is under - active development, we certainly do not want it to be taken from - the ambient build environment (as we did in the previous example - with ~ed~ which, however, is a pretty stable tool). So we use an - ~implicit~ target for this. -- Next, we notice that our code generator is used during the - build. In particular, we want that tool (written in some compiled - language) to be built for the platform we run our actions on, not - the target platform we build our final binaries for. Therefore, - we have to use a configuration transition. -- As our defining expression also needs the configuration transition - to access the artifacts of that implicit target, we better define - it as a reusable expression. Other rules in our rule collection - might also have the same task; so ~["transitions", "for host"]~ - might be a good place to define it. In fact, it can look like - the expression with that name in our own code base. - -So, the overall organization of our rule might be as follows. - -#+BEGIN_SRC -{ "generated code": - { "target_fields": ["srcs"] - , "implicit": {"generator": [["generators", "foogen"]]} - , "config_vars": ["HOST_ARCH"] - , "imports": {"for host": ["transitions", "for host"]} - , "config_transitions": - {"generator": [{"type": "CALL_EXPRESSION", "name": "for host"}]} - , "expression": ... - } -} -#+END_SRC - -*** Providing information to consuming targets - -In the simple case of patching, the resulting file is indeed the -only information the consumer of that target needs; in fact, the main -point was that the resulting target could be a drop-in replacement -of a source file. A typical rule, however, defines something like -a library and a library is much more, than just the actual library -file and the public headers: a library may depend on other libraries; -therefore, in order to use it, we need -- to have the header files of dependencies available that might be - included by the public header files of that library, -- to have the libraries transitively depended upon available during - linking, and -- to know the order in which to link the dependencies (as they - might have dependencies among each other). -In order to keep a maintainable build description, all this should -be taken care of by simply depending on that library. We do _not_ -want the consumer of a target having to be aware of such transitive -dependencies (e.g., when constructing the link command line), as -it used to be the case in early build tools like ~make~. - -It is a deliberate design choice that a target is given only by -the result of its analysis, regardless of where it is coming from. -Therefore, all this information needs to be part of the result of -a target. Such kind of information is precisely, what the mentioned -~"provides"~ map is for. As a map, it can contain an arbitrary -amount of information and the interface function ~"DEP_PROVIDES"~ -is in such a way that adding more providers does not affect targets -not aware of them (there is no function asking for all providers -of a target). The keys and their meaning have to be agreed upon -by a target and its consumers. As the latter, however, typically -are a target of the same family (authored by the same group), this -usually is not a problem. - -A typical example of computing a provided value is the ~"link-args"~ -in the rules used by ~just~ itself. They are defined by the following -expression. -#+BEGIN_SRC -{ "type": "nub_right" -, "$1": - { "type": "++" - , "$1": - [ {"type": "keys", "$1": {"type": "var", "name": "lib"}} - , {"type": "CALL_EXPRESSION", "name": "link-args-deps"} - , {"type": "var", "name": "link external", "default": []} - ] - } -} -#+END_SRC -This expression -- collects the respective provider of its dependencies, -- adds itself in front, and -- deduplicates the resulting list, keeping only the right-most - occurrence of each entry. -In this way, the invariant is kept, that the ~"link-args"~ from a -topological ordering of the dependencies (in the order that a each -entry is mentioned before its dependencies). diff --git a/doc/concepts/target-cache.md b/doc/concepts/target-cache.md new file mode 100644 index 00000000..0db627e1 --- /dev/null +++ b/doc/concepts/target-cache.md @@ -0,0 +1,231 @@ +Target-level caching +==================== + +`git` trees as content-fixed roots +---------------------------------- + +### The `"git tree"` root scheme + +The multi-repository configuration supports a scheme `"git tree"`. This +scheme is given by two parameters, + + - the id of the tree (as a string with the hex encoding), and + - an arbitrary `git` repository containing the specified tree object, + as well as all needed tree and blob objects reachable from that + tree. + +For example, a root could be specified as follows. + +``` jsonc +["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"] +``` + +It should be noted that the `git` tree identifier alone already +specifies the content of the full tree. However, `just` needs access to +some repository containing the tree in order to know what the tree looks +like. + +Nevertheless, it is an important observation that the tree identifier +alone already specifies the content of the whole (logical) directory. +The equality of two such directories can be established by comparing the +two identifiers *without* the need to read any file from +disk. Those "fixed-content" descriptions, i.e., descriptions of a +repository root that already fully determines the content are the key to +caching whole targets. + +### `KNOWN` artifacts + +The in-memory representation of known artifacts has an optional +reference to a repository containing that artifact. Artifacts "known" +from local repositories might not be known to the CAS used for the +action execution; this additional reference allows to fill such misses +in the CAS. + +Content-fixed repositories +-------------------------- + +### The parts of a content-fixed repository + +In order to meaningfully cache a target, we need to be able to +efficiently compute the cache key. We restrict this to the case where we +can compute the information about the repository without file-system +access. This requires that all roots (workspace, target root, etc) be +content fixed, as well as the bindings of the free repository names (and +hence also all transitively reachable repositories). The call such +repositories "content-fixed" repositories. + +### Canonical description of a content-fixed repository + +The local data of a repository consists of the following. + + - The roots (for workspace, targets, rules, expressions). As the tree + identifier already defines the content, we leave out the path to the + repository containing the tree. + - The names of the targets, rules, and expression files. + - The names of the outgoing "bindings". + +Additionally, repositories can reach additional repositories via +bindings. Moreover, this repository-level dependency relation is not +necessarily cycle free. In particular, we cannot use the tree unfolding +as canonical representation of that graph up to bisimulation, as we do +with most other data structures. To still get a canonical +representation, we factor out the largest bisimulation, i.e., minimize +the respective automaton (with repositories as states, local data as +locally observable properties, and the binding relation as edges). + +Finally, for each repository individually, the reachable repositories +are renamed `"0"`, `"1"`, `"2"`, etc, following a depth-first traversal +starting from the repository in question where outgoing edges are +traversed in lexicographical order. The entry point is hence +recognisable as repository `"0"`. + +The repository key content-identifier of the canonically formatted +canonical serialisation of the JSON encoding of the obtain +multi-repository configuration (with repository-free git-root +descriptions). The serialisation itself is stored in CAS. + +These identifications and replacement of global names does not change +the semantics, as our name data types are completely opaque to our +expression language. In the `"json_encode"` expression, they're +serialized as `null` and string representation is only generated in user +messages not available to the language itself. Moreover, names cannot be +compared for equality either, so their only observable properties, i.e., +the way `"DEP_ARTIFACTS"`, `"DEP_RUNFILES`, and `"DEP_PROVIDES"` reacts +to them are invariant under repository bisimulation. + +Configuration and the `"export"` rule +------------------------------------- + +Targets not only depend on the content of their repository, but also on +their configurations. Normally, the effective part of a configuration is +only determined after analysing the target. However, for caching, we +need to compute the cache key directly. This property is provided by the +built-in `"export"` rule; only `"export"` targets residing in +content-fixed repositories will be cached. This also serves as +indication, which targets of a repository are intended for consumption +by other repositories. + +An `"export"` rule takes precisely the following arguments. + + - `"target"` specifying a single target, the target to be cached. It + must not be tainted. + - `"flexible_config"` a list of strings; those specify the variables + of the configuration that are considered. All other parts of the + configuration are ignored. So the effective configuration for the + `"export"` target is the configuration restricted to those variables + (filled up with `null` if the variable was not present in the + original configuration). + - `"fixed_config"` a dict with of arbitrary JSON values (taken + unevaluated) with keys disjoint from the `"flexible_config"`. + +An `"export"` target is analyzed as follows. The configuration is +restricted to the variables specified in the `"flexible_config"`; this +will result in the effective configuration for the exported target. It +is a requirement that the effective configuration contain only pure JSON +values. The (necessarily conflict-free) union with the `"fixed_config"` +is computed and the `"target"` is evaluated in this configuration. The +result (artifacts, runfiles, provided information) is the result of that +evaluation. It is a requirement that the provided information does only +contain pure JSON values and artifacts (including tree artifacts); in +particular, they may not contain names. + +Cache key +--------- + +We only consider `"export"` targets in content-fixed repositories for +caching. An export target is then fully described by + + - the repository key of the repository the export target resides in, + - the target name of the export target within that repository, + described as module-name pair, and + - the effective configuration. + +More precisely, the canonical description is the JSON object with those +values for the keys `"repo_key"`, `"target_name"`, and +`"effective_config"`, respectively. The repository key is the blob +identifier of the canonical serialisation (including sorted keys, etc) +of the just described piece of JSON. To allow debugging and cooperation +with other tools, whenever a cache key is computed, it is ensured, that +the serialisation ends up in the applicable CAS. + +It should be noted that the cache key can be computed +*without* analyzing the target referred to. This is +possible, as the configuration is pruned a priori instead of the usual +procedure to analyse and afterwards determine the parts of the +configuration that were relevant. + +Cached value +------------ + +The value to be cached is the result of evaluating the target, that is, +its artifacts, runfiles, and provided data. All artifacts inside those +data structures will be described as known artifacts. + +As serialisation, we will essentially use our usual JSON encoding; while +this can be used as is for artifacts and runfiles where we know that +they have to be a map from strings to artifacts, additional information +will be added for the provided data. The provided data can contain +artifacts, but also legitimately pure JSON values that coincide with our +JSON encoding of artifacts; the same holds true for nodes and result +values. Moreover, the tree unfolding implicit in the JSON serialisation +can be exponentially larger than the value. + +Therefore, in our serialisation, we add an entry for every subexpression +and separately add a list of which subexpressions are artifacts, nodes, +or results. During deserialisation, we use this subexpression structure +to deserialize every subexpression only one. + +Sharding of target cache +------------------------ + +In our target description, the execution environment is not included. +For local execution, it is implicit anyway. As we also want to cache +high-level targets when using remote execution, we shard the target +cache (e.g., by using appropriate subdirectories) by the blob identifier +of the serialisation of the description of the execution backend. Here, +`null` stands for local execution, and for remote execution we use an +object with keys `"remote_execution_address"` and +`"remote_execution_properties"` filled in the obvious way. As usual, we +add the serialisation to the CAS. + +`"export"` targets, strictness and the extensional projection +------------------------------------------------------------- + +As opposed to the target that is exported, the corresponding export +target, if part of a content-fixed repository, will be strict: a build +depending on such a target can only succeed if all artifacts in the +result of target (regardless whether direct artifacts, runfiles, or as +part of the provided data) can be built, even if not all (or even none) +are actually used in the build. + +Upon cache hit, the artifacts of an export target are the known +artifacts corresponding to the artifacts of the exported target. While +extensionally equal, known artifacts are defined differently, so an +export target and the exported target are intensionally different (and +that difference might only be visible on the second build). As +intensional equality is used when testing for absence of conflicts in +staging, a target and its exported version almost always conflict and +hence should not be used together. One way to achieve this is to always +use the export target for any target that is exported. This fits well +together with the recommendation of only depending on export targets of +other repositories. + +If a target forwards artifacts of an exported target (indirect header +files, indirect link dependencies, etc), and is exported again, no +additional conflicts occur; replacing by the corresponding known +artifact is a projection: the known artifact corresponding to a known +artifact is the artifact itself. Moreover, by the strictness property +described earlier, if an export target has a cache hit, then so have all +export targets it depends upon. Keep in mind that a repository can only +be content-fixed if all its dependencies are. + +For this strictness-based approach to work, it is, however, a +requirement that any artifact that is exported (typically indirectly, +e.g., as part of a common dependency) by several targets is only used +through the same export target. For a well-structured repository, this +should not be a natural property anyway. + +The forwarding of artifacts are the reason we chose that in the +non-cached analysis of an export target the artifacts are passed on as +received and are not wrapped in an "add to cache" action. The latter +choice would violate that projection property we rely upon. diff --git a/doc/concepts/target-cache.org b/doc/concepts/target-cache.org deleted file mode 100644 index 591a66af..00000000 --- a/doc/concepts/target-cache.org +++ /dev/null @@ -1,219 +0,0 @@ -* Target-level caching - -** ~git~ trees as content-fixed roots - -*** The ~"git tree"~ root scheme - -The multi-repository configuration supports a scheme ~"git tree"~. -This scheme is given by two parameters, -- the id of the tree (as a string with the hex encoding), and -- an arbitrary ~git~ repository containing the specified tree - object, as well as all needed tree and blob objects reachable - from that tree. -For example, a root could be specified as follows. -#+BEGIN_SRC -["git tree", "6a1820e78f61aee6b8f3677f150f4559b6ba77a4", "/usr/local/src/justbuild.git"] -#+END_SRC - -It should be noted that the ~git~ tree identifier alone already -specifies the content of the full tree. However, ~just~ needs access -to some repository containing the tree in order to know what the -tree looks like. - -Nevertheless, it is an important observation that the tree identifier -alone already specifies the content of the whole (logical) directory. -The equality of two such directories can be established by comparing -the two identifiers _without_ the need to read any file from -disk. Those "fixed-content" descriptions, i.e., descriptions of a -repository root that already fully determines the content are the -key to caching whole targets. - -*** ~KNOWN~ artifacts - -The in-memory representation of known artifacts has an optional -reference to a repository containing that artifact. Artifacts -"known" from local repositories might not be known to the CAS used -for the action execution; this additional reference allows to fill -such misses in the CAS. - -** Content-fixed repositories - -*** The parts of a content-fixed repository - -In order to meaningfully cache a target, we need to be able to -efficiently compute the cache key. We restrict this to the case where -we can compute the information about the repository without file-system -access. This requires that all roots (workspace, target root, etc) -be content fixed, as well as the bindings of the free repository -names (and hence also all transitively reachable repositories). -The call such repositories "content-fixed" repositories. - -*** Canonical description of a content-fixed repository - -The local data of a repository consists of the following. -- The roots (for workspace, targets, rules, expressions). As the - tree identifier already defines the content, we leave out the - path to the repository containing the tree. -- The names of the targets, rules, and expression files. -- The names of the outgoing "bindings". - -Additionally, repositories can reach additional repositories via -bindings. Moreover, this repository-level dependency relation -is not necessarily cycle free. In particular, we cannot use the -tree unfolding as canonical representation of that graph up to -bisimulation, as we do with most other data structures. To still get -a canonical representation, we factor out the largest bisimulation, -i.e., minimize the respective automaton (with repositories as -states, local data as locally observable properties, and the binding -relation as edges). - -Finally, for each repository individually, the reachable repositories -are renamed ~"0"~, ~"1"~, ~"2"~, etc, following a depth-first -traversal starting from the repository in question where outgoing -edges are traversed in lexicographical order. The entry point is -hence recognisable as repository ~"0"~. - -The repository key content-identifier of the canonically formatted -canonical serialisation of the JSON encoding of the obtain -multi-repository configuration (with repository-free git-root -descriptions). The serialisation itself is stored in CAS. - -These identifications and replacement of global names does not change -the semantics, as our name data types are completely opaque to our -expression language. In the ~"json_encode"~ expression, they're -serialized as ~null~ and string representation is only generated in -user messages not available to the language itself. Moreover, names -cannot be compared for equality either, so their only observable -properties, i.e., the way ~"DEP_ARTIFACTS"~, ~"DEP_RUNFILES~, and -~"DEP_PROVIDES"~ reacts to them are invariant under repository -bisimulation. - -** Configuration and the ~"export"~ rule - -Targets not only depend on the content of their repository, but also -on their configurations. Normally, -the effective part of a configuration is only determined after -analysing the target. However, for caching, we need to compute -the cache key directly. This property is provided by the built-in ~"export"~ rule; only ~"export"~ targets -residing in content-fixed repositories will be cached. This also -serves as indication, which targets of a repository are intended -for consumption by other repositories. - -An ~"export"~ rule takes precisely the following arguments. -- ~"target"~ specifying a single target, the target to be cached. - It must not be tainted. -- ~"flexible_config"~ a list of strings; those specify the variables - of the configuration that are considered. All other parts of - the configuration are ignored. So the effective configuration for - the ~"export"~ target is the configuration restricted to those - variables (filled up with ~null~ if the variable was not present - in the original configuration). -- ~"fixed_config"~ a dict with of arbitrary JSON values (taken - unevaluated) with keys disjoint from the ~"flexible_config"~. - -An ~"export"~ target is analyzed as follows. The configuration is -restricted to the variables specified in the ~"flexible_config"~; -this will result in the effective configuration for the exported -target. It is a requirement that the effective configuration contain -only pure JSON values. The (necessarily conflict-free) union with -the ~"fixed_config"~ is computed and the ~"target"~ is evaluated -in this configuration. The result (artifacts, runfiles, provided -information) is the result of that evaluation. It is a requirement -that the provided information does only contain pure JSON values -and artifacts (including tree artifacts); in particular, they may -not contain names. - -** Cache key - -We only consider ~"export"~ targets in content-fixed repositories -for caching. An export target is then fully described by -- the repository key of the repository the export target resides in, -- the target name of the export target within that repository, - described as module-name pair, and -- the effective configuration. -More precisely, the canonical description is the JSON object with -those values for the keys ~"repo_key"~, ~"target_name"~, and ~"effective_config"~, -respectively. The repository key is the blob identifier of the -canonical serialisation (including sorted keys, etc) of the just -described piece of JSON. To allow debugging and cooperation with -other tools, whenever a cache key is computed, it is ensured, -that the serialisation ends up in the applicable CAS. - -It should be noted that the cache key can be computed _without_ -analyzing the target referred to. This is possible, as the -configuration is pruned a priori instead of the usual procedure -to analyse and afterwards determine the parts of the configuration -that were relevant. - -** Cached value - -The value to be cached is the result of evaluating the target, -that is, its artifacts, runfiles, and provided data. All artifacts -inside those data structures will be described as known artifacts. - -As serialisation, we will essentially use our usual JSON encoding; -while this can be used as is for artifacts and runfiles where we -know that they have to be a map from strings to artifacts, additional -information will be added for the provided data. The provided data -can contain artifacts, but also legitimately pure JSON values that -coincide with our JSON encoding of artifacts; the same holds true -for nodes and result values. Moreover, the tree unfolding implicit -in the JSON serialisation can be exponentially larger than the value. - -Therefore, in our serialisation, we add an entry for every subexpression -and separately add a list of which subexpressions are artifacts, -nodes, or results. During deserialisation, we use this subexpression -structure to deserialize every subexpression only one. - -** Sharding of target cache - -In our target description, the execution environment is not included. -For local execution, it is implicit anyway. As we also want to -cache high-level targets when using remote execution, we shard the -target cache (e.g., by using appropriate subdirectories) by the blob -identifier of the serialisation of the description of the execution -backend. Here, ~null~ stands for local execution, and for remote -execution we use an object with keys ~"remote_execution_address"~ -and ~"remote_execution_properties"~ filled in the obvious way. As -usual, we add the serialisation to the CAS. - -** ~"export"~ targets, strictness and the extensional projection - -As opposed to the target that is exported, the corresponding export -target, if part of a content-fixed repository, will be strict: a -build depending on such a target can only succeed if all artifacts -in the result of target (regardless whether direct artifacts, -runfiles, or as part of the provided data) can be built, even if -not all (or even none) are actually used in the build. - -Upon cache hit, the artifacts of an export target are the known -artifacts corresponding to the artifacts of the exported target. -While extensionally equal, known artifacts are defined differently, -so an export target and the exported target are intensionally -different (and that difference might only be visible on the second -build). As intensional equality is used when testing for absence -of conflicts in staging, a target and its exported version almost -always conflict and hence should not be used together. One way to -achieve this is to always use the export target for any target that -is exported. This fits well together with the recommendation of -only depending on export targets of other repositories. - -If a target forwards artifacts of an exported target (indirect header -files, indirect link dependencies, etc), and is exported again, no -additional conflicts occur; replacing by the corresponding known -artifact is a projection: the known artifact corresponding to a -known artifact is the artifact itself. Moreover, by the strictness -property described earlier, if an export target has a cache hit, -then so have all export targets it depends upon. Keep in mind that -a repository can only be content-fixed if all its dependencies are. - -For this strictness-based approach to work, it is, however, a -requirement that any artifact that is exported (typically indirectly, -e.g., as part of a common dependency) by several targets is only -used through the same export target. For a well-structured repository, -this should not be a natural property anyway. - -The forwarding of artifacts are the reason we chose that in the -non-cached analysis of an export target the artifacts are passed on -as received and are not wrapped in an "add to cache" action. The -latter choice would violate that projection property we rely upon. |