Add basic documentation for anonymous targets

author: Klaus Aehlig <klaus.aehlig@huawei.com> 2022-03-22 12:41:30 +0100
committer: Klaus Aehlig <klaus.aehlig@huawei.com> 2022-03-22 12:43:35 +0100
commit: a9f66b1355fa45cff1ce3501f1c86f7fcca84092 (patch)
tree: c6fc27d8c9b85bcf1cffb9f7027ca48a030547ae /doc/concepts/anonymous-targets.org
parent: 072b2946ab808c630381502381c2d36466e053f4 (diff)
download: justbuild-a9f66b1355fa45cff1ce3501f1c86f7fcca84092.tar.gz
1 files changed, 335 insertions, 0 deletions
diff --git a/doc/concepts/anonymous-targets.org b/doc/concepts/anonymous-targets.org
new file mode 100644
index 00000000..0bbfb658
--- /dev/null
+++ b/doc/concepts/anonymous-targets.org
@@ -0,0 +1,335 @@
+* Anonymous targets
+** Motivation
+
+Using [[https://github.com/protocolbuffers/protobuf][Protocol
+buffers]] allows to specify, in a language-independent way, a wire
+format for structured data. This is done by using description files
+from which APIs for various languages can be generated. As protocol
+buffers can contain other protocol buffers, the description files
+themselves have a dependency structure.
+
+From a software-engeneering point of view, the challenge is to
+ensure that the author of the description files does not have to
+be aware of the languages for which APIs will be generated later.
+In fact, the main benefit of the language-independent description
+is that clients in various languages can be implemented using the
+same wire protocol (and thus capable of communicating with the
+same server).
+
+For a build system that means that we have to expect that language
+bindings at places far away from the protocol definition, and
+potentially several times. Such a duplication can also occur
+implicitly if two buffers, for which language bindings are generated
+both use a common buffer for which bindings are never requested
+explicitly. Still, we want to avoid duplicate work for common parts
+and we have to avoid conflicts with duplicate symbols and staging
+conflicts for the libraries for the common part.
+
+Our approach is that a "proto" target only provides the description
+files together with their dependency structure. From those, a
+consuming target generates "anonymous targets" as additional
+dependencies; as those targets will have an appropriate notion of
+equality, no duplicate work is done and hence, as a side effect,
+staging or symbol conflicts are avoided as well.
+
+** Prelimanary remark: action identifiers
+
+Actions are defined as Merkle-tree hash of the contents. As all
+components (input tree, list of output strings, command vector,
+environment, and cache pragma) are given by expressions, that can
+quickly be computed. This identifier also defines the notion of
+equality for actions, and hence action artifacts. Recall that eqality
+of artifacts is also (implicitly) used in our notion of disjoint
+map union (where the set of keys does not have to be disjoint, as
+long as the values for all duplicate keys are equal).
+
+When constructing the action graph for traversal, we can drop
+duplicates (i.e., actions with the same identifier, and hence the
+same description). For the serialization of the graph as part of
+the analyse command, we can afford the preperatory step to compute
+a map from action id to list of origins.
+
+** Equality
+
+*** Notions of equality
+
+In the context of builds, there are different concepts of equality
+to consider. We recall the definitions, as well as their use in
+our build tool.
+
+**** Locational equality ("Defined at the same place")
+
+Names (for targets and rules) are given by repository name, module
+name, and target name (inside the module); additionally, for target
+names, there's a bit specifying that we explicitly refer to a file.
+Names are equal if and only if the respective strings (and the file
+bit) are equal.
+
+For targets, we use locational equality, i.e., we consider targets
+equal precisely if their names are equal; targets defined at different
+places are considered different, even if they're defined in the
+same way. The reason we use notion of equality is that we have to
+refer to targets (and also check if we already have a pending task
+to analyse them) before we have fully explored them with all the
+targets referred to in their definition.
+
+**** Intensional equality ("Defined in the same way")
+
+In our expression language we handle definitions; in particular,
+we treat artifacts by their definition: a particular source file,
+the output of a particular action, etc. Hence we use intensional
+equality in our expression language; two objects are equal precisely
+if they are defined in the same way. This notion of equality is easy
+to determine without the need of reading a source file or running
+an action. We implement quick tests by keeping a Merkle-tree hash
+of all expression values.
+
+**** Extensional equality ("Defining the same object")
+
+For built artifacts, we use extensional equality, i.e., we consider
+two files equal, if they are bit-by-bit identical. Implementation-wise,
+we compare an appropriate cryptographic hash. Before running an
+action, we built its inputs. In particular (as inputs are considered
+extensionally) an action might cause a cache hit with an intensionally
+different one.
+
+**** Observable equality ("The defined objects behave in the same way")
+
+Finally, there is the notion of observable equality, i.e., the
+property that two binaries behaving the same way in all situations.
+As this notion is undecidable, it is never used directly by any
+build tool. However, it is often the motivation for a build in the
+first place: we want a binary that behaves in a particular way.
+
+*** Relation between these notions
+
+The notions of equality were introduced in order from most fine grained
+to most coarse. Targets defined at the same place are obviously defined
+in the same way. Intensionally equal artifacts create equal action
+graphs; here we can confidently say "equal" and not only isomorphic:
+due to our preliminary clean up, even the node names are equal.
+Making sure that equal actions produce bit-by-bit equal outputs
+is the realm of [[https://reproducible-builds.org/][reproducibe
+builds]]. The tool can support this by appropriate sandboxing,
+etc, but the rules still have to define actions that don't pick
+up non-input information like the current time, user id, readdir
+order, etc. Files that are bit-by-bit identical will behave in
+the same way.
+
+*** Example
+
+Consider the following target file.
+
+#+BEGIN_SRC
+{ "foo":
+  { "type": "generic"
+  , "outs": ["out.txt"]
+  , "cmds": ["echo Hello World > out.txt"]
+  }
+, "bar":
+  { "type": "generic"
+  , "outs": ["out.txt"]
+  , "cmds": ["echo Hello World > out.txt"]
+  }
+, "baz":
+  { "type": "generic"
+  , "outs": ["out.txt"]
+  , "cmds": ["echo -n Hello > out.txt && echo ' World' >> out.txt"]
+  }
+, "foo upper":
+  { "type": "generic"
+  , "deps": ["foo"]
+  , "outs": ["upper.txt"]
+  , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
+  }
+, "bar upper":
+  { "type": "generic"
+  , "deps": ["bar"]
+  , "outs": ["upper.txt"]
+  , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
+  }
+, "baz upper":
+  { "type": "generic"
+  , "deps": ["baz"]
+  , "outs": ["upper.txt"]
+  , "cmds": ["cat out.txt | tr a-z A-Z > upper.txt"]
+  }
+, "ALL":
+  { "type": "install"
+  , "files":
+    {"foo.txt": "foo upper", "bar.txt": "bar upper", "baz.txt": "baz upper"}
+  }
+}
+#+END_SRC
+
+Assume we build the target "ALL". Then we will analyse 7 targets, all
+the locationally different ones ("foo", "bar", "baz", "foo upper", "bar
+upper", "baz upper"). For the targets "foo" and "bar", we immediately
+see that the definition is equal; their intensional equality also
+renders "foo upper" and "bar upper" intensionally equal. Our action
+graph will contain 4 actions: one with origins ["foo", "bar"], one
+with origins ["baz"], one with origins ["foo upper", "bar upper"],
+and one with origins ["baz upper"]. The "install" target will, of
+course, not create any actions. Building sequentially ("-J 1"), we
+will get one cache hit. Even though the artifacts of "foo"/"bar"
+and "baz" are defined differently, they are extensionally equal;
+both define a file with contents "Hello World\n".
+
+** Anonymous targets
+
+Besides named targets we also have additional targets (and hence also
+configured targets) that are not associated with a location they are
+defined at. Due to the absence of definition location, their notion
+of equality will take care of the necessary deduplication (implicitly,
+by the way our dependency exploration works). We will call them
+"anonymous targets", even though, technically, they're not fully
+anonymous as the rules that are part of their structure will be
+given by name, i.e., defining rule location.
+
+*** Value type: target graph node
+
+In order to allow targets to adequately describe a dependency
+structure, we have a value type in our expression language, that
+of a (target) graph node. As with all value types, equality is
+intensional, i.e., nodes defined in the same way are equal even
+if defined at different places. This can be achieved by our usual
+approach for expressions of having cached Merkle-tree hashes and
+comparing them when an equality test is required. This efficient
+test for equality also allows using graph nodes as part of a map
+key, e.g., for our asynchronous map conumers.
+
+As a graph node can only be defined with all data given, the defined
+dependency structure is cycle-free by construction. However, the
+tree unfolding will usually be exponentially larger. For internal
+handling, this is not a problem: our shared-pointer implementation
+can efficently represent a directed acyclic graph and since we
+cache hashes in expressions, we can compute the overall hash without
+folding the structure to a tree. When presenting nodes to the user,
+we only show the map of identifier to definition, to avoid that
+exponential unfolding.
+
+We have two kinds of nodes.
+
+**** Value nodes
+
+These represent a target that, in any configuration, returns a fixed
+value. Source files would typically be represented this way. The
+constructor function ~"VALUE_NODE"~ takes a single argument ~"$1"~
+that has to be a result value.
+
+**** Abstract nodes
+
+These represent internal nodes in the dag. Their constructor
+~"ABSTRACT_NODE"~ takes the following arguments (all evaluated).
+- ~"node_type"~. An arbitrary string, not interpreted in any way, to
+  indicate the role that the node has in the dependency structure.
+  When we create an anonymous target from a node, this will serve
+  as the key into the rule mapping to be applied.
+- ~"string_fields"~. This has to be a map of strings.
+- ~"target_fields"~. These have to be a map of lists of graph nodes.
+Moreover, we require that the keys for maps provided as ~"string_fields"~
+and ~"target_fields"~ be disjoint.
+
+*** Graph nodes in ~export~ targets
+
+Graph nodes are completely free of names and hence are eligible
+for exporting. As with other values, in the cache the intensional
+definition of artifacts implict in them will be replaced by the
+corresponding, extensionally equal, known value.
+
+However, some care has to be taken in the serialisation that is
+part of the caching, as we do not want to unfold the the dag to
+a tree. Therefore, we take as JSON serialisation a simple dict
+with with "type" set to "NODE", and "value" set to the Merkle-tree
+hash. That serialisation respects intensional equality. To allow
+deserialisation, we add an additional map to the serialisation from
+node hash to its definition.
+
+*** Dependings on anonymous targets
+
+**** Parts of an anonymous target
+
+An anonymous target is given by a pair of a node and a map mapping
+the abstract node-type specifying strings to rule names. So, in
+the implementation these are just two expression pointers (with
+their defined notion of equality, i.e., equality of the respective
+Merkle-tree hashes). Such a pair of pointers also forms an additional
+variant of a name value, refering to such an anonymous target.
+
+It should be noted that such an anonymous target contains all the
+information needed to evaluate it in the same way as a regular (named)
+target defined by a user-defined rule. It is an analysis error
+analysing an anonymous target where there is no entry in the rules
+map for the string given as "node_type" for the corresponding node.
+
+**** Anonymous targets as additional dependencies
+
+We keep the property that a user can only request named targets.
+So anonymous targets have to be requested by other targets. We
+also keep the property that other targets are only requested at
+certain fixed steps in the evaluation of a target. To still achieve
+a meaningful use of anonymous targets our rule language hanldes
+anonymous targets in the following way.
+
+***** Rules paramaeter "anonymous"
+
+In the rule definition a paramter "anonymous" (with empty map as
+default) is allowed. It is used to define an additional dependency on
+anonymous targets. The value has to be a map with keys the additional
+implicitly defined field names. It is hence a requirement that the
+set of keys be disjoint from all other field names (the values of
+"config_fields", "string_fields", and "target_fields", as well as
+the keys of the "implict" parameter). Another consequence is that
+"config_transitions" map may now also have meaningful entries for
+the keys of the "anonymous" map. Each value in the map has to be
+itself a map, with entries "target", "provider", and "rule_map".
+
+For "target", a single string has to be specifed, and the value has
+to be a member of the "target_fields" list. For provider, a single
+string has to be specified as well. The idea is that the nodes are
+collected from that provider of the targets in the specified target
+field. For "rule_map" a map has to be specified from strings to
+rule names; the latter are evaluated in the context of the rule
+definition.
+
+****** Example
+
+For generating language bindings for protocol buffers, a rule might
+look as follows.
+
+#+BEGIN_SRC
+{ "cc_proto_bindings":
+  { "target_fields": ["proto_deps"]
+  , "anonymous":
+    { "protos":
+      { "target": "proto_deps"
+      , "provider": "proto"
+      , "rule_map": {"proto_library": "cc_proto_library"}
+      }
+    }
+  , "expression": {...}
+  }
+}
+#+END_SRC
+
+***** Evaluation mechanism
+
+The evaluation of a target defined by a user-defined rule is handled
+as follows. After the target fields are evaluated as usual, an
+additional step is carried out.
+
+For each anymous-target field, i.e., for each key in the "anonymous"
+map, a list of anymous targets is generated from the corresponding
+value: take all targets from the specified "target" field in all
+their specified configuration transitions (they have already been
+evaluated) and take the values provided for the specifed "provider"
+key (using the empty list as default). That value has to be a list
+of nodes. All the node lists obtained that way are concatenated.
+The configuration transition for the respective field name is
+evaluated. Those targets are then evaluated for all the transitioned
+configurations requested.
+
+In the final evaluation of the defining expression, the anonymous-target
+fields are available in the same way as any other target field.
+Also, they contribute to the effective configuration in the same
+way as regular target fields.
author	Klaus Aehlig <klaus.aehlig@huawei.com>	2022-03-22 12:41:30 +0100
committer	Klaus Aehlig <klaus.aehlig@huawei.com>	2022-03-22 12:43:35 +0100
commit	a9f66b1355fa45cff1ce3501f1c86f7fcca84092 (patch)
tree	c6fc27d8c9b85bcf1cffb9f7027ca48a030547ae /doc/concepts/anonymous-targets.org
parent	072b2946ab808c630381502381c2d36466e053f4 (diff)
download	justbuild-a9f66b1355fa45cff1ce3501f1c86f7fcca84092.tar.gz