Background
Every package built with linuxkit pkg build has an automatically generated digest as a tag, which also can be shown via lkt pkg show-tag. This digest is used to determine whether or not a package's specific state is available in local cache, on registry, or needs a new build.
The tag can be overridden on user request.
The generated tag has limits, does not handle all use cases, and can be misleading.
What should the tag be replaced with?
Current state
When linuxkit builds a package with linuxkit pkg build, it automatically generates a tag that is a sha256 hash. By default, that tag is composed of the results of git ls-tree on the directory.
For example, on the most recent commit to master as of this writing of pkg/init:
$ lkt pkg show-tag pkg/init
linuxkit/init:680da6e6f79bb8236a095147d532cd2160e23c9f
$ git ls-tree --full-tree HEAD -- ./pkg/init
040000 tree 680da6e6f79bb8236a095147d532cd2160e23c9f pkg/init
In addition, if there are changes that are not committed - i.e. the git tree for that directory is dirty - it adds the word dirty and the digest of the file contents:
$ touch pkg/init/foo
$ ✗ lkt pkg show-tag pkg/init
linuxkit/init:680da6e6f79bb8236a095147d532cd2160e23c9f-dirty-35f1311
The appended digest is given by listing all of the files in the tree and then sha256 digesting them.
What is missing
The following are missed:
- build-arg-files that are outside the directory, e.g.
lkt pkg build --build-arg-file /tmp/foo ./pkg/init.
- files in the directory that are not git committed. Arguably this case can be ignored, as someone is choosing explicitly to avoid git.
- contents that are not determinable, e.g.
ADD https://example.com/foo. There is no way to know that the contents of that URL have changed.
- dynamically generated build args, e.g. those related to the platform, or the special
REL_* linuxkit ones
Purpose of the digest tag
The digest tag has two purposes.
First, primarily, it serves as a way to check if anything has changed, such that there is need to rebuild. All of the things listed as "missing" above fit within that category. Something has changed, yet lkt pkg build cannot detect that something has changed.
Second, it has some element of provenance: given an artifact (OCI image) can I get to the source? This assumes that, given an ls-tree output digest, you can find that exact state again, which is somewhat questionable. However, this part is secondary, because every lkt pkg build also adds a label to the image with the git commit and repository that generated the image. This should be enough for provenance; if not, it should be fixed here. This leaves just the first issue to be resolved.
@justincormack uses the terms "input hash" and "output hash" for these.
In terms of the first, a key goal is to be able to determine the tag of the input tree without rebuilding or even calling buildkit to rebuild it. Solely based on the input, we should be able to:
- Determine the value of the tag/hash/identifier
- Use that identifier to determine if it exists in the cache or registry, and therefore if it needs to be rebuilt
Some possible future avenues
Directory contents plus build args
One possibility is to include all of the file contents as well as build args, generated and static, files or CLI flags, into a single digest. This has nothing to do with git commands or even if files are checked in, it is just the content. Change a build arg = rebuild an image; change a file = rebuild an image.
The whole dirty just goes away (although we could keep it if it is helpful).
Like most digests, they are one-way hashes. There would be no way to go from hash-to-source, but the git commit label is for that; you always can go from source-to-hash.
It would benefit us to do a better job capturing the source info in labels, like the build args or the CLI flags used.
buildkit
Another approach might be to figure out what buildkit does to determine if something needs to be rebuilt and adopt it, maybe as a library.
I would be hesitant to actually use buildkit, as we prefer to be able to determine tags and such purely via CLI.
Other approaches?
Background
Every package built with
linuxkit pkg buildhas an automatically generated digest as a tag, which also can be shown vialkt pkg show-tag. This digest is used to determine whether or not a package's specific state is available in local cache, on registry, or needs a new build.The tag can be overridden on user request.
The generated tag has limits, does not handle all use cases, and can be misleading.
What should the tag be replaced with?
Current state
When linuxkit builds a package with
linuxkit pkg build, it automatically generates a tag that is a sha256 hash. By default, that tag is composed of the results ofgit ls-treeon the directory.For example, on the most recent commit to
masteras of this writing ofpkg/init:In addition, if there are changes that are not committed - i.e. the git tree for that directory is dirty - it adds the word
dirtyand the digest of the file contents:The appended digest is given by listing all of the files in the tree and then sha256 digesting them.
What is missing
The following are missed:
lkt pkg build --build-arg-file /tmp/foo ./pkg/init.ADD https://example.com/foo. There is no way to know that the contents of that URL have changed.REL_*linuxkit onesPurpose of the digest tag
The digest tag has two purposes.
First, primarily, it serves as a way to check if anything has changed, such that there is need to rebuild. All of the things listed as "missing" above fit within that category. Something has changed, yet
lkt pkg buildcannot detect that something has changed.Second, it has some element of provenance: given an artifact (OCI image) can I get to the source? This assumes that, given an
ls-treeoutput digest, you can find that exact state again, which is somewhat questionable. However, this part is secondary, because everylkt pkg buildalso adds a label to the image with the git commit and repository that generated the image. This should be enough for provenance; if not, it should be fixed here. This leaves just the first issue to be resolved.@justincormack uses the terms "input hash" and "output hash" for these.
In terms of the first, a key goal is to be able to determine the tag of the input tree without rebuilding or even calling buildkit to rebuild it. Solely based on the input, we should be able to:
Some possible future avenues
Directory contents plus build args
One possibility is to include all of the file contents as well as build args, generated and static, files or CLI flags, into a single digest. This has nothing to do with git commands or even if files are checked in, it is just the content. Change a build arg = rebuild an image; change a file = rebuild an image.
The whole
dirtyjust goes away (although we could keep it if it is helpful).Like most digests, they are one-way hashes. There would be no way to go from hash-to-source, but the git commit label is for that; you always can go from source-to-hash.
It would benefit us to do a better job capturing the source info in labels, like the build args or the CLI flags used.
buildkit
Another approach might be to figure out what buildkit does to determine if something needs to be rebuilt and adopt it, maybe as a library.
I would be hesitant to actually use buildkit, as we prefer to be able to determine tags and such purely via CLI.
Other approaches?