redesigning the pkg tag · Issue #4180 · linuxkit/linuxkit · GitHub
Skip to content

redesigning the pkg tag #4180

Description

@deitch

Background

Every package built with linuxkit pkg build has an automatically generated digest as a tag, which also can be shown via lkt pkg show-tag. This digest is used to determine whether or not a package's specific state is available in local cache, on registry, or needs a new build.

The tag can be overridden on user request.

The generated tag has limits, does not handle all use cases, and can be misleading.

What should the tag be replaced with?

Current state

When linuxkit builds a package with linuxkit pkg build, it automatically generates a tag that is a sha256 hash. By default, that tag is composed of the results of git ls-tree on the directory.

For example, on the most recent commit to master as of this writing of pkg/init:

$ lkt pkg show-tag pkg/init 
linuxkit/init:680da6e6f79bb8236a095147d532cd2160e23c9f

$ git ls-tree --full-tree HEAD -- ./pkg/init
040000 tree 680da6e6f79bb8236a095147d532cd2160e23c9f    pkg/init

In addition, if there are changes that are not committed - i.e. the git tree for that directory is dirty - it adds the word dirty and the digest of the file contents:

$ touch pkg/init/foo
$ ✗ lkt pkg show-tag pkg/init              
linuxkit/init:680da6e6f79bb8236a095147d532cd2160e23c9f-dirty-35f1311

The appended digest is given by listing all of the files in the tree and then sha256 digesting them.

What is missing

The following are missed:

  • build-arg-files that are outside the directory, e.g. lkt pkg build --build-arg-file /tmp/foo ./pkg/init.
  • files in the directory that are not git committed. Arguably this case can be ignored, as someone is choosing explicitly to avoid git.
  • contents that are not determinable, e.g. ADD https://example.com/foo. There is no way to know that the contents of that URL have changed.
  • dynamically generated build args, e.g. those related to the platform, or the special REL_* linuxkit ones

Purpose of the digest tag

The digest tag has two purposes.

First, primarily, it serves as a way to check if anything has changed, such that there is need to rebuild. All of the things listed as "missing" above fit within that category. Something has changed, yet lkt pkg build cannot detect that something has changed.

Second, it has some element of provenance: given an artifact (OCI image) can I get to the source? This assumes that, given an ls-tree output digest, you can find that exact state again, which is somewhat questionable. However, this part is secondary, because every lkt pkg build also adds a label to the image with the git commit and repository that generated the image. This should be enough for provenance; if not, it should be fixed here. This leaves just the first issue to be resolved.

@justincormack uses the terms "input hash" and "output hash" for these.

In terms of the first, a key goal is to be able to determine the tag of the input tree without rebuilding or even calling buildkit to rebuild it. Solely based on the input, we should be able to:

  1. Determine the value of the tag/hash/identifier
  2. Use that identifier to determine if it exists in the cache or registry, and therefore if it needs to be rebuilt

Some possible future avenues

Directory contents plus build args

One possibility is to include all of the file contents as well as build args, generated and static, files or CLI flags, into a single digest. This has nothing to do with git commands or even if files are checked in, it is just the content. Change a build arg = rebuild an image; change a file = rebuild an image.

The whole dirty just goes away (although we could keep it if it is helpful).

Like most digests, they are one-way hashes. There would be no way to go from hash-to-source, but the git commit label is for that; you always can go from source-to-hash.

It would benefit us to do a better job capturing the source info in labels, like the build args or the CLI flags used.

buildkit

Another approach might be to figure out what buildkit does to determine if something needs to be rebuilt and adopt it, maybe as a library.

I would be hesitant to actually use buildkit, as we prefer to be able to determine tags and such purely via CLI.

Other approaches?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions