Skip to content

cache probe failure on coder/dogfood image #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnstcn opened this issue Sep 2, 2024 · 6 comments
Closed

cache probe failure on coder/dogfood image #43

johnstcn opened this issue Sep 2, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@johnstcn
Copy link
Member

johnstcn commented Sep 2, 2024

Dockerfile: https://github.com/coder/coder/blob/main/dogfood/contents/Dockerfile
devcontainer.json: https://github.com/coder/coder/blob/main/dogfood/contents/devcontainer.json

Error:

get cached image: error probing build cache: failed to optimize instructions: failed to get files used from context: copy failed: no source files specified

Reproducible with envbuilder provider v0.0.5

@coder-labeler coder-labeler bot added bug Something isn't working help wanted Extra attention is needed labels Sep 2, 2024
@johnstcn johnstcn removed the help wanted Extra attention is needed label Sep 2, 2024
@mafredri
Copy link
Member

mafredri commented Sep 2, 2024

Turns out this is because we're using resource parameters directly here:

opts := eboptions.Options{
// These options are always required
CacheRepo: data.CacheRepo.ValueString(),
Filesystem: osfs.New("/"),
ForceSafe: false, // This should never be set to true, as this may be running outside of a container!
GetCachedImage: true, // always!
Logger: tfLogFunc(ctx),
Verbose: data.Verbose.ValueBool(),
WorkspaceFolder: workspaceFolder,
// Options related to compiling the devcontainer
BuildContextPath: data.BuildContextPath.ValueString(),
DevcontainerDir: data.DevcontainerDir.ValueString(),
DevcontainerJSONPath: data.DevcontainerJSONPath.ValueString(),
DockerfilePath: data.DockerfilePath.ValueString(),
DockerConfigBase64: data.DockerConfigBase64.ValueString(),
FallbackImage: data.FallbackImage.ValueString(),
// These options are required for cloning the Git repo
CacheTTLDays: data.CacheTTLDays.ValueInt64(),
GitURL: data.GitURL.ValueString(),
GitCloneDepth: data.GitCloneDepth.ValueInt64(),
GitCloneSingleBranch: data.GitCloneSingleBranch.ValueBool(),
GitUsername: data.GitUsername.ValueString(),
GitPassword: data.GitPassword.ValueString(),
GitSSHPrivateKeyPath: data.GitSSHPrivateKeyPath.ValueString(),
GitHTTPProxyURL: data.GitHTTPProxyURL.ValueString(),
RemoteRepoBuildMode: data.RemoteRepoBuildMode.ValueBool(),
RemoteRepoDir: filepath.Join(tmpDir, "repo"),
SSLCertBase64: data.SSLCertBase64.ValueString(),
// Other options
BaseImageCacheDir: data.BaseImageCacheDir.ValueString(),
BinaryPath: envbuilderPath, // needed to reproduce the final layer.
ExitOnBuildFailure: data.ExitOnBuildFailure.ValueBool(), // may wish to do this instead of fallback image?
Insecure: data.Insecure.ValueBool(), // might have internal CAs?
IgnorePaths: tfListToStringSlice(data.IgnorePaths), // may need to be specified?
// The below options are not relevant and are set to their zero value explicitly.
// They must be set by extra_env.
CoderAgentSubsystem: nil,
CoderAgentToken: "",
CoderAgentURL: "",
ExportEnvFile: "",
InitArgs: "",
InitCommand: "",
InitScript: "",
LayerCacheDir: "",
PostStartScriptPath: "",
PushImage: false, // This is only relevant when building.
SetupScript: "",
SkipRebuild: false,
}

However, another way to feed these options are via extra_env. We're not considering those when building the envbuilder options within the provider.

johnstcn added a commit that referenced this issue Sep 4, 2024
Relates to #43

Our previous logic did not pass options from extra_env to envbuilder.RunCacheProbe.
This fixes the logic and adds more comprehensive tests around the overriding logic.
Future commits will refactor this logic some more.
@johnstcn
Copy link
Member Author

johnstcn commented Sep 5, 2024

I think I've found a minimal Dockerfile example that succeeds with envbuilder but fails with the provider:

FROM localhost:5000/envbuilder-test-alpine:latest AS a
RUN date > /date.txt
FROM localhost:5000/envbuilder-test-alpine:latest
COPY --from=a /date.txt /date.txt

When added as an integration test case, the error is:

2024-09-05T14:19:28.294+0100 [WARN]  sdk.proto: Response contains warning diagnostic:
  diagnostic_severity=WARNING diagnostic_summary="Cached image not found." 
  tf_req_id=b783cbc4-0d7f-d6d4-dd35-548e8ace47a8 
  tf_provider_addr=registry.terraform.io/hashicorp/envbuilder tf_proto_version=6.6 
  tf_rpc=ApplyResourceChange tf_resource_type=envbuilder_cached_image 
  diagnostic_detail="Failed to find cached image in repository \"localhost:37083/test\". It 
  will be rebuilt in the next apply. Error: get cached image: error probing build cache: 
  failed to optimize instructions: failed to get files used from context: failed to get 
  fileinfo for /tmp/envbuilder-provider-cached-image-data-
  source1005954580/.envbuilder/0/date.txt: lstat /tmp/envbuilder-provider-cached-image-
  data-source1005954580/.envbuilder/0/date.txt: no such file or directory"

@johnstcn
Copy link
Member Author

johnstcn commented Sep 5, 2024

In the terraform provider we override envbuilder's constants.MagicDir before starting a cache probe. This has a number of drawbacks, including:

  • If the version of kaniko imported by the provider is out of sync with that of envbuilder, this will no longer work,
  • There are other 'magic' constants that are not related to MagicDir that also need to be updated.

A separate fix to envbuilder will be necessary to allow overriding these properly at runtime, preferably without needing to import kaniko directly.

@johnstcn
Copy link
Member Author

johnstcn commented Sep 10, 2024

Getting closer now, but still getting a cache miss on a specific instruction with the provider, but running the envbuilder image with ENVBUILDER_GET_CACHED_IMAGE=1 results in a cache hit:

Provider:

2024-09-10T12:24:34.601+0100 [INFO]  provider.terraform-provider-envbuilder_v0.0.7: #2: No cached layer found for cmd COPY files /: @module=envbuilder tf_provider_addr=registry.terraform.io/coder/envbuilder tf_resource_type=envbuilder_cached_image tf_rpc=ApplyResourceChange @caller=github.com/coder/terraform-provider-envbuilder/internal/tfutil/tfutil.go:73 tf_req_id=f87f22e6-d3fc-df4f-ea2c-dcf79ff1ea99 timestamp="2024-09-10T12:24:34.601+0100"
2024-09-10T12:24:34.601+0100 [DEBUG] provider.terraform-provider-envbuilder_v0.0.7: #2: Key missing was: sha256:075680e983398fda61b1ac59ad733ad81d18df4bc46411666bb8a03fb9ea0195-SHELL ["/bin/bash", "-c"]-ARG DEBIAN_FRONTEND="noninteractive"-|4-DEBIAN_FRONTEND=noninteractive-GOPATH=/tmp/-GO_VERSION=1.22.5-PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin-RUN apt-get update && apt-get install --yes ca-certificates-COPY files /-2dc0bd7156c4b4750d28df5606c28421594902d08eef2ee746bcf350f9a79bf6: @caller=github.com/coder/terraform-provider-envbuilder/internal/tfutil/tfutil.go:73 tf_req_id=f87f22e6-d3fc-df4f-ea2c-dcf79ff1ea99 tf_resource_type=envbuilder_cached_image tf_rpc=ApplyResourceChange @module=envbuilder tf_provider_addr=registry.terraform.io/coder/envbuilder timestamp="2024-09-10T12:24:34.601+0100"

Envbuilder image:

#3: Checking for cached layer localhost:5000/cache:e99886072dd86a782b5332a697fdbd36abf362a2db38784cf1c3f338c83b1994...
#3: Using caching version of cmd: COPY files /
...
#3: COPY files /
#3: Found cached layer, extracting to filesystem

terraform-provider-envbuilder-43-docker_run.log.gz
terraform-provider-envbuilder-43-cache_probe.log.gz
terraform-provider-envbuilder-43-main.tf.gz

@johnstcn
Copy link
Member Author

Ok, after adding some additional logging in Kaniko:

Provider:

2024-09-13T21:49:21.788+0100 [DEBUG] provider.terraform-provider-envbuilder: #3: CacheHasher: path=/tmp/envbuilder-provider-cached-image-data-source1858353700/.envbuilder/repo/dogfood/contents/files/etc/apt/sources.list.d/docker.list  mode=-rw-r--r-- uid=1000 gid=1000 reg len=101: @module=envbuilder tf_resource_type=envbuilder_cached_image tf_rpc=ApplyResourceChange @caller=/home/coder/src/coder/terraform-provider-envbuilder/internal/tfutil/tfutil.go:73 tf_provider_addr=registry.terraform.io/coder/envbuilder tf_req_id=a5e4b5b9-b042-12aa-5dff-8d66c795d2e8 timestamp="2024-09-13T21:49:21.788+0100"

Envbuilder:

#2: �[90mCacheHasher: path=/workspaces/coder/dogfood/contents/files/etc/apt/sources.list.d/docker.list  mode=-rw-r--r-- uid=0 gid=0 reg len=101�[0m

I hacked our ownership workaround for the injected envbuilder binary to apply to all files, and sure enough, the cache miss disappeared!

So it seems like cache probe, at minimum, needs to detect if it is not running as UID/GID 0 and conditionally adjust its behaviour to pretend that all local files are owned by root when calculating hashes.

Downside of this is that we could potentially end up pulling a cached layer with incorrect ownership information if the desired permissions are not 0:0. We might need some alternative way to invalidate the cache in this case. Or just run the provider as root 🙃

johnstcn added a commit to coder/kaniko that referenced this issue Sep 25, 2024
Fix for caching when context files permissions change. This is irrelevant for a COPY operation since the they are either copied as root:root, or a specific owner/group depending on COPY --chown= argument.

Relates to coder/terraform-provider-envbuilder#43

Co-authored-by: Cian Johnston <[email protected]>
@johnstcn
Copy link
Member Author

Fixed by coder/kaniko#29

Verified with the template contained inside envbuilder-dogfood/.
See also: coder/coder#14796

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants