Terraform 'data source will be read during apply' messages - What is it and how to fix


Table of Contents

Terraform users will likely be familiar with “data source will be read during apply” messages that may appear in the plan output. These messages can be confusing and may even lead to unexpected re-creation of resources. Typically, these messages are related to using data sources in combination with Terraform modules and explicit dependencies.

Data sources and modules are two powerful and essential concepts. Data sources allow you to fetch information from existing resources and pass that data to other resources. Modules promote reusability and hiding complexity by encapsulating collections of resources into sharable, versioned packages. Explicit dependencies are also valid to use in various situations. But combining these concepts can lead to confusion and unexpected behavior.

Let’s walk through everything you need to know about these messages and how to get rid of them.

The basics

Let’s first do a quick dive through Terraform data sources and dependencies to get some terminology straight.

The following data source loads all metadata of an existing AWS VPC.

data "aws_vpc" "vpc" {
  filter = {
    name = "tag:Name"
    values = ["main"]
  }
}

resource "aws_subnet" "subnet" {
  vpc_id = data.aws_vpc.vpc.id
}

In this example we load the VPC with name “main” and pass the ID of that VPC to a subnet resource. This allows us to create a subnet within an existing VPC without knowing the ID of that VPC. This also created an implicit dependency from the aws_subnet resource to the aws_vpc data source. The AWS subnet will only be created after that data source is loaded. This makes sense - we need the ID of the VPC to create the subnet.

In some cases you’ll want to create an explicit dependency on a resource using the depends_on property. This is useful when you want to make sure that a resource is created after another resource, but you don’t need to pass any information from that resource into the dependee resource. The following is such an example;

resource "aws_vpc_peering_connection" "services" {
  peer_owner_id = [...]
  peer_vpc_id   = [...]
}

resource "aws_instance" "foo" {
    instance_type = "t2.micro"
    [...]
    depends_on = [aws_vpc_peering_connection.services]
}

Here we create an AWS EC2 instance that depends on the existence of a VPC peering connection. We don’t need to pass any information from the peering connection unto the EC2 instance, but we do want to make sure that the peering connection is created before the EC2 instance is created (for example because whatever runs on that EC2 instance is going to connect to the peered “services” network).

While implicit dependencies are much more common and explicit dependencies are generally discouraged, both are valid and useful concepts in Terraform. However, as we’ll see next, explicit dependencies can lead to a lot of confusion when used on modules.

Why depends_on doesn’t go well with modules

Terraform modules are re-usable containers with multiple resources. What follows is an example with two modules, where the second instance module depends on the first vpc module.

# creates a VPC with peering that we want to wait for
# (contents not shown, details not relevant)
module "vpc" {
  source = "./vpc"

  tags = local.tags
}

# create an EC2 instance
module "instance" {
  source = "./instance"

  depends_on = [module.vpc]
}

# ./instance/main.tf
data "aws_ami" "ami" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    [...]
  }
}

resource "aws_instance" "instance" {
  ami           = data.aws_ami.ami.id
  instance_type = "t2.micro"
  [...]
}

I’ve seen explicit dependencies on modules like this fairly often, probably because it just feels “right” to have the entire vpc be finished before spinning up the EC2 instance.

The depends_on in this example creates a dependency to all resources (including data sources) in the “instance” module to all resources in the “vpc” module. This means that even if we change something insignificant such as the tags on the VPC, the aws_ami will trigger a # module.instance.data.aws_ami.ami will be read during apply message. All resources that implicitly or explicitly depend on that AMI data source (such as the EC2 instance) will be updated. If the property that is using an output from the data source leads to a recreation of the resource, the EC2 instance will be re-created.

This is what this looks like when using terraform graph. You can see very clearly here that the AMI data resource depends on all resources in the iam and vpc modules.

Terraform dependency graph

Even more confusion can arise when nothing in the module is changed directly, but instead one of the properties passed to the module is changed (such as updated tags). You might make a change to a completely unrelated resource in your project but when that results in a changed module input value, the data will be re-read in that other module that depends on that module.

As that may be a bit hard to follow, let’s to go through these steps more explicitly:

  1. Something changes - a resource, variable, local: you name it
  2. A property of that changed entity is passed to the vpc module
  3. One of the resources in the vpc module uses this input and thus might potentially change
  4. All resources in the instance module depend on all resources in vpc module, including the AMI data source
  5. The AMI data source is re-read as a dependency potentially changes - this generates the data source will be read during apply message
  6. The EC2 instance resource is re-created because the AMI property might potentially change

Confused? Good. That’s the point. This situation is something we’ll definitely want to avoid. Now that the challenge is (hopefully) a bit more clear, let’s dive into what we can actually do about it.

How to get rid of the “data source will be read during apply” message

1. Don’t set dependencies

Purists might hate this solution, but pragmatists will like it. Do you really need the explicit dependency? Is it practically going to cause issues if they’re omitted?

The above VPC peering example is a good example. In practice, the peering will (arguably, depending on factors such as if the peering is auto-approved, cross account and such) most likely finish before EC2 starts initializing and running. In addition, the startup script or application that has a dependency on the peering connection may retry until the connection becomes available. Also, if your Terraform doesn’t (often) build a completely new environment, the dependency might practically not add a lot of value. Therefore, you might consider omitting the explicit dependency.

If you decide to omit the explicit dependency, do make sure you add a comment or update your documentation to make sure that the next person (or future you) working on the code knows that the dependency exists and why it is omitted. While I prefer code over documentation, sometimes you just have to be practical.

2. Don’t use modules

The second solution is to reconsider using modules. Modules are meant to be reusable components, or they should be used to abstract away complexity. What I’ve seen fairly often though is that modules are used to organize code in multiple directories. Terraform unfortunately does not support the use of subdirectories. To work around this, modules are used to organize code in a more structured way.

The use of modules isn’t free however - it adds complexity as you have to pass around outputs and variables. And as we’ve seen, it can lead to “data source will be read during apply” messages and re-creation of resources.

While having a lot of files in a single directory can be a bit messy, it does remove the need for modules and the complexity that comes with them. It’s a trade-off. Consider this option if you’re passing a lot of values between your modules. Refactoring Terraform is very doable these days, so moving resources out of your modules is no big deal.

3. Stop using depends_on: pass dependent resources instead

If you really need the explicit dependency and you have valid reasons for using modules, this is what you’ll need to do.

Most important: remove the depends_on on your modules. This is the root issue of the problem. Next, you’ll want to pass the resource that you really want to depend on (such as the VPC peering connection) to the module and only let the relevant resource(s) depend on that variable. Here’s what that looks like in code:

# vpc/main.tf
resource "aws_vpc_peering_connection" "services" {
  [...]
}

output "peering_connection" {
  value = aws_vpc_peering_connection.services
}

# instance/main.tf
variable "peering_connection" {
  type = object({})
}

resource "aws_instance" "instance" {
  [...]
  depends_on = [var.peering_connection]
}

# main.tf
module "vpc" {
  source = "./vpc"
}

module "instance" {
  source = "./instance"

  peering_connection = module.vpc.peering_connection
}

You can see in this example how we pass the aws_vpc_peering_connection resource to the instance module, and let the aws_instance resource specifically depend on it. Running terraform graph shows us that the proper dependency now exists:

Terraform dependency graph

Building re-usable Terraform modules

The previous solution brings some additional complications. The re-usable instance module is no longer an abstraction to spin up an EC2 instance. It’s now an abstraction to spin up an EC2 instance that depends on a VPC peering connection.

And what if we have more dependencies? Perhaps in some cases you have an aws_iam_role_policy that you must sometimes depend on. Let’s see how we can pass a dynamic list of dependencies instead.

# vpc/main.tf
resource "aws_vpc_peering_connection" "services" {
  [...]
}

output "peering_connection" {
  value = aws_vpc_peering_connection.services
}

# instance/main.tf
variable "instance_dependencies" {
  type = list(string)
  default = []
}

resource "aws_instance" "instance" {
  [...]
  depends_on = [var.instance_dependencies]
}

# main.tf
module "vpc" {
  source = "./vpc"
}

module "instance" {
  source = "./instance"

  instance_dependencies = [module.vpc.peering_connection.id]
}

Here we can see that we pass an optional list of instance_dependencies to the module. Through this we now pass the VPC peering connection id on which the EC2 instance will explicitly depend. If we have additional dependencies such as an IAM policy id, we can pass that as well without changing the module code.

Conclusion

This blog post explained how to solve the primary reason for the “data source will be read during apply” messages in Terraform. Instead of explicitly using depends_on on modules, use one of three solutions instead:

  1. Don’t add the dependency if it’s (practically) not needed
  2. Don’t use modules if you don’t need them
  3. Pass the dependent resources to the module and only let the relevant resources depend on them

Of course, the “data source will be read during apply” message may also unexpectedly pop up in other situations where you don’t use depends_on on a module. In those cases, you can most likely use similar solutions to those described above.

Get in touch! Follow me on Twitter: @SanderKnape.


comments powered by Disqus