When Your Infrastructure Fights Back: Recognizing Complexity in Terraform
Something felt off, and I couldn’t name it.
Our Terraform codebase worked. Pipelines ran. Resources deployed. Nothing was technically broken. But every change felt heavier than it should. Simple modifications — updating a tag policy, adjusting a naming convention, adding a new component to an existing account — took longer than anyone expected. Not because the tasks were hard, but because the codebase had quietly become difficult to reason about.
I kept explaining this friction to people as “tech debt” or “it’s just how Terraform is.” But those labels never felt precise enough. They described the symptom without diagnosing the cause.
Then I picked up John Ousterhout’s A Philosophy of Software Design and something clicked. The book isn’t about infrastructure — it’s about software systems. But its framework for understanding complexity mapped so precisely to what I was experiencing in our Terraform and Terragrunt code that I had to write about it.
The definition is deceptively simple: complexity is anything related to the structure of a system that makes it hard to understand and modify [1]. Not the size of the system. Not the number of resources. The friction you feel when trying to change it.
The book identifies three symptoms. Every single one shows up in infrastructure code.
Symptom 1: Change Amplification
Change amplification is what happens when a seemingly simple modification requires changes in many different places. In application code, this might mean updating a color value across dozens of files. In Terraform, the stakes are higher because changes don’t just mean editing files — they mean deployments, state file modifications, and blast radius.
Here’s a real example. A company — let’s call them Acme Corp — decides to change its resource naming prefix from "acme" to "acmeco" after a merger. What sounds like a find-and-replace exercise turns into weeks of work. The prefix is everywhere:
# modules/s3/main.tf
resource "aws_s3_bucket" "this" {
bucket = "${var.prefix}-${var.environment}-${var.name}"
}
# modules/rds/main.tf
resource "aws_db_instance" "this" {
identifier = "${var.prefix}-${var.environment}-${var.name}"
}
# modules/iam/main.tf
resource "aws_iam_role" "this" {
name = "${var.prefix}-${var.environment}-${var.role_name}"
}
# ... repeated across 20+ modules
Changing var.prefix from "acme" to "acmeco" sounds trivial. The code change takes 15 minutes. But the production deployment can take weeks. Why? Because many AWS resources are immutable with respect to their name or identifier. S3 buckets can’t be renamed. RDS instances can’t change their identifier without downtime. IAM roles referenced by ARN across accounts will break every service that depends on them.
This means terraform plan will show destroy and recreate for resources you never intended to replace. And this is where teams sometimes make it worse: the temptation is to push the change through, trusting Terraform to handle the recreation. But for stateful resources like databases or buckets with data, that’s not a rename — it’s a data loss event.
The safer pattern is to use lifecycle blocks to protect resources whose names are baked into their identity:
resource "aws_s3_bucket" "this" {
bucket = "${var.prefix}-${var.environment}-${var.name}"
lifecycle {
ignore_changes = [bucket] # Prevent accidental destruction on prefix change
}
}
resource "aws_db_instance" "this" {
identifier = "${var.prefix}-${var.environment}-${var.name}"
lifecycle {
ignore_changes = [identifier] # Same — protect stateful resources
}
}
But here’s the uncomfortable truth: once you need ignore_changes to protect yourself from your own naming convention, that’s a design smell. It means a decision that was supposed to be cosmetic (the prefix) has become load-bearing infrastructure. A single naming decision, duplicated across every layer, now cascades through the entire system — and the only thing preventing catastrophe is a lifecycle block that silences the very tool trying to warn you.
The same pattern appears with tag policies. Suppose your organization standardizes new required tags. In theory, straightforward. In practice:
# Your custom module — tags as a map, works fine
module "vpc" {
source = "./modules/vpc"
tags = local.common_tags # ✅ Easy to update
}
# Third-party AWS module — different tag format
module "eks" {
source = "terraform-aws-modules/eks/aws"
tags = local.common_tags # ✅ Also works
# But wait — node groups have their own tag structure
eks_managed_node_groups = {
default = {
tags = { # ❌ Different format, easy to miss
"kubernetes.io/cluster/${var.cluster_name}" = "owned"
}
}
}
}
# Bare resource — because the use case was too small for a module
resource "aws_sns_topic" "alerts" {
name = "${var.prefix}-alerts"
# ❌ No tags at all — nobody added them when this was created
}
One policy change. Twenty-plus files to modify. Multiple deployment pipelines to run. Each with its own blast radius.
The key insight is that change amplification is a design problem, not a discipline problem. The question isn’t “why didn’t someone make this easier?” but “what structural decisions allowed a single change to require modifications in so many places?” When a naming convention or tag policy must be manually updated across modules, Terragrunt configs, and state boundaries, that’s not Terraform being difficult. That’s the design leaking information across boundaries.
How to spot it: If you needed to change your tagging policy tomorrow, how many files would you touch? If the answer is more than a handful, that’s change amplification. A subtler test: when you change a module’s interface, how many Terragrunt configs break? If updating a variable name requires editing every environment that consumes it, your module boundaries might be too thin.
What this points toward: centralize the things that change together. If tags are defined in one place and inherited everywhere, a policy change becomes a single-file edit. The goal isn’t DRY for its own sake — it’s ensuring that decisions that change as a unit live as a unit.
# Instead of scattering tags across 20 modules:
# Define once in root.hcl, inherit everywhere
# root.hcl
locals {
common_tags = {
ManagedBy = "terraform"
Project = "platform"
Environment = local.environment
CostCenter = local.cost_center
}
}
Symptom 2: Cognitive Load
Cognitive load is about how much an engineer needs to know to complete a task. High cognitive load means you have to hold too many things in your head simultaneously to make a safe change. In infrastructure, this is amplified because the things you need to know often aren’t visible in the code you’re looking at.
Consider what it takes to add a Lambda function to an existing account in a Terragrunt setup. Here’s the Terragrunt config you’d write:
# environments/production/eu-west-1/12-lambda/my-function/terragrunt.hcl
include "root" {
path = find_in_parent_folders("root.hcl")
}
include "env" {
path = find_in_parent_folders("env.hcl")
}
terraform {
source = "${get_repo_root()}//modules/lambda"
}
dependency "vpc" {
config_path = "../../10-networking/vpc"
}
dependency "security" {
config_path = "../../11-security/iam-roles"
}
inputs = {
function_name = "my-function"
subnet_ids = dependency.vpc.outputs.private_subnet_ids
role_arn = dependency.security.outputs.lambda_role_arn
# But wait — where does the environment come from?
# Is it in root.hcl? env.hcl? The module defaults?
# What about the naming prefix? The log retention period?
# The VPC security group — is that in the vpc dependency or somewhere else?
}
This looks clean. But to write it correctly, you need to understand: the module interface for Lambda (which variables are required, which are optional, what types they expect). The IAM structure (centralized roles or per-service?). The VPC configuration (which subnets, which security groups?). The naming conventions. The Terragrunt dependency chain. The state organization. And the variable hierarchy — where is each value actually set?
That’s seven distinct mental models before writing a single line of HCL. Most of that context isn’t in the file you’re editing.
Terragrunt’s DRY features can make this worse. Its include blocks, dependency blocks, and generate blocks are genuinely useful for reducing repetition. But they create obscurity: important information that isn’t obvious from looking at the code.
# root.hcl — generates provider configuration dynamically
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<EOF
provider "aws" {
region = "${local.region}"
assume_role {
role_arn = "arn:aws:iam::${local.account_id}:role/${local.role_name}"
}
default_tags {
tags = {
Environment = "${local.environment}"
ManagedBy = "terraform"
# Which role does this assume? What permissions does it have?
# You need to trace through root.hcl → env.hcl → account.hcl
# to figure out what local.role_name resolves to
}
}
}
EOF
}
This is the Terragrunt DRY paradox. The very mechanisms designed to simplify your code can make it harder to understand, because the reader has to reconstruct context that has been abstracted away. Less repetition, more cognitive load.
The worst version is the “where is this value set?” problem. A variable might be defined in the module’s variables.tf with a default, overridden in root.hcl, overridden again at the environment level, and again at the component level. Tracing a single value through that hierarchy requires mentally unwinding a stack of configurations. Get it wrong, and you won’t know until terraform plan shows you something unexpected — or worse, until terraform apply does.
How to spot it: Can a new team member deploy a new service by reading the code, or do they need a walkthrough from someone who holds the context in their head? If it’s the latter, the knowledge lives in people, not in the system. A subtler test: how long does terraform plan take to mentally parse? If a plan output requires 10 minutes of reading to determine whether it’s safe to apply, that’s cognitive load you’re paying on every deployment.
What this points toward: make the implicit explicit. If a Terragrunt config depends on understanding a chain of includes to know which provider role it assumes, consider adding a comment or a README that makes the resolution path visible. Not everything needs to be automated — sometimes documentation at the point of confusion is more valuable than another layer of abstraction.
Also: resist the urge to make every module maximally configurable. Modules with 30 input variables are technically flexible, but they push the burden of understanding onto the consumer. Sometimes an opinionated module that makes 25 of those decisions for you — with an escape hatch for the 5 that actually vary — reduces cognitive load more than a fully generic one.
# High cognitive load: consumer must understand every option
module "lambda" {
source = "./modules/lambda-generic"
function_name = "my-function"
runtime = "python3.12"
handler = "main.handler"
memory_size = 256
timeout = 30
vpc_config = { ... }
iam_policy_arns = [ ... ]
log_retention = 14
tracing_mode = "Active"
environment_vars = { ... }
layers = [ ... ]
# 15 more variables...
}
# Lower cognitive load: opinionated wrapper with sensible defaults
module "lambda" {
source = "./modules/platform-lambda"
function_name = "my-function"
handler = "main.handler"
# Everything else has team-standard defaults.
# Override only what's different for this use case.
}
Symptom 3: Unknown Unknowns
This is the most dangerous symptom. Unknown unknowns are when it’s not just hard to make a change — it’s unclear what you even need to know to make the change safely. There’s no error message, no obvious documentation to read, no one to ask because nobody realizes the dependency exists.
Infrastructure code is uniquely prone to this.
Picture this: you need to update a security group rule.
# Account A: 10-networking/security-groups/main.tf
resource "aws_security_group" "api" {
name_prefix = "api-sg-"
vpc_id = var.vpc_id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [var.allowed_cidr] # You change this. Seems safe.
}
}
Straightforward, right? But in another account, there’s a Terragrunt dependency referencing this security group by its ID — and that reference isn’t documented anywhere in the module you’re looking at:
# Account B: 12-compute/service/terragrunt.hcl
dependency "account_a_networking" {
config_path = "../../../../account-a/10-networking/security-groups"
}
inputs = {
allowed_security_groups = [
dependency.account_a_networking.outputs.api_security_group_id
# ☝️ This dependency is invisible from Account A's perspective
]
}
You apply your change in Account A. Works fine. Two days later, a deployment in Account B fails because your security group change invalidated an assumption that was never made explicit.
Or consider layer ordering. If you organize your infrastructure in numbered layers — 10-networking, 11-dns, 12-compute — what happens when you need to insert something between 10 and 11?
infrastructure/
├── 10-networking/ # VPC, subnets, route tables
├── 11-dns/ # Route53 zones
├── 12-compute/ # ECS, Lambda
└── ...
# Need to add a NAT gateway layer between networking and DNS.
# Do you call it 10a? 10.5? Renumber everything?
# Scripts that parse folder names by number might break.
# Cross-references to "10-networking" in other configs need updating.
The most insidious unknown unknowns are implicit ordering dependencies. Terragrunt’s run-all command deploys layers in dependency order, but not all dependencies are declared:
# This works because 11-dns always deploys after 10-networking.
# But there's no dependency block enforcing that order.
# If someone runs a targeted apply on 11-dns alone,
# or if run-all parallelizes differently than expected,
# the implicit assumption breaks silently.
# 11-dns/main.tf
data "aws_vpc" "main" {
filter {
name = "tag:Name"
values = ["${var.prefix}-main"] # Assumes 10-networking already ran
}
}
Nothing in the code tells you this could fail. It works in sequence. It breaks in parallel. And the person deploying has no way to know the constraint exists.
The core problem: there is something you need to know, but there is no way for you to find out what it is, or even whether there is an issue [1]. In application code, this is bad. In infrastructure code, where a mistake can take down production, it’s dangerous.
How to spot it: When was the last time a change broke something nobody expected? If it happens regularly, your codebase has hidden dependencies that aren’t visible to the people making changes. A subtler test: could you draw a dependency graph of your infrastructure from the code alone, without asking anyone? If not, some of those dependencies exist only in people’s heads — and that’s where unknown unknowns live.
One practice that helps: keep a simple log of surprises. What broke, what you didn’t expect, what you had to ask someone about. Patterns emerge fast. After a few weeks, you’ll have a map of exactly where your codebase hides its assumptions.
What this points toward: make dependencies explicit and documented. Cross-account references should be declared as Terragrunt dependencies, even if Terraform doesn’t strictly require them. Implicit ordering should be replaced with explicit dependency blocks. And any hardcoded value — a subnet ID, an account number, a CIDR range — should have a comment explaining why that specific value was chosen and where it comes from.
# Bad: implicit dependency, no one knows this exists
data "aws_vpc" "main" {
filter {
name = "tag:Name"
values = ["platform-main"]
}
}
# Better: explicit dependency, visible in the dependency graph
dependency "networking" {
config_path = "../10-networking/vpc"
}
inputs = {
vpc_id = dependency.networking.outputs.vpc_id
}
Why This Matters for Platform Engineers
If you’re building platforms — infrastructure that other people use to deploy and operate their services — these three symptoms aren’t just your problem. They’re your users’ problem. Every instance of change amplification, every spike in cognitive load, every unknown unknown in your codebase is friction that platform consumers absorb. It’s the question they ask in Slack that they shouldn’t have needed to ask. It’s the deployment that took three hours instead of thirty minutes. It’s the incident that happened because nobody knew about a hidden dependency.
Skelton and Pais put it well in Team Topologies: the primary benefit of a platform is to reduce the cognitive load on the teams using it [2]. If your infrastructure code is complex in these ways, you’re not reducing cognitive load. You’re just relocating it.
This is worth sitting with. The instinct in platform engineering is to add — more modules, more abstractions, more DRY configurations, more layers. But complexity is incremental. Each small addition is harmless on its own. It’s the accumulation that becomes nearly impossible to unwind.
Naming Changes Conversations
I’m not going to wrap this up with a prescriptive solution. The fixes depend heavily on context — your team size, your codebase history, your organizational constraints. What I can say is that having a vocabulary for these problems changed how my team talked about them.
Before this framework, we’d say things like “this feels like tech debt” or “Terraform is just hard.” Those labels are vague enough that they don’t lead anywhere. They don’t help you prioritize, and they don’t help you explain to leadership why a “simple” change is taking two weeks.
But when you can say “we have a change amplification problem — our tagging policy is duplicated across 20 modules, so every policy update costs us days instead of minutes,” that’s a conversation people can act on. When you can point to cognitive load — “a new team member can’t deploy a service without a 45-minute walkthrough because the variable hierarchy spans four configuration layers” — you’ve moved from a feeling to a diagnosis.
And when you can name unknown unknowns — “we had two incidents this quarter caused by cross-account dependencies that weren’t declared in the code” — you’ve made the invisible visible. That’s the first step toward designing against it.
The Longer View
The next time your Terraform fights back — when a “simple” change spirals into a multi-day effort, when someone asks why this is so hard, when a deploy fails because of a dependency nobody knew about — consider whether the problem is really about Terraform. Or whether it’s about complexity that crept in one reasonable decision at a time.
The most useful insight from A Philosophy of Software Design might be the simplest: complexity is incremental. No single shortcut or tactical decision makes a system complex. But dozens of them, accumulated over months and years, create a codebase that fights every change.
The good news? Recognition is the hardest part. Once you can name the symptoms, you can start designing against them.
References
[1] Ousterhout, J. (2018). A Philosophy of Software Design. Yaknyam Press. Chapter 2: The Nature of Complexity.
[2] Skelton, M. & Pais, M. (2019). Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press.