Terraform Best Practices for Production Infrastructure

Managing infrastructure as code with Terraform is powerful, but it comes with responsibilities. Here are the practices that have saved us from outages and made our infrastructure maintainable.

State Management

Remote State is Non-Negotiable

Never use local state files in production. Ever.

Good: S3 Backend with State Locking (AWS)

terraform {
  backend "s3" {
    bucket                = "company-terraform-state"
    key                   = "production/vpc/terraform.tfstate"
    region                = "us-east-1"
    encrypt               = true
    # S3 now supports native locking via the use_lockfile attribute
    use_lockfile          = true

    # Prevent accidental changes to state config
    skip_credentials_validation = false
    skip_metadata_api_check     = false
  }
}

Alternative: GCS Backend (Native Locking)

terraform {
  backend "gcs" {
    bucket  = "company-terraform-state"
    prefix  = "terraform/state"
  }
}

Why lock our statefiles?

Prevents concurrent modifications
Prevents state corruption
Saves you from "who ran terraform at the same time" incidents

Previously you'd use DynamoDB to do this, but now you can do it natively in S3 or GCS.

State File Organization

Separate state files by:

Environment (dev, staging, prod)
Lifecycle (infrastructure that changes together)
Blast radius (isolate critical resources)

Directory structure:

terraform/
├── environments/
│   ├── production/
│   │   ├── networking/
│   │   ├── databases/
│   │   └── applications/
│   └── staging/
│       ├── networking/
│       └── applications/
└── modules/
    ├── vpc/
    ├── rds/
    └── eks/

Module Design

Keep Modules Focused

A module should do one thing well. Bad modules try to be everything.

Bad: God Module

module "everything" {
  source = "./modules/infrastructure"

  create_vpc      = true
  create_database = true
  create_k8s      = true
  create_cdn      = true
  # ... 50 more parameters
}

Good: Focused Modules

module "vpc" {
  source = "./modules/vpc"

  cidr_block = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b"]
}

module "database" {
  source = "./modules/rds"

  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.database_subnet_ids
  instance_class    = "db.r5.large"
}

Version Your Modules

Pin module versions in production:

module "vpc" {
  source  = "git::https://github.com/company/terraform-modules.git//vpc?ref=v2.1.0"

  # or with Terraform Registry
  source  = "company/vpc/aws"
  version = "~> 2.1"
}

Resource Naming and Tagging

Consistent Naming Convention

locals {
  name_prefix = "${var.environment}-${var.project}-${var.component}"

  common_tags = {
    Environment = var.environment
    Project     = var.project
    ManagedBy   = "Terraform"
    CostCenter  = var.cost_center
  }
}

resource "aws_instance" "app" {
  # ... configuration

  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-app-server"
    Component = "application"
  })
}

Data Sources vs. Resources

Use data sources for resources managed elsewhere:

# Don't create what already exists
data "aws_vpc" "main" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
}

# Use the existing VPC
resource "aws_subnet" "app" {
  vpc_id = data.aws_vpc.main.id
  # ...
}

Variables and Validation

Use Input Validation

Catch mistakes before they reach AWS:

variable "environment" {
  type        = string
  description = "Environment name"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "instance_type" {
  type        = string
  description = "EC2 instance type"

  validation {
    condition     = can(regex("^[t3|m5|c5]\\.", var.instance_type))
    error_message = "Only t3, m5, and c5 instance families are allowed."
  }
}

Use Sensitive Variables

Protect secrets in logs and output:

variable "database_password" {
  type      = string
  sensitive = true
}

output "db_endpoint" {
  value     = aws_db_instance.main.endpoint
  sensitive = false
}

output "db_password" {
  value     = aws_db_instance.main.password
  sensitive = true  # Won't show in terraform output
}

Lifecycle Management

Prevent Accidental Deletion

resource "aws_db_instance" "production" {
  # ... configuration

  lifecycle {
    prevent_destroy = true

    # Ignore tags added by other tools
    ignore_changes = [
      tags["LastBackup"],
      tags["BackupRetention"]
    ]
  }
}

Create Before Destroy

For zero-downtime updates:

resource "aws_launch_template" "app" {
  # ... configuration

  lifecycle {
    create_before_destroy = true
  }
}

Secrets Management

Never Hardcode Secrets

Bad:

resource "aws_db_instance" "main" {
  password = "SuperSecret123"  # DON'T DO THIS
}

Good:

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/database/master-password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Conditional Resources

Use count for Optional Resources

variable "create_backup" {
  type    = bool
  default = true
}

resource "aws_backup_vault" "main" {
  count = var.create_backup ? 1 : 0

  name = "backup-vault"
}

# Reference with index
resource "aws_backup_plan" "main" {
  count = var.create_backup ? 1 : 0

  vault_name = aws_backup_vault.main[0].name
}

Use for_each for Multiple Similar Resources

variable "availability_zones" {
  type = map(object({
    cidr_block = string
  }))

  default = {
    "us-east-1a" = { cidr_block = "10.0.1.0/24" }
    "us-east-1b" = { cidr_block = "10.0.2.0/24" }
  }
}

resource "aws_subnet" "private" {
  for_each = var.availability_zones

  vpc_id            = aws_vpc.main.id
  availability_zone = each.key
  cidr_block        = each.value.cidr_block

  tags = {
    Name = "private-${each.key}"
  }
}

Testing and Validation

Use terraform fmt and validate

In CI/CD:

#!/bin/bash
terraform fmt -check -recursive
terraform init -backend=false
terraform validate

Plan Before Apply

Always review plans in production:

# Generate plan
terraform plan -out=tfplan

# Review the plan
terraform show tfplan

# Apply only if approved
terraform apply tfplan

Use tflint for Extra Validation

# Install tflint
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash

# Run checks
tflint --init
tflint

Drift Detection

Scheduled Drift Checks

#!/bin/bash
# drift-check.sh
terraform plan -detailed-exitcode

if [ $? -eq 2 ]; then
  echo "Drift detected!"
  # Send alert
  curl -X POST https://alerts.company.com/drift \
    -d '{"message": "Terraform drift detected in production"}'
fi

Common Pitfalls to Avoid

1. Not Using Workspaces Correctly

Workspaces are for the same infrastructure in different environments, not different infrastructure.

2. Circular Dependencies

# BAD: Circular dependency
resource "aws_security_group" "app" {
  ingress {
    security_groups = [aws_security_group.db.id]
  }
}

resource "aws_security_group" "db" {
  ingress {
    security_groups = [aws_security_group.app.id]
  }
}

# GOOD: Use security group rules
resource "aws_security_group" "app" {
  # ... base config
}

resource "aws_security_group" "db" {
  # ... base config
}

resource "aws_security_group_rule" "app_to_db" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.db.id
  source_security_group_id = aws_security_group.app.id
}

3. Not Handling Provider Credentials Properly

Use AWS profiles or IAM roles, never hardcode credentials.

4. Forgetting to Ignore Generated Files

# .gitignore
.terraform/
*.tfstate
*.tfstate.*
.terraform.lock.hcl  # or commit this for consistency
tfplan
*.tfvars  # if they contain secrets

Documentation

Required README Sections

Every Terraform project should have:

# Infrastructure Module

## Overview
Brief description of what this creates

## Requirements
- Terraform >= 1.0
- AWS Provider >= 4.0

## Usage
Example code showing how to use the module

## Inputs
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| vpc_cidr | CIDR block for VPC | string | n/a | yes |

## Outputs
| Name | Description |
|------|-------------|
| vpc_id | The ID of the VPC |

## Examples
Link to example implementations

Conclusion

Good Terraform practices are about:

Safety: Preventing accidental destruction
Consistency: Using patterns that work across teams
Maintainability: Making it easy to understand and modify
Security: Protecting sensitive data and resources

Start with these practices early. It's much harder to retrofit them into existing infrastructure than to build them in from the start.