Terraform Best Practices for Production Infrastructure
Managing infrastructure as code with Terraform is powerful, but it comes with responsibilities. Here are the practices that have saved us from outages and made our infrastructure maintainable.
State Management
Remote State is Non-Negotiable
Never use local state files in production. Ever.
Good: S3 Backend with State Locking (AWS)
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "production/vpc/terraform.tfstate"
region = "us-east-1"
encrypt = true
# S3 now supports native locking via the use_lockfile attribute
use_lockfile = true
# Prevent accidental changes to state config
skip_credentials_validation = false
skip_metadata_api_check = false
}
}
Alternative: GCS Backend (Native Locking)
terraform {
backend "gcs" {
bucket = "company-terraform-state"
prefix = "terraform/state"
}
}
Why lock our statefiles?
- Prevents concurrent modifications
- Prevents state corruption
- Saves you from "who ran terraform at the same time" incidents
Previously you'd use DynamoDB to do this, but now you can do it natively in S3 or GCS.
State File Organization
Separate state files by:
- Environment (dev, staging, prod)
- Lifecycle (infrastructure that changes together)
- Blast radius (isolate critical resources)
Directory structure:
terraform/
├── environments/
│ ├── production/
│ │ ├── networking/
│ │ ├── databases/
│ │ └── applications/
│ └── staging/
│ ├── networking/
│ └── applications/
└── modules/
├── vpc/
├── rds/
└── eks/
Module Design
Keep Modules Focused
A module should do one thing well. Bad modules try to be everything.
Bad: God Module
module "everything" {
source = "./modules/infrastructure"
create_vpc = true
create_database = true
create_k8s = true
create_cdn = true
# ... 50 more parameters
}
Good: Focused Modules
module "vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
}
module "database" {
source = "./modules/rds"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnet_ids
instance_class = "db.r5.large"
}
Version Your Modules
Pin module versions in production:
module "vpc" {
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v2.1.0"
# or with Terraform Registry
source = "company/vpc/aws"
version = "~> 2.1"
}
Resource Naming and Tagging
Consistent Naming Convention
locals {
name_prefix = "${var.environment}-${var.project}-${var.component}"
common_tags = {
Environment = var.environment
Project = var.project
ManagedBy = "Terraform"
CostCenter = var.cost_center
}
}
resource "aws_instance" "app" {
# ... configuration
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-app-server"
Component = "application"
})
}
Data Sources vs. Resources
Use data sources for resources managed elsewhere:
# Don't create what already exists
data "aws_vpc" "main" {
filter {
name = "tag:Name"
values = ["production-vpc"]
}
}
# Use the existing VPC
resource "aws_subnet" "app" {
vpc_id = data.aws_vpc.main.id
# ...
}
Variables and Validation
Use Input Validation
Catch mistakes before they reach AWS:
variable "environment" {
type = string
description = "Environment name"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "instance_type" {
type = string
description = "EC2 instance type"
validation {
condition = can(regex("^[t3|m5|c5]\\.", var.instance_type))
error_message = "Only t3, m5, and c5 instance families are allowed."
}
}
Use Sensitive Variables
Protect secrets in logs and output:
variable "database_password" {
type = string
sensitive = true
}
output "db_endpoint" {
value = aws_db_instance.main.endpoint
sensitive = false
}
output "db_password" {
value = aws_db_instance.main.password
sensitive = true # Won't show in terraform output
}
Lifecycle Management
Prevent Accidental Deletion
resource "aws_db_instance" "production" {
# ... configuration
lifecycle {
prevent_destroy = true
# Ignore tags added by other tools
ignore_changes = [
tags["LastBackup"],
tags["BackupRetention"]
]
}
}
Create Before Destroy
For zero-downtime updates:
resource "aws_launch_template" "app" {
# ... configuration
lifecycle {
create_before_destroy = true
}
}
Secrets Management
Never Hardcode Secrets
Bad:
resource "aws_db_instance" "main" {
password = "SuperSecret123" # DON'T DO THIS
}
Good:
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "production/database/master-password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Conditional Resources
Use count for Optional Resources
variable "create_backup" {
type = bool
default = true
}
resource "aws_backup_vault" "main" {
count = var.create_backup ? 1 : 0
name = "backup-vault"
}
# Reference with index
resource "aws_backup_plan" "main" {
count = var.create_backup ? 1 : 0
vault_name = aws_backup_vault.main[0].name
}
Use for_each for Multiple Similar Resources
variable "availability_zones" {
type = map(object({
cidr_block = string
}))
default = {
"us-east-1a" = { cidr_block = "10.0.1.0/24" }
"us-east-1b" = { cidr_block = "10.0.2.0/24" }
}
}
resource "aws_subnet" "private" {
for_each = var.availability_zones
vpc_id = aws_vpc.main.id
availability_zone = each.key
cidr_block = each.value.cidr_block
tags = {
Name = "private-${each.key}"
}
}
Testing and Validation
Use terraform fmt and validate
In CI/CD:
#!/bin/bash
terraform fmt -check -recursive
terraform init -backend=false
terraform validate
Plan Before Apply
Always review plans in production:
# Generate plan
terraform plan -out=tfplan
# Review the plan
terraform show tfplan
# Apply only if approved
terraform apply tfplan
Use tflint for Extra Validation
# Install tflint
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash
# Run checks
tflint --init
tflint
Drift Detection
Scheduled Drift Checks
#!/bin/bash
# drift-check.sh
terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
echo "Drift detected!"
# Send alert
curl -X POST https://alerts.company.com/drift \
-d '{"message": "Terraform drift detected in production"}'
fi
Common Pitfalls to Avoid
1. Not Using Workspaces Correctly
Workspaces are for the same infrastructure in different environments, not different infrastructure.
2. Circular Dependencies
# BAD: Circular dependency
resource "aws_security_group" "app" {
ingress {
security_groups = [aws_security_group.db.id]
}
}
resource "aws_security_group" "db" {
ingress {
security_groups = [aws_security_group.app.id]
}
}
# GOOD: Use security group rules
resource "aws_security_group" "app" {
# ... base config
}
resource "aws_security_group" "db" {
# ... base config
}
resource "aws_security_group_rule" "app_to_db" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_group_id = aws_security_group.db.id
source_security_group_id = aws_security_group.app.id
}
3. Not Handling Provider Credentials Properly
Use AWS profiles or IAM roles, never hardcode credentials.
4. Forgetting to Ignore Generated Files
# .gitignore
.terraform/
*.tfstate
*.tfstate.*
.terraform.lock.hcl # or commit this for consistency
tfplan
*.tfvars # if they contain secrets
Documentation
Required README Sections
Every Terraform project should have:
# Infrastructure Module
## Overview
Brief description of what this creates
## Requirements
- Terraform >= 1.0
- AWS Provider >= 4.0
## Usage
Example code showing how to use the module
## Inputs
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| vpc_cidr | CIDR block for VPC | string | n/a | yes |
## Outputs
| Name | Description |
|------|-------------|
| vpc_id | The ID of the VPC |
## Examples
Link to example implementations
Conclusion
Good Terraform practices are about:
- Safety: Preventing accidental destruction
- Consistency: Using patterns that work across teams
- Maintainability: Making it easy to understand and modify
- Security: Protecting sensitive data and resources
Start with these practices early. It's much harder to retrofit them into existing infrastructure than to build them in from the start.