Building Observability at Scale

After processing over 1TB of logs and 50K+ metrics weekly, here's what I've learned about building production observability systems.

The Problem with Traditional Monitoring

Traditional monitoring approaches break down at scale. When you're dealing with hundreds of microservices, thousands of instances, and millions of metrics, you can't just throw everything at a single monitoring stack and hope for the best.

Early in our scaling journey, we faced:

Query timeouts on Grafana dashboards
High cardinality metrics crushing Prometheus
Log storage costs spiraling out of control
Alert fatigue from poorly tuned thresholds
No clear ownership of observability data

The Three Pillars at Scale

Metrics: The Cardinality Problem

High cardinality metrics are the silent killer of monitoring systems. We learned this the hard way when customer IDs in metric labels brought down our Prometheus cluster.

What worked:

Aggressive use of recording rules to pre-aggregate data
Strict label policies (no customer IDs, request IDs, or timestamps)
Separate Prometheus instances for different teams/services
Thanos for long-term storage and global querying
Regular cardinality audits using promtool

Key metrics to track:

# Request rate, error rate, duration (RED method)
rate(http_requests_total[5m])
rate(http_requests_failed_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Saturation metrics
node_filesystem_avail_bytes / node_filesystem_size_bytes

Logs: Cost vs. Value

At $0.50 per GB ingested, our log costs were approaching $15K monthly. We had to make hard decisions about what to keep.

What we did:

Implemented sampling for high-volume, low-value logs
Moved to Loki for cost-effective log aggregation
Retained only 7 days of full logs, 30 days of sampled
Structured logging with consistent fields across services
Log levels enforced at the platform level

Structured logging example:

{
  "timestamp": "2024-03-10T10:30:00Z",
  "level": "error",
  "service": "payment-processor",
  "trace_id": "abc123",
  "message": "Payment gateway timeout",
  "error_code": "GATEWAY_TIMEOUT",
  "duration_ms": 5000
}

Traces: The Missing Context

Metrics tell you something is wrong. Logs tell you what. Traces tell you why.

We implemented distributed tracing using OpenTelemetry and saw immediate value:

Reduced mean time to detection (MTTD) by 60%
Identified cross-service performance bottlenecks
Clear understanding of request flows through the system

Critical practices:

Automatic trace propagation across services
Sampling strategy (100% for errors, 1% for success)
Correlation IDs linking traces to logs and metrics
Span attributes for business context (user tier, feature flags)

The Tools We Use

Our observability stack evolved through trial and error:

Core Components

Prometheus + Thanos: Metrics collection and long-term storage
Loki: Cost-effective log aggregation
Tempo: Distributed tracing backend
Grafana: Unified visualization layer
AlertManager: Alert routing and de-duplication

Integration Layer

OpenTelemetry Collector: Centralized telemetry processing
Vector: Log routing and transformation
Promtail: Log shipping to Loki

Lessons Learned the Hard Way

1. Instrumentation is Code

Treat your observability instrumentation with the same rigor as your application code:

Code reviews for new metrics
Tests for alert rules
Version control for dashboards
Documentation for what metrics mean

2. Default to Structured Data

Every log line, metric label, and trace attribute should follow a schema. We created service templates with:

Pre-configured logging libraries
Standard metric naming conventions
Required trace attributes
Automatic correlation ID injection

3. Cardinality is Everything

We now have automated checks that reject metric labels that could cause cardinality explosions:

# Allowed labels
- service_name
- endpoint
- status_code
- availability_zone

# Blocked labels (high cardinality)
- customer_id
- request_id
- user_agent
- timestamp

4. Make Alerts Actionable

Every alert must have:

Clear ownership (who gets paged)
Runbook link (what to do)
Context (why it matters)
Threshold reasoning (how we chose the value)

Bad alert:

alert: HighErrorRate
expr: rate(errors[5m]) > 10

Good alert:

alert: PaymentProcessingErrors
expr: |
  rate(payment_errors_total{severity="critical"}[5m]) > 0.01
  and
  sum(rate(payment_errors_total[5m])) by (error_type) > 5
annotations:
  summary: "Critical payment processing errors detected"
  description: "{{ $value }} payment errors per second"
  runbook: "https://wiki.company.com/runbooks/payment-errors"
  impact: "Customers cannot complete purchases"
labels:
  severity: page
  team: payments

5. Cost Governance from Day One

We implemented cost controls early:

Metric retention tiers (7d, 30d, 1y based on importance)
Automatic log sampling for verbose services
Budget alerts at team level
Quarterly review of highest-cost signals

Scaling Strategies

Horizontal Scaling

Prometheus federation for multiple clusters
Separate Loki instances per environment
Regional Tempo deployments for trace collection

Vertical Optimization

Recording rules to reduce query load
Log aggregation pipelines with Vector
Trace sampling based on error rates and latency

Data Lifecycle

Hot storage (7 days): SSD-backed, fast queries
Warm storage (30 days): Object storage, acceptable latency
Cold storage (1 year): Compressed archives, rare access

Measuring Observability

We track our observability platform itself:

Reliability metrics:

Query success rate (target: >99.9%)
P95 query latency (target: <5s)
Ingestion lag (target: <30s)

Cost metrics:

Cost per GB ingested
Cost per 1M metrics
Total observability spend as % of infrastructure

Value metrics:

MTTD (mean time to detection)
MTTR (mean time to resolution)
Number of incidents found through proactive monitoring

The Bottom Line

Building observability at scale requires:

Discipline: Enforce standards across teams
Trade-offs: Not everything needs to be monitored
Iteration: Your first stack won't be your last
Cost awareness: Observability can get expensive fast
Ownership: Make teams responsible for their telemetry

The goal isn't perfect observability—it's actionable insights at reasonable cost. Start with the essentials, measure what matters, and evolve as you scale.

Resources

Remember: The best observability strategy is the one your team will actually use. Keep it simple, make it actionable, and iterate based on real needs.