cd ../blog
observabilitymonitoringprometheusgrafana

Building Observability at Scale: Lessons Learned

March 10, 20245 min read

Building Observability at Scale

After processing over 1TB of logs and 50K+ metrics weekly, here's what I've learned about building production observability systems.

The Problem with Traditional Monitoring

Traditional monitoring approaches break down at scale. When you're dealing with hundreds of microservices, thousands of instances, and millions of metrics, you can't just throw everything at a single monitoring stack and hope for the best.

Early in our scaling journey, we faced:

  • Query timeouts on Grafana dashboards
  • High cardinality metrics crushing Prometheus
  • Log storage costs spiraling out of control
  • Alert fatigue from poorly tuned thresholds
  • No clear ownership of observability data

The Three Pillars at Scale

Metrics: The Cardinality Problem

High cardinality metrics are the silent killer of monitoring systems. We learned this the hard way when customer IDs in metric labels brought down our Prometheus cluster.

What worked:

  • Aggressive use of recording rules to pre-aggregate data
  • Strict label policies (no customer IDs, request IDs, or timestamps)
  • Separate Prometheus instances for different teams/services
  • Thanos for long-term storage and global querying
  • Regular cardinality audits using promtool

Key metrics to track:

# Request rate, error rate, duration (RED method)
rate(http_requests_total[5m])
rate(http_requests_failed_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Saturation metrics
node_filesystem_avail_bytes / node_filesystem_size_bytes

Logs: Cost vs. Value

At $0.50 per GB ingested, our log costs were approaching $15K monthly. We had to make hard decisions about what to keep.

What we did:

  • Implemented sampling for high-volume, low-value logs
  • Moved to Loki for cost-effective log aggregation
  • Retained only 7 days of full logs, 30 days of sampled
  • Structured logging with consistent fields across services
  • Log levels enforced at the platform level

Structured logging example:

{
  "timestamp": "2024-03-10T10:30:00Z",
  "level": "error",
  "service": "payment-processor",
  "trace_id": "abc123",
  "message": "Payment gateway timeout",
  "error_code": "GATEWAY_TIMEOUT",
  "duration_ms": 5000
}

Traces: The Missing Context

Metrics tell you something is wrong. Logs tell you what. Traces tell you why.

We implemented distributed tracing using OpenTelemetry and saw immediate value:

  • Reduced mean time to detection (MTTD) by 60%
  • Identified cross-service performance bottlenecks
  • Clear understanding of request flows through the system

Critical practices:

  • Automatic trace propagation across services
  • Sampling strategy (100% for errors, 1% for success)
  • Correlation IDs linking traces to logs and metrics
  • Span attributes for business context (user tier, feature flags)

The Tools We Use

Our observability stack evolved through trial and error:

Core Components

  • Prometheus + Thanos: Metrics collection and long-term storage
  • Loki: Cost-effective log aggregation
  • Tempo: Distributed tracing backend
  • Grafana: Unified visualization layer
  • AlertManager: Alert routing and de-duplication

Integration Layer

  • OpenTelemetry Collector: Centralized telemetry processing
  • Vector: Log routing and transformation
  • Promtail: Log shipping to Loki

Lessons Learned the Hard Way

1. Instrumentation is Code

Treat your observability instrumentation with the same rigor as your application code:

  • Code reviews for new metrics
  • Tests for alert rules
  • Version control for dashboards
  • Documentation for what metrics mean

2. Default to Structured Data

Every log line, metric label, and trace attribute should follow a schema. We created service templates with:

  • Pre-configured logging libraries
  • Standard metric naming conventions
  • Required trace attributes
  • Automatic correlation ID injection

3. Cardinality is Everything

We now have automated checks that reject metric labels that could cause cardinality explosions:

# Allowed labels
- service_name
- endpoint
- status_code
- availability_zone

# Blocked labels (high cardinality)
- customer_id
- request_id
- user_agent
- timestamp

4. Make Alerts Actionable

Every alert must have:

  • Clear ownership (who gets paged)
  • Runbook link (what to do)
  • Context (why it matters)
  • Threshold reasoning (how we chose the value)

Bad alert:

alert: HighErrorRate
expr: rate(errors[5m]) > 10

Good alert:

alert: PaymentProcessingErrors
expr: |
  rate(payment_errors_total{severity="critical"}[5m]) > 0.01
  and
  sum(rate(payment_errors_total[5m])) by (error_type) > 5
annotations:
  summary: "Critical payment processing errors detected"
  description: "{{ $value }} payment errors per second"
  runbook: "https://wiki.company.com/runbooks/payment-errors"
  impact: "Customers cannot complete purchases"
labels:
  severity: page
  team: payments

5. Cost Governance from Day One

We implemented cost controls early:

  • Metric retention tiers (7d, 30d, 1y based on importance)
  • Automatic log sampling for verbose services
  • Budget alerts at team level
  • Quarterly review of highest-cost signals

Scaling Strategies

Horizontal Scaling

  • Prometheus federation for multiple clusters
  • Separate Loki instances per environment
  • Regional Tempo deployments for trace collection

Vertical Optimization

  • Recording rules to reduce query load
  • Log aggregation pipelines with Vector
  • Trace sampling based on error rates and latency

Data Lifecycle

  • Hot storage (7 days): SSD-backed, fast queries
  • Warm storage (30 days): Object storage, acceptable latency
  • Cold storage (1 year): Compressed archives, rare access

Measuring Observability

We track our observability platform itself:

Reliability metrics:

  • Query success rate (target: >99.9%)
  • P95 query latency (target: <5s)
  • Ingestion lag (target: <30s)

Cost metrics:

  • Cost per GB ingested
  • Cost per 1M metrics
  • Total observability spend as % of infrastructure

Value metrics:

  • MTTD (mean time to detection)
  • MTTR (mean time to resolution)
  • Number of incidents found through proactive monitoring

The Bottom Line

Building observability at scale requires:

  1. Discipline: Enforce standards across teams
  2. Trade-offs: Not everything needs to be monitored
  3. Iteration: Your first stack won't be your last
  4. Cost awareness: Observability can get expensive fast
  5. Ownership: Make teams responsible for their telemetry

The goal isn't perfect observability—it's actionable insights at reasonable cost. Start with the essentials, measure what matters, and evolve as you scale.

Resources

Remember: The best observability strategy is the one your team will actually use. Keep it simple, make it actionable, and iterate based on real needs.