Building Observability at Scale
After processing over 1TB of logs and 50K+ metrics weekly, here's what I've learned about building production observability systems.
The Problem with Traditional Monitoring
Traditional monitoring approaches break down at scale. When you're dealing with hundreds of microservices, thousands of instances, and millions of metrics, you can't just throw everything at a single monitoring stack and hope for the best.
Early in our scaling journey, we faced:
- Query timeouts on Grafana dashboards
- High cardinality metrics crushing Prometheus
- Log storage costs spiraling out of control
- Alert fatigue from poorly tuned thresholds
- No clear ownership of observability data
The Three Pillars at Scale
Metrics: The Cardinality Problem
High cardinality metrics are the silent killer of monitoring systems. We learned this the hard way when customer IDs in metric labels brought down our Prometheus cluster.
What worked:
- Aggressive use of recording rules to pre-aggregate data
- Strict label policies (no customer IDs, request IDs, or timestamps)
- Separate Prometheus instances for different teams/services
- Thanos for long-term storage and global querying
- Regular cardinality audits using promtool
Key metrics to track:
# Request rate, error rate, duration (RED method)
rate(http_requests_total[5m])
rate(http_requests_failed_total[5m])
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Saturation metrics
node_filesystem_avail_bytes / node_filesystem_size_bytes
Logs: Cost vs. Value
At $0.50 per GB ingested, our log costs were approaching $15K monthly. We had to make hard decisions about what to keep.
What we did:
- Implemented sampling for high-volume, low-value logs
- Moved to Loki for cost-effective log aggregation
- Retained only 7 days of full logs, 30 days of sampled
- Structured logging with consistent fields across services
- Log levels enforced at the platform level
Structured logging example:
{
"timestamp": "2024-03-10T10:30:00Z",
"level": "error",
"service": "payment-processor",
"trace_id": "abc123",
"message": "Payment gateway timeout",
"error_code": "GATEWAY_TIMEOUT",
"duration_ms": 5000
}
Traces: The Missing Context
Metrics tell you something is wrong. Logs tell you what. Traces tell you why.
We implemented distributed tracing using OpenTelemetry and saw immediate value:
- Reduced mean time to detection (MTTD) by 60%
- Identified cross-service performance bottlenecks
- Clear understanding of request flows through the system
Critical practices:
- Automatic trace propagation across services
- Sampling strategy (100% for errors, 1% for success)
- Correlation IDs linking traces to logs and metrics
- Span attributes for business context (user tier, feature flags)
The Tools We Use
Our observability stack evolved through trial and error:
Core Components
- Prometheus + Thanos: Metrics collection and long-term storage
- Loki: Cost-effective log aggregation
- Tempo: Distributed tracing backend
- Grafana: Unified visualization layer
- AlertManager: Alert routing and de-duplication
Integration Layer
- OpenTelemetry Collector: Centralized telemetry processing
- Vector: Log routing and transformation
- Promtail: Log shipping to Loki
Lessons Learned the Hard Way
1. Instrumentation is Code
Treat your observability instrumentation with the same rigor as your application code:
- Code reviews for new metrics
- Tests for alert rules
- Version control for dashboards
- Documentation for what metrics mean
2. Default to Structured Data
Every log line, metric label, and trace attribute should follow a schema. We created service templates with:
- Pre-configured logging libraries
- Standard metric naming conventions
- Required trace attributes
- Automatic correlation ID injection
3. Cardinality is Everything
We now have automated checks that reject metric labels that could cause cardinality explosions:
# Allowed labels
- service_name
- endpoint
- status_code
- availability_zone
# Blocked labels (high cardinality)
- customer_id
- request_id
- user_agent
- timestamp
4. Make Alerts Actionable
Every alert must have:
- Clear ownership (who gets paged)
- Runbook link (what to do)
- Context (why it matters)
- Threshold reasoning (how we chose the value)
Bad alert:
alert: HighErrorRate
expr: rate(errors[5m]) > 10
Good alert:
alert: PaymentProcessingErrors
expr: |
rate(payment_errors_total{severity="critical"}[5m]) > 0.01
and
sum(rate(payment_errors_total[5m])) by (error_type) > 5
annotations:
summary: "Critical payment processing errors detected"
description: "{{ $value }} payment errors per second"
runbook: "https://wiki.company.com/runbooks/payment-errors"
impact: "Customers cannot complete purchases"
labels:
severity: page
team: payments
5. Cost Governance from Day One
We implemented cost controls early:
- Metric retention tiers (7d, 30d, 1y based on importance)
- Automatic log sampling for verbose services
- Budget alerts at team level
- Quarterly review of highest-cost signals
Scaling Strategies
Horizontal Scaling
- Prometheus federation for multiple clusters
- Separate Loki instances per environment
- Regional Tempo deployments for trace collection
Vertical Optimization
- Recording rules to reduce query load
- Log aggregation pipelines with Vector
- Trace sampling based on error rates and latency
Data Lifecycle
- Hot storage (7 days): SSD-backed, fast queries
- Warm storage (30 days): Object storage, acceptable latency
- Cold storage (1 year): Compressed archives, rare access
Measuring Observability
We track our observability platform itself:
Reliability metrics:
- Query success rate (target: >99.9%)
- P95 query latency (target: <5s)
- Ingestion lag (target: <30s)
Cost metrics:
- Cost per GB ingested
- Cost per 1M metrics
- Total observability spend as % of infrastructure
Value metrics:
- MTTD (mean time to detection)
- MTTR (mean time to resolution)
- Number of incidents found through proactive monitoring
The Bottom Line
Building observability at scale requires:
- Discipline: Enforce standards across teams
- Trade-offs: Not everything needs to be monitored
- Iteration: Your first stack won't be your last
- Cost awareness: Observability can get expensive fast
- Ownership: Make teams responsible for their telemetry
The goal isn't perfect observability—it's actionable insights at reasonable cost. Start with the essentials, measure what matters, and evolve as you scale.
Resources
Remember: The best observability strategy is the one your team will actually use. Keep it simple, make it actionable, and iterate based on real needs.