Cost Optimization April 7, 2026 15 min read

Why Your Observability Bill Is Out of Control (And the Open-Source Fix)

Nobody planned for this. The same platform that cost $800/month with 10 engineers starts costing $14,000/month at 50. Here's why it happens and how to fix it.

Share:

Something strange happens around the time an engineering team starts to grow.

The same observability platform that cost you $800/month when you had 10 engineers starts costing $14,000/month when you hit 50. By the time you're 100 engineers shipping to production daily, the bill is somewhere between $50,000 and $150,000 per year — and climbing. For larger enterprises, seven-figure observability spend is table stakes.

Nobody planned for this. It wasn't in the budget. And yet here you are, in a meeting defending why you're spending more on watching your software than you spent building it.

This isn't bad luck. It's architecture. The billing models underlying most commercial observability platforms are specifically designed to grow faster than your engineering team does. Understanding why is the first step toward fixing it.

How Observability Bills Actually Work (And Why They Spiral)

The Host-Based Trap

Traditional APM vendors like Datadog built their billing around a deceptively simple metric: the number of hosts you're monitoring. In the early days of monolithic servers, this made sense. You had 10 servers, you paid for 10 agents, done.

Then came containers and Kubernetes. A single physical node might run 50–100 ephemeral pods, each technically a "host" from the agent's perspective. Autoscaling means your host count fluctuates wildly — and many platforms bill at the 99th percentile of monthly usage, meaning even a brief autoscale spike permanently raises your bill that month.

Teams that moved from VMs to Kubernetes expecting cost savings were blindsided. Their infrastructure costs dropped; their observability costs doubled.

The Dual-Billing Problem for Logs

Logs are where things get truly painful. Many major vendors charge you twice for the same data:

  • Ingestion fee — to receive and process your logs
  • Indexing fee — to make them searchable for analysis

The math compounds quickly. At scale, a platform ingesting 500GB of logs per day can cost over $10,000/month in indexing costs alone — even if you already paid to get the logs in. Some teams respond by aggressively filtering logs at the source, inadvertently destroying the diagnostic signal they were paying to preserve in the first place.

The Custom Metrics Cardinality Explosion

Custom metrics are billed by cardinality — the number of unique time series generated by your metric labels. This is where many observability bills become genuinely unpredictable.

Consider a simple metric like http_request_duration_seconds. Add a user_id label to enable per-user latency tracking, and if you have 100,000 users, you've just created 100,000 time series from a single metric. At typical commercial rates, that one decision can add thousands of dollars to your monthly bill.

In practice, this creates perverse incentives. Engineers strip useful dimensional context from their metrics to keep bills manageable. The observability that was supposed to give you more visibility leaves you with less.

The Root Cause: Proprietary Data Lock-In

All of these billing dynamics share a common root: your telemetry data lives in a proprietary format inside a proprietary system, and the vendor is the only one who can access it.

This isn't a coincidence. When you can't easily move your data, you can't easily move your workload. You're not choosing an observability vendor — you're taking on a long-term financial dependency that compounds with every byte you ingest.

The good news? The industry has been quietly building a solution to this for the past four years.

OpenTelemetry Has Changed Everything

OpenTelemetry — the open-source observability framework backed by the CNCF — has become the de facto standard for instrumenting cloud-native applications. As of 2026:

  • 95% of new cloud-native instrumentation uses OpenTelemetry
  • It's the second most active CNCF project by contributor count, trailing only Kubernetes
  • 58% of adopters cite vendor portability as their primary motivation
  • The OpenTelemetry Collector now processes over 10 billion spans per day across known public deployments — a 4x increase from 2023

The signal is unmistakable: the industry has voted. Proprietary instrumentation agents are a legacy pattern. OpenTelemetry is the standard.

But here's the thing most teams miss: OpenTelemetry solves the collection side of the problem, not the storage side. You can instrument your application with OTel today and still send all that telemetry straight into a $150,000/year proprietary platform. You've gained portability at the edge — but if you're not thinking carefully about where that data lands, you've only solved half the problem.

The other half is the backend.

Why ClickHouse Is the Right Backend for Observability Data

Observability data has specific characteristics that make it a poor fit for many general-purpose databases and search engines:

  • Massive write volume — millions of log lines and trace spans per minute
  • High cardinality — billions of unique series across labels, trace IDs, service names
  • Time-series access patterns — almost all queries are range queries over time
  • Scan-heavy analytics — "how many 5xx errors in the last hour broken down by service?" requires scanning entire time ranges, not point lookups

ClickHouse, a columnar OLAP database originally developed at Yandex, was purpose-built for exactly this workload. Its architecture delivers several decisive advantages:

Columnar storage means that a query touching only the status_code and timestamp columns of a trillion-row log table never reads the other 20 columns. This translates directly to query speed and storage efficiency.

LZ4 compression on columnar data achieves compression ratios that are dramatically better than row-oriented storage — often 10–20x for log data, which is highly repetitive in structure. Storing a terabyte of logs in ClickHouse commonly requires less than 100GB on disk.

Vectorized query execution means analytical queries that would take seconds in PostgreSQL or Elasticsearch complete in milliseconds at scale.

The result: the same observability workload that costs $10,000/month on a commercial platform can run on ClickHouse-backed infrastructure for $500–$1,000/month. That's not an exaggeration — it's the math that makes platforms like Qorrelate possible.

What OpenTelemetry + ClickHouse Looks Like in Practice

Here's the architecture that powers cost-efficient, high-performance observability for modern engineering teams:

Your Services
    │
    ▼
OpenTelemetry SDKs (vendor-neutral instrumentation)
    │
    ▼
OpenTelemetry Collector (routing, filtering, transformation)
    │
    ▼
ClickHouse-Backed Observability Platform
(logs + metrics + traces + session replays — unified)

Each layer is open and swappable. Your application code uses standard OTel APIs. The Collector can route to multiple backends simultaneously, transform data in flight, and apply sampling policies to control volume. The backend uses ClickHouse for storage, giving you sub-second query performance at petabyte scale.

Contrast this with the proprietary approach:

Your Services
    │
    ▼
Proprietary Agent (vendor lock-in starts here)
    │
    ▼
Vendor's Ingest Pipeline (you pay per GB)
    │
    ▼
Vendor's Proprietary Storage (you pay per indexed GB)
    │
    ▼
Vendor's Query Layer (you pay per query beyond the free tier)

The proprietary model charges you at every layer. The open model charges you only for storage and compute — at commodity cloud prices.

The Real Cost Comparison

Let's run the numbers for a mid-sized engineering team with realistic observability requirements:

Workload Datadog (estimated) OTel + ClickHouse-Native
500GB logs/month ~$1,500–$2,500 ~$75–$100
50M trace spans/month ~$1,500–$3,000 ~$25
10K custom metric series ~$500–$1,000 Included
Infrastructure monitoring (50 hosts) ~$2,000–$3,500 ~$50–$100
Monthly total $5,500–$10,000 $150–$275

At scale, this difference compounds dramatically. Teams spending $100,000/year on Datadog are often paying $90,000 for the privilege of proprietary lock-in and $10,000 for the actual infrastructure that stores and serves their data.

Getting Started Without Ripping Out Everything

The most common objection to migration is inertia: "We have years of dashboards, alerts, and workflows built around our current platform. We can't just tear it all out."

You don't have to.

OpenTelemetry supports side-by-side routing. You can configure the OTel Collector to send telemetry to both your existing platform and a new backend simultaneously — zero service disruption, immediate visibility into the new system. Most teams run parallel ingest for 30 days, validate that the new platform captures everything they need, and then cut over.

A practical migration path looks like this:

  1. Week 1–2: Instrument new services with OpenTelemetry SDKs. Configure the Collector to route to both platforms.
  2. Week 3–4: Migrate existing services from proprietary agents to OTel SDKs. No code changes required for many languages — only configuration.
  3. Month 2: Rebuild critical dashboards and alerts in the new platform. OTel's standard semantic conventions mean your log fields, metric names, and span attributes follow predictable patterns.
  4. Month 3: Decommission the proprietary agents. Collect your first dramatically smaller invoice.

The hardest part isn't technical — it's organizational. Changing the platform your team reaches for when something breaks at 3am requires trust, which requires the new platform to prove itself during the parallel period. Give it that chance.

What to Look for in an OpenTelemetry-Native Platform

Not all "OTel-compatible" platforms are created equal. Some legacy platforms added an OTel ingest endpoint while keeping their proprietary billing model intact. You send OTel data; they charge you like it's their own format.

A genuinely OTel-native platform should offer:

  • Native OTLP ingest — the OpenTelemetry Protocol, not a translation layer
  • Unified signals — logs, metrics, traces, and session replays in a single query interface, not siloed products
  • Predictable pricing — per GB, per span, per metric series — not per host, per user, or per "feature module"
  • No proprietary agent requirement — your standard OTel Collector should be sufficient
  • Open-source core — you should be able to self-host if you need to, as a safeguard against future pricing changes

Qorrelate: Built on OpenTelemetry and ClickHouse from Day One

Qorrelate is a modern observability platform built specifically around this architecture. We didn't bolt OpenTelemetry support onto a legacy system — we built the entire platform on OTel as the instrumentation layer and ClickHouse as the storage layer, because we believed (and the industry has since confirmed) that this is the right foundation for scalable, affordable observability.

The result is a platform that's 12x cheaper than Datadog for equivalent workloads, with:

  • 50+ native integrations — including AWS, GCP, Azure, Kubernetes, PostgreSQL, Redis, and all major language SDKs
  • Unified observability — logs, metrics, traces, session replays, and workflow automation in one place
  • SIEM capabilities — security event detection built on the same data pipeline as your performance telemetry
  • BYOK encryption — bring your own keys for data at rest, required by many compliance frameworks
  • MIT-licensed open-source core — full transparency, self-hosting supported, no proprietary lock-in

Pricing is straightforward: $0.15/GB for logs, $0.20/1K unified time series, $0.50/1M trace spans. No host-based pricing, no dual billing, no cardinality multipliers.

There's also a free tier — 5GB logs, 5K metric time series, and 500K trace spans per month, with 7-day retention and up to 3 team members. No credit card required.

If your current observability bill is giving you anxiety, take Qorrelate for a free spin and run the parallel ingest experiment. The math usually speaks for itself.

The Bottom Line

Observability costs spiral out of control because the dominant platforms are designed to make your bill grow faster than your business. Host-based pricing, dual log billing, and cardinality-driven metric costs are structural features, not bugs — they're how those businesses grow revenue.

OpenTelemetry has dismantled the instrumentation lock-in that made this dynamic unavoidable. ClickHouse has proven that the storage problem is solved. What remains is the will to make the switch.

For most engineering teams, the question isn't whether migrating to an OTel-native, ClickHouse-backed platform makes financial sense. It's whether the migration friction is worth $80,000 a year in savings. We think it usually is.

Have questions about migrating your observability stack? Reach out to us at support@qorrelate.io — we're happy to walk through the specifics of your current setup and what a migration would look like.

Related Posts