OllyGarden

Severity-Based Log Routing with the OpenTelemetry Collector

Juraci Paixão Kröhling — Thu, 19 Mar 2026 10:28:00 GMT

Log storage costs scale with volume, and modern applications generate extraordinary volumes. A distributed system handling thousands of requests per second can easily produce millions of log records daily, the vast majority of which are INFO or DEBUG messages that exist primarily for post-hoc debugging. Sending all of this to a managed observability backend accumulates costs rapidly, yet dropping these logs entirely means losing the context you need when something goes wrong.

Analysis of real-world log traffic reveals a striking pattern: the vast majority of logs sent to vendor backends are INFO severity and lower. In my KubeCon North America 2025 talk with Michele Mancioppi, we presented findings showing this imbalance across production environments. The ideal scenario inverts this distribution: vendor backends should primarily receive WARN and above, the logs that signal problems requiring attention, while verbose logs flow to cheaper storage tiers.

The fundamental insight is that not all logs require the same storage tier. ERROR and WARN messages demand immediate visibility and fast query performance because they indicate problems requiring human attention. INFO and DEBUG messages, by contrast, primarily serve forensic purposes: understanding what happened before an error occurred. These forensic logs can tolerate slower query performance and longer retrieval times in exchange for dramatically lower storage costs.

The OpenTelemetry Collector's routing connector enables this tiered storage pattern by evaluating each log record's severity and directing it to the appropriate destination. Important logs flow to your vendor backend for alerting and dashboards. Verbose logs flow to object storage for cost-effective archival. The result is observability that remains comprehensive without the comprehensive bill.

Understanding Log Severity in OpenTelemetry

Before configuring routing, understanding how OpenTelemetry represents log severity is essential. The OTLP data model defines a 24-level severity scale through the severity_number field, grouped into six base levels with four sub-levels each.

Range	Base Level	Typical Use
1-4	TRACE	Fine-grained debugging, execution flow
5-8	DEBUG	Diagnostic information for developers
9-12	INFO	Normal operational messages
13-16	WARN	Potential issues that may require attention
17-20	ERROR	Errors that require investigation
21-24	FATAL	Critical failures, system crashes

The base severity number for each level represents the first value in its range: TRACE is 1, DEBUG is 5, INFO is 9, WARN is 13, ERROR is 17, and FATAL is 21. When routing by severity, you compare against these base values.

The severity_text field preserves the original severity name from the source logging framework. A Java application using java.util.logging might emit SEVERE as severity_text while the collector maps it to severity_number 17 (ERROR). This dual representation lets you route on normalized numbers while retaining source-specific terminology.

OTTL, the OpenTelemetry Transformation Language used by the routing connector, provides named constants for severity comparisons. Writing severity_number >= SEVERITY_NUMBER_WARN is clearer than writing severity_number >= 13 and survives specification changes should the numeric values ever be updated.

The Routing Connector

The routing connector is a connector component that evaluates telemetry against a routing table and forwards it to matching pipelines. Unlike processors that transform data in place, connectors sit between pipelines, receiving from one and emitting to others. This architecture enables fan-out patterns where a single input pipeline routes to multiple output pipelines based on arbitrary conditions.

For severity-based routing, the connector examines each log record's severity_number field and routes to different pipelines depending on the value. The routing table uses OTTL conditions, so you have access to the full expression language for complex routing logic.

The connector operates as both an exporter (from the perspective of the input pipeline) and a receiver (from the perspective of output pipelines). This dual role is reflected in how you wire it in the service section: the input pipeline exports to the connector, while output pipelines receive from it.

Complete Configuration

The following configuration demonstrates severity-based routing with two tiers: important logs (WARN and above) route to an observability backend via OTLP, while informational logs (INFO and below) route to local files for archival. This example uses a local LGTM stack (Loki, Grafana, Tempo, Mimir) as the vendor backend, making it easy to test the pattern locally before deploying to production.

Start the LGTM stack with Docker, mapping the OTLP gRPC port to 14317 to avoid conflicts with the collector:

docker run -d --name lgtm -p 3000:3000 -p 14317:4317 grafana/otel-lgtm

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch/vendor:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048

  batch/file:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 10000

exporters:
  otlp/lgtm:
    endpoint: localhost:14317
    tls:
      insecure: true

  file:
    path: ./archive.jsonl
    rotation:
      max_megabytes: 100
      max_days: 7
      max_backups: 10
    format: json

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number >= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor]

service:
  pipelines:
    logs/intake:
      receivers: [otlp]
      exporters: [routing/severity]

    logs/vendor:
      receivers: [routing/severity]
      processors: [batch/vendor]
      exporters: [otlp/lgtm]

    logs/archive:
      receivers: [routing/severity]
      processors: [batch/file]
      exporters: [file]

The configuration defines three pipelines forming a routing topology. The intake pipeline receives all logs via OTLP and exports to the routing connector. The routing connector evaluates each log record: those with severity_number >= SEVERITY_NUMBER_WARN (13 or higher, meaning WARN, ERROR, or FATAL) route to the vendor pipeline, while everything else routes to the archive pipeline via default_pipelines.

Notice that each output pipeline has its own batch processor with different parameters. The vendor pipeline uses aggressive batching with 1-second timeouts and smaller batches optimized for near-real-time delivery. The archive pipeline uses relaxed batching with 10-second timeouts and larger batches optimized for file write efficiency. This demonstrates a key benefit of the routing pattern: each destination can have processing tuned to its characteristics.

Configuration Walkthrough

The routing/severity connector configuration warrants closer examination.

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number >= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor]

The default_pipelines field specifies where unmatched logs route. In this configuration, logs that do not match the WARN-or-higher condition route to the archive pipeline. Without default_pipelines, unmatched logs would be silently dropped, which is rarely the desired behavior.

The error_mode: ignore setting determines behavior when OTTL condition evaluation fails. With ignore, evaluation errors log a warning and route the affected log to default_pipelines. The alternative, propagate, causes evaluation errors to fail the entire batch, potentially losing data. Production configurations should almost always use ignore.

The context: log setting means the condition evaluates per individual log record. Alternative contexts like resource evaluate once per ResourceLogs batch, which is more efficient but cannot inspect log-level fields like severity_number. For severity-based routing, log context is required.

The condition severity_number >= SEVERITY_NUMBER_WARN uses OTTL's named severity constant. This matches WARN (13-16), ERROR (17-20), and FATAL (21-24). The SEVERITY_NUMBER_WARN constant evaluates to 13, the base value for the WARN range.

Routing to Multiple Destinations

Some organizations want important logs sent to both the vendor backend and archival storage for redundancy. The routing connector supports this by listing multiple pipelines in a single route.

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number >= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor, logs/archive]

With this configuration, ERROR and WARN logs route to both pipelines, while INFO and DEBUG route only to archive. This ensures critical logs have redundant storage while still benefiting from reduced vendor costs for verbose logs.

Splitting at INFO

The boundary between important and archival logs is a policy decision. The example above uses WARN as the threshold, sending INFO to archival storage. Some organizations prefer to keep INFO in the vendor backend for operational visibility while archiving only DEBUG and TRACE levels.

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number >= SEVERITY_NUMBER_INFO
        pipelines: [logs/vendor]

Changing SEVERITY_NUMBER_WARN to SEVERITY_NUMBER_INFO shifts the boundary. Now INFO (9-12), WARN (13-16), ERROR (17-20), and FATAL (21-24) route to the vendor, while only DEBUG (5-8) and TRACE (1-4) route to archive.

The cost implications depend on your log distribution. If 90% of your logs are DEBUG level, archiving DEBUG yields substantial savings. If DEBUG logs are rare but INFO logs are prolific, archiving only DEBUG may not meaningfully reduce vendor costs.

Handling Unspecified Severity

Log records may arrive with severity_number set to 0 (SEVERITY_NUMBER_UNSPECIFIED) when the source did not map severity correctly. These ambiguous logs need a routing decision. The safest approach treats unknown logs as potentially important by adding or severity_number == 0 to the vendor routing condition. We will cover strategies for inferring and mapping severity from log content in a future article.

Performance Considerations

The routing connector evaluates conditions for every log record when using log context. High log volumes make condition evaluation a meaningful cost. OTTL condition compilation happens once at startup, but evaluation happens continuously during operation.

Simple numeric comparisons like severity_number >= SEVERITY_NUMBER_WARN are fast. Avoid expensive operations in routing conditions. String matching with IsMatch and regular expressions, body parsing with ParseJSON, and complex boolean logic all add evaluation cost that multiplies by log volume.

If your routing logic requires expensive operations, consider whether a transform processor earlier in the pipeline could precompute the routing decision into an attribute. Routing on a precomputed attribute is faster than repeating expensive evaluations.

processors:
  transform/route_tag:
    log_statements:
      - context: log
        statements:
          - set(attributes["route"], "vendor") where severity_number >= SEVERITY_NUMBER_WARN
          - set(attributes["route"], "archive") where severity_number < SEVERITY_NUMBER_WARN

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    table:
      - context: log
        condition: attributes["route"] == "vendor"
        pipelines: [logs/vendor]

This pattern moves routing logic to the transform processor, which executes once per log record. The routing connector then performs a simple attribute comparison. For complex routing logic involving multiple conditions, this approach consolidates evaluation.

Batching Strategy

The batch processor configuration for each output pipeline affects efficiency and latency. The vendor pipeline typically wants low latency for alerting, so smaller batches with short timeouts make sense. The archive pipeline optimizes for throughput and file write efficiency, so larger batches with longer timeouts are appropriate.

File write efficiency improves with larger batches. Writing many small chunks incurs filesystem overhead, while larger batches amortize that cost. The archive pipeline's batch configuration targets larger payloads:

processors:
  batch/file:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 10000

These parameters produce batches up to 10,000 log records or 10 seconds of accumulation, whichever comes first. The file exporter then writes these batches efficiently with its built-in rotation handling.

The vendor pipeline uses more aggressive parameters for responsiveness:

processors:
  batch/vendor:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048

One-second timeout ensures logs reach the vendor backend quickly for alerting. Smaller batch sizes prevent individual batches from becoming unwieldy.

Trade-offs and Limitations

Severity-based routing assumes severity numbers are correctly populated. Logs with missing or incorrect severity will route incorrectly. If your log sources do not reliably set severity, you may need preprocessing to infer severity from log content before routing decisions occur.

Routing decisions are permanent within a single collector deployment. Once a log routes to archive, it does not also appear in the vendor backend unless you explicitly configure dual routing. If you later need archived logs for investigation, you query your archive storage rather than your vendor's fast query interface. Ensure your archive storage has adequate query tooling for forensic analysis.

Verification

After deploying severity-based routing, verify that logs route as expected. Use telemetrygen to send test logs at different severity levels and check each destination.

Send an INFO log, which should route to archive:

telemetrygen logs --otlp-insecure --severity-number 9 --severity-text INFO \
  --body "Test info message" --logs 1

Send a WARN log, which should route to vendor:

telemetrygen logs --otlp-insecure --severity-number 13 --severity-text WARN \
  --body "Test warning message" --logs 1

Verify the routing by checking both destinations. The INFO log should appear in the local archive.jsonl file, while the WARN log should appear in Loki at http://localhost:3000 (the LGTM container started earlier).

Summary

Severity-based log routing enables tiered storage without sacrificing observability. Important logs reach your vendor backend for fast querying and alerting. Verbose logs reach archival storage for cost-effective retention. The OpenTelemetry Collector's routing connector makes this pattern straightforward to implement.

The key configuration elements are the routing connector with OTTL conditions on severity_number, separate output pipelines for each destination, and batch processor tuning appropriate to each destination's characteristics. The pattern scales with log volume since routing decisions are per-record evaluations of simple numeric conditions.

Start by analyzing your log distribution across severity levels. If verbose logs dominate volume, severity-based routing can meaningfully reduce vendor costs. If important logs dominate, the savings may be modest. Either way, the architectural separation between immediate visibility and archival storage provides flexibility for future optimization.

Your telemetry answers yesterday's questions

Juraci Paixão Kröhling — Thu, 12 Mar 2026 10:07:34 GMT

Every piece of telemetry exists to answer a question. A span answers "what happened during this request?" A metric answers "how is this system performing over time?" A log answers "what did the application observe at this moment?" When engineers configure instrumentation, they are implicitly encoding the questions they expect to ask. The problem is that the questions change, and the instrumentation does not.

A service deployed three months ago had a particular set of unknowns. How will it perform under real traffic? Are the retry mechanisms working correctly? Does the circuit breaker trigger at the right thresholds? The instrumentation was configured to answer these questions, and it did. The service proved itself. The unknowns became knowns. But the instrumentation kept running, answering questions that stopped being relevant weeks ago.

When stability makes telemetry redundant

Consider a payment processing service that has been running in production for six months without a significant incident. During its first weeks, engineers needed detailed spans for every database query, every downstream call, every retry attempt. Those spans helped them verify that the service behaved correctly under production conditions.

Six months later, the service processes thousands of transactions per hour with predictable latency and a near-zero error rate. The detailed spans still flow into the backend. Every database query, every downstream call, every retry, all captured, serialized, transmitted, stored. The pipeline processes them faithfully. Nobody looks at them.

This is not wasted telemetry in the traditional sense. Each individual span is well-formed and technically correct. The problem is relevance. The questions these spans answer, "is the database query pattern correct?" and "do retries work as designed?", were answered months ago. The telemetry is accurate but obsolete. It consumes real resources to confirm what the system has already proven through months of stable operation.

When pressure creates new questions

The opposite scenario is more urgent. A downstream dependency starts responding intermittently. Traffic spikes during a major sales event. A configuration change in an adjacent service introduces unexpected latency.

Operators open their dashboards and find that the existing telemetry describes the normal world with precision but has little to say about the abnormal world they are experiencing right now. The service-level metrics confirm elevated error rates, but there is no breakdown by downstream dependency. The traces capture the full request lifecycle, but they lack attributes that would distinguish between traffic patterns. The logs report application-level events but miss the infrastructure signals that would explain the cascading failure.

The gap between the questions operators need to answer and the questions the telemetry was designed to answer becomes painfully visible during incidents. Engineers spend the first thirty minutes of an outage not debugging the problem but instrumenting for it: adding log lines, enabling verbose tracing, deploying configuration changes to capture the attributes they need. This is reactive instrumentation, the opposite of the proactive observability that the industry aspires to.

The root cause is temporal mismatch. The instrumentation was configured for a different moment in the system's lifecycle, when the risks were different, when the traffic patterns were different, when the dependencies behaved differently. The system changed. The world around it changed. The telemetry stayed the same.

The review that never happens

The textbook answer is periodic reassessment. Teams should review their instrumentation regularly, asking whether the telemetry they collect still matches the questions they need to answer. Reduce verbosity for stable services. Add coverage for services under new pressure. Retire metrics that no alert or dashboard references.

This is sound advice that almost no organization follows. The reason is simple: there is always something more urgent. Feature delivery, incident response, infrastructure maintenance, and hiring all compete for the same engineering hours. Telemetry review is important but never urgent, which means it loses to everything that is both important and urgent.

The observability team, if the organization has one, is occupied with pipeline operations: keeping collectors running, managing backend capacity, responding to cost overruns. Asking application teams to audit their own instrumentation requires them to context-switch from their primary work, understand what they are currently emitting, evaluate whether it is still relevant, and make informed changes. Each of these steps demands time and expertise that teams under delivery pressure cannot spare.

The result is that instrumentation configurations calcify at their initial state. Services that were instrumented for launch keep their launch-day telemetry forever. Services that were instrumented during an incident keep their incident-response telemetry long after the incident resolves. Nobody adjusts because nobody has time, and the mismatch between questions and answers widens silently.

AI as continuous telemetry reviewer

This is the kind of problem where AI changes the equation fundamentally. The work of reviewing telemetry, analyzing what each service emits, evaluating whether it matches current conditions, identifying gaps and redundancies, is exactly the kind of continuous, attention-intensive analysis that humans cannot sustain and AI can.

An AI system observing the telemetry stream can build and maintain a model of each service's emissions and behavioral patterns. It can detect when a service has stabilized and its verbose instrumentation has become redundant. It can recognize when traffic patterns shift and existing telemetry lacks the attributes needed to understand the new behavior. It can identify metrics that nothing references and spans that nobody queries.

The critical capability is not just detection but reasoning. AI can formulate the questions that current conditions would demand, then check whether the existing telemetry can answer them. "If this service's primary database became unavailable, would the current instrumentation reveal the failure mode?" "If traffic doubled, would the existing metrics distinguish between capacity pressure and application errors?" These are the questions a thorough human review would ask. AI can ask them continuously, across every service, without competing with feature delivery for engineering time.

This does not replace human decision-making about instrumentation strategy. Engineers still decide what matters, what trade-offs to accept, and what risks to prioritize. AI handles the part that humans agree is important but cannot sustain: the ongoing, service-by-service evaluation of whether the telemetry still fits the reality.

Closing the temporal gap

The fundamental insight is that telemetry quality is not a property of individual spans or metrics. It is a measure of alignment between what is collected and what is needed right now. That alignment degrades in both directions: stable systems become over-instrumented, and pressured systems become under-instrumented. Both conditions waste resources. One wastes money. The other wastes time during incidents.

Organizations that treat instrumentation as a one-time project accept this drift as inevitable. Those that recognize telemetry as something that evolves with the system manage it as an ongoing lifecycle, and invest in AI systems that maintain alignment between collection and need, get observability that adapts to their current reality rather than preserving a snapshot of the past.

Your telemetry answers yesterday's questions. The question is whether you have a system that keeps it current.

When to Use Each Telemetry Signal: Logs, Traces, and Metrics

Juraci Paixão Kröhling — Tue, 17 Feb 2026 14:00:22 GMT

Understanding when to use logs, traces, or metrics is fundamental to building effective observability. Each signal serves a distinct purpose, and choosing the right one for a given situation directly impacts your ability to debug, monitor, and understand your systems. The challenge is that these signals overlap in capability, leading teams to either over-instrument with redundant data or miss critical insights by using the wrong signal for the job.

Logs: The system lifecycle narrator

Logs are the grandfather of all telemetry signals. Most of us learned to read logs and error messages early in our computing journey, first as users trying to understand why something failed, and eventually as developers writing log statements to communicate system state. Because of this history, logs remain the most universal and accessible signal.

Most legacy systems rely exclusively on logs. Before modern observability practices, aggregating log records was the standard approach to understanding request rates, queue depths, or user journeys across systems. Many teams implemented correlation IDs to tie logs for a specific request across multiple services, essentially building a primitive form of distributed tracing before dedicated tracing systems existed.

Today, logs serve a more focused purpose: understanding the lifecycle of an application. They excel at recording when a polled connection to a database was established, when a ring was rebalanced, when a circuit breaker opened or closed, when a critical resource stopped working, or when a service recovered from degraded mode. These are events about the application itself, not about individual business transactions.

The strength of logs lies in their flexibility and low barrier to entry. Any developer can add a log statement. The weakness is that this flexibility often leads to inconsistent structure, making analysis difficult at scale.

Traces: The transaction investigator

Traces capture telemetry for business transactions, typically in the context of an end-user request that might touch dozens or hundreds of services. Unlike logs, which describe system state, traces describe what happened during a specific operation and how long each step took.

Think of spans, the building blocks of traces, as super-logs. A span is essentially a log entry with a timestamp, a duration, causality relationships through parent span references, and built-in correlation IDs through the trace ID. The critical addition is context propagation: a standardized mechanism to pass trace and span IDs to downstream services, ensuring that all participants in a transaction can contribute to the same trace. This is what those hand-rolled correlation ID solutions were trying to achieve, but traces provide it as a first-class capability with standard protocols and automatic propagation.

The power of traces becomes evident when debugging errors. A trace shows not just that an error occurred, but the exact path the request took, which services were involved, and where the failure originated. When aggregated across many transactions, traces reveal user behavior patterns, system bottlenecks, and optimization opportunities.

One pattern that traces expose with unusual clarity is N+1 queries. While this anti-pattern is difficult to spot with other signals, a trace immediately reveals when a single request triggers dozens of sequential database or network calls. The visual representation of span timing makes the problem obvious in a way that logs or metrics cannot match.

Spans carry highly detailed attributes: which feature flags were active, the user's IP address, whether this is a VIP customer, which authentication mechanism was used, which payment method was selected, and which specific service instances processed the request. This level of detail makes traces the backbone of observability. When you have questions or theories about service behavior, traces often provide the answers.

This power comes with trade-offs. The detail that makes traces valuable also makes them expensive. The sheer volume of spans in a complex system creates significant storage and processing costs. Traces are also the most difficult signal to learn and implement correctly, which drives teams toward auto-instrumentation. While convenient, auto-instrumentation often increases volume further without adding proportional value.

Metrics: The pre-calculated answer engine

Metrics are aggregations of events or numeric representations of system state. They answer questions like: what is the current queue depth? How many users visited this page? What is the p99 latency for a specific endpoint?

As aggregations, metrics require choosing dimensions upfront. You might aggregate by endpoint path, service location, or page visited. You typically do not store the IP address for each individual user visit unless you specifically need to count visits per IP. Time-series databases, the systems specialized for storing metrics, are optimized for aggregated data rather than high-cardinality dimensions.

Metrics excel at pre-calculating answers to questions you know you will ask. RED metrics (requests, errors, duration) for HTTP services are the classic example. If you know you will want to track request rates, error percentages, and latency distributions for every endpoint, metrics provide this efficiently and at low query cost.

The limitation appears during ad-hoc exploration. While metrics can answer many questions, there will inevitably be investigations where you need a dimension you did not anticipate. Am I seeing high latency for all users, or only those in Europe? Only Germany? Only Berlin? If you did not include geographic dimensions in your metrics, you cannot answer these questions without re-instrumenting.

Metrics are the classic signal in monitoring. When operating a database, experienced operators know which metrics to watch: connection pool utilization, query latency distributions, replication lag. These indicators quickly reveal the health of the system without requiring investigation into individual transactions.

Choosing the right signal

The decision framework is straightforward once you understand each signal's purpose.

Use traces to record events related to business transactions. When an HTTP request arrives, when a user places an order, when a payment is processed, these are trace-worthy operations. The value is in understanding the complete path and timing of individual transactions.

Use metrics to pre-calculate answers to questions you know you will ask. If you need to monitor request rates, error percentages, or latency distributions, define those metrics upfront. The value is in fast, cheap access to known indicators.

Use logs to understand lifecycle events of your services. When dependencies change state, when configuration reloads, when the application starts or stops gracefully, these belong in logs. The value is in understanding the application as a running system, not the transactions it processes.

When signals overlap

Real systems often require multiple signals for the same event. A database connection failure might warrant a log (lifecycle event: dependency unavailable), affect a metric (connection error count), and appear in traces (failed span for database operations). This overlap is expected and appropriate.

The mistake is using one signal where another would be more effective. Aggregating log records to compute request rates works, but metrics do this more efficiently. Searching traces to understand when a service entered degraded mode works, but logs make this pattern explicit. Understanding why a specific request failed from metrics alone is nearly impossible, while a trace makes the answer visible.

Match the signal to the question. System health and known indicators call for metrics. Transaction debugging and behavior analysis call for traces. Application lifecycle and operational events call for logs.

Summary

Each telemetry signal has a distinct purpose that reflects its design and history. Logs narrate system lifecycle events: startups, configuration changes, dependency state transitions. Traces capture business transaction details: request paths, timing, errors, and the attributes that explain behavior. Metrics pre-calculate answers to monitoring questions: rates, distributions, and aggregate states.

Effective observability uses all three signals appropriately. The goal is not coverage through redundancy, but precision through choosing the right tool for each question you need to answer.

You don't have too much telemetry. You have bad telemetry.

Juraci Paixão Kröhling — Wed, 04 Feb 2026 13:00:51 GMT

The quarterly budget review arrives, and the observability line item has doubled again. The reflexive response is familiar: "We need to sample more aggressively" or "Let's only observe critical services." These tactics will reduce costs. They will also destroy your ability to debug production incidents, trading a financial problem for an operational one.

The uncomfortable truth is that most organizations do not have a volume problem. They have a governance problem. Before reaching for the sampling dial, engineering leaders should ask a more fundamental question: do we actually know what we are collecting, and is it worth keeping?

The governance gap

Most organizations cannot answer three basic questions about their telemetry: What are we collecting? Who owns it? Is it valuable?

Telemetry tends to grow organically. A developer enables debug logging during an incident and forgets to disable it. Auto-instrumentation captures every internal function call by default. Library internals generate spans that no one ever examines. Over months and years, this accumulation becomes the baseline that everyone assumes is necessary.

The result is a telemetry estate where no one understands the data, no one owns the data, and no one has consciously decided the data is worth the cost of keeping it. When the bill arrives, the only lever that seems available is sampling, which treats all telemetry as equally valuable and cuts it indiscriminately.

Patterns of bad telemetry

Before implementing sampling, engineering leaders should understand the common patterns of telemetry that provides minimal debugging value while consuming significant resources.

Health check floods represent one of the most common offenders. Kubernetes probes, load balancer checks, and monitoring systems generate millions of traces daily. These traces confirm that services are responding, but they reveal nothing about application behavior, user experience, or system bottlenecks. They crowd out useful signal and consume pipeline capacity.

Debug logs abandoned in production create similar waste. During incident response, engineers often increase logging verbosity to understand system behavior. Once the incident resolves, these verbose settings remain in place, generating enormous log volumes that no one examines until the next billing cycle.

High-cardinality metric attributes cause a different kind of problem. Adding user identifiers or transaction IDs to metric labels seems useful until the metrics backend collapses under millions of unique time series. The cost grows multiplicatively with each additional high-cardinality attribute.

Internal span proliferation occurs when auto-instrumentation, especially via eBPF, captures every method call within a service. A single user request might generate fifteen spans, ten of which complete in under a millisecond and represent internal implementation details rather than meaningful system boundaries. These spans add noise to traces without aiding debugging.

Orphaned spans result from broken context propagation between services. These spans cannot be assembled into coherent traces, rendering them useless for understanding request flow. They consume storage and processing resources while providing zero debugging value.

Fix at source, not at pipe

Many organizations attempt to address telemetry waste by adding filters in their collection pipeline. This approach misses the fundamental inefficiency. By the time data reaches the collector, the application has already generated, serialized, and transmitted it across the network. Filtering at the collector reduces storage costs, but the computational and network costs have already been incurred.

Source-level fixes eliminate waste entirely. Configuring instrumentation agents to exclude health check endpoints prevents those traces from being created. Establishing log level policies in deployment configurations ensures debug logging stays in development environments. Code review practices can catch high-cardinality metric attributes before they reach production.

The collector should serve as a safety net for edge cases, not the primary mechanism for data governance. Filter processors handle scenarios that cannot be addressed at the source, such as legacy applications or third-party services. For everything else, the most cost-effective solution is preventing waste from being generated.

The volume question remains

Even after addressing bad telemetry, some organizations will still face legitimate volume challenges. High-traffic systems generate substantial telemetry even when every span and log provides genuine value. The difference is what happens next.

Sampling garbage gives you a smaller pile of garbage. When telemetry is a mix of useful signal and noise, sampling cuts both indiscriminately. You reduce costs, but you also reduce your ability to debug the specific incidents that sampling happened to discard.

Sampling after cleanup is a strategic decision about valuable data. When you have eliminated the noise, every piece of remaining telemetry serves a purpose. Sampling decisions become intentional trade-offs between cost and observability coverage rather than desperate cuts to an unmanaged data stream. Tail-based sampling can preserve error traces while reducing successful request volume. Rate limiting can cap burst traffic while maintaining baseline visibility.

The key insight is that cleanup dramatically reduces the volume that needs sampling in the first place. Organizations often discover that addressing bad telemetry alone brings costs within acceptable ranges, eliminating the need for aggressive sampling entirely.

A practical approach for engineering leaders

Addressing telemetry governance requires visibility before action. Start by inventorying what you collect, identifying the top contributors to volume across traces, metrics, and logs. Most organizations find that a small number of sources account for the majority of data.

Categorize that volume by type. Health checks, internal spans, debug logs, and high-cardinality metrics each require different remediation strategies. Understanding the composition of your telemetry guides where to focus effort.

Assess value honestly by asking when each category of telemetry last contributed to resolving an incident. If no one can recall using health check traces for debugging, they are candidates for elimination or aggressive filtering.

Implement fixes at the source where possible. Agent configuration changes, log level policies, and instrumentation code reviews address the root cause rather than treating symptoms. Reserve collector-level filtering for cases where source changes are impractical.

Finally, if volume remains a concern after cleanup, implement sampling with intention. Document what is being sampled and why. Ensure that sampling policies preserve the traces most likely to matter during incidents, such as errors, high-latency requests, and specific customer traffic.

The path from reactive cost cutting to intentional data governance requires effort, but the reward is an observability system that costs less and works better. The next time the budget conversation surfaces, the answer should not be "sample more." It should be "we know exactly what we collect, and it is worth keeping."

Reducing Log Volume with the OpenTelemetry Log Deduplication Processor

Juraci Paixão Kröhling — Mon, 19 Jan 2026 15:00:41 GMT

Your logs are probably at least 80% repetitive noise. Connection retries, health checks, heartbeat messages: the same log line repeated thousands of times per minute. You pay storage costs for each one while the signal drowns in noise. The OpenTelemetry Collector's log deduplication processor offers an elegant solution to this problem.

The repetitive log problem

Modern distributed systems generate enormous volumes of logs, but much of that volume provides diminishing returns. Consider a typical microservice that logs connection errors when a downstream dependency is unavailable. If the service retries every 100 milliseconds for 30 seconds, that's 300 nearly identical log entries for a single incident. Each entry consumes storage, network bandwidth, and processing capacity in your logging backend.

Health check endpoints compound the problem. Kubernetes probes, load balancer checks, and monitoring systems all generate log entries at regular intervals. A single service might log thousands of health check responses per hour, none of which provide meaningful insight beyond "the service was running."

The logdedupprocessor in the OpenTelemetry Collector solves this by aggregating identical logs over a configurable time window. Instead of forwarding every duplicate entry, it emits a single log with a count of how many times that message appeared.

How log deduplication works

The core concept is straightforward. Logs are considered identical when they share the same resource attributes, scope, body, attributes, and severity. The processor computes a hash of these fields and tracks occurrences within a configurable interval.

When the interval expires, the processor emits a single log entry with three additional attributes: log_count (the number of duplicates), first_observed_timestamp, and last_observed_timestamp. You keep full visibility into frequency patterns without storing every identical entry.

This approach differs from sampling in an important way. Sampling discards data permanently. Deduplication preserves the information that matters (what happened, how often, and when) while eliminating redundant storage.

Practical configuration

Here is a configuration that deduplicates connection errors while preserving audit logs:

processors:
  logdedup:
    interval: 1s
    conditions:
      - severity_number >= SEVERITY_NUMBER_ERROR
      - attributes["log.type"] == "connection"
    exclude_fields:
      - attributes.request_id
      - attributes.timestamp

The conditions field uses OpenTelemetry Transformation Language (OTTL) expressions to filter which logs get deduplicated. Logs that do not match pass through unchanged. In this example, only ERROR-level logs with the log.type=connection attribute are candidates for deduplication.

The exclude_fields option removes high-cardinality fields from the comparison. Fields like request IDs and timestamps differ between entries even when the log message is semantically identical. By excluding them, logs that differ only in these volatile fields are treated as duplicates.

A complete pipeline example

To use the log deduplication processor, include it in your collector pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  logdedup:
    interval: 1s
    conditions:
      - severity_number >= SEVERITY_NUMBER_ERROR
      - attributes["log.type"] == "connection"
    exclude_fields:
      - attributes.request_id
      - attributes.timestamp

exporters:
  debug:

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [logdedup]
      exporters: [debug]

Testing with telemetrygen

To test this configuration locally, use telemetrygen to generate connection error logs:

telemetrygen logs \
  --otlp-insecure \
  --logs 100 \
  --rate 10 \
  --severity-text ERROR \
  --severity-number 17 \
  --body "Connection refused: failed to connect to database at 10.0.0.5:5432" \
  --telemetry-attributes 'log.type="connection"' \
  --telemetry-attributes 'service.name="order-service"' \
  --telemetry-attributes 'db.system="postgresql"'

This generates 100 logs at 10 per second, all with ERROR severity and the log.type=connection attribute that triggers deduplication. After a few seconds, you should see a few log entries with log_count: N in your backend instead of 100 separate entries.

Tradeoffs and considerations

The log deduplication processor introduces latency equal to your interval setting. Logs are held until the interval expires before being forwarded. For most use cases, a 1-second delay is acceptable, but real-time alerting systems may need adjustment.

For compliance-critical logs where every occurrence must be preserved with its original timestamp, skip deduplication entirely. Audit logs, security events, and regulatory records often require complete fidelity.

The tradeoff is straightforward: reduced storage and clearer signal at the cost of slight delay and losing individual timestamps. For high-volume repetitive logs, that tradeoff is usually worth it.

Conclusion

The log deduplication processor provides a practical solution to the noise problem in modern logging pipelines. By aggregating identical entries while preserving frequency information, you can dramatically reduce storage costs and improve signal clarity without sacrificing observability.

Combined with other OpenTelemetry Collector processors like filtering and sampling, log deduplication gives you fine-grained control over your telemetry pipeline. The result is a logging system that captures what matters while discarding the noise.

What 10,000 Slack Messages Reveal About OpenTelemetry Adoption Challenges

Juraci Paixão Kröhling — Tue, 06 Jan 2026 14:27:11 GMT

The OpenTelemetry community has grown tremendously over the past few years, and with that growth comes valuable insights hidden in our community conversations. We analyzed nearly 10,000 messages from the #otel-collector and #opentelemetry Slack channels spanning from May 2019 to December 2025 to understand what challenges users face most often, which components generate the most discussion, and where the community might need additional documentation or tooling improvements.

The Dataset

Our analysis covered 9,966 messages across two of the most active OpenTelemetry Slack channels:

#otel-collector: 5,570 messages (56%)
#opentelemetry: 4,396 messages (44%)

These messages break down into several categories:

Category	Percentage
Questions	46.7%
Error Reports	25.9%
Discussions	23.3%
Configuration	3.0%
Help Responses	1.0%

The high proportion of questions and error reports (over 72% combined) tells us that these channels serve as critical support resources for the community, and the topics that appear most frequently represent real adoption challenges.

We applied topic modeling using BERTopic to cluster similar messages, then analyzed sentiment and frustration indicators to identify which topics cause the most difficulty. Messages containing error reports, repeated requests for help, or expressions of confusion scored higher on our frustration metric.

Most Discussed Collector Components

Topic modeling revealed clear patterns in which Collector components generate the most community discussion. Here are the top components by message volume:

1. Prometheus Receiver and Exporter (498 messages, 5.0%)

Prometheus integration dominates community discussions. Users frequently ask about:

Configuring the Prometheus receiver to scrape metrics
Setting up the Prometheus remote write exporter
Understanding metric type and metadata preservation across the pipeline
Integrating with existing Prometheus infrastructure

This makes sense given Prometheus's widespread adoption. Many organizations start their OpenTelemetry journey by wanting to integrate with or migrate from existing Prometheus setups. The remote write exporter in particular sees heavy use, as it allows teams to continue using Prometheus as a storage backend while adopting OpenTelemetry for collection and processing.

2. k8sattributes Processor (258 messages, 2.6%)

Kubernetes metadata enrichment is the second most discussed topic. Common challenges include:

Pod association and metadata extraction in DaemonSet deployments
RBAC permissions for accessing the Kubernetes API
Performance implications in large clusters
Interaction with the kubeletstats receiver

The complexity of Kubernetes environments and the desire for rich metadata context makes this processor essential but sometimes tricky to configure correctly. Users often discover that running the Collector as a DaemonSet requires different pod association rules than running it as a gateway, leading to troubleshooting cycles that could be avoided with clearer guidance.

3. Tail Sampling Processor (167 messages, 1.7%)

Tail-based sampling generates significant discussion, often with a higher frustration level than other topics. Users struggle with:

Policy configuration and interaction between multiple policies
Stateful sampling across distributed services
Head sampling vs. tail sampling trade-offs
Debugging why traces are or aren't being sampled
Understanding the decision wait period and its impact on latency

The stateful nature of tail sampling, which requires collecting all spans of a trace before making a decision, adds operational complexity that head sampling avoids. Many teams end up running both approaches, using head sampling at the SDK level for baseline reduction and tail sampling in the Collector for intelligent retention of interesting traces.

4. Kafka Receiver and Exporter (131 messages, 1.3%)

Kafka integration appears frequently, particularly around:

Connection and authentication issues with managed Kafka services (AWS MSK)
Topic configuration and consumer group management
Message format and serialization
High-availability deployment patterns

5. Memory Limiter Processor (125 messages, 1.3%)

Resource management is a consistent concern:

Proper memory limit configuration relative to container limits
GOMEMLIMIT interaction with the memory limiter
Debugging memory spikes and OOM situations
CPU usage profiling with pprof

Understanding the relationship between Go's memory management, container limits, and the memory limiter processor requires knowledge that spans multiple domains. The recent addition of GOMEMLIMIT support has helped, but users still need guidance on proper configuration for their specific deployment scenarios.

What This Tells Us

Several themes emerge from this analysis:

The Prometheus ecosystem remains central. Organizations aren't abandoning Prometheus; they're integrating it with OpenTelemetry. Documentation and tooling that bridges these ecosystems will continue to be valuable.

Kubernetes complexity compounds OTel complexity. The k8sattributes processor and Operator discussions show that Kubernetes environments introduce additional layers of configuration and troubleshooting. Simplified deployment patterns and better defaults could help.

Sampling is conceptually difficult. Tail sampling, despite being well documented, generates ongoing confusion. Interactive tools or visualization of sampling decisions might help users understand and debug their configurations.

Error messages need improvement. Many frustration-heavy discussions start with a cryptic error message. Investing in actionable error messages with suggested fixes would significantly improve the user experience.

The gap between "getting started" and "production ready" is real. Basic tutorials work, but scaling to production with proper memory limits, persistent queues, and multi-backend routing requires significant learning.

Moving Forward

We hope this analysis helps maintainers and SIGs identify areas where documentation improvements would have the highest impact. The data clearly shows that certain topics, particularly around configuration patterns, sampling strategies, and multi-backend deployments, generate recurring questions that better guides could address.

On my end, I have lined up a series of articles that tackle some of these pain points directly, covering topics like decomposing Collector configuration files into manageable pieces, routing telemetry to multiple backends based on tenant or environment, and building effective tail sampling strategies.

Acknowledgments

Thank you to everyone who participates in the OpenTelemetry Slack community. Your questions, error reports, and discussions not only help each other but also provide valuable signal for where the project can improve. A special thanks to the community members who take time to answer questions and share their experiences - the 1% of help responses in our data represent countless hours of volunteer effort that makes this community welcoming for newcomers.

This analysis used topic modeling and sentiment analysis on publicly available Slack messages. Individual messages were aggregated into topics; no personally identifiable information was used in this report.

Meet Rose: OllyGarden's AI Instrumentation Agent

Nicolas Wörner — Wed, 29 Oct 2025 12:00:32 GMT

Imagine the perfect observability world: There is an incident, the on-call team gets paged in the middle of the night, wakes up and thanks to your telemetry, the root-cause is identified within just a few minutes. Telemetry is produced without sensitive data or passwords and you only pay for what you actually need. Dashboards aren't broken or inconsistent and you have confidence in your data.

Unfortunately this perfect observability world is often an utopia and the harsh reality looks different. But why is that and what can we do to bring today's reality closer to the perfect observability world?

Modern tools and instrumentation approaches have made it easier than ever to collect telemetry data of applications. You press the "easy button" and thanks to powerful tools like eBPF or auto-instrumentation agents, a lot of data will magically appear in the observability backend. Within almost no time, the whole system is suddenly instrumented. Sounds great, doesn't it? Where's the catch?

Such approaches are great to get you started quickly and have valid use cases for baseline visibility. However, as your applications scale, the produced amount of telemetry data grows significantly as well. The telemetry data is often low-quality and at a certain point it becomes challenging (and expensive) to maintain and make sense out of the data.

To have more control about telemetry, it's possible to manually instrument applications. That gives the power to capture what actually is needed. Even better: Telemetry quality can be enhanced thanks to custom application/business specific attributes while guaranteeing that no sensitive data is produced. You only pay for what you need and thanks to the reduced amount of low-signal data, relevant issues can be identified faster.

While the theory sounds promising, the reality is that manual instrumentation isn't trivial. When done right, it requires consistency across application boundaries, correct context propagation, OpenTelemetry specific domain knowledge and most importantly engineers who have time and the knowledge to maintain the instrumentation.

Introducing Rose

Today we are announcing the research preview of OllyGarden Rose, our AI instrumentation agent. Rose integrates seamlessly into your development workflow as a GitHub app that analyzes OpenTelemetry instrumentation in pull requests, identifies pitfalls and suggests improvements to ensure consistent, high-quality telemetry practices. It’s designed to facilitate the manual instrumentation process by reducing engineering time, ensuring consistency, and providing clear guidance that builds confidence in your telemetry. At a later stage, OllyGarden Rose will be able to do assessments of the instrumentation quality of an entire code repository, or even install and perform an initial instrumentation on its own, guided by our knowledge about what’s good telemetry, as well as other external sources.

Research preview launches October 29, 2025. Click here to learn more.

Key Features

Context-Aware Analysis

Rose understands your entire codebase, not just the diff. It knows your organization's telemetry patterns, recognizes which semantic conventions apply, and understands whether you're instrumenting an HTTP client, database call, or message queue. It provides guidance specific to your exact situation.

OllyGarden Knowledge Base

While general-purpose coding assistants understand OpenTelemetry SDK syntax, they often lack the depth to guide instrumentation across application boundaries and understand the why and what. Built on OllyGarden's expertise from years of contributing to OpenTelemetry, best-practices and industry standards are encoded into actionable rules and patterns that Rose applies automatically.

OpenTelemetry Education

Every comment Rose makes includes an explanation of why something matters, optionally a concrete code suggestion showing how to fix it, and links to relevant OpenTelemetry documentation. The goal isn’t to just fix issues, but to teach and share observability best practices with every pull request.

Join the Research Preview

Our mission at OllyGarden is to bring the reality closer to the perfect observability world, where it's easy to achieve and maintain high-quality telemetry data. In addition to the OllyGarden insights platform, Rose is another step towards that goal.

We'll provide Rose free of charge to selected participants during the research preview period. In return, we expect participants to provide feedback and access to the target source code repository (the one to be instrumented), so that we can analyze what worked and what didn't.

NDAs are available for organizations with security requirements. Our goal is to learn from real-world code out there, with any level of instrumentation. We're especially interested in teams already using or planning to adopt manual instrumentation practices.

Ready to participate? Contact us to join the research preview for free.

Introducing OllyGarden Tulip: Our Open-Source Distribution of the OpenTelemetry Collector

Juraci Paixão Kröhling — Thu, 16 Oct 2025 07:00:33 GMT

TL;DR: We're launching OllyGarden Tulip, a commercially supported OpenTelemetry Collector distribution with stable releases, predictable upgrade paths, and professional support from the people who helped build the Collector. It's open source and free to use, with optional commercial support for production deployments. Quarterly releases start with v25.11, with LTS releases every 18 months.

Back in 2019, I encountered something that would shape the next six years of my career: the OpenCensus Service. This unassuming piece of infrastructure could receive telemetry data in one format (OpenCensus, Zipkin) and export it in another, like Jaeger. Simple, elegant, powerful.

When OpenCensus merged with OpenTracing, the service became OpenTelemetry Service. Coming from the Jaeger world, I immediately saw a naming problem: having an "OpenTelemetry Service" that functioned like Jaeger Collector would confuse everyone. In my first SIG calls, I suggested renaming it to OpenTelemetry Collector. The response? "Too much legacy already. People know it by this name."

I was wrong about the timing, but right about the need. Eventually, the community came around, and OpenTelemetry Collector was born.

Helping Build the Collector Ecosystem

That early misstep didn't discourage me. Instead, it pulled me deeper into the community. Over the following years, I implemented authentication support and the first auth mechanism, built the load balancing exporter, created the OpenTelemetry Collector Builder (ocb), developed the OpenTelemetry Operator, and maintained the tail sampling processor. I gave conference talks at events worldwide.

More importantly, I talked to users. Hundreds of conversations about their deployments, their challenges, their workarounds. I addressed concerns where I could, but some problems were too big for a single engineer, even one working at larger organizations.

The Problems I Couldn't Solve Alone

The requests came consistently, almost predictably: "Can we get commercial support for the Collector?"

Some users had backend vendors who offered Collector support, but only as long as they remained customers. Want to migrate to a different backend? Your Collector support disappears. Companies building custom Collector distributions with proprietary components were left entirely on their own.

It was painful. Passionate users consuming my code and projects, and I couldn't offer them the support they needed.

The technical challenges were equally frustrating. Teams stuck on ancient Collector versions because upgrades broke their dashboards and alerts when internal telemetry metrics changed. Organizations forced to update configurations for unrelated components just to consume a critical bug fix. Custom distribution maintainers struggling to keep pace with upstream changes while managing their own components.

These weren't edge cases. These were real operational pain points affecting production systems at scale.

Introducing OllyGarden Tulip

Today, we're changing that. I'm excited to announce OllyGarden Tulip, a commercially supported OpenTelemetry Collector distribution that solves the problems I've heard about for years.

Tulip provides the stability guarantees, predictable release cycles, and professional support that production systems deserve. It's an open source distribution built using ocb, the same tool I created for the community. You can use it freely, extend it, and build on it. And when you need support, we're here with the deep expertise that comes from years of building and maintaining the Collector itself.

This isn't just another distribution. It's the support offering that Collector users have been asking for, delivered by the people who know this codebase intimately.

Why OllyGarden Tulip Exists

When Yuri and I founded OllyGarden at the beginning of this year, our focus was clear: give observability engineers superpowers to understand what's good and what's bad about their telemetry through our Insights platform. That remains our core mission, and we're making significant progress there.

But as we've built OllyGarden, I've kept hearing the same pains from Collector users that I've witnessed for years. These aren't problems we can ignore, and they're problems we can solve right now. So we're accelerating our plans and launching OllyGarden Tulip today, a commercially supported OpenTelemetry Collector distribution built specifically to address the support and stability challenges that production teams face every day.

What Makes Tulip Different

Tulip provides stability guarantees that match real-world needs. Need a critical bug fix without updating every component? We've got you covered. Upgraded to the latest version and experiencing unexpected performance issues? Throw it at us. Want predictable release cycles that align with your planning? We deliver.

Our approach combines flexibility with reliability. We provide quarterly releases tracking upstream versions, starting with v25.11 (November 2025). Every 18 months, we'll release an LTS version, with the first likely at v26.5. The distribution itself is open source, built using ocb. You can use the binaries or container images for free. Need components we don't support yet? Fork our repository and add them. We'll still support the ones in our manifest. Need commercial support? We're here for you.

Built on Open Source, Backed by Experience

OllyGarden Tulip isn't a fork or a proprietary reimagining. It's an open source distribution of the Collector built using the same tools I created, specifically ocb. You can use it freely. You can extend it. You can build on it.

What we're offering is something the community has asked for repeatedly: stable, professional support from people who know this codebase intimately.

I haven't been as involved in the OpenTelemetry Collector community since January. I've been focused on building our products. But Tulip brings us closer again. More importantly, it provides a support offering that our users deserve.

Who This Is For

You should consider OllyGarden Tulip if you run the OpenTelemetry Collector in production and need reliable support, if you build custom Collector distributions and want stable upstream compatibility, if you need predictable upgrade paths that won't break your observability infrastructure, if you want to decouple your Collector support from your backend vendor relationship, or if you value stability and professional support over bleeding-edge features.

Getting Started

OllyGarden Tulip is available now. Our open source manifest and container images are free to use in any environment. For commercial support, contact us to discuss your needs. Visit our documentation site for implementation guides and resources.

We're starting this journey with the v25.11 release, and we're committed to the long-term stability that production systems require.

A Personal Note

For six years, I've watched the OpenTelemetry Collector grow from an experimental service to critical infrastructure powering observability at organizations worldwide. I've celebrated its successes and felt the pain of its operational challenges.

OllyGarden Tulip represents my commitment to the users who've trusted my code over the years. You've built incredible things. You deserve support that matches your ambition.

Let's build something reliable together.

Ready to learn more? Visit our documentation or contact us to discuss commercial support options.

The Variability Principle: How to Decide What Deserves a Span

Jakub Mikłasz — Mon, 06 Oct 2025 08:53:22 GMT

Every team discovers OpenTelemetry the same way. First, excitement—finally, visibility into distributed systems! Then comes the instrumentation party. Spans everywhere. Every function. Every validation. Every calculation gets its own span because "more data is better," right?

Three months later, you're staring at a trace with 500 spans trying to figure out why a simple API call took 3 seconds. Your observability bill has grown 10x. And your engineers have given up on traces entirely because they're impossible to read.

There's a better way.

The Problem: Span Explosion

Most teams create spans like this:

func ProcessPayment(ctx context.Context, payment Payment) error {
    ctx, span := tracer.Start(ctx, "process payment")
    defer span.End()

    validateAmount(ctx, payment.Amount)      // Another span
    validateCard(ctx, payment.CardNumber)    // Another span
    calculateFees(ctx, payment.Amount)       // Another span
    formatCurrency(ctx, payment.Total)       // Another span
    // ... 10 more spans for trivial operations
}

At 10,000 requests per minute with 15 spans each, you're generating 6.5 billion spans per month. At $0.20 per million spans, that's $1,300 monthly just for payment processing traces.

But cost isn't the real problem. The real problem is that your traces become unreadable. When everything has a span, nothing stands out. Signal drowns in noise.

The Variability Principle: Your New Mental Model

Here's the principle that changed everything for us:

"Is this operation unpredictable?"

If yes, create a span. If no, don't.

This simple question cuts through all the complexity. It's not about operation importance or business value—it's about performance predictability.

Unpredictable = Create a Span

Operations with unpredictable performance need spans:

Database queries: Could take 5ms or 5 seconds depending on locks, data size, indexes
HTTP calls: Network latency, retries, timeouts are all variable
External APIs: You don't control their performance
Message queues: Depends on queue depth, consumer availability
Cache operations: Network round-trip to Redis/Memcached
File I/O: Disk performance varies, especially with network storage

These operations can surprise you. When they're slow, you need to know.

Predictable = Skip the Span

Operations with predictable performance don't need spans:

Validation logic: Checking if a string contains "@" is always microseconds
Math calculations: CPU-bound operations are consistent
Data transformation: Mapping objects in memory is deterministic
String formatting: Always fast, never the problem
Getters/setters: Not worth measuring

These operations can't surprise you. They're never the bottleneck.

The Pattern in Practice

Let's refactor that payment processing:

func ProcessPayment(ctx context.Context, payment Payment) {
    ctx, span := tracer.Start(ctx, "process payment")
    defer span.End()

    // Add context as attributes, not spans
    span.SetAttributes(
        attribute.Float64("payment.amount", payment.Amount),
        attribute.String("payment.currency", payment.Currency),
    )

    // Validation is predictable - no span needed
    if payment.Amount <= 0 || !isValidCard(payment.CardNumber) {
        span.RecordError(errors.New("invalid payment"))
        return
    }

    // Database operation is unpredictable - needs a span
    ctx, dbSpan := tracer.Start(ctx, "INSERT payments")
    dbSpan.SetAttributes(
        attribute.String("db.system", "postgresql"),
        attribute.String("db.collection.name", "payments"),
        attribute.String("db.operation.name", "INSERT"),
    )
    db.SavePayment(ctx, payment)
    dbSpan.End()

    // External API is unpredictable - needs a span
    ctx, chargeSpan := tracer.Start(ctx, "charge card")
    paymentGateway.Charge(ctx, payment)
    chargeSpan.End()
}

Result: 3 spans instead of 15. Traces are readable. Engineers can actually find problems.

What to Use Instead of Spans

When you skip creating a span, you still need to capture information. That's where attributes and events come in.

Attributes: Context Without Cost

Attributes add metadata to existing spans. They're perfect for:

Request/response data (user ID, order total, currency)
Configuration values (retry count, timeout settings)
Business context (customer tier, feature flags)

span.SetAttributes(
    attribute.String("user.id", userID),
    attribute.Float64("order.total", 157.46),
    attribute.Bool("cache.hit", true),
)

Attributes are indexed and searchable. They let you filter traces without creating separate spans.

Events: Milestones in Time

Events mark important moments within a span's lifecycle. They're perfect for:

Validation checkpoints
State transitions
Progress markers in loops

// Mark validation completion
span.AddEvent("validation completed")

// Track calculation results
span.AddEvent("total calculated",
    trace.WithAttributes(
        attribute.Int("line_items.count", 4),
        attribute.Float64("total", 157.46),
    ))

// Record state changes
span.AddEvent("payment saved")

// Track retry attempts
span.AddEvent("retry attempt",
    trace.WithAttributes(
        attribute.Int("attempt", 3),
        attribute.String("reason", "timeout"),
    ))

Events show you when something happened and provide rich context without the overhead of a full span. When debugging, they help you see the timeline of operations within your parent span.

The Decision Framework

Before creating any span, ask one question:

"Is this operation unpredictable?"

Yes → Create a span

No → Use attributes or events

That's it. This single question replaces complex decision trees and eliminates 80% of unnecessary spans.

Remember This

Your traces should tell a story, not document every CPU cycle. Each span costs money, performance, and clarity.

Create spans only for operations that could surprise you. For everything else, there are attributes and events.

The best observability isn't about having all the data—it's about having the right data.

How to Name Your Metrics

Juraci Paixão Kröhling — Tue, 09 Sep 2025 22:00:12 GMT

Metrics are the quantitative backbone of observability—the numbers that tell us how our systems are performing. This is the third post in our OpenTelemetry naming series, where we've already explored how to name spans and how to enrich them with meaningful attributes. Now let's tackle the art of naming the measurements that matter.

Unlike spans that tell stories about what happened, metrics tell us about quantities: how many, how fast, how much. But here's the thing—naming them well is just as crucial as naming spans, and the principles we've learned apply here too. The "who" still belongs in attributes, not names.

Learning from Traditional Systems

Before diving into OpenTelemetry best practices, let's examine how traditional monitoring systems handle metric naming. Take Kubernetes, for example. Its metrics follow patterns like:

apiserver_request_total
scheduler_schedule_attempts_total
container_cpu_usage_seconds_total
kubelet_volume_stats_used_bytes

Notice the pattern? Component name + resource + action + unit. The service or component name is baked right into the metric name. This approach made sense in simpler data models where you had limited options for storing context.

But this creates several problems:

Cluttered observability backend: Every component gets its own metric namespace, making it harder to find the right metric among dozens or hundreds of similarly-named metrics
Inflexible aggregation: Can't easily sum metrics across different components
Vendor lock-in: Metric names become tied to specific implementations
Maintenance overhead: Adding new services requires new metric names

The Core Anti-Pattern: Service Names in Metric Names

Here's the most important principle for OpenTelemetry metrics: Don't include your service name in the metric name.

Let's say you have a payment service. You might be tempted to create metrics like:

payment.transaction.count
payment.latency.p95
payment.error.rate

Don't do this. The service name is already available as context through the service.name resource attribute. Instead, use:

transaction.count with service.name=payment
http.server.request.duration with service.name=payment
error.rate with service.name=payment

Why is this better? Because now you can easily aggregate across all services:

sum(transaction.count)  // All transactions across all services
sum(transaction.count{service.name="payment"})  // Just payment transactions

If every service had its own metric name, you'd need to know every service name to build meaningful dashboards. With clean names, one query works for everything.

OpenTelemetry's Rich Context Model

OpenTelemetry metrics benefit from the same rich context model we discussed in our span attributes article. Instead of forcing everything into the metric name, we have multiple layers where context can live:

Traditional Approach (Prometheus style):

payment_service_transaction_total{method="credit_card",status="success"}
user_service_auth_latency_milliseconds{endpoint="/login",region="us-east"}  
inventory_service_db_query_seconds{table="products",operation="select"}

OpenTelemetry Approach:

transaction.count
- Resource: service.name=payment, service.version=1.2.3, deployment.environment.name=prod
- Scope: instrumentation.library.name=com.acme.payment, instrumentation.library.version=2.1.0
- Attributes: method=credit_card, status=success

auth.duration  
- Resource: service.name=user, service.version=2.0.1, deployment.environment.name=prod
- Scope: instrumentation.library.name=express.middleware
- Attributes: endpoint=/login, region=us-east
- Unit: ms

db.client.operation.duration
- Resource: service.name=inventory, service.version=1.5.2
- Scope: instrumentation.library.name=postgres.client  
- Attributes: db.sql.table=products, db.operation=select
- Unit: s

This three-layer separation follows the OpenTelemetry specification's Events → Metric Streams → Timeseries model, where context flows through multiple hierarchical levels rather than being crammed into names.

Units: Keep Them Out of Names Too

Just like we learned that service names don't belong in metric names, units don't belong there either.

Traditional systems often include units in the name because they lack proper unit metadata:

response_time_milliseconds
memory_usage_bytes
throughput_requests_per_second

OpenTelemetry treats units as metadata, separate from the name:

http.server.request.duration with unit ms
system.memory.usage with unit By
http.server.request.rate with unit {request}/s

This approach has several benefits:

Clean names: No ugly suffixes cluttering your metric names
Standardized units: Follow the Unified Code for Units of Measure (UCUM)
Backend flexibility: Systems can handle unit conversion automatically
Consistent conventions: Aligns with OpenTelemetry semantic conventions

The specification recommends using non-prefixed units like By (bytes) rather than MiBy (mebibytes) unless there are technical reasons to do otherwise.

Practical Naming Guidelines

When creating metric names, apply the same {verb} {object} principle we learned for spans, where it makes sense:

Focus on the operation: What is being measured?
Not the operator: Who is doing the measuring?
Follow semantic conventions: Use established patterns when available
Keep units as metadata: Don't suffix names with units

Here are examples following OpenTelemetry semantic conventions:

http.server.request.duration (not payment_http_requests_ms)
db.client.operation.duration (not user_service_db_queries_seconds)
messaging.client.sent.messages (not order_service_messages_sent_total)
transaction.count (not payment_transaction_total)

Real-world Migration Examples

Traditional (Context+Units in Name)	OpenTelemetry (Clean Separation)	Why It's Better
`payment_transaction_total`	`transaction.count` + `service.name=payment` + unit `1`	Aggregatable across services
`user_service_auth_latency_ms`	`auth.duration` + `service.name=user` + unit `ms`	Standard operation name, proper unit metadata
`inventory_db_query_seconds`	`db.client.operation.duration` + `service.name=inventory` + unit `s`	Follows semantic conventions
`api_gateway_requests_per_second`	`http.server.request.rate` + `service.name=api-gateway` + unit `{request}/s`	Clean name, proper rate unit
`redis_cache_hit_ratio_percent`	`cache.hit_ratio` + `service.name=redis` + unit `1`	Ratios are unitless

Benefits of Clean Naming

Separating context from metric names provides specific technical advantages that improve both query performance and operational workflows. The first benefit is cross-service aggregation. A query like sum(transaction.count) returns data from all services without requiring you to know or maintain a list of service names. In a system with 50 microservices, this means one query instead of 50, and that query doesn't break when you add the 51st service.

This consistency makes dashboards reusable across services. A dashboard built for monitoring HTTP requests in your authentication service works without modification for your payment service, inventory service, or any other HTTP-serving component. You write the query once—http.server.request.duration filtered by service.name—and apply it everywhere. No more maintaining dozens of nearly-identical dashboards. Some observability vendors now take this further, automatically generating dashboards based on semantic convention metric names—when your services emit http.server.request.duration, the platform knows exactly what visualizations and aggregations make sense for that metric.

Clean naming also reduces metric namespace clutter. Consider a platform with dozens of services each defining their own metrics. With traditional naming, your metric browser shows hundreds of service-specific variations: apiserver_request_total, payment_service_request_total, user_service_request_total, inventory_service_request_total, and so on. Finding the right metric becomes an exercise in scrolling and searching through redundant variations. With clean naming, you have one metric name (request.count) with attributes capturing the context. This makes metric discovery straightforward—you find the measurement you need, then filter by the service you care about.

Unit handling becomes systematic when units are metadata rather than name suffixes. Observability platforms can perform unit conversions automatically—displaying the same duration metric as milliseconds in one graph and seconds in another, based on what makes sense for the visualization. The metric remains request.duration with unit metadata ms, not two separate metrics request_duration_ms and request_duration_seconds.

The approach also ensures compatibility between manual and automatic instrumentation. When you follow semantic conventions like http.server.request.duration, your custom metrics align with those generated by auto-instrumentation libraries. This creates a consistent data model where queries work across both manually and automatically instrumented services, and engineers don't need to remember which metrics come from which source.

Common Pitfalls to Avoid

Engineers often embed deployment-specific information directly into metric names, creating patterns like user_service_v2_latency. This breaks when version 3 deploys—every dashboard, alert, and query that references the metric name must be updated. The same problem occurs with instance-specific names like node_42_memory_usage. In a cluster with dynamic scaling, you end up with hundreds of distinct metric names that represent the same measurement, making it impossible to write simple aggregation queries.

Environment-specific prefixes cause similar maintenance problems. With metrics named prod_payment_errors and staging_auth_count, you can't write a single query that works across environments. A dashboard that monitors production can't be used for staging without modification. When you need to compare metrics between environments—a common debugging task—you have to write complex queries that explicitly reference each environment's metric names.

Technology stack details in metric names create future migration headaches. A metric named nodejs_payment_memory becomes misleading when you rewrite the service in Go. Similarly, postgres_user_queries requires renaming if you migrate to something else. These technology-specific names also prevent you from writing queries that work across services using different tech stacks, even when they perform the same business function.

Mixing business domains with infrastructure metrics violates the separation between what a system does and how it does it. A metric like ecommerce_cpu_usage conflates the business purpose (e-commerce) with the technical measurement (CPU usage). This makes it harder to reuse infrastructure monitoring across different business domains and complicates multi-tenant deployments where the same infrastructure serves multiple business functions.

The practice of including units in metric names—latency_ms, memory_bytes, count_total—creates redundancy now that OpenTelemetry provides proper unit metadata. It also prevents automatic unit conversion. With request_duration_ms and request_duration_seconds as separate metrics, you need different queries for different time scales. With a single request.duration metric that includes unit metadata, the observability platform handles conversion automatically.

The pattern is clear: context that varies by deployment, instance, environment, or version belongs in attributes, not in the metric name. The metric name should identify what you're measuring. Everything else—who's measuring it, where it's running, which version it is—goes in the attribute layer where it can be filtered, grouped, and aggregated as needed.

Cultivating Better Metrics

Just like the spans we covered earlier in this series, well-named metrics are a gift to your future self and your team. They provide clarity during incidents, enable powerful cross-service analysis, and make your observability data truly useful rather than just voluminous.

The key insight is the same one we learned with spans: separation of concerns. The metric name describes what you're measuring. The context—who's measuring it, where, when, and how—lives in the rich attribute hierarchy that OpenTelemetry provides.

In our next post, we'll dive deep into metric attributes—the context layer that makes metrics truly powerful. We'll explore how to structure the rich contextual information that doesn't belong in names, and how to balance informativeness with cardinality concerns.

Until then, remember: a clean metric name is like a well-tended garden path—it leads you exactly where you need to go.

How to Name Your Span Attributes

Juraci Paixão Kröhling — Tue, 26 Aug 2025 22:00:41 GMT

Welcome to the second installment in our series on OpenTelemetry naming best practices. In our previous post, we explored how to name spans using the {verb} {object} pattern. Today, we're diving into span attributes, the rich contextual data that transforms your traces from simple operation logs into powerful debugging and analysis primitives.

This guide targets developers who are:

Instrumenting their own applications with custom spans and attributes
Enriching telemetry beyond what auto-instrumentation provides
Creating libraries that others will instrument

The attribute naming decisions you make directly impact the usability and maintainability of your observability data. Let's get them right.

Start with Semantic Conventions

Here's the most important rule that will save you time and improve interoperability: if an OpenTelemetry semantic convention exists and the semantics match your use case, use it.

This isn't just about convenience—it's about building telemetry that integrates seamlessly with the broader OpenTelemetry ecosystem. When you use standardized attribute names, your data automatically works with existing dashboards, alerting rules, and analysis tools.

When Semantics Match, Use the Convention

Your Need	Use This Semantic Convention	Why
HTTP request method	`http.request.method`	Standardized across all HTTP instrumentation
Database collection name	`db.collection.name`	Works with database monitoring tools
Service identification	`service.name`	Core resource attribute for service correlation
Network peer address	`network.peer.address`	Standard for network-level debugging
Error classification	`error.type`	Enables consistent error analysis

The key principle is semantic match over naming preference. Even if you prefer database_table over db.collection.name, use the semantic convention when it accurately describes your data.

When Semantics Don't Match, Don't Force It

Resist the temptation to misuse semantic conventions:

Don't Do This	Why It's Wrong
Using `db.collection.name` for a file name	Files and database collections are different concepts
Using `http.request.method` for business actions	"approve_payment" isn't an HTTP method
Using `user.id` for a transaction ID	Users and transactions are different entities

Misusing semantic conventions is worse than creating custom attributes—it creates confusion and breaks tooling that expects the standard semantics.

The Golden Rule: Domain First, Never Company First

When you need custom attributes beyond the semantic conventions, the most critical principle is: start with the domain or technology, never your company or application name.

This principle seems obvious but is consistently violated across the industry. Here's why it matters and how to get it right.

Why Company-First Naming Fails

Bad Attribute Name	Problems
`og.user.id`	Company prefix pollutes global namespace
`myapp.request.size`	Application-specific, not reusable
`acme.inventory.count`	Makes correlation with standard attributes difficult
`shopify_store.product.sku`	Unnecessarily ties concept to one vendor

These approaches create attributes that are:

Difficult to correlate across teams and organizations
Impossible to reuse in different contexts
Vendor-locked and inflexible
Inconsistent with OpenTelemetry's interoperability goals

Domain-First Success Stories

Good Attribute Name	Why It Works
`user.id`	Universal concept, vendor-neutral
`request.size`	Reusable across applications
`inventory.count`	Clear, domain-specific concept
`product.sku`	Standard e-commerce terminology
`workflow.step.name`	Generic process management concept

This approach creates attributes that are universally understandable, reusable by others facing similar problems, and future-proof.

Understanding the Structure: Dots and Underscores

OpenTelemetry attribute names follow a specific structural pattern that balances readability with consistency. Understanding this pattern helps you create attributes that feel natural alongside standard semantic conventions.

Use Dots for Hierarchical Separation

Dots (.) separate hierarchical components, following the pattern: {domain}.{component}.{property}

Examples from semantic conventions:

http.request.method - HTTP domain, request component, method property
db.collection.name - Database domain, collection component, name property
service.instance.id - Service domain, instance component, id property

Use Underscores for Multi-Word Components

When a single component contains multiple words, use underscores (_):

http.response.status_code - "status_code" is one logical component
system.memory.usage_percent - "usage_percent" is one measurement concept

Create Deeper Hierarchies When Needed

You can nest further when it adds clarity:

http.request.body.size
k8s.pod.label.{key}
messaging.kafka.message.key

Each level should represent a meaningful conceptual boundary.

Reserved Namespaces: What You Must Never Use

Certain namespaces are strictly reserved, and violating these rules can break your telemetry data.

The `otel.*` Namespace is Off-Limits

The otel.* prefix is exclusively reserved for the OpenTelemetry specification itself. It's used to express OpenTelemetry concepts in telemetry formats that don't natively support them.

Reserved otel.* attributes include:

otel.scope.name - Instrumentation scope name
otel.status_code - Span status code
otel.span.sampling_result - Sampling decision

Never create attributes starting with otel. Any additions to this namespace must be approved as part of the OpenTelemetry specification.

Other Reserved Attributes

The specification also reserves these specific attribute names:

error.type
exception.message, exception.stacktrace, exception.type
server.address, server.port
service.name
telemetry.sdk.language, telemetry.sdk.name, telemetry.sdk.version
url.scheme

Semantic Convention Patterns

The best way to develop good attribute naming intuition is studying OpenTelemetry's semantic conventions. These represent thousands of hours of design work by observability experts.

Domain Organization Patterns

Notice how semantic conventions organize around clear domains:

Infrastructure Domains

service.* - Service identity and metadata
host.* - Host/machine information
container.* - Container runtime information
process.* - Operating system processes

Communication Domains

http.* - HTTP protocol specifics
network.* - Network layer information
rpc.* - Remote procedure call attributes
messaging.* - Message queue systems

Data Domains

db.* - Database operations
url.* - URL components

Universal Property Patterns

Across all domains, consistent patterns emerge for common properties:

Identity Properties

.name - Human-readable identifiers (service.name, container.name)
.id - System identifiers (container.id, process.pid)
.version - Version information (service.version)
.type - Classification (messaging.operation.type, error.type)

Network Properties

.address - Network addresses (server.address, client.address)
.port - Port numbers (server.port, client.port)

Measurement Properties

.size - Byte measurements (http.request.body.size)
.count - Quantities (messaging.batch.message_count)
.duration - Time measurements (http.server.request.duration)

When creating custom domains, follow these same patterns. For inventory management, consider:

inventory.item.name
inventory.item.id
inventory.location.address
inventory.batch.count

Creating Custom Domains Safely

Sometimes your business logic requires attributes outside existing semantic conventions. This is normal—OpenTelemetry can't cover every possible business domain.

Guidelines for Safe Custom Domains

Choose descriptive, generic names that others could reuse
Avoid company-specific terminology in the domain name
Follow hierarchical patterns established by semantic conventions
Consider if your domain could become a future semantic convention

Examples of Well-Designed Custom Attributes

Domain	Good Attributes	Why They Work
Business	`payment.method`, `order.status`	Clear, reusable business concepts
Logistics	`inventory.location`, `shipment.carrier`	Domain-specific but transferable
Process	`workflow.step.name`, `approval.status`	Generic process management
Content	`document.format`, `media.codec`	Universal content concepts

The Rare Exception: When Prefixes Make Sense

In rare cases, you might need company or application prefixes. This typically happens when your custom attribute might conflict with attributes from other sources in a distributed system.

Consider prefixes when:

Your attribute might conflict with vendor attributes in a distributed system
You're instrumenting proprietary technology that's truly company-specific
You're capturing internal implementation details that shouldn't be generalized

For most business logic attributes, stick with domain-first naming.

Your Action Plan

Naming span attributes well creates telemetry data that's maintainable, interoperable, and valuable across your organization. Here's your roadmap:

Always check semantic conventions first - Use them when semantics match
Lead with domain, never company - Create vendor-neutral attributes
Respect reserved namespaces - Especially avoid otel.*
Follow hierarchical patterns - Use dots and underscores consistently
Build for reusability - Think beyond your current needs

By following these principles, you're not just solving today's instrumentation challenges, you're contributing to a more coherent, interoperable observability ecosystem that benefits everyone.

In our next post in this series, we'll shift our focus from spans to metrics, exploring how to name the quantitative measurements that tell us how our systems are performing, and why the same principles of separation and domain-first thinking apply to the numbers that matter most.

How to Name Your Spans

Juraci Paixão Kröhling — Tue, 05 Aug 2025 22:00:41 GMT

One of the most fundamental yet often overlooked aspects of good instrumentation is naming. This post is the first in a series dedicated to the art and science of naming things in OpenTelemetry. We'll start with spans, the building blocks of a distributed trace, and give you the most important takeaway right at the beginning: how to name the spans that describe your unique business logic.

Naming your business spans

While OpenTelemetry's automatic instrumentation is fantastic for covering standard operations (like incoming HTTP requests or database calls), the most valuable insights often come from the custom spans you add to your own business logic. These are the operations unique to your application's domain.

For these custom spans, we recommend a pattern that borrows from basic grammar. Simple, clear sentences often follow a subject -> verb -> direct object structure. The "subject" (the service performing the work) is already part of the trace's context. We can use the rest of that structure for our span name:

{verb} {object}

This pattern is descriptive, easy to understand, and helps maintain low cardinality—a crucial concept we'll touch on later.

{verb}: A verb describing the work being done (for example: process, send, calculate, render).
{object}: A noun describing what is being acted upon (for example: payment, invoice, shopping_cart, ad).

Let's look at some examples:

Bad Name	Good Span Name	Why It's Better
process_payment_for_user_jane_doe	process payment	The verb and object are clear. The user ID belongs in an attribute.
sendinvoice#98765	send invoice	Aggregable. You can easily find the P95 latency for sending all invoices.
render_ad_for_campaign_summer_sale	render ad	The specific campaign is a detail, not the core operation. Put it in an attribute.
calculate_shipping_for_zip_90210	calculate shipping	The operation is consistent. The zip code is a parameter, not part of the name.
validation_failed	validate user_input	Focus on the operation, not the outcome. The result belongs in the span's status.

By adhering to the {verb} {object} format, you create a clear, consistent vocabulary for your business operations. This makes your traces incredibly powerful. A product manager could ask, "How long does it take to process payments?" and an engineer can immediately filter for those spans and get an answer.

Why this pattern works

So why is process payment good and process*invoice*#98765 bad? The reason is cardinality.

Cardinality refers to the number of unique values a piece of data can have. A span name should have low cardinality. If you include unique identifiers like a user ID or an invoice number in the span name, you will create a unique name for every single operation. This floods your observability backend, makes it impossible to group and analyze similar operations, and can significantly increase costs.

The {verb} {object} pattern naturally produces low-cardinality names. The unique, high-cardinality details (invoice\_#98765, user_jane_doe) belong in span attributes, which we will cover in a future blog post.

Learning from Semantic Conventions

This {verb} {object} approach isn't arbitrary. It's a best practice that reflects the principles behind the official OpenTelemetry Semantic Conventions (SemConv). SemConv provides a standardized set of names for common operations, ensuring that a span for an HTTP request is named consistently, regardless of the language or framework.

When you look closely, you'll see this same pattern of describing an operation on a resource echoed throughout the conventions. By following it for your custom spans, you are aligning with the established philosophy of the entire OpenTelemetry ecosystem.

Let's look at a few examples from SemConv.

HTTP spans

For server-side HTTP spans, the convention is {method} {route}.

Example: GET /api/users/:ID
Analysis: This is a verb (GET) acting on an object (/api/users/:id). The use of a route template instead of the actual path (/api/users/123) is a perfect example of maintaining low cardinality.

Database spans

Database spans are often named {db.operation} {db.name}.{db.sql.table}.

Example: INSERT my_database.users
Analysis: This is a verb (INSERT) acting on an object (my_database.users). The specific values being inserted are high-cardinality and are rightly excluded from the name.

RPC spans

For Remote Procedure Calls, the convention is {rpc.service}/{rpc.method}.

Example: com.example.UserService/GetUser
Analysis: While the format is different, the principle is the same. It describes a method (GetUser), which is a verb, within a service (com.example.UserService), which is the object or resource.

The key takeaway is that by using {verb} {object}, you are speaking the same language as the rest of your instrumentation.

Cultivating a healthy system

Naming spans is not a trivial task. It's a foundational practice for building a robust and effective observability strategy. By adopting a clear, consistent pattern like {verb} {object} for your business-specific spans, you can transform your telemetry data from a tangled mess into a well-tended garden.

A well-named span is a gift to your future self and your team. It provides clarity during stressful outages, enables powerful performance analysis, and ultimately helps you build better, more reliable software.

In our next post in this series, we will dig into the next layer of detail: span attributes. We'll explore how to add the rich, high-cardinality context to your spans that is necessary for deep debugging, without compromising the aggregability of your span names.

🌱 Cultivating Unique service.instance.id on NGINX Ingress with OpenTelemetry

Yuri Oliveira Sá — Thu, 31 Jul 2025 06:00:06 GMT

A task to set a unique service.instance.id on NGINX Ingress Controller using the OpenTelemetry should be simple, right? But as it turned out, the ingress controller doesn't expose all of NGINX’s OTel knobs out of the box, so I had to roll my own tweak garden.

Why should I care?

Without unique instance IDs, your tracing data looks like a tangled tangleweed—hard to trace, difficult to debug, and completely undermines what we’re trying to do with observability.

The solution

Part 1 - Nginx Ingress Controller - Helm Chart

To get POD_UID injected into each trace, here’s the minimal yet powerful config snippet I landed on:

Set the POD_UID environment variable, once NGINX Ingress Controller set by default only the POD_NAME and POD_NAMESPACE.

extraEnvs:
  - name: POD_UID
    valueFrom:
      fieldRef:
         fieldPath: metadata.uid

Set the POD_UID as span-attribute for instance.

controller:
  config:
    main-snippet: |
      env POD_UID;
    server-snippet: |
      set $pod_uid "unknown";
      access_by_lua_block {
        ngx.var.pod_uid = os.getenv("POD_UID") or "unknown"
      }
      opentelemetry_attribute service.instance.id $pod_uid;

Part 2 - OpenTelemetry Collector Config

Configure the transform processor to set the resource-attribute.

    processors:
      transform:
        error_mode: ignore
        trace_statements:
          - set(resource.attributes["service.instance.id"], span.attributes["service.instance.id"])
          - delete_key(span.attributes, "service.instance.id")

How it works:

Explicitly exposes the POD_UID environment variables so NGINX can see them.
Initializes a default $pod_uid—my fail-safe in case the env var goes missing.
Uses Lua to pull in the real POD_UID.
Finally, sets the OpenTelemetry attribute service.instance.id to match the actual UID.
Through OpenTelemetry Collector capture the service.instance.id from span-attributes and set it on resource-attributes.

Benefits to your telemetry:

Without unique `service.instance.id`	With unique `service.instance.id`
Pods share the same instance ID, causing confusion in telemetry.	Each pod is individually identifiable.
Difficult to isolate errors and debug effectively.	Allows precise tracing to individual pods.
Limited visibility into pod-specific issues.	Enables accurate root-cause analysis.

Final result

By implementing this solution, each NGINX ingress pod is clearly distinguishable in tracing data. This improves observability significantly by providing accurate, pod-specific telemetry, facilitating precise troubleshooting and diagnostics.

Introducing OllyGarden

OllyGarden — Wed, 09 Jul 2025 07:00:20 GMT

In the rush for visibility, many organizations find themselves lost in an overgrown jungle of data. Teams generate a constant stream of telemetry, hoping it will sprout into useful insights. Instead, they often end up with "bad telemetry"—data that is noisy, irrelevant, or incomplete, driving up costs and obscuring the very answers they seek.

Today, we're coming out of stealth to introduce OllyGarden, a new company dedicated to tending this garden of observability. Backed by a pre-seed round led by DIG Ventures, with investments from observability leaders like Datadog Ventures, Grafana Labs, and Dash0, as well as special angels with deep knowledge in our space, like Batuhan Uslu, Ben Sigelman, Chris Aniszczyk, and Ian Livingstone, we're ready to get to work.

Our mission is simple: to improve the efficiency of telemetry pipelines. We believe the first and most crucial step is to help companies understand and optimize the telemetry they are already generating.

The Weeds We’re Tackling: Real-World Telemetry Pains

The problem of bad telemetry isn't theoretical; it's a daily struggle for engineering teams everywhere. During our research, we heard the same stories again and again. These might sound familiar to you:

Runaway Costs: An engineer at a company based in Berlin mentioned that they had a single high-cardinality metric generating over $20,000 worth of telemetry per month without a clear value. After reducing the cardinality, the cost was reduced to $2,000. How many such metrics are hiding in your pipelines, undetected?
Broken Insights & Poor User Experience: What happens when telemetry fails? An engineer working at a company in Australia mentioned that their support team uses distributed tracing to debug user-reported problems, but incomplete traces limit their ability to help their users, directly impacting customer satisfaction.
The "Instrument Everything" Trap: Auto-instrumentation tools are powerful, but if left unconfigured they can gather far more information than necessary. This can generate extremely high volumes of data, overloading systems before you even get a chance to analyze the data.
Vast, Unchecked Data Volumes: Another company shared that they generate about 3 Petabytes of uncompressed telemetry per month, acknowledging that among that trove of data, "a lot of it is bad telemetry". With industry experts estimating that up to 90% of telemetry data goes unused, imagine how many CPU cycles, collector instances, and how much egress traffic could be spared if that bad telemetry wasn't generated in the first place.

Our First Step: OllyGarden Insights & the Instrumentation Score

You can't improve what you can't measure. For too long, evaluating telemetry quality has been a subjective exercise based on gut feelings and tribal knowledge. We're tackling this problem by giving observability engineers superpowers, allowing them to see exactly how good the telemetry is inside their pipelines.

Our first product, OllyGarden Insights, analyzes your telemetry streams to give you deep insights into your data quality. It assesses your instrumentation against best practices, identifies services that are over- or under-instrumented, and helps you make informed, data-driven decisions about what to change.

To provide a common vocabulary for this, we launched the Instrumentation Score, a standardized value to objectively assess OpenTelemetry instrumentation. We believe such a fundamental metric for telemetry health shouldn't be proprietary. That’s why we initiated the Instrumentation Score as an open-source effort, with an open governance model and support from partners like Dash0, Datadog, Grafana Labs, Honeycomb, New Relic, and Splunk. This is our first major contribution back to the community we care so much about, and a testament to how we plan to operate.

Why OllyGarden? Our Roots in OpenTelemetry

We know the challenges of telemetry because we've been helping build the solutions for years. OllyGarden was founded by OpenTelemetry veteran Juraci Paixão Kröhling and SRE expert Yuri Oliveira Sá. As a member of the OpenTelemetry Governance Committee, creator of the OpenTelemetry Operator and OpenTelemetry Collector Builder (ocb), and one of the project's top contributors, Juraci has been on the front lines of the observability revolution.

This deep experience shapes our core philosophy:

A Vendor-Neutral Approach: Our business is to make your telemetry more efficient. We are a neutral partner you can trust, acting as a complement to observability backends by helping their customers send them higher-quality data. We succeed when you gain clarity, regardless of where your data is stored.
The Start of Our Journey: OllyGarden Insights is our first major step. We are focused on delivering immediate value by helping you understand and improve the telemetry you already have. We are just at the beginning of our journey and are incredibly excited about the future, but our commitment today is clear: to empower engineers with the backend-neutral tools they need to cultivate clarity from their data.

Start Cultivating Better Telemetry Today

It's time to move beyond the jungle of bad telemetry and start purposefully cultivating a garden of clear, actionable insights.

We are now engaging with early users. By joining us, you will not only see the power of purposeful instrumentation firsthand but also have the unique opportunity to help shape our product and secure special early-bird pricing. We'll analyze a sample of your telemetry and provide you with your Instrumentation Score and actionable insights for improvement.

Ready to see what's really growing in your telemetry pipelines? Contact us at contact@olly.garden or visit our website to learn more.

Introducing the Instrumentation Score

Juraci Paixão Kröhling — Wed, 11 Jun 2025 04:09:52 GMT

Telemetry data is the foundation of our observability. We gather metrics, traces, and logs, aiming to cultivate a clear understanding of application health. Yet, a persistent question often arises: "Is our telemetry actually good?" How do we distinguish valuable insights from data that merely consumes resources?

For too long, evaluating instrumentation effectiveness has been a subjective exercise. We've lacked a common language or a standard measure to truly understand if our telemetry is enriching our insights or just overgrowing the plot. At OllyGarden, we recognize this challenge and are building a product to give super powers to observability engineers. We heard from them that our Instrumentation Score was something special. And we believe that if you want something really special, you share it.

Today, OllyGarden introduces its first major contribution to the observability ecosystem: the Instrumentation Score.

The Instrumentation Score is a standardized, numerical value designed to objectively assess the consistency and effectiveness of OpenTelemetry instrumentation. It analyzes OTLP (OpenTelemetry Protocol) data streams against a predefined set of rules rooted in OpenTelemetry best practices and semantic conventions. It’s a health check for your telemetry, providing a clear, actionable measure of its quality.

As Juraci Paixão Kröhling, Co-founder at OllyGarden, states:

"As an OpenTelemetry contributor and enthusiast, I've seen firsthand the project's power to democratize instrumentation. Yet, a persistent question has always been: 'Are we generating good telemetry?' Too often, the answer is unclear, leading to missed insights or wasted resources. The Instrumentation Score, an initiative we're launching from OllyGarden, aims to provide that clarity. It's about establishing a common, actionable language for telemetry quality, built on OpenTelemetry principles, to empower every engineer and organization to confidently improve their observability practices and truly harness the value of their data."

The Instrumentation Score provides a common vocabulary for discussing instrumentation effectiveness. For engineers and SREs, it offers actionable guidance, highlighting where instrumentation can be improved. For CTOs and technology leaders, the strategic value includes improved ROI on observability by focusing on purposeful telemetry and reducing bad telemetry.

OllyGarden is committed to OpenTelemetry and the open-source ecosystem. The Instrumentation Score leverages OpenTelemetry Semantic Conventions and analyzes OTLP data. Crucially, we've ensured this initiative is not only open source but is being developed with an open governance model, with support or contributions from key industry players like Dash0, New Relic, Splunk, Datadog, and Grafana Labs. The specification is an open-source effort, and we are actively requesting observability engineers to contribute with their own rules by opening a pull request against the GitHub repository. Our goal is a collaborative evolution, with the Instrumentation Score eventually finding a home within a neutral foundation.

The introduction of the Instrumentation Score is a step towards a future where organizations can confidently understand and improve their telemetry.

We invite you to learn more and get involved:

Explore the Instrumentation Score landing page: https://score.olly.garden/
Review the specification and contribute on GitHub: https://github.com/instrumentation-score/

OllyGarden aims to improve the efficiency of telemetry pipelines. The Instrumentation Score is the first seed we’re planting, hoping that together, we can help everyone grow a more effective observability practice.

Concrete Applications of Purposeful Instrumentation

Juraci Paixão Kröhling — Wed, 04 Jun 2025 22:00:30 GMT

In our Purposeful Instrumentation blog post, we laid the groundwork for a more disciplined approach to observability. We argued that the goal isn't to merely amass data, but to cultivate high-quality telemetry signals – focusing on quality over quantity. The aim is to transform our experience during high-pressure incidents from frantically searching through a "dense thicket of irrelevant data" to confidently navigating a "well-lit path to the root cause."

Many of us have experienced the pitfalls of the "instrument everything" mantra. While well-intentioned, it often leads to an "overgrown jungle of telemetry data," where critical signals are drowned out by noise. Purposeful instrumentation, in contrast, encourages us to strategically gather the right data. This isn't just about digital decluttering; it yields tangible benefits: reduced noise, faster troubleshooting, and improved clarity and maintainability in our systems.

This post moves from philosophy to practice. We'll dive into concrete examples and techniques, showcasing how to apply purposeful instrumentation in real-world scenarios—from initial telemetry design to ongoing pipeline adjustments and even code-level optimizations.

Designing Telemetry with NASA's Rigor

When we think about systems operating under the most severe limitations, spacecraft telemetry, particularly from missions like NASA's Mars rovers, offers profound inspiration. The extreme constraints of space exploration—limited bandwidth, power, and processing capabilities—force engineers to meticulously justify and optimize every single bit of data transmitted. For observability engineers on Earth, even without such stark limitations, these practices offer invaluable lessons in cultivating efficiency.

Here are some key takeaways:

Data Type Optimization: Spacecraft systems often convert 64-bit floating-point numbers to 32-bit or even 16-bit integers. Sometimes, scaled integers (like centi-degrees Celsius) are used to preserve essential precision while drastically reducing data volume. For our enterprise systems, this prompts a critical question: Do we really need microsecond precision for every timer, or would seconds suffice for certain metrics, thereby reducing storage and processing overhead?
Bit Packing and Enumerated Types: To save space, boolean flags and enumerated values with a limited set of states are often packed into smaller integer types on spacecraft. For example, 15 distinct safety checks might be encoded into a single 16-bit integer. This principle is directly applicable to software telemetry, particularly in how we design attributes to reduce cardinality and data volume. Instead of verbose string representations for statuses, can an enumerated integer suffice?
Configurable Data Collection: Spacecraft aren't static in their data collection. They possess "knobs" that allow operators to increase data verbosity for anomaly investigations, switching between "Brief records" for nominal operations and "Verbose records" when digging deeper. This mirrors the need in our systems for dynamic control over telemetry, perhaps adjusting log levels or sampling rates based on operational context rather than maintaining a constant, high-volume stream.
Summary Data and Compression: Reporting small, high-level summary data packets independently from detailed diagnostic data products allows for quick operational decision-making. If summaries are nominal, large, detailed data products might even be discarded to save precious bandwidth. Lossless compression is also a standard practice, always balancing the CPU cost of compression/decompression against bandwidth savings.
The "Very Small Products" Problem: Interestingly, generating a multitude of tiny data products can be inefficient, consuming storage slots and impacting system performance, as was observed with the Mars 2020 rover's packetizer. This highlights the importance of batching and aggregation not just for network efficiency but also for processing and storage optimization within our telemetry pipelines. The OpenTelemetry Collector’s batch processor is a prime example of applying this principle.

These extreme examples from NASA underscore a fundamental discipline: diligently asking, "What data do I really need?" and "What is the cost versus the value?" This scrutiny is crucial for building sustainable and effective telemetry strategies, ensuring we're not just collecting data, but harvesting actionable insights.

Tuning Automatic Instrumentation for Precision with OpenTelemetry

OpenTelemetry's auto-instrumentation agents are a massive boon, offering broad telemetry coverage for popular libraries and frameworks with minimal upfront effort. It’s tempting to see this as "zero code, zero thought." However, this convenience doesn't absolve us from the need for purposeful configuration. Blindly enabling instrumentation for every conceivable library can quickly lead back to that "overgrown jungle of telemetry data," swamping your systems with noise and incurring unnecessary costs.

Review Default Configurations: Auto-instrumentation defaults are often tuned for maximum coverage, which might not align with your specific observability goals or the critical paths of your application. As Elena Kovalenko of Delivery Hero noted, unconfigured auto-instrumentation can generate extremely high cardinality and massive data volumes, potentially overloading collectors and backend systems. It’s vital to treat the default settings as a starting point, not a final destination.
Selectively Disable Unnecessary Instrumentation: Most OpenTelemetry auto-instrumentation agents allow for fine-grained control, enabling you to disable instrumentation for components that are irrelevant to your critical diagnostic paths or those known to produce excessive, low-value data.
- Concrete Example: Suppressing JDBC Telemetry: If your primary diagnostic focus is at the service interaction level, the verbose telemetry generated by JDBC instrumentation (tracing every database call) might be more noise than signal. With the OpenTelemetry Java agent, for instance, you can easily disable this by setting the environment variable OTEL_INSTRUMENTATION_JDBC_ENABLED=false. This targeted pruning ensures that resources aren't wasted collecting, processing, and storing data that doesn't contribute significantly to your understanding of system health.

Auto-instrumentation plants the seeds of visibility; purposeful configuration helps you cultivate the desired crop, ensuring a healthy yield of actionable insights rather than a field of weeds.

Optimizing Data Flow with the OpenTelemetry Collector: Pipeline Adjustments

The OpenTelemetry Collector is more than just a telemetry forwarder; it's a powerful, vendor-agnostic control plane. It’s a great place to implement purposeful telemetry strategies by filtering, sampling, enriching, and transforming data before it even reaches your observability backends. Let's look at how sophisticated organizations are leveraging the Collector.

eBay's Journey: Scaling Distributed Tracing with Cost Optimization

Handling telemetry at eBay's scale—ingesting 6.5 million spans per second—necessitates highly judicious instrumentation and aggressive optimization. They faced challenges with broken call chains due to context propagation issues and the difficulty of applying uniform sampling across APIs with vastly different traffic volumes.

Their approach to sampling evolved:

Initial Strategy: They started with head sampling at the client (e.g., 2% of requests) combined with parent-based sampling to ensure entire traces were captured if any part was sampled.
Adding Tail Sampling: After that, they employed a tail-sampling strategy to retain "interesting" traces—those with errors, high latency, or specific critical attributes—along with a baseline 1% of successful traces, storing these for 14 days. This allowed them to focus retention on the most valuable diagnostic data.
Evolving Tail Sampling with OTel Collector: Recognizing the significant memory and complexity challenges of performing in-memory tail sampling within the OpenTelemetry Collector for long-duration traces or requests spanning multiple clusters, eBay pivoted. They now leverage exemplars from metrics to identify traces of interest. These traces are then copied from a raw trace table to a sampled table after a 10-15 minute delay. This innovative, storage-based tail sampling approach demonstrates a mature balance between comprehensive diagnostic capability and cost control.

TomTom's Centralized Control: Enforcing Governance and Flexibility

TomTom implemented a centralized OpenTelemetry Collector Service that acts as a gateway between their internal applications and various SaaS observability platforms. This central hub provides several advantages:

Governance and Standardization: It allows them to enforce authentication, manage general configurations like batching and encryption consistently, and, crucially, handle data enrichment and manipulation centrally.
Filtering and PII Redaction: They use the filterprocessor to drop noisy or irrelevant logs (e.g., from specific Kubernetes namespaces). For sensitive data, a combination of the transformprocessor and attributesprocessor is used to redact Personally Identifiable Information (PII) before telemetry leaves their trust boundary.
Telemetry Enrichment: Data is enriched with valuable metadata, such as an "owner" label, which provides better context during troubleshooting and improves accountability.
Strategic Benefits: This centralized model offers flexibility in switching telemetry backends, enforces data governance policies, and has proven critical for cost control and maintaining data quality at an enterprise scale.

These real-world examples illustrate the power of the OpenTelemetry Collector as a central point for cultivating telemetry quality.

Crafting Intentional Manual Instrumentation: OllyGarden’s Example

While auto-instrumentation provides breadth, manual instrumentation offers depth and precision. But even here, more isn't always better. A common pitfall is "over-spanning": creating an excessive number of highly granular spans for minor, sequential internal operations within a single logical unit of work. This can obscure the true flow of a request, add unnecessary overhead, and make traces harder to interpret—akin to "wandering aimlessly in the woods" instead of following a clear path. For example, a single logical onTraces operation might be fragmented into several child spans for processResourceSpans, cluttering the trace view and inflating span counts unnecessarily.

Here’s the original Go code we wrote and landed in production:

    ctx, span := telemetry.Tracer().Start(ctx, "tendril.processResourceSpans")
    defer span.End()

    // Extract service information from resource
    svcName := getResourceString(rs.Resource(), attrServiceName)
    span.SetAttributes(attribute.String("service.name", svcName))

    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    span.SetAttributes(attribute.String("service.version", svcVersion))

    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)
    span.SetAttributes(attribute.String("deployment.environment.name", svcEnv))

The Purposeful Solution: Leveraging Span Events

Instead of creating distinct child spans for every micro-step, it's often far more effective to consolidate these internal milestones as span events within a single, overarching span that represents the larger logical operation. This aligns with the core principle of choosing the "most effective signal type for your defined purpose." Logs provide detailed context for discrete occurrences, metrics track aggregatable trends, and traces show flow; span events offer a way to add rich, contextual markers to a span without creating new ones.

And here’s the code after the fine-tuning:

    span := trace.SpanFromContext(ctx)

    // Extract service information from resource
    svcName := getResourceString(rs.Resource(), attrServiceName)
    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)

    span.AddEvent("processing resource spans for service", trace.WithAttributes(
        attribute.String("service.name", svcName),
        attribute.String("service.version", svcVersion),
        attribute.String("deployment.environment.name", svcEnv),
    ))

Benefits of Using Span Events Over Excessive Child Spans:

Clearer Trace Representation: A single span with well-defined events provides a cleaner, more focused view of a component's internal workings within the context of the larger trace. This gives a "well-lit path" to understanding that component's behavior.
Reduced Overhead and Cost: Span events are generally lighter-weight than full spans. This translates to reduced data volume and consequently lower processing and storage costs in your observability backend.
Enhanced Context: Events, with their associated attributes, allow you to capture crucial details (e.g., input size, processing duration for a specific sub-task, success/failure flags) at precise points within the operation, without fragmenting the trace into many tiny pieces.

Conclusion: Towards Insightful and Economical Observability

Moving from indiscriminate data collection to purposeful software telemetry is more than an engineering exercise; it's a strategic imperative. It ensures that our substantial investments in observability deliver tangible business value—faster incident resolution, optimized performance, and controlled costs—rather than just overwhelming data lakes.

This journey of continuous cultivation is not a one-off task. It requires ongoing review, governance, and a feedback loop where insights from incidents, performance anomalies, and cost reports are fed back into your instrumentation design and data pipeline policies. As your systems evolve, so too must your telemetry strategy.

The guiding questions we discussed in our previous post remain your most valuable tools:

"What question are we trying to answer with this data?"
"What data do we truly need, and at what precision?"
"Why this specific signal type (metric, log, trace, event)?"
"How will this data actually be used and by whom?"
"And critically, what is its ongoing cost versus its value?"

By consistently applying this critical lens, engineering teams can cultivate an observability practice that is not only powerful and insightful but also sustainable and economically sound. This deliberate, adaptive, and insight-driven approach is the future of effective software observability. OllyGarden is committed to being a neutral and valuable partner in this ecosystem, helping you analyze, optimize, and manage your OpenTelemetry pipelines to harvest the richest insights efficiently.

Purposeful Instrumentation

Juraci Paixão Kröhling — Thu, 01 May 2025 08:01:22 GMT

It’s the middle of the night. An alert jolts you awake – a critical service is sputtering. Your mind races as you dive into a labyrinth of dashboards, logs, and traces. Are you navigating a well-lit path to the root cause, or are you lost in a dense thicket of irrelevant data? In these high-pressure moments, the quality – not just the quantity – of your observability instrumentation is what truly counts.

Many teams, in their quest for visibility, fall into the trap of "instrument everything." The intention is good, but the result is often an overgrown jungle of telemetry data: noisy metrics, verbose logs, and sprawling traces that obscure rather than illuminate, or perhaps a monoculture of one of those telemetry data types. This is where the practice of Purposeful Instrumentation comes in – a disciplined approach to cultivating high-quality observability signals. It's about moving beyond simply collecting data to strategically gathering the right data to understand system health, optimize performance, and troubleshoot effectively. Think of it as tending a garden: you don't just let everything grow wild; you carefully select, nurture, and prune to ensure a healthy and productive yield. It's fundamentally about quality over quantity, having the telemetry you need, without excess.

Why Prune the Noise?

Adopting a purposeful approach isn't just about tidiness; it delivers tangible benefits that directly impact your team's effectiveness and your organization's bottom line.

Reduced Noise & Increased Signal: Over-instrumentation creates a cacophony. Imagine trying to hear a single bird's song in the middle of a roaring stadium. Purposeful instrumentation acts like a filter, silencing the distracting roar and amplifying the signals that truly indicate system behavior and potential issues. You focus your resources on telemetry that provides genuine insight, making it easier to spot anomalies and trends.
Faster Troubleshooting & Resolution: When an incident occurs, time is critical. Sifting through irrelevant data wastes precious minutes, if not hours. With instrumentation designed to answer specific questions or diagnose common failure modes, you have targeted data trails leading you directly towards the problem's source. It’s the difference between wandering aimlessly in the woods and following a clearly marked trail.
Significant Cost Optimization: Telemetry data isn't free. Storage, processing, and analysis all incur costs, which can escalate rapidly with high data volumes and cardinality. Instrumenting only what provides clear value ensures you're not paying to store noise. This optimizes your observability spend and demonstrably increases the return on your investment (ROI). Think of it as allocating water and fertilizer only to the plants you intend to grow.
Improved Clarity & Maintainability: Code cluttered with arbitrary instrumentation is harder to read, understand, and maintain. When instrumentation is added with clear intent, documented appropriately (even if informally via commit messages or code comments), it serves as a form of living documentation. Future engineers (including your future self!) can readily grasp why a particular metric, span, or log statement exists and how it contributes to understanding the system.

Guiding Questions for Purposeful Instrumentation

Before adding any new metric, span, span event, or log line, pause and cultivate intention by asking critical questions:

What question am I trying to answer? This is the cornerstone. Are you trying to understand latency distribution, error rates under specific conditions, resource consumption patterns, or the flow of requests across services? Defining the question sharpens the focus of your instrumentation. Don't aim to predict every possible future question, but consider the types of questions most likely to arise based on the service's function and history. What are the known failure modes or performance bottlenecks for this component?
What data do I really need to answer this? Challenge the defaults. Do you need millisecond precision, or would seconds suffice? Do you need the full user ID (potentially creating high cardinality), or could you use a user type or a randomized cohort ID? Can data be aggregated at the source to reduce volume and cardinality? For example, instead of logging every request, could you use metrics instead with enough labels to distinguish between outcomes?
Why this type of signal (Metric, Trace, Log)? Each signal type has strengths. Metrics are great for aggregatable trends and alerting (e.g., overall request rate). Traces excel at illustrating request flows and latency breakdowns across distributed systems. Logs provide detailed, event-specific context, especially for non-transaction data (configuration changes, connections/disconnections to databases, …). Are you choosing the most effective signal type for your defined purpose? Adding high-cardinality attributes to metrics intended for aggregation, for instance, is often an anti-pattern. Creating a span when a simple event on an existing span would suffice adds unnecessary overhead.
How will this data actually be used? Will this feed a critical dashboard panel? Trigger an alert? Be used primarily for ad-hoc debugging during incidents? Understanding the consumption pattern helps determine the required granularity, retention, and format. How do you envision it being visualized or queried? Instrumenting data that no one knows how to use or interpret is like planting seeds you never intend to water. Again, don’t aim to predict exactly how things will be used, but having an idea helps set the direction.
What is the cost versus the value? Consider the compute resources needed to generate the data, the network bandwidth to transmit it, and the storage/processing costs in your observability backend. Is the potential insight or troubleshooting value gained worth this ongoing cost? Regularly reassess this balance, especially for verbose or high-frequency telemetry.

You’ll never get the perfect balance on the first shot. In fact, even if you get into a perfect state today, it won’t be suitable anymore tomorrow as systems evolve. Keep an open mind and add things that are going in the direction of what you believe you’ll need. Adding too much fertilizer does hurt your garden.

Applying Purposefulness in Practice

Purposeful instrumentation isn't just a theoretical concept; it's applied through conscious choices during development and operation.

Manual Instrumentation: When you manually add code to emit telemetry (e.g., using OpenTelemetry APIs), be explicit. Add enough details explaining the 'why' behind non-obvious metrics or attributes. Document the intended use case, especially for custom, high-value signals. This foresight is invaluable during incident response or later refactoring, and writing about them is a great exercise to reason about them in the first place.
Use semantic conventions to your favor: not only does it tell how things should be named, but it also helps brainstorming what kind of instrumentation can be used. For instance, are you adding deployment.environment.name to your resource attributes?
Auto-Instrumentation: Tools like OpenTelemetry's auto-instrumentation agents (e.g., the Java agent) are powerful, providing broad coverage with minimal effort. However, "zero code" doesn't mean "zero thought." Don't blindly enable every single instrumentation library offered. Review the default configuration. Can you disable instrumentation for components irrelevant to your critical paths (e.g., verbose JDBC logging if you primarily diagnose issues at the service level)? Can you configure sampling decisions more intelligently? Tune or suppress instrumentation known to generate excessive noise or high-cardinality data that bloats costs without commensurate value. Auto-instrumentation provides the seeds; purposeful configuration helps you cultivate the desired crop.
Regular Review & Weeding: Instrumentation needs aren't static. Systems evolve, code gets refactored, and priorities shift. Schedule periodic reviews (e.g., quarterly) of your existing telemetry. It might still be a bit early, but consider using OTel Weaver to help you here. Ask: Are there metrics, logs, or trace attributes that haven't been queried or looked at in months? Be ruthless about pruning unused or redundant instrumentation. This ongoing "weeding" keeps your observability garden healthy, cost-effective, and focused on yielding insights.

Reaping the Benefits of Critical Scrutiny

Consistently applying a critical, purposeful lens to your instrumentation strategy, whether manual or automatic, transforms observability from a potential data swamp into a beautiful field full of data ready to harvest. It ensures your telemetry remains:

Focused: Directly addressing key operational questions and business KPIs.
Relevant: Aligned with current system architecture and troubleshooting needs.
Cost-Effective: Providing maximum insight for the resources invested.
Actionable: Enabling swift diagnosis, resolution, and performance optimization.

By consciously choosing what to plant in your observability garden and why, you cultivate a rich harvest of insights.

Next up

What does purposeful instrumentation actually look like in real code, and how do we correct existing instrumentation that might not be that useful? Stay tuned for our next article, where we'll expose concrete examples of common instrumentation pitfalls and walk you through the precise steps to fix them, both directly at the source and with the OTel Collector.

Conclusion

In today's complex, distributed systems, observability is non-negotiable. But the path to enlightenment isn't paved with sheer data volume. It's built on the foundation of purposeful instrumentation – the deliberate act of gathering the right signals to illuminate system behavior.

By embedding the practice of asking "Why this signal? Why now? How will it help?" into our development workflow, we shift from reactive data collection to proactive insight generation. We reduce noise, accelerate troubleshooting, control costs, and ultimately, build more reliable and performant software.

So, the next time you reach for that instrumentation library or add that log line, take a moment. Pause. Ask yourself: "What is my purpose?". Cultivate clarity, and you'll reap the rewards of truly effective observability.

Acknowledgement: The concept of intentional instrumentation gained prominence for me through a conversation with Adriel Perkins, which evolved into purposeful instrumentation.

There's a Lot of Bad Telemetry Out There

Juraci Paixão Kröhling — Fri, 28 Mar 2025 13:21:24 GMT

Ninety percent. That's the number the founder of an observability company mentioned some time ago when talking about telemetry data that withers away unused. It is data created, collected, transmitted, and stored, without ever blooming into a dashboard, alert, or query result. Our monitoring tools keep planting more and more data seeds, hoping they'll sprout into useful insights during a 2am production incident. Like planting without proper soil analysis, these generic instrumentation tools create low-quality telemetry that isn't suited for our environment. Traces from health check endpoints are like weeds taking up valuable space.

Don’t get me wrong, this is still better than nothing! However, many companies have matured past this stage and need to cultivate higher quality telemetry.

Telemetry is essential for us, observability engineers, to understand how our systems are performing. It's the soil from which observability and modern monitoring blossom. Despite its importance, a significant amount of telemetry data out there is, to put it bluntly, bad. This post will explore what good telemetry looks like, digging into the problems caused by bad telemetry, examine the root causes, and plant a thought on how to cultivate higher quality data.

What is Good Telemetry?

Good telemetry is characterized by providing an accurate, relevant, timely, and actionable picture of a system's health and performance. It offers just enough information — not too much or too little — to allow for improved troubleshooting, quick identification and resolution of issues, and a deeper understanding of the system. Additionally, good telemetry facilitates faster incident response, minimizing downtime and service disruptions, and bears fruit in the form of data-driven decisions that optimize performance.

Examples include traces that show the path of a request through a system, metrics that accurately reflect resource utilization, and logs that provide context for errors and warnings. Quite frequently, good telemetry is used to provide business insights along with operational data.

Bad Telemetry

Bad telemetry is the opposite of good telemetry: inaccurate, irrelevant, old, non-actionable, or far too much in terms of volume.

Inaccurate telemetry happens when we have the wrong values, leading us to wrong conclusions. For instance, concurrently counting the number of leaves in a tree and storing the counter on a non-concurrent data structure will eventually result in the wrong number of leaves being reported
Irrelevant telemetry doesn’t provide any meaningful insight. It might be interesting to know that I have three boxes of kiwis, but without knowing the size of the box, that information is meaningless.
Incomplete telemetry means that we can’t determine the root cause of a problem. How can I tell whether my strawberries received enough sun if I’m not recording the amount of sun they received? How can I tell why my tulips didn’t grow if I don’t know whether they were planted in the first place?
Old data is bad telemetry, meaning that I’m only taking action when it might be too late. There’s no point in employing a scarecrow when the birds ate all the seeds already
Having too much data can also make it harder to find the real information we are looking for. It’ll definitely take us longer to find a magic herb in the middle of overgrown bushes.

One interesting consideration is that the definition of bad telemetry also varies based on the backend we are using. Rice needs tons of water, purple coneflower will probably complain. For a time-series database, a high-cardinality metric is certainly not desirable.

Consequences of bad telemetry

We, as an industry, have tolerated bad telemetry as a price to pay for having telemetry in the first place. However, bad telemetry isn’t just bad: it’s the weed in our garden. Misleading insights from erroneous data can lead to poor decisions, and time and effort can be wasted analyzing useless data. Additionally, bad telemetry can result in slower problem resolution, due to the difficulty in identifying and fixing issues. It can lead to poor decision-making and incorrect business strategies based on faulty metrics.

Perhaps more relevant for today’s economic realities, bad telemetry often means paying way more for observability than you should: high egress costs, overprovisioned infrastructure, complex pipelines, and unreasonably big checks to observability vendors.

Root Causes of Bad Telemetry

Many developers and engineers lack a full understanding of the principles of good telemetry. Observability is not taught alongside writing operating systems or databases at the university. It's not surprising that engineers don't learn how to properly instrument their applications.

Like security, observability is typically a discipline that engineers come across when they get more experienced, once they’ve been burned by bug reports that they couldn’t reproduce, leading to frustrated users. Insufficient planning is another issue, as telemetry is often an afterthought rather than a core part of system design. Poor implementation is also a factor, as instrumentation can be complex and mistakes are easily made. Finally, telemetry systems require ongoing tending and upkeep, and inadequate maintenance can lead to having an infestation of bad telemetry. Instrumenting complex systems can be particularly challenging, as tasks like ensuring that context is propagated correctly across distributed services are crucial for accurate tracing.

It doesn’t help that things are still moving really fast in observability, causing best practices to suddenly become anti-patterns, as well as the lack of standardization in new areas. Applications making use of generative AI tools or LLMs need to be instrumented today, but standards to instrument those components are still being worked on. Without clear industry guidance, people often have to make decisions based on their experiences (or their vendor’s suggestions), which quite often means that they’d be at odds with the standard when it gets created. OpenTelemetry semantic conventions definitely help here, and while OTel is in constant progress, we don’t have enough stable conventions for all that matters out there.

How to Improve Telemetry Quality

We are at the beginning of this journey towards high quality telemetry, and while there’s a lot to learn, I believe that we have enough to start changing the status quo. We already know a few anti-patterns, like the ones we mentioned earlier. We can also be opinionated about some solutions, such as OpenTelemetry, and gather insights based on those strong opinions. Perhaps all my telemetry should have a service.name and service.version field? Perhaps some resource attributes, like process.executable.path, should be filtered out at the source when going to a time-series metrics backend?

Improving telemetry quality is a long-term activity, and I believe we’ll never have perfectly good telemetry. New services come and go every day, new versions are deployed all the time, and they all bring new telemetry with them, which is likely to be imperfect.

It’s our job as observability engineers to understand what is good and what is bad telemetry, knowing our tools and how they’ll behave with the telemetry we are sending to them. Once we know and have our own recipes of what’s good and what’s bad, it’s a matter of looking into our telemetry and applying our knowledge, cross-pollinating with the engineers doing the instrumentation. Very likely, they are the ones doing the actual changes, or at least reviewing the pull requests we send their way improving the instrumentation of their code.

The question that remains is: how do we show “progress”? Are we really better off with good telemetry, or can we just survive with bad telemetry? Instinctively, we know that good telemetry is more efficient than bad telemetry, but we should be ready to measure the impact.

Conclusion

Bad telemetry remains a pervasive challenge with significant consequences, one that we, as observability engineers, have often tolerated or underestimated. This is partly due to a lack of clear understanding of what constitutes good telemetry and partly due to the absence of robust tools and processes for detection and remediation. However, by actively defining and implementing standards for good telemetry, identifying the root causes of poor data quality (like pulling out weeds), and leveraging tools like OpenTelemetry, we can unlock the full potential of observability. This means reduced alert fatigue, faster incident resolution, and ultimately, more reliable and performant systems. Pioneering teams are already demonstrating the value of this approach, implementing semantic conventions and data quality pipelines that yield healthy, actionable insights. It's time for each of us to assess our own telemetry landscape, advocate for better instrumentation practices within our teams, and ensure that the data we rely on is truly serving our needs.

Credits

Dan Blanco, my colleague at OpenTelemetry’s Governance Committee, is the author of the quote “There’s a lot of bad telemetry out there,” and he used it when I was describing what we are building at OllyGarden. He immediately understood our value proposition, wrapping it up with this quote. I like it: it’s blunt, it’s real, and it’s blameless.

OllyGarden

Severity-Based Log Routing with the OpenTelemetry Collector

Understanding Log Severity in OpenTelemetry

The Routing Connector

Complete Configuration

Configuration Walkthrough

Routing to Multiple Destinations

Splitting at INFO

Handling Unspecified Severity

Performance Considerations

Batching Strategy

Trade-offs and Limitations

Verification

Summary

Your telemetry answers yesterday's questions

When stability makes telemetry redundant

When pressure creates new questions

The review that never happens

AI as continuous telemetry reviewer

Closing the temporal gap

When to Use Each Telemetry Signal: Logs, Traces, and Metrics

Logs: The system lifecycle narrator

Traces: The transaction investigator

Metrics: The pre-calculated answer engine

Choosing the right signal

When signals overlap

Summary

You don't have too much telemetry. You have bad telemetry.

The governance gap

Patterns of bad telemetry

Fix at source, not at pipe

The volume question remains

A practical approach for engineering leaders

Reducing Log Volume with the OpenTelemetry Log Deduplication Processor

The repetitive log problem

How log deduplication works

Practical configuration

A complete pipeline example

Testing with telemetrygen

Tradeoffs and considerations

Conclusion

What 10,000 Slack Messages Reveal About OpenTelemetry Adoption Challenges

The Dataset

Most Discussed Collector Components

1. Prometheus Receiver and Exporter (498 messages, 5.0%)

2. k8sattributes Processor (258 messages, 2.6%)

3. Tail Sampling Processor (167 messages, 1.7%)

4. Kafka Receiver and Exporter (131 messages, 1.3%)

5. Memory Limiter Processor (125 messages, 1.3%)

Top 10 Problem Areas and Pain Points

1. Connection and Export Failures

2. Custom Collector Distributions

3. Configuration Syntax and Validation

4. Context Propagation

5. Attribute and Resource Management

6. OTTL (OpenTelemetry Transformation Language)

7. Kubernetes Operator and Auto-Instrumentation

8. Backend Integration

9. Docker and Container Deployment

10. Queue and Retry Behavior

What This Tells Us

Moving Forward

Acknowledgments

Meet Rose: OllyGarden's AI Instrumentation Agent

Introducing Rose

Key Features

Context-Aware Analysis

OllyGarden Knowledge Base

OpenTelemetry Education

Join the Research Preview

Introducing OllyGarden Tulip: Our Open-Source Distribution of the OpenTelemetry Collector

Helping Build the Collector Ecosystem

The Problems I Couldn't Solve Alone

Introducing OllyGarden Tulip

Why OllyGarden Tulip Exists

What Makes Tulip Different

Built on Open Source, Backed by Experience

Who This Is For

Getting Started

A Personal Note

The `otel.*` Namespace is Off-Limits