<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[OllyGarden]]></title><description><![CDATA[Fix Your Telemetry. Autonomously.]]></description><link>https://blog.olly.garden</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1761335922273/94e9fddb-b4b1-424e-a285-f74fe440b99d.png</url><title>OllyGarden</title><link>https://blog.olly.garden</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 03:18:02 GMT</lastBuildDate><atom:link href="https://blog.olly.garden/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Severity-Based Log Routing with the OpenTelemetry Collector]]></title><description><![CDATA[Log storage costs scale with volume, and modern applications generate extraordinary volumes. A distributed system handling thousands of requests per second can easily produce millions of log records d]]></description><link>https://blog.olly.garden/severity-based-log-routing-with-the-opentelemetry-collector</link><guid isPermaLink="true">https://blog.olly.garden/severity-based-log-routing-with-the-opentelemetry-collector</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[log management]]></category><category><![CDATA[Log Routing]]></category><category><![CDATA[opentelemetry collector]]></category><category><![CDATA[telemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Thu, 19 Mar 2026 10:28:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768832177546/a244d6b5-d21a-4718-8d94-5ac16d3985f6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Log storage costs scale with volume, and modern applications generate extraordinary volumes. A distributed system handling thousands of requests per second can easily produce millions of log records daily, the vast majority of which are INFO or DEBUG messages that exist primarily for post-hoc debugging. Sending all of this to a managed observability backend accumulates costs rapidly, yet dropping these logs entirely means losing the context you need when something goes wrong.</p>
<p>Analysis of real-world log traffic reveals a striking pattern: the vast majority of logs sent to vendor backends are INFO severity and lower. In my <a href="https://www.youtube.com/watch?v=kdzeUiMI_t4">KubeCon North America 2025 talk with Michele Mancioppi</a>, we presented findings showing this imbalance across production environments. The ideal scenario inverts this distribution: vendor backends should primarily receive WARN and above, the logs that signal problems requiring attention, while verbose logs flow to cheaper storage tiers.</p>
<p>The fundamental insight is that not all logs require the same storage tier. ERROR and WARN messages demand immediate visibility and fast query performance because they indicate problems requiring human attention. INFO and DEBUG messages, by contrast, primarily serve forensic purposes: understanding what happened before an error occurred. These forensic logs can tolerate slower query performance and longer retrieval times in exchange for dramatically lower storage costs.</p>
<p>The OpenTelemetry Collector's routing connector enables this tiered storage pattern by evaluating each log record's severity and directing it to the appropriate destination. Important logs flow to your vendor backend for alerting and dashboards. Verbose logs flow to object storage for cost-effective archival. The result is observability that remains comprehensive without the comprehensive bill.</p>
<h2>Understanding Log Severity in OpenTelemetry</h2>
<p>Before configuring routing, understanding how OpenTelemetry represents log severity is essential. The OTLP data model defines a 24-level severity scale through the <code>severity_number</code> field, grouped into six base levels with four sub-levels each.</p>
<table>
<thead>
<tr>
<th>Range</th>
<th>Base Level</th>
<th>Typical Use</th>
</tr>
</thead>
<tbody><tr>
<td>1-4</td>
<td>TRACE</td>
<td>Fine-grained debugging, execution flow</td>
</tr>
<tr>
<td>5-8</td>
<td>DEBUG</td>
<td>Diagnostic information for developers</td>
</tr>
<tr>
<td>9-12</td>
<td>INFO</td>
<td>Normal operational messages</td>
</tr>
<tr>
<td>13-16</td>
<td>WARN</td>
<td>Potential issues that may require attention</td>
</tr>
<tr>
<td>17-20</td>
<td>ERROR</td>
<td>Errors that require investigation</td>
</tr>
<tr>
<td>21-24</td>
<td>FATAL</td>
<td>Critical failures, system crashes</td>
</tr>
</tbody></table>
<p>The base severity number for each level represents the first value in its range: TRACE is 1, DEBUG is 5, INFO is 9, WARN is 13, ERROR is 17, and FATAL is 21. When routing by severity, you compare against these base values.</p>
<p>The <code>severity_text</code> field preserves the original severity name from the source logging framework. A Java application using <code>java.util.logging</code> might emit <code>SEVERE</code> as <code>severity_text</code> while the collector maps it to <code>severity_number</code> 17 (ERROR). This dual representation lets you route on normalized numbers while retaining source-specific terminology.</p>
<p>OTTL, the OpenTelemetry Transformation Language used by the routing connector, provides named constants for severity comparisons. Writing <code>severity_number &gt;= SEVERITY_NUMBER_WARN</code> is clearer than writing <code>severity_number &gt;= 13</code> and survives specification changes should the numeric values ever be updated.</p>
<h2>The Routing Connector</h2>
<p>The routing connector is a connector component that evaluates telemetry against a routing table and forwards it to matching pipelines. Unlike processors that transform data in place, connectors sit between pipelines, receiving from one and emitting to others. This architecture enables fan-out patterns where a single input pipeline routes to multiple output pipelines based on arbitrary conditions.</p>
<p>For severity-based routing, the connector examines each log record's <code>severity_number</code> field and routes to different pipelines depending on the value. The routing table uses OTTL conditions, so you have access to the full expression language for complex routing logic.</p>
<p>The connector operates as both an exporter (from the perspective of the input pipeline) and a receiver (from the perspective of output pipelines). This dual role is reflected in how you wire it in the service section: the input pipeline exports to the connector, while output pipelines receive from it.</p>
<h2>Complete Configuration</h2>
<p>The following configuration demonstrates severity-based routing with two tiers: important logs (WARN and above) route to an observability backend via OTLP, while informational logs (INFO and below) route to local files for archival. This example uses a local LGTM stack (Loki, Grafana, Tempo, Mimir) as the vendor backend, making it easy to test the pattern locally before deploying to production.</p>
<p>Start the LGTM stack with Docker, mapping the OTLP gRPC port to 14317 to avoid conflicts with the collector:</p>
<pre><code class="language-bash">docker run -d --name lgtm -p 3000:3000 -p 14317:4317 grafana/otel-lgtm
</code></pre>
<pre><code class="language-yaml">receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch/vendor:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048

  batch/file:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 10000

exporters:
  otlp/lgtm:
    endpoint: localhost:14317
    tls:
      insecure: true

  file:
    path: ./archive.jsonl
    rotation:
      max_megabytes: 100
      max_days: 7
      max_backups: 10
    format: json

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number &gt;= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor]

service:
  pipelines:
    logs/intake:
      receivers: [otlp]
      exporters: [routing/severity]

    logs/vendor:
      receivers: [routing/severity]
      processors: [batch/vendor]
      exporters: [otlp/lgtm]

    logs/archive:
      receivers: [routing/severity]
      processors: [batch/file]
      exporters: [file]
</code></pre>
<p>The configuration defines three pipelines forming a routing topology. The intake pipeline receives all logs via OTLP and exports to the routing connector. The routing connector evaluates each log record: those with <code>severity_number &gt;= SEVERITY_NUMBER_WARN</code> (13 or higher, meaning WARN, ERROR, or FATAL) route to the vendor pipeline, while everything else routes to the archive pipeline via <code>default_pipelines</code>.</p>
<p>Notice that each output pipeline has its own batch processor with different parameters. The vendor pipeline uses aggressive batching with 1-second timeouts and smaller batches optimized for near-real-time delivery. The archive pipeline uses relaxed batching with 10-second timeouts and larger batches optimized for file write efficiency. This demonstrates a key benefit of the routing pattern: each destination can have processing tuned to its characteristics.</p>
<h2>Configuration Walkthrough</h2>
<p>The <code>routing/severity</code> connector configuration warrants closer examination.</p>
<pre><code class="language-yaml">connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number &gt;= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor]
</code></pre>
<p>The <code>default_pipelines</code> field specifies where unmatched logs route. In this configuration, logs that do not match the WARN-or-higher condition route to the archive pipeline. Without <code>default_pipelines</code>, unmatched logs would be silently dropped, which is rarely the desired behavior.</p>
<p>The <code>error_mode: ignore</code> setting determines behavior when OTTL condition evaluation fails. With <code>ignore</code>, evaluation errors log a warning and route the affected log to <code>default_pipelines</code>. The alternative, <code>propagate</code>, causes evaluation errors to fail the entire batch, potentially losing data. Production configurations should almost always use <code>ignore</code>.</p>
<p>The <code>context: log</code> setting means the condition evaluates per individual log record. Alternative contexts like <code>resource</code> evaluate once per ResourceLogs batch, which is more efficient but cannot inspect log-level fields like <code>severity_number</code>. For severity-based routing, log context is required.</p>
<p>The condition <code>severity_number &gt;= SEVERITY_NUMBER_WARN</code> uses OTTL's named severity constant. This matches WARN (13-16), ERROR (17-20), and FATAL (21-24). The <code>SEVERITY_NUMBER_WARN</code> constant evaluates to 13, the base value for the WARN range.</p>
<h2>Routing to Multiple Destinations</h2>
<p>Some organizations want important logs sent to both the vendor backend and archival storage for redundancy. The routing connector supports this by listing multiple pipelines in a single route.</p>
<pre><code class="language-yaml">connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number &gt;= SEVERITY_NUMBER_WARN
        pipelines: [logs/vendor, logs/archive]
</code></pre>
<p>With this configuration, ERROR and WARN logs route to both pipelines, while INFO and DEBUG route only to archive. This ensures critical logs have redundant storage while still benefiting from reduced vendor costs for verbose logs.</p>
<h2>Splitting at INFO</h2>
<p>The boundary between important and archival logs is a policy decision. The example above uses WARN as the threshold, sending INFO to archival storage. Some organizations prefer to keep INFO in the vendor backend for operational visibility while archiving only DEBUG and TRACE levels.</p>
<pre><code class="language-yaml">connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    error_mode: ignore
    table:
      - context: log
        condition: severity_number &gt;= SEVERITY_NUMBER_INFO
        pipelines: [logs/vendor]
</code></pre>
<p>Changing <code>SEVERITY_NUMBER_WARN</code> to <code>SEVERITY_NUMBER_INFO</code> shifts the boundary. Now INFO (9-12), WARN (13-16), ERROR (17-20), and FATAL (21-24) route to the vendor, while only DEBUG (5-8) and TRACE (1-4) route to archive.</p>
<p>The cost implications depend on your log distribution. If 90% of your logs are DEBUG level, archiving DEBUG yields substantial savings. If DEBUG logs are rare but INFO logs are prolific, archiving only DEBUG may not meaningfully reduce vendor costs.</p>
<h2>Handling Unspecified Severity</h2>
<p>Log records may arrive with <code>severity_number</code> set to 0 (SEVERITY_NUMBER_UNSPECIFIED) when the source did not map severity correctly. These ambiguous logs need a routing decision. The safest approach treats unknown logs as potentially important by adding <code>or severity_number == 0</code> to the vendor routing condition. We will cover strategies for inferring and mapping severity from log content in a future article.</p>
<h2>Performance Considerations</h2>
<p>The routing connector evaluates conditions for every log record when using log context. High log volumes make condition evaluation a meaningful cost. OTTL condition compilation happens once at startup, but evaluation happens continuously during operation.</p>
<p>Simple numeric comparisons like <code>severity_number &gt;= SEVERITY_NUMBER_WARN</code> are fast. Avoid expensive operations in routing conditions. String matching with <code>IsMatch</code> and regular expressions, body parsing with <code>ParseJSON</code>, and complex boolean logic all add evaluation cost that multiplies by log volume.</p>
<p>If your routing logic requires expensive operations, consider whether a transform processor earlier in the pipeline could precompute the routing decision into an attribute. Routing on a precomputed attribute is faster than repeating expensive evaluations.</p>
<pre><code class="language-yaml">processors:
  transform/route_tag:
    log_statements:
      - context: log
        statements:
          - set(attributes["route"], "vendor") where severity_number &gt;= SEVERITY_NUMBER_WARN
          - set(attributes["route"], "archive") where severity_number &lt; SEVERITY_NUMBER_WARN

connectors:
  routing/severity:
    default_pipelines: [logs/archive]
    table:
      - context: log
        condition: attributes["route"] == "vendor"
        pipelines: [logs/vendor]
</code></pre>
<p>This pattern moves routing logic to the transform processor, which executes once per log record. The routing connector then performs a simple attribute comparison. For complex routing logic involving multiple conditions, this approach consolidates evaluation.</p>
<h2>Batching Strategy</h2>
<p>The batch processor configuration for each output pipeline affects efficiency and latency. The vendor pipeline typically wants low latency for alerting, so smaller batches with short timeouts make sense. The archive pipeline optimizes for throughput and file write efficiency, so larger batches with longer timeouts are appropriate.</p>
<p>File write efficiency improves with larger batches. Writing many small chunks incurs filesystem overhead, while larger batches amortize that cost. The archive pipeline's batch configuration targets larger payloads:</p>
<pre><code class="language-yaml">processors:
  batch/file:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 10000
</code></pre>
<p>These parameters produce batches up to 10,000 log records or 10 seconds of accumulation, whichever comes first. The file exporter then writes these batches efficiently with its built-in rotation handling.</p>
<p>The vendor pipeline uses more aggressive parameters for responsiveness:</p>
<pre><code class="language-yaml">processors:
  batch/vendor:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
</code></pre>
<p>One-second timeout ensures logs reach the vendor backend quickly for alerting. Smaller batch sizes prevent individual batches from becoming unwieldy.</p>
<h2>Trade-offs and Limitations</h2>
<p>Severity-based routing assumes severity numbers are correctly populated. Logs with missing or incorrect severity will route incorrectly. If your log sources do not reliably set severity, you may need preprocessing to infer severity from log content before routing decisions occur.</p>
<p>Routing decisions are permanent within a single collector deployment. Once a log routes to archive, it does not also appear in the vendor backend unless you explicitly configure dual routing. If you later need archived logs for investigation, you query your archive storage rather than your vendor's fast query interface. Ensure your archive storage has adequate query tooling for forensic analysis.</p>
<h2>Verification</h2>
<p>After deploying severity-based routing, verify that logs route as expected. Use <code>telemetrygen</code> to send test logs at different severity levels and check each destination.</p>
<p>Send an INFO log, which should route to archive:</p>
<pre><code class="language-bash">telemetrygen logs --otlp-insecure --severity-number 9 --severity-text INFO \
  --body "Test info message" --logs 1
</code></pre>
<p>Send a WARN log, which should route to vendor:</p>
<pre><code class="language-bash">telemetrygen logs --otlp-insecure --severity-number 13 --severity-text WARN \
  --body "Test warning message" --logs 1
</code></pre>
<p>Verify the routing by checking both destinations. The INFO log should appear in the local <code>archive.jsonl</code> file, while the WARN log should appear in Loki at <a href="http://localhost:3000">http://localhost:3000</a> (the LGTM container started earlier).</p>
<h2>Summary</h2>
<p>Severity-based log routing enables tiered storage without sacrificing observability. Important logs reach your vendor backend for fast querying and alerting. Verbose logs reach archival storage for cost-effective retention. The OpenTelemetry Collector's routing connector makes this pattern straightforward to implement.</p>
<p>The key configuration elements are the routing connector with OTTL conditions on <code>severity_number</code>, separate output pipelines for each destination, and batch processor tuning appropriate to each destination's characteristics. The pattern scales with log volume since routing decisions are per-record evaluations of simple numeric conditions.</p>
<p>Start by analyzing your log distribution across severity levels. If verbose logs dominate volume, severity-based routing can meaningfully reduce vendor costs. If important logs dominate, the savings may be modest. Either way, the architectural separation between immediate visibility and archival storage provides flexibility for future optimization.</p>
]]></content:encoded></item><item><title><![CDATA[Your telemetry answers yesterday's questions]]></title><description><![CDATA[Every piece of telemetry exists to answer a question. A span answers "what happened during this request?" A metric answers "how is this system performing over time?" A log answers "what did the applic]]></description><link>https://blog.olly.garden/your-telemetry-answers-yesterday-s-questions</link><guid isPermaLink="true">https://blog.olly.garden/your-telemetry-answers-yesterday-s-questions</guid><category><![CDATA[observability]]></category><category><![CDATA[telemetry]]></category><category><![CDATA[instrumentation]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Thu, 12 Mar 2026 10:07:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/695b816347e92352fdf40037/be0c349c-017b-4344-a4b9-bcccb6e8bd66.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every piece of telemetry exists to answer a question. A span answers "what happened during this request?" A metric answers "how is this system performing over time?" A log answers "what did the application observe at this moment?" When engineers configure instrumentation, they are implicitly encoding the questions they expect to ask. The problem is that the questions change, and the instrumentation does not.</p>
<p>A service deployed three months ago had a particular set of unknowns. How will it perform under real traffic? Are the retry mechanisms working correctly? Does the circuit breaker trigger at the right thresholds? The instrumentation was configured to answer these questions, and it did. The service proved itself. The unknowns became knowns. But the instrumentation kept running, answering questions that stopped being relevant weeks ago.</p>
<h2>When stability makes telemetry redundant</h2>
<p>Consider a payment processing service that has been running in production for six months without a significant incident. During its first weeks, engineers needed detailed spans for every database query, every downstream call, every retry attempt. Those spans helped them verify that the service behaved correctly under production conditions.</p>
<p>Six months later, the service processes thousands of transactions per hour with predictable latency and a near-zero error rate. The detailed spans still flow into the backend. Every database query, every downstream call, every retry, all captured, serialized, transmitted, stored. The pipeline processes them faithfully. Nobody looks at them.</p>
<p>This is not wasted telemetry in the traditional sense. Each individual span is well-formed and technically correct. The problem is relevance. The questions these spans answer, "is the database query pattern correct?" and "do retries work as designed?", were answered months ago. The telemetry is accurate but obsolete. It consumes real resources to confirm what the system has already proven through months of stable operation.</p>
<h2>When pressure creates new questions</h2>
<p>The opposite scenario is more urgent. A downstream dependency starts responding intermittently. Traffic spikes during a major sales event. A configuration change in an adjacent service introduces unexpected latency.</p>
<p>Operators open their dashboards and find that the existing telemetry describes the normal world with precision but has little to say about the abnormal world they are experiencing right now. The service-level metrics confirm elevated error rates, but there is no breakdown by downstream dependency. The traces capture the full request lifecycle, but they lack attributes that would distinguish between traffic patterns. The logs report application-level events but miss the infrastructure signals that would explain the cascading failure.</p>
<p>The gap between the questions operators need to answer and the questions the telemetry was designed to answer becomes painfully visible during incidents. Engineers spend the first thirty minutes of an outage not debugging the problem but instrumenting for it: adding log lines, enabling verbose tracing, deploying configuration changes to capture the attributes they need. This is reactive instrumentation, the opposite of the proactive observability that the industry aspires to.</p>
<p>The root cause is temporal mismatch. The instrumentation was configured for a different moment in the system's lifecycle, when the risks were different, when the traffic patterns were different, when the dependencies behaved differently. The system changed. The world around it changed. The telemetry stayed the same.</p>
<h2>The review that never happens</h2>
<p>The textbook answer is periodic reassessment. Teams should review their instrumentation regularly, asking whether the telemetry they collect still matches the questions they need to answer. Reduce verbosity for stable services. Add coverage for services under new pressure. Retire metrics that no alert or dashboard references.</p>
<p>This is sound advice that almost no organization follows. The reason is simple: there is always something more urgent. Feature delivery, incident response, infrastructure maintenance, and hiring all compete for the same engineering hours. Telemetry review is important but never urgent, which means it loses to everything that is both important and urgent.</p>
<p>The observability team, if the organization has one, is occupied with pipeline operations: keeping collectors running, managing backend capacity, responding to cost overruns. Asking application teams to audit their own instrumentation requires them to context-switch from their primary work, understand what they are currently emitting, evaluate whether it is still relevant, and make informed changes. Each of these steps demands time and expertise that teams under delivery pressure cannot spare.</p>
<p>The result is that instrumentation configurations calcify at their initial state. Services that were instrumented for launch keep their launch-day telemetry forever. Services that were instrumented during an incident keep their incident-response telemetry long after the incident resolves. Nobody adjusts because nobody has time, and the mismatch between questions and answers widens silently.</p>
<h2>AI as continuous telemetry reviewer</h2>
<p>This is the kind of problem where AI changes the equation fundamentally. The work of reviewing telemetry, analyzing what each service emits, evaluating whether it matches current conditions, identifying gaps and redundancies, is exactly the kind of continuous, attention-intensive analysis that humans cannot sustain and AI can.</p>
<p>An AI system observing the telemetry stream can build and maintain a model of each service's emissions and behavioral patterns. It can detect when a service has stabilized and its verbose instrumentation has become redundant. It can recognize when traffic patterns shift and existing telemetry lacks the attributes needed to understand the new behavior. It can identify metrics that nothing references and spans that nobody queries.</p>
<p>The critical capability is not just detection but reasoning. AI can formulate the questions that current conditions would demand, then check whether the existing telemetry can answer them. "If this service's primary database became unavailable, would the current instrumentation reveal the failure mode?" "If traffic doubled, would the existing metrics distinguish between capacity pressure and application errors?" These are the questions a thorough human review would ask. AI can ask them continuously, across every service, without competing with feature delivery for engineering time.</p>
<p>This does not replace human decision-making about instrumentation strategy. Engineers still decide what matters, what trade-offs to accept, and what risks to prioritize. AI handles the part that humans agree is important but cannot sustain: the ongoing, service-by-service evaluation of whether the telemetry still fits the reality.</p>
<h2>Closing the temporal gap</h2>
<p>The fundamental insight is that telemetry quality is not a property of individual spans or metrics. It is a measure of alignment between what is collected and what is needed right now. That alignment degrades in both directions: stable systems become over-instrumented, and pressured systems become under-instrumented. Both conditions waste resources. One wastes money. The other wastes time during incidents.</p>
<p>Organizations that treat instrumentation as a one-time project accept this drift as inevitable. Those that recognize telemetry as something that evolves with the system manage it as an ongoing lifecycle, and invest in AI systems that maintain alignment between collection and need, get observability that adapts to their current reality rather than preserving a snapshot of the past.</p>
<p>Your telemetry answers yesterday's questions. The question is whether you have a system that keeps it current.</p>
]]></content:encoded></item><item><title><![CDATA[When to Use Each Telemetry Signal: Logs, Traces, and Metrics]]></title><description><![CDATA[Understanding when to use logs, traces, or metrics is fundamental to building effective observability. Each signal serves a distinct purpose, and choosing the right one for a given situation directly impacts your ability to debug, monitor, and unders...]]></description><link>https://blog.olly.garden/when-to-use-each-telemetry-signal-logs-traces-and-metrics</link><guid isPermaLink="true">https://blog.olly.garden/when-to-use-each-telemetry-signal-logs-traces-and-metrics</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[Devops]]></category><category><![CDATA[cloud native]]></category><category><![CDATA[distributed systems]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Tue, 17 Feb 2026 14:00:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1771335707319/7e76045f-1c9b-4f59-981c-c3a278b33e79.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Understanding when to use logs, traces, or metrics is fundamental to building effective observability. Each signal serves a distinct purpose, and choosing the right one for a given situation directly impacts your ability to debug, monitor, and understand your systems. The challenge is that these signals overlap in capability, leading teams to either over-instrument with redundant data or miss critical insights by using the wrong signal for the job.</p>
<h2 id="heading-logs-the-system-lifecycle-narrator">Logs: The system lifecycle narrator</h2>
<p>Logs are the grandfather of all telemetry signals. Most of us learned to read logs and error messages early in our computing journey, first as users trying to understand why something failed, and eventually as developers writing log statements to communicate system state. Because of this history, logs remain the most universal and accessible signal.</p>
<p>Most legacy systems rely exclusively on logs. Before modern observability practices, aggregating log records was the standard approach to understanding request rates, queue depths, or user journeys across systems. Many teams implemented correlation IDs to tie logs for a specific request across multiple services, essentially building a primitive form of distributed tracing before dedicated tracing systems existed.</p>
<p>Today, logs serve a more focused purpose: understanding the lifecycle of an application. They excel at recording when a polled connection to a database was established, when a ring was rebalanced, when a circuit breaker opened or closed, when a critical resource stopped working, or when a service recovered from degraded mode. These are events about the application itself, not about individual business transactions.</p>
<p>The strength of logs lies in their flexibility and low barrier to entry. Any developer can add a log statement. The weakness is that this flexibility often leads to inconsistent structure, making analysis difficult at scale.</p>
<h2 id="heading-traces-the-transaction-investigator">Traces: The transaction investigator</h2>
<p>Traces capture telemetry for business transactions, typically in the context of an end-user request that might touch dozens or hundreds of services. Unlike logs, which describe system state, traces describe what happened during a specific operation and how long each step took.</p>
<p>Think of spans, the building blocks of traces, as super-logs. A span is essentially a log entry with a timestamp, a duration, causality relationships through parent span references, and built-in correlation IDs through the trace ID. The critical addition is context propagation: a standardized mechanism to pass trace and span IDs to downstream services, ensuring that all participants in a transaction can contribute to the same trace. This is what those hand-rolled correlation ID solutions were trying to achieve, but traces provide it as a first-class capability with standard protocols and automatic propagation.</p>
<p>The power of traces becomes evident when debugging errors. A trace shows not just that an error occurred, but the exact path the request took, which services were involved, and where the failure originated. When aggregated across many transactions, traces reveal user behavior patterns, system bottlenecks, and optimization opportunities.</p>
<p>One pattern that traces expose with unusual clarity is N+1 queries. While this anti-pattern is difficult to spot with other signals, a trace immediately reveals when a single request triggers dozens of sequential database or network calls. The visual representation of span timing makes the problem obvious in a way that logs or metrics cannot match.</p>
<p>Spans carry highly detailed attributes: which feature flags were active, the user's IP address, whether this is a VIP customer, which authentication mechanism was used, which payment method was selected, and which specific service instances processed the request. This level of detail makes traces the backbone of observability. When you have questions or theories about service behavior, traces often provide the answers.</p>
<p>This power comes with trade-offs. The detail that makes traces valuable also makes them expensive. The sheer volume of spans in a complex system creates significant storage and processing costs. Traces are also the most difficult signal to learn and implement correctly, which drives teams toward auto-instrumentation. While convenient, auto-instrumentation often increases volume further without adding proportional value.</p>
<h2 id="heading-metrics-the-pre-calculated-answer-engine">Metrics: The pre-calculated answer engine</h2>
<p>Metrics are aggregations of events or numeric representations of system state. They answer questions like: what is the current queue depth? How many users visited this page? What is the p99 latency for a specific endpoint?</p>
<p>As aggregations, metrics require choosing dimensions upfront. You might aggregate by endpoint path, service location, or page visited. You typically do not store the IP address for each individual user visit unless you specifically need to count visits per IP. Time-series databases, the systems specialized for storing metrics, are optimized for aggregated data rather than high-cardinality dimensions.</p>
<p>Metrics excel at pre-calculating answers to questions you know you will ask. RED metrics (requests, errors, duration) for HTTP services are the classic example. If you know you will want to track request rates, error percentages, and latency distributions for every endpoint, metrics provide this efficiently and at low query cost.</p>
<p>The limitation appears during ad-hoc exploration. While metrics can answer many questions, there will inevitably be investigations where you need a dimension you did not anticipate. Am I seeing high latency for all users, or only those in Europe? Only Germany? Only Berlin? If you did not include geographic dimensions in your metrics, you cannot answer these questions without re-instrumenting.</p>
<p>Metrics are the classic signal in monitoring. When operating a database, experienced operators know which metrics to watch: connection pool utilization, query latency distributions, replication lag. These indicators quickly reveal the health of the system without requiring investigation into individual transactions.</p>
<h2 id="heading-choosing-the-right-signal">Choosing the right signal</h2>
<p>The decision framework is straightforward once you understand each signal's purpose.</p>
<p>Use traces to record events related to business transactions. When an HTTP request arrives, when a user places an order, when a payment is processed, these are trace-worthy operations. The value is in understanding the complete path and timing of individual transactions.</p>
<p>Use metrics to pre-calculate answers to questions you know you will ask. If you need to monitor request rates, error percentages, or latency distributions, define those metrics upfront. The value is in fast, cheap access to known indicators.</p>
<p>Use logs to understand lifecycle events of your services. When dependencies change state, when configuration reloads, when the application starts or stops gracefully, these belong in logs. The value is in understanding the application as a running system, not the transactions it processes.</p>
<h2 id="heading-when-signals-overlap">When signals overlap</h2>
<p>Real systems often require multiple signals for the same event. A database connection failure might warrant a log (lifecycle event: dependency unavailable), affect a metric (connection error count), and appear in traces (failed span for database operations). This overlap is expected and appropriate.</p>
<p>The mistake is using one signal where another would be more effective. Aggregating log records to compute request rates works, but metrics do this more efficiently. Searching traces to understand when a service entered degraded mode works, but logs make this pattern explicit. Understanding why a specific request failed from metrics alone is nearly impossible, while a trace makes the answer visible.</p>
<p>Match the signal to the question. System health and known indicators call for metrics. Transaction debugging and behavior analysis call for traces. Application lifecycle and operational events call for logs.</p>
<h2 id="heading-summary">Summary</h2>
<p>Each telemetry signal has a distinct purpose that reflects its design and history. Logs narrate system lifecycle events: startups, configuration changes, dependency state transitions. Traces capture business transaction details: request paths, timing, errors, and the attributes that explain behavior. Metrics pre-calculate answers to monitoring questions: rates, distributions, and aggregate states.</p>
<p>Effective observability uses all three signals appropriately. The goal is not coverage through redundancy, but precision through choosing the right tool for each question you need to answer.</p>
]]></content:encoded></item><item><title><![CDATA[You don't have too much telemetry. You have bad telemetry.]]></title><description><![CDATA[The quarterly budget review arrives, and the observability line item has doubled again. The reflexive response is familiar: "We need to sample more aggressively" or "Let's only observe critical services." These tactics will reduce costs. They will al...]]></description><link>https://blog.olly.garden/you-dont-have-too-much-telemetry-you-have-bad-telemetry</link><guid isPermaLink="true">https://blog.olly.garden/you-dont-have-too-much-telemetry-you-have-bad-telemetry</guid><category><![CDATA[#observability #opentelemetry #devops #sre #cloud-native]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Wed, 04 Feb 2026 13:00:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770192983262/f651641f-93df-4f5c-9e42-2e9f7f393fa6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The quarterly budget review arrives, and the observability line item has doubled again. The reflexive response is familiar: "We need to sample more aggressively" or "Let's only observe critical services." These tactics will reduce costs. They will also destroy your ability to debug production incidents, trading a financial problem for an operational one.</p>
<p>The uncomfortable truth is that most organizations do not have a volume problem. They have a governance problem. Before reaching for the sampling dial, engineering leaders should ask a more fundamental question: do we actually know what we are collecting, and is it worth keeping?</p>
<h2 id="heading-the-governance-gap">The governance gap</h2>
<p>Most organizations cannot answer three basic questions about their telemetry: What are we collecting? Who owns it? Is it valuable?</p>
<p>Telemetry tends to grow organically. A developer enables debug logging during an incident and forgets to disable it. Auto-instrumentation captures every internal function call by default. Library internals generate spans that no one ever examines. Over months and years, this accumulation becomes the baseline that everyone assumes is necessary.</p>
<p>The result is a telemetry estate where no one understands the data, no one owns the data, and no one has consciously decided the data is worth the cost of keeping it. When the bill arrives, the only lever that seems available is sampling, which treats all telemetry as equally valuable and cuts it indiscriminately.</p>
<h2 id="heading-patterns-of-bad-telemetry">Patterns of bad telemetry</h2>
<p>Before implementing sampling, engineering leaders should understand the common patterns of telemetry that provides minimal debugging value while consuming significant resources.</p>
<p>Health check floods represent one of the most common offenders. Kubernetes probes, load balancer checks, and monitoring systems generate millions of traces daily. These traces confirm that services are responding, but they reveal nothing about application behavior, user experience, or system bottlenecks. They crowd out useful signal and consume pipeline capacity.</p>
<p>Debug logs abandoned in production create similar waste. During incident response, engineers often increase logging verbosity to understand system behavior. Once the incident resolves, these verbose settings remain in place, generating enormous log volumes that no one examines until the next billing cycle.</p>
<p>High-cardinality metric attributes cause a different kind of problem. Adding user identifiers or transaction IDs to metric labels seems useful until the metrics backend collapses under millions of unique time series. The cost grows multiplicatively with each additional high-cardinality attribute.</p>
<p>Internal span proliferation occurs when auto-instrumentation, especially via eBPF, captures every method call within a service. A single user request might generate fifteen spans, ten of which complete in under a millisecond and represent internal implementation details rather than meaningful system boundaries. These spans add noise to traces without aiding debugging.</p>
<p>Orphaned spans result from broken context propagation between services. These spans cannot be assembled into coherent traces, rendering them useless for understanding request flow. They consume storage and processing resources while providing zero debugging value.</p>
<h2 id="heading-fix-at-source-not-at-pipe">Fix at source, not at pipe</h2>
<p>Many organizations attempt to address telemetry waste by adding filters in their collection pipeline. This approach misses the fundamental inefficiency. By the time data reaches the collector, the application has already generated, serialized, and transmitted it across the network. Filtering at the collector reduces storage costs, but the computational and network costs have already been incurred.</p>
<p>Source-level fixes eliminate waste entirely. Configuring instrumentation agents to exclude health check endpoints prevents those traces from being created. Establishing log level policies in deployment configurations ensures debug logging stays in development environments. Code review practices can catch high-cardinality metric attributes before they reach production.</p>
<p>The collector should serve as a safety net for edge cases, not the primary mechanism for data governance. Filter processors handle scenarios that cannot be addressed at the source, such as legacy applications or third-party services. For everything else, the most cost-effective solution is preventing waste from being generated.</p>
<h2 id="heading-the-volume-question-remains">The volume question remains</h2>
<p>Even after addressing bad telemetry, some organizations will still face legitimate volume challenges. High-traffic systems generate substantial telemetry even when every span and log provides genuine value. The difference is what happens next.</p>
<p>Sampling garbage gives you a smaller pile of garbage. When telemetry is a mix of useful signal and noise, sampling cuts both indiscriminately. You reduce costs, but you also reduce your ability to debug the specific incidents that sampling happened to discard.</p>
<p>Sampling after cleanup is a strategic decision about valuable data. When you have eliminated the noise, every piece of remaining telemetry serves a purpose. Sampling decisions become intentional trade-offs between cost and observability coverage rather than desperate cuts to an unmanaged data stream. Tail-based sampling can preserve error traces while reducing successful request volume. Rate limiting can cap burst traffic while maintaining baseline visibility.</p>
<p>The key insight is that cleanup dramatically reduces the volume that needs sampling in the first place. Organizations often discover that addressing bad telemetry alone brings costs within acceptable ranges, eliminating the need for aggressive sampling entirely.</p>
<h2 id="heading-a-practical-approach-for-engineering-leaders">A practical approach for engineering leaders</h2>
<p>Addressing telemetry governance requires visibility before action. Start by inventorying what you collect, identifying the top contributors to volume across traces, metrics, and logs. Most organizations find that a small number of sources account for the majority of data.</p>
<p>Categorize that volume by type. Health checks, internal spans, debug logs, and high-cardinality metrics each require different remediation strategies. Understanding the composition of your telemetry guides where to focus effort.</p>
<p>Assess value honestly by asking when each category of telemetry last contributed to resolving an incident. If no one can recall using health check traces for debugging, they are candidates for elimination or aggressive filtering.</p>
<p>Implement fixes at the source where possible. Agent configuration changes, log level policies, and instrumentation code reviews address the root cause rather than treating symptoms. Reserve collector-level filtering for cases where source changes are impractical.</p>
<p>Finally, if volume remains a concern after cleanup, implement sampling with intention. Document what is being sampled and why. Ensure that sampling policies preserve the traces most likely to matter during incidents, such as errors, high-latency requests, and specific customer traffic.</p>
<p>The path from reactive cost cutting to intentional data governance requires effort, but the reward is an observability system that costs less and works better. The next time the budget conversation surfaces, the answer should not be "sample more." It should be "we know exactly what we collect, and it is worth keeping."</p>
]]></content:encoded></item><item><title><![CDATA[Reducing Log Volume with the OpenTelemetry Log Deduplication Processor]]></title><description><![CDATA[Your logs are probably at least 80% repetitive noise. Connection retries, health checks, heartbeat messages: the same log line repeated thousands of times per minute. You pay storage costs for each one while the signal drowns in noise. The OpenTeleme...]]></description><link>https://blog.olly.garden/reducing-log-volume-with-the-opentelemetry-log-deduplication-processor</link><guid isPermaLink="true">https://blog.olly.garden/reducing-log-volume-with-the-opentelemetry-log-deduplication-processor</guid><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Mon, 19 Jan 2026 15:00:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768587144939/4c55a1cc-f5d0-4062-bc95-fe02422c2bb9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Your logs are probably at least 80% repetitive noise. Connection retries, health checks, heartbeat messages: the same log line repeated thousands of times per minute. You pay storage costs for each one while the signal drowns in noise. The OpenTelemetry Collector's log deduplication processor offers an elegant solution to this problem.</p>
<h2 id="heading-the-repetitive-log-problem">The repetitive log problem</h2>
<p>Modern distributed systems generate enormous volumes of logs, but much of that volume provides diminishing returns. Consider a typical microservice that logs connection errors when a downstream dependency is unavailable. If the service retries every 100 milliseconds for 30 seconds, that's 300 nearly identical log entries for a single incident. Each entry consumes storage, network bandwidth, and processing capacity in your logging backend.</p>
<p>Health check endpoints compound the problem. Kubernetes probes, load balancer checks, and monitoring systems all generate log entries at regular intervals. A single service might log thousands of health check responses per hour, none of which provide meaningful insight beyond "the service was running."</p>
<p>The logdedupprocessor in the OpenTelemetry Collector solves this by aggregating identical logs over a configurable time window. Instead of forwarding every duplicate entry, it emits a single log with a count of how many times that message appeared.</p>
<h2 id="heading-how-log-deduplication-works">How log deduplication works</h2>
<p>The core concept is straightforward. Logs are considered identical when they share the same resource attributes, scope, body, attributes, and severity. The processor computes a hash of these fields and tracks occurrences within a configurable interval.</p>
<p>When the interval expires, the processor emits a single log entry with three additional attributes: <code>log_count</code> (the number of duplicates), <code>first_observed_timestamp</code>, and <code>last_observed_timestamp</code>. You keep full visibility into frequency patterns without storing every identical entry.</p>
<p>This approach differs from sampling in an important way. Sampling discards data permanently. Deduplication preserves the information that matters (what happened, how often, and when) while eliminating redundant storage.</p>
<h2 id="heading-practical-configuration">Practical configuration</h2>
<p>Here is a configuration that deduplicates connection errors while preserving audit logs:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">processors:</span>
  <span class="hljs-attr">logdedup:</span>
    <span class="hljs-attr">interval:</span> <span class="hljs-string">1s</span>
    <span class="hljs-attr">conditions:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">severity_number</span> <span class="hljs-string">&gt;=</span> <span class="hljs-string">SEVERITY_NUMBER_ERROR</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes["log.type"]</span> <span class="hljs-string">==</span> <span class="hljs-string">"connection"</span>
    <span class="hljs-attr">exclude_fields:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes.request_id</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes.timestamp</span>
</code></pre>
<p>The <code>conditions</code> field uses OpenTelemetry Transformation Language (OTTL) expressions to filter which logs get deduplicated. Logs that do not match pass through unchanged. In this example, only ERROR-level logs with the <code>log.type=connection</code> attribute are candidates for deduplication.</p>
<p>The <code>exclude_fields</code> option removes high-cardinality fields from the comparison. Fields like request IDs and timestamps differ between entries even when the log message is semantically identical. By excluding them, logs that differ only in these volatile fields are treated as duplicates.</p>
<h2 id="heading-a-complete-pipeline-example">A complete pipeline example</h2>
<p>To use the log deduplication processor, include it in your collector pipeline:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">receivers:</span>
  <span class="hljs-attr">otlp:</span>
    <span class="hljs-attr">protocols:</span>
      <span class="hljs-attr">grpc:</span>
        <span class="hljs-attr">endpoint:</span> <span class="hljs-number">0.0</span><span class="hljs-number">.0</span><span class="hljs-number">.0</span><span class="hljs-string">:4317</span>

<span class="hljs-attr">processors:</span>
  <span class="hljs-attr">logdedup:</span>
    <span class="hljs-attr">interval:</span> <span class="hljs-string">1s</span>
    <span class="hljs-attr">conditions:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">severity_number</span> <span class="hljs-string">&gt;=</span> <span class="hljs-string">SEVERITY_NUMBER_ERROR</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes["log.type"]</span> <span class="hljs-string">==</span> <span class="hljs-string">"connection"</span>
    <span class="hljs-attr">exclude_fields:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes.request_id</span>
      <span class="hljs-bullet">-</span> <span class="hljs-string">attributes.timestamp</span>

<span class="hljs-attr">exporters:</span>
  <span class="hljs-attr">debug:</span>

<span class="hljs-attr">service:</span>
  <span class="hljs-attr">pipelines:</span>
    <span class="hljs-attr">logs:</span>
      <span class="hljs-attr">receivers:</span> [<span class="hljs-string">otlp</span>]
      <span class="hljs-attr">processors:</span> [<span class="hljs-string">logdedup</span>]
      <span class="hljs-attr">exporters:</span> [<span class="hljs-string">debug</span>]
</code></pre>
<h2 id="heading-testing-with-telemetrygen">Testing with telemetrygen</h2>
<p>To test this configuration locally, use telemetrygen to generate connection error logs:</p>
<pre><code class="lang-bash">telemetrygen logs \
  --otlp-insecure \
  --logs 100 \
  --rate 10 \
  --severity-text ERROR \
  --severity-number 17 \
  --body <span class="hljs-string">"Connection refused: failed to connect to database at 10.0.0.5:5432"</span> \
  --telemetry-attributes <span class="hljs-string">'log.type="connection"'</span> \
  --telemetry-attributes <span class="hljs-string">'service.name="order-service"'</span> \
  --telemetry-attributes <span class="hljs-string">'db.system="postgresql"'</span>
</code></pre>
<p>This generates 100 logs at 10 per second, all with ERROR severity and the <code>log.type=connection</code> attribute that triggers deduplication. After a few seconds, you should see a few log entries with <code>log_count: N</code> in your backend instead of 100 separate entries.</p>
<h2 id="heading-tradeoffs-and-considerations">Tradeoffs and considerations</h2>
<p>The log deduplication processor introduces latency equal to your interval setting. Logs are held until the interval expires before being forwarded. For most use cases, a 1-second delay is acceptable, but real-time alerting systems may need adjustment.</p>
<p>For compliance-critical logs where every occurrence must be preserved with its original timestamp, skip deduplication entirely. Audit logs, security events, and regulatory records often require complete fidelity.</p>
<p>The tradeoff is straightforward: reduced storage and clearer signal at the cost of slight delay and losing individual timestamps. For high-volume repetitive logs, that tradeoff is usually worth it.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The log deduplication processor provides a practical solution to the noise problem in modern logging pipelines. By aggregating identical entries while preserving frequency information, you can dramatically reduce storage costs and improve signal clarity without sacrificing observability.</p>
<p>Combined with other OpenTelemetry Collector processors like filtering and sampling, log deduplication gives you fine-grained control over your telemetry pipeline. The result is a logging system that captures what matters while discarding the noise.</p>
]]></content:encoded></item><item><title><![CDATA[What 10,000 Slack Messages Reveal About OpenTelemetry Adoption Challenges]]></title><description><![CDATA[The OpenTelemetry community has grown tremendously over the past few years, and
with that growth comes valuable insights hidden in our community conversations.
We analyzed nearly 10,000 messages from the #otel-collector and
#opentelemetry Slack chann...]]></description><link>https://blog.olly.garden/what-10000-slack-messages-reveal-about-opentelemetry-adoption-challenges</link><guid isPermaLink="true">https://blog.olly.garden/what-10000-slack-messages-reveal-about-opentelemetry-adoption-challenges</guid><category><![CDATA[OpenTelemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Tue, 06 Jan 2026 14:27:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767628658449/3c6da5d0-0e18-4f07-b81c-783487852480.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The OpenTelemetry community has grown tremendously over the past few years, and
with that growth comes valuable insights hidden in our community conversations.
We analyzed nearly 10,000 messages from the <code>#otel-collector</code> and
<code>#opentelemetry</code> Slack channels spanning from May 2019 to December 2025 to understand
what challenges users face most often, which components generate the most
discussion, and where the community might need additional documentation or
tooling improvements.</p>
<h2 id="heading-the-dataset">The Dataset</h2>
<p>Our analysis covered 9,966 messages across two of the most active OpenTelemetry
Slack channels:</p>
<ul>
<li><strong>#otel-collector</strong>: 5,570 messages (56%)</li>
<li><strong>#opentelemetry</strong>: 4,396 messages (44%)</li>
</ul>
<p>These messages break down into several categories:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Category</td><td>Percentage</td></tr>
</thead>
<tbody>
<tr>
<td>Questions</td><td>46.7%</td></tr>
<tr>
<td>Error Reports</td><td>25.9%</td></tr>
<tr>
<td>Discussions</td><td>23.3%</td></tr>
<tr>
<td>Configuration</td><td>3.0%</td></tr>
<tr>
<td>Help Responses</td><td>1.0%</td></tr>
</tbody>
</table>
</div><p>The high proportion of questions and error reports (over 72% combined) tells us
that these channels serve as critical support resources for the community, and
the topics that appear most frequently represent real adoption challenges.</p>
<p>We applied topic modeling using BERTopic to cluster similar messages, then
analyzed sentiment and frustration indicators to identify which topics cause
the most difficulty. Messages containing error reports, repeated requests for
help, or expressions of confusion scored higher on our frustration metric.</p>
<h2 id="heading-most-discussed-collector-components">Most Discussed Collector Components</h2>
<p>Topic modeling revealed clear patterns in which Collector components generate
the most community discussion. Here are the top components by message volume:</p>
<h3 id="heading-1-prometheus-receiver-and-exporter-498-messages-50">1. Prometheus Receiver and Exporter (498 messages, 5.0%)</h3>
<p>Prometheus integration dominates community discussions. Users frequently ask
about:</p>
<ul>
<li>Configuring the Prometheus receiver to scrape metrics</li>
<li>Setting up the Prometheus remote write exporter</li>
<li>Understanding metric type and metadata preservation across the pipeline</li>
<li>Integrating with existing Prometheus infrastructure</li>
</ul>
<p>This makes sense given Prometheus's widespread adoption. Many organizations
start their OpenTelemetry journey by wanting to integrate with or migrate from
existing Prometheus setups. The remote write exporter in particular sees heavy
use, as it allows teams to continue using Prometheus as a storage backend while
adopting OpenTelemetry for collection and processing.</p>
<h3 id="heading-2-k8sattributes-processor-258-messages-26">2. k8sattributes Processor (258 messages, 2.6%)</h3>
<p>Kubernetes metadata enrichment is the second most discussed topic. Common
challenges include:</p>
<ul>
<li>Pod association and metadata extraction in DaemonSet deployments</li>
<li>RBAC permissions for accessing the Kubernetes API</li>
<li>Performance implications in large clusters</li>
<li>Interaction with the kubeletstats receiver</li>
</ul>
<p>The complexity of Kubernetes environments and the desire for rich metadata
context makes this processor essential but sometimes tricky to configure
correctly. Users often discover that running the Collector as a DaemonSet
requires different pod association rules than running it as a gateway, leading
to troubleshooting cycles that could be avoided with clearer guidance.</p>
<h3 id="heading-3-tail-sampling-processor-167-messages-17">3. Tail Sampling Processor (167 messages, 1.7%)</h3>
<p>Tail-based sampling generates significant discussion, often with a higher
frustration level than other topics. Users struggle with:</p>
<ul>
<li>Policy configuration and interaction between multiple policies</li>
<li>Stateful sampling across distributed services</li>
<li>Head sampling vs. tail sampling trade-offs</li>
<li>Debugging why traces are or aren't being sampled</li>
<li>Understanding the decision wait period and its impact on latency</li>
</ul>
<p>The stateful nature of tail sampling, which requires collecting all spans of a
trace before making a decision, adds operational complexity that head sampling
avoids. Many teams end up running both approaches, using head sampling at the
SDK level for baseline reduction and tail sampling in the Collector for
intelligent retention of interesting traces.</p>
<h3 id="heading-4-kafka-receiver-and-exporter-131-messages-13">4. Kafka Receiver and Exporter (131 messages, 1.3%)</h3>
<p>Kafka integration appears frequently, particularly around:</p>
<ul>
<li>Connection and authentication issues with managed Kafka services (AWS MSK)</li>
<li>Topic configuration and consumer group management</li>
<li>Message format and serialization</li>
<li>High-availability deployment patterns</li>
</ul>
<h3 id="heading-5-memory-limiter-processor-125-messages-13">5. Memory Limiter Processor (125 messages, 1.3%)</h3>
<p>Resource management is a consistent concern:</p>
<ul>
<li>Proper memory limit configuration relative to container limits</li>
<li>GOMEMLIMIT interaction with the memory limiter</li>
<li>Debugging memory spikes and OOM situations</li>
<li>CPU usage profiling with pprof</li>
</ul>
<p>Understanding the relationship between Go's memory management, container
limits, and the memory limiter processor requires knowledge that spans multiple
domains. The recent addition of <code>GOMEMLIMIT</code> support has helped, but users
still need guidance on proper configuration for their specific deployment
scenarios.</p>
<h2 id="heading-top-10-problem-areas-and-pain-points">Top 10 Problem Areas and Pain Points</h2>
<p>Beyond component-specific discussions, our frustration analysis identified the
topics that cause the most difficulty for users. These represent areas where
improved documentation, better error messages, or tooling enhancements could
have the highest impact.</p>
<h3 id="heading-1-connection-and-export-failures">1. Connection and Export Failures</h3>
<p>The most frustrating experiences relate to OTLP export failures, particularly:</p>
<ul>
<li><code>DEADLINE_EXCEEDED</code> errors when exporting to backends</li>
<li>TLS configuration issues with load balancers</li>
<li>gRPC vs. HTTP protocol confusion</li>
<li>Connectivity issues behind proxies or in cloud environments</li>
</ul>
<h3 id="heading-2-custom-collector-distributions">2. Custom Collector Distributions</h3>
<p>Building custom distributions with <code>ocb</code> (OpenTelemetry Collector Builder)
generates significant frustration:</p>
<ul>
<li>Version conflicts between components</li>
<li>Build failures on specific platforms (Windows MSI notably painful)</li>
<li>Dependency resolution issues</li>
<li>Understanding which components to include</li>
</ul>
<h3 id="heading-3-configuration-syntax-and-validation">3. Configuration Syntax and Validation</h3>
<p>Many users struggle with basic configuration:</p>
<ul>
<li>YAML syntax errors that produce cryptic error messages</li>
<li>Understanding the relationship between receivers, processors, and exporters</li>
<li>Pipeline configuration and data flow</li>
<li>Environment variable substitution syntax</li>
</ul>
<h3 id="heading-4-context-propagation">4. Context Propagation</h3>
<p>Distributed tracing fundamentals cause confusion:</p>
<ul>
<li>B3 vs. W3C trace context formats</li>
<li>Baggage propagation across service boundaries</li>
<li>Extract and inject operations in SDKs</li>
<li>Cross-language propagation issues</li>
</ul>
<h3 id="heading-5-attribute-and-resource-management">5. Attribute and Resource Management</h3>
<p>Understanding the data model proves challenging:</p>
<ul>
<li>When to use resource attributes vs. span/metric/log attributes</li>
<li>Moving attributes between resource and signal levels</li>
<li>Semantic conventions compliance</li>
<li>Attribute cardinality and its impact</li>
</ul>
<h3 id="heading-6-ottl-opentelemetry-transformation-language">6. OTTL (OpenTelemetry Transformation Language)</h3>
<p>While powerful, OTTL generates confusion:</p>
<ul>
<li>Function syntax and available operations</li>
<li>Context-specific paths and accessors</li>
<li>Debugging transformation failures</li>
<li>Performance implications of complex transforms</li>
</ul>
<h3 id="heading-7-kubernetes-operator-and-auto-instrumentation">7. Kubernetes Operator and Auto-Instrumentation</h3>
<p>The Operator simplifies deployment but introduces its own challenges:</p>
<ul>
<li>Instrumentation injection not working as expected</li>
<li>Multiple collector deployment modes (DaemonSet vs. Sidecar vs. Deployment)</li>
<li>CRD configuration options</li>
<li>Troubleshooting injected agents</li>
</ul>
<h3 id="heading-8-backend-integration">8. Backend Integration</h3>
<p>Connecting to observability backends requires effort:</p>
<ul>
<li>Jaeger configuration and migration from legacy setups</li>
<li>Vendor-specific exporter configuration</li>
<li>Authentication and authorization with managed services</li>
<li>Multi-backend routing</li>
</ul>
<h3 id="heading-9-docker-and-container-deployment">9. Docker and Container Deployment</h3>
<p>Container-related issues appear regularly:</p>
<ul>
<li>Image selection (contrib vs. core)</li>
<li>Version availability on Docker Hub</li>
<li>Custom image building</li>
<li>Resource limits and performance tuning</li>
</ul>
<h3 id="heading-10-queue-and-retry-behavior">10. Queue and Retry Behavior</h3>
<p>Understanding the exporter helper's behavior:</p>
<ul>
<li>Persistent queue configuration and storage</li>
<li>Retry policies and backoff behavior</li>
<li>Data loss scenarios and prevention</li>
<li>Queue sizing for high-volume deployments</li>
</ul>
<h2 id="heading-what-this-tells-us">What This Tells Us</h2>
<p>Several themes emerge from this analysis:</p>
<p><strong>The Prometheus ecosystem remains central.</strong> Organizations aren't abandoning
Prometheus; they're integrating it with OpenTelemetry. Documentation and
tooling that bridges these ecosystems will continue to be valuable.</p>
<p><strong>Kubernetes complexity compounds OTel complexity.</strong> The k8sattributes
processor and Operator discussions show that Kubernetes environments introduce
additional layers of configuration and troubleshooting. Simplified deployment
patterns and better defaults could help.</p>
<p><strong>Sampling is conceptually difficult.</strong> Tail sampling, despite being well
documented, generates ongoing confusion. Interactive tools or visualization of
sampling decisions might help users understand and debug their configurations.</p>
<p><strong>Error messages need improvement.</strong> Many frustration-heavy discussions start
with a cryptic error message. Investing in actionable error messages with
suggested fixes would significantly improve the user experience.</p>
<p><strong>The gap between "getting started" and "production ready" is real.</strong> Basic
tutorials work, but scaling to production with proper memory limits, persistent
queues, and multi-backend routing requires significant learning.</p>
<h2 id="heading-moving-forward">Moving Forward</h2>
<p>We hope this analysis helps maintainers and SIGs identify areas where
documentation improvements would have the highest impact. The data clearly
shows that certain topics, particularly around configuration patterns, sampling
strategies, and multi-backend deployments, generate recurring questions that
better guides could address.</p>
<p>On my end, I have lined up a series of articles that tackle some of these pain
points directly, covering topics like decomposing Collector configuration files
into manageable pieces, routing telemetry to multiple backends based on tenant
or environment, and building effective tail sampling strategies.</p>
<h2 id="heading-acknowledgments">Acknowledgments</h2>
<p>Thank you to everyone who participates in the OpenTelemetry Slack community.
Your questions, error reports, and discussions not only help each other but
also provide valuable signal for where the project can improve. A special
thanks to the community members who take time to answer questions and share
their experiences - the 1% of help responses in our data represent countless
hours of volunteer effort that makes this community welcoming for newcomers.</p>
<hr />
<p><em>This analysis used topic modeling and sentiment analysis on publicly available
Slack messages. Individual messages were aggregated into topics; no personally
identifiable information was used in this report.</em></p>
]]></content:encoded></item><item><title><![CDATA[Meet Rose: OllyGarden's AI Instrumentation Agent]]></title><description><![CDATA[Imagine the perfect observability world: There is an incident, the on-call team gets paged in the middle of the night, wakes up and thanks to your telemetry, the root-cause is identified within just a]]></description><link>https://blog.olly.garden/meet-rose-ollygardens-ai-instrumentation-agent</link><guid isPermaLink="true">https://blog.olly.garden/meet-rose-ollygardens-ai-instrumentation-agent</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Nicolas Wörner]]></dc:creator><pubDate>Wed, 29 Oct 2025 12:00:32 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/67e562b72417211a3624884f/9f886af7-7fec-4ee4-9397-7236a8f010cb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Imagine the perfect observability world: There is an incident, the on-call team gets paged in the middle of the night, wakes up and thanks to your telemetry, the root-cause is identified within just a few minutes. Telemetry is produced without sensitive data or passwords and you only pay for what you <em>actually</em> need. Dashboards aren't broken or inconsistent and you have confidence in your data.</p>
<p>Unfortunately this perfect observability world is often an utopia and the harsh reality looks different. But why is that and what can we do to bring today's reality closer to the perfect observability world?</p>
<p>Modern tools and instrumentation approaches have made it easier than ever to collect telemetry data of applications. You press the "easy button" and thanks to powerful tools like eBPF or auto-instrumentation agents, a lot of data will magically appear in the observability backend. Within almost no time, the whole system is suddenly instrumented. Sounds great, doesn't it? Where's the catch?</p>
<p>Such approaches are great to get you started quickly and have valid use cases for baseline visibility. However, as your applications scale, the produced amount of telemetry data grows significantly as well. The telemetry data is often low-quality and at a certain point it becomes challenging (and expensive) to maintain and make sense out of the data.</p>
<p>To have more control about telemetry, it's possible to manually instrument applications. That gives the power to capture what actually is needed. Even better: Telemetry quality can be enhanced thanks to custom application/business specific attributes while guaranteeing that no sensitive data is produced. You only pay for what you need and thanks to the reduced amount of low-signal data, relevant issues can be identified faster.</p>
<p>While the theory sounds promising, the reality is that manual instrumentation isn't trivial. When done right, it requires consistency across application boundaries, correct context propagation, OpenTelemetry specific domain knowledge and most importantly engineers who have time and the knowledge to maintain the instrumentation.</p>
<h2>Introducing Rose</h2>
<p>Today we are announcing the research preview of <a href="https://ollygarden.com/rose"><strong>OllyGarden Rose</strong></a>, our AI instrumentation agent. Rose integrates seamlessly into your development workflow as a <strong>GitHub</strong> app that analyzes OpenTelemetry instrumentation in pull requests, identifies pitfalls and suggests improvements to ensure consistent, high-quality telemetry practices. It’s designed to facilitate the manual instrumentation process by reducing engineering time, ensuring consistency, and providing clear guidance that builds confidence in your telemetry. At a later stage, OllyGarden Rose will be able to do assessments of the instrumentation quality of an entire code repository, or even install and perform an initial instrumentation on its own, guided by our knowledge about what’s good telemetry, as well as other external sources.</p>
<p><strong>Research preview launches October 29, 2025. Click</strong> <a href="https://ollygarden.com/rose"><strong>here</strong></a> <strong>to learn more.</strong></p>
<h3>Key Features</h3>
<h4>Context-Aware Analysis</h4>
<p>Rose understands your entire codebase, not just the diff. It knows your organization's telemetry patterns, recognizes which semantic conventions apply, and understands whether you're instrumenting an HTTP client, database call, or message queue. It provides guidance specific to your exact situation.</p>
<h4>OllyGarden Knowledge Base</h4>
<p>While general-purpose coding assistants understand OpenTelemetry SDK syntax, they often lack the depth to guide instrumentation across application boundaries and understand the why and what. Built on OllyGarden's expertise from years of contributing to OpenTelemetry, best-practices and industry standards are encoded into actionable rules and patterns that Rose applies automatically.</p>
<h4>OpenTelemetry Education</h4>
<p>Every comment Rose makes includes an explanation of why something matters, optionally a concrete code suggestion showing how to fix it, and links to relevant OpenTelemetry documentation. The goal isn’t to just fix issues, but to teach and share observability best practices with every pull request.</p>
<h2>Join the Research Preview</h2>
<p>Our mission at OllyGarden is to bring the reality closer to the perfect observability world, where it's easy to achieve and maintain high-quality telemetry data. In addition to the OllyGarden insights platform, Rose is another step towards that goal.</p>
<p>We'll provide Rose free of charge to selected participants during the research preview period. In return, we expect participants to provide feedback and access to the target source code repository (the one to be instrumented), so that we can analyze what worked and what didn't.</p>
<p>NDAs are available for organizations with security requirements. Our goal is to learn from real-world code out there, with any level of instrumentation. We're especially interested in teams already using or planning to adopt manual instrumentation practices.</p>
<p><strong>Ready to participate?</strong> <a href="https://ollygarden.com/rose">Contact us</a> to join the research preview for free.</p>
]]></content:encoded></item><item><title><![CDATA[Introducing OllyGarden Tulip: Our Open-Source Distribution of the OpenTelemetry Collector]]></title><description><![CDATA[TL;DR: We're launching OllyGarden Tulip, a commercially supported OpenTelemetry Collector distribution with stable releases, predictable upgrade paths, and professional support from the people who helped build the Collector. It's open source and free...]]></description><link>https://blog.olly.garden/introducing-tulip-supported-otel-collector</link><guid isPermaLink="true">https://blog.olly.garden/introducing-tulip-supported-otel-collector</guid><category><![CDATA[opentelemetry collector]]></category><category><![CDATA[OpenTelemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Thu, 16 Oct 2025 07:00:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1760365384730/d7551537-6e3d-44d0-96a9-12570420f40d.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR:</strong> We're launching <a target="_blank" href="https://olly.garden/tulip">OllyGarden Tulip</a>, a commercially supported OpenTelemetry Collector distribution with stable releases, predictable upgrade paths, and professional support from the people who helped build the Collector. It's open source and free to use, with optional commercial support for production deployments. Quarterly releases start with v25.11, with LTS releases every 18 months.</p>
<hr />
<p>Back in 2019, I encountered something that would shape the next six years of my career: the OpenCensus Service. This unassuming piece of infrastructure could receive telemetry data in one format (OpenCensus, Zipkin) and export it in another, like Jaeger. Simple, elegant, powerful.</p>
<p>When OpenCensus merged with OpenTracing, the service became OpenTelemetry Service. Coming from the Jaeger world, I immediately saw a naming problem: having an "OpenTelemetry Service" that functioned like Jaeger Collector would confuse everyone. In my first SIG calls, I suggested renaming it to OpenTelemetry Collector. The response? "Too much legacy already. People know it by this name."</p>
<p>I was wrong about the timing, but right about the need. Eventually, the community came around, and OpenTelemetry Collector was born.</p>
<h2 id="heading-helping-build-the-collector-ecosystem">Helping Build the Collector Ecosystem</h2>
<p>That early misstep didn't discourage me. Instead, it pulled me deeper into the community. Over the following years, I implemented authentication support and the first auth mechanism, built the load balancing exporter, created the OpenTelemetry Collector Builder (ocb), developed the OpenTelemetry Operator, and maintained the tail sampling processor. I gave conference talks at events worldwide.</p>
<p>More importantly, I talked to users. Hundreds of conversations about their deployments, their challenges, their workarounds. I addressed concerns where I could, but some problems were too big for a single engineer, even one working at larger organizations.</p>
<h2 id="heading-the-problems-i-couldnt-solve-alone">The Problems I Couldn't Solve Alone</h2>
<p>The requests came consistently, almost predictably: "Can we get commercial support for the Collector?"</p>
<p>Some users had backend vendors who offered Collector support, but only as long as they remained customers. Want to migrate to a different backend? Your Collector support disappears. Companies building custom Collector distributions with proprietary components were left entirely on their own.</p>
<p>It was painful. Passionate users consuming my code and projects, and I couldn't offer them the support they needed.</p>
<p>The technical challenges were equally frustrating. Teams stuck on ancient Collector versions because upgrades broke their dashboards and alerts when internal telemetry metrics changed. Organizations forced to update configurations for unrelated components just to consume a critical bug fix. Custom distribution maintainers struggling to keep pace with upstream changes while managing their own components.</p>
<p>These weren't edge cases. These were real operational pain points affecting production systems at scale.</p>
<h2 id="heading-introducing-ollygarden-tulip">Introducing OllyGarden Tulip</h2>
<p>Today, we're changing that. I'm excited to announce <a target="_blank" href="https://olly.garden/tulip"><strong>OllyGarden Tulip</strong></a>, a commercially supported OpenTelemetry Collector distribution that solves the problems I've heard about for years.</p>
<p>Tulip provides the stability guarantees, predictable release cycles, and professional support that production systems deserve. It's an open source distribution built using ocb, the same tool I created for the community. You can use it freely, extend it, and build on it. And when you need support, we're here with the deep expertise that comes from years of building and maintaining the Collector itself.</p>
<p>This isn't just another distribution. It's the support offering that Collector users have been asking for, delivered by the people who know this codebase intimately.</p>
<h2 id="heading-why-ollygarden-tulip-exists">Why OllyGarden Tulip Exists</h2>
<p>When Yuri and I founded OllyGarden at the beginning of this year, our focus was clear: give observability engineers superpowers to understand what's good and what's bad about their telemetry through our Insights platform. That remains our core mission, and we're making significant progress there.</p>
<p>But as we've built OllyGarden, I've kept hearing the same pains from Collector users that I've witnessed for years. These aren't problems we can ignore, and they're problems we can solve right now. So we're accelerating our plans and launching <strong>OllyGarden Tulip</strong> today, a commercially supported OpenTelemetry Collector distribution built specifically to address the support and stability challenges that production teams face every day.</p>
<h2 id="heading-what-makes-tulip-different">What Makes Tulip Different</h2>
<p>Tulip provides stability guarantees that match real-world needs. Need a critical bug fix without updating every component? We've got you covered. Upgraded to the latest version and experiencing unexpected performance issues? Throw it at us. Want predictable release cycles that align with your planning? We deliver.</p>
<p>Our approach combines flexibility with reliability. We provide quarterly releases tracking upstream versions, starting with v25.11 (November 2025). Every 18 months, we'll release an LTS version, with the first likely at v26.5. The distribution itself is open source, built using ocb. You can use the binaries or container images for free. Need components we don't support yet? Fork our repository and add them. We'll still support the ones in our manifest. Need commercial support? We're here for you.</p>
<h2 id="heading-built-on-open-source-backed-by-experience">Built on Open Source, Backed by Experience</h2>
<p>OllyGarden Tulip isn't a fork or a proprietary reimagining. It's an open source distribution of the Collector built using the same tools I created, specifically ocb. You can use it freely. You can extend it. You can build on it.</p>
<p>What we're offering is something the community has asked for repeatedly: stable, professional support from people who know this codebase intimately.</p>
<p>I haven't been as involved in the OpenTelemetry Collector community since January. I've been focused on building our products. But Tulip brings us closer again. More importantly, it provides a support offering that our users deserve.</p>
<h2 id="heading-who-this-is-for">Who This Is For</h2>
<p>You should consider OllyGarden Tulip if you run the OpenTelemetry Collector in production and need reliable support, if you build custom Collector distributions and want stable upstream compatibility, if you need predictable upgrade paths that won't break your observability infrastructure, if you want to decouple your Collector support from your backend vendor relationship, or if you value stability and professional support over bleeding-edge features.</p>
<h2 id="heading-getting-started">Getting Started</h2>
<p>OllyGarden Tulip is available now. Our open source manifest and container images are free to use in any environment. For commercial support, contact us to discuss your needs. Visit our documentation site for implementation guides and resources.</p>
<p>We're starting this journey with the v25.11 release, and we're committed to the long-term stability that production systems require.</p>
<h2 id="heading-a-personal-note">A Personal Note</h2>
<p>For six years, I've watched the OpenTelemetry Collector grow from an experimental service to critical infrastructure powering observability at organizations worldwide. I've celebrated its successes and felt the pain of its operational challenges.</p>
<p>OllyGarden Tulip represents my commitment to the users who've trusted my code over the years. You've built incredible things. You deserve support that matches your ambition.</p>
<p>Let's build something reliable together.</p>
<hr />
<p><strong>Ready to learn more?</strong> Visit our documentation or contact us to discuss commercial support options.</p>
]]></content:encoded></item><item><title><![CDATA[The Variability Principle: How to Decide What Deserves a Span]]></title><description><![CDATA[Every team discovers OpenTelemetry the same way. First, excitement—finally, visibility into distributed systems! Then comes the instrumentation party. Spans everywhere. Every function. Every validation. Every calculation gets its own span because "mo...]]></description><link>https://blog.olly.garden/what-deserves-a-span</link><guid isPermaLink="true">https://blog.olly.garden/what-deserves-a-span</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[tracing]]></category><dc:creator><![CDATA[Jakub Mikłasz]]></dc:creator><pubDate>Mon, 06 Oct 2025 08:53:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759430553963/e8406985-0d26-44ab-ac5d-a411fd243db7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every team discovers OpenTelemetry the same way. First, excitement—finally, visibility into distributed systems! Then comes the instrumentation party. Spans everywhere. Every function. Every validation. Every calculation gets its own span because "more data is better," right?</p>
<p>Three months later, you're staring at a trace with 500 spans trying to figure out why a simple API call took 3 seconds. Your observability bill has grown 10x. And your engineers have given up on traces entirely because they're impossible to read.</p>
<p>There's a better way.</p>
<h2 id="heading-the-problem-span-explosion"><strong>The Problem: Span Explosion</strong></h2>
<p>Most teams create spans like this:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">ProcessPayment</span><span class="hljs-params">(ctx context.Context, payment Payment)</span> <span class="hljs-title">error</span></span> {
    ctx, span := tracer.Start(ctx, <span class="hljs-string">"process payment"</span>)
    <span class="hljs-keyword">defer</span> span.End()

    validateAmount(ctx, payment.Amount)      <span class="hljs-comment">// Another span</span>
    validateCard(ctx, payment.CardNumber)    <span class="hljs-comment">// Another span</span>
    calculateFees(ctx, payment.Amount)       <span class="hljs-comment">// Another span</span>
    formatCurrency(ctx, payment.Total)       <span class="hljs-comment">// Another span</span>
    <span class="hljs-comment">// ... 10 more spans for trivial operations</span>
}
</code></pre>
<p>At 10,000 requests per minute with 15 spans each, you're generating 6.5 billion spans per month. At $0.20 per million spans, that's $1,300 monthly just for payment processing traces.</p>
<p>But cost isn't the real problem. The real problem is that your traces become unreadable. When everything has a span, nothing stands out. Signal drowns in noise.</p>
<h2 id="heading-the-variability-principle-your-new-mental-model"><strong>The Variability Principle: Your New Mental Model</strong></h2>
<p>Here's the principle that changed everything for us:</p>
<blockquote>
<p><strong>"Is this operation unpredictable?"</strong></p>
</blockquote>
<p>If yes, create a span. If no, don't.</p>
<p>This simple question cuts through all the complexity. It's not about operation importance or business value—it's about performance predictability.</p>
<h3 id="heading-unpredictable-create-a-span"><strong>Unpredictable = Create a Span</strong></h3>
<p>Operations with unpredictable performance need spans:</p>
<ul>
<li><p><strong>Database queries</strong>: Could take 5ms or 5 seconds depending on locks, data size, indexes</p>
</li>
<li><p><strong>HTTP calls</strong>: Network latency, retries, timeouts are all variable</p>
</li>
<li><p><strong>External APIs</strong>: You don't control their performance</p>
</li>
<li><p><strong>Message queues</strong>: Depends on queue depth, consumer availability</p>
</li>
<li><p><strong>Cache operations</strong>: Network round-trip to Redis/Memcached</p>
</li>
<li><p><strong>File I/O</strong>: Disk performance varies, especially with network storage</p>
</li>
</ul>
<p>These operations can surprise you. When they're slow, you need to know.</p>
<h3 id="heading-predictable-skip-the-span"><strong>Predictable = Skip the Span</strong></h3>
<p>Operations with predictable performance don't need spans:</p>
<ul>
<li><p><strong>Validation logic</strong>: Checking if a string contains "@" is always microseconds</p>
</li>
<li><p><strong>Math calculations</strong>: CPU-bound operations are consistent</p>
</li>
<li><p><strong>Data transformation</strong>: Mapping objects in memory is deterministic</p>
</li>
<li><p><strong>String formatting</strong>: Always fast, never the problem</p>
</li>
<li><p><strong>Getters/setters</strong>: Not worth measuring</p>
</li>
</ul>
<p>These operations can't surprise you. They're never the bottleneck.</p>
<h2 id="heading-the-pattern-in-practice"><strong>The Pattern in Practice</strong></h2>
<p>Let's refactor that payment processing:</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">ProcessPayment</span><span class="hljs-params">(ctx context.Context, payment Payment)</span></span> {
    ctx, span := tracer.Start(ctx, <span class="hljs-string">"process payment"</span>)
    <span class="hljs-keyword">defer</span> span.End()

    <span class="hljs-comment">// Add context as attributes, not spans</span>
    span.SetAttributes(
        attribute.Float64(<span class="hljs-string">"payment.amount"</span>, payment.Amount),
        attribute.String(<span class="hljs-string">"payment.currency"</span>, payment.Currency),
    )

    <span class="hljs-comment">// Validation is predictable - no span needed</span>
    <span class="hljs-keyword">if</span> payment.Amount &lt;= <span class="hljs-number">0</span> || !isValidCard(payment.CardNumber) {
        span.RecordError(errors.New(<span class="hljs-string">"invalid payment"</span>))
        <span class="hljs-keyword">return</span>
    }

    <span class="hljs-comment">// Database operation is unpredictable - needs a span</span>
    ctx, dbSpan := tracer.Start(ctx, <span class="hljs-string">"INSERT payments"</span>)
    dbSpan.SetAttributes(
        attribute.String(<span class="hljs-string">"db.system"</span>, <span class="hljs-string">"postgresql"</span>),
        attribute.String(<span class="hljs-string">"db.collection.name"</span>, <span class="hljs-string">"payments"</span>),
        attribute.String(<span class="hljs-string">"db.operation.name"</span>, <span class="hljs-string">"INSERT"</span>),
    )
    db.SavePayment(ctx, payment)
    dbSpan.End()

    <span class="hljs-comment">// External API is unpredictable - needs a span</span>
    ctx, chargeSpan := tracer.Start(ctx, <span class="hljs-string">"charge card"</span>)
    paymentGateway.Charge(ctx, payment)
    chargeSpan.End()
}
</code></pre>
<p>Result: 3 spans instead of 15. Traces are readable. Engineers can actually find problems.</p>
<h2 id="heading-what-to-use-instead-of-spans"><strong>What to Use Instead of Spans</strong></h2>
<p>When you skip creating a span, you still need to capture information. That's where attributes and events come in.</p>
<h3 id="heading-attributes-context-without-cost"><strong>Attributes: Context Without Cost</strong></h3>
<p>Attributes add metadata to existing spans. They're perfect for:</p>
<ul>
<li><p>Request/response data (user ID, order total, currency)</p>
</li>
<li><p>Configuration values (retry count, timeout settings)</p>
</li>
<li><p>Business context (customer tier, feature flags)</p>
</li>
</ul>
<pre><code class="lang-go">span.SetAttributes(
    attribute.String(<span class="hljs-string">"user.id"</span>, userID),
    attribute.Float64(<span class="hljs-string">"order.total"</span>, <span class="hljs-number">157.46</span>),
    attribute.Bool(<span class="hljs-string">"cache.hit"</span>, <span class="hljs-literal">true</span>),
)
</code></pre>
<p>Attributes are indexed and searchable. They let you filter traces without creating separate spans.</p>
<h3 id="heading-events-milestones-in-time"><strong>Events: Milestones in Time</strong></h3>
<p>Events mark important moments within a span's lifecycle. They're perfect for:</p>
<ul>
<li><p>Validation checkpoints</p>
</li>
<li><p>State transitions</p>
</li>
<li><p>Progress markers in loops</p>
</li>
</ul>
<pre><code class="lang-go"><span class="hljs-comment">// Mark validation completion</span>
span.AddEvent(<span class="hljs-string">"validation completed"</span>)

<span class="hljs-comment">// Track calculation results</span>
span.AddEvent(<span class="hljs-string">"total calculated"</span>,
    trace.WithAttributes(
        attribute.Int(<span class="hljs-string">"line_items.count"</span>, <span class="hljs-number">4</span>),
        attribute.Float64(<span class="hljs-string">"total"</span>, <span class="hljs-number">157.46</span>),
    ))

<span class="hljs-comment">// Record state changes</span>
span.AddEvent(<span class="hljs-string">"payment saved"</span>)

<span class="hljs-comment">// Track retry attempts</span>
span.AddEvent(<span class="hljs-string">"retry attempt"</span>,
    trace.WithAttributes(
        attribute.Int(<span class="hljs-string">"attempt"</span>, <span class="hljs-number">3</span>),
        attribute.String(<span class="hljs-string">"reason"</span>, <span class="hljs-string">"timeout"</span>),
    ))
</code></pre>
<p>Events show you <em>when</em> something happened and provide rich context without the overhead of a full span. When debugging, they help you see the timeline of operations within your parent span.</p>
<h2 id="heading-the-decision-framework"><strong>The Decision Framework</strong></h2>
<p>Before creating any span, ask one question:</p>
<p><strong>"Is this operation unpredictable?"</strong></p>
<p>Yes → Create a span</p>
<p>No → Use attributes or events</p>
<p>That's it. This single question replaces complex decision trees and eliminates 80% of unnecessary spans.</p>
<h2 id="heading-remember-this"><strong>Remember This</strong></h2>
<p>Your traces should tell a story, not document every CPU cycle. Each span costs money, performance, and clarity.</p>
<p>Create spans only for operations that could surprise you. For everything else, there are attributes and events.</p>
<p>The best observability isn't about having all the data—it's about having the right data.</p>
]]></content:encoded></item><item><title><![CDATA[How to Name Your Metrics]]></title><description><![CDATA[Metrics are the quantitative backbone of observability—the numbers that tell us how our systems are performing. This is the third post in our OpenTelemetry naming series, where we've already explored how to name spans and how to enrich them with mean...]]></description><link>https://blog.olly.garden/how-to-name-your-metrics</link><guid isPermaLink="true">https://blog.olly.garden/how-to-name-your-metrics</guid><category><![CDATA[observability]]></category><category><![CDATA[telemetry]]></category><category><![CDATA[OpenTelemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Tue, 09 Sep 2025 22:00:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1759779166859/ee11b350-4fb4-4bd1-97ef-396f83a7a553.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Metrics are the quantitative backbone of observability—the numbers that tell us how our systems are performing. This is the third post in our OpenTelemetry naming series, where we've already explored <a target="_blank" href="how-to-name-your-spans">how to name spans</a> and how to enrich them with meaningful attributes. Now let's tackle the art of naming the measurements that matter.</p>
<p>Unlike spans that tell stories about what happened, metrics tell us about quantities: how many, how fast, how much. But here's the thing—naming them well is just as crucial as naming spans, and the principles we've learned apply here too. The "who" still belongs in attributes, not names.</p>
<h2 id="heading-learning-from-traditional-systems">Learning from Traditional Systems</h2>
<p>Before diving into OpenTelemetry best practices, let's examine how traditional monitoring systems handle metric naming. Take Kubernetes, for example. Its metrics follow patterns like:</p>
<ul>
<li><p><code>apiserver_request_total</code></p>
</li>
<li><p><code>scheduler_schedule_attempts_total</code></p>
</li>
<li><p><code>container_cpu_usage_seconds_total</code></p>
</li>
<li><p><code>kubelet_volume_stats_used_bytes</code></p>
</li>
</ul>
<p>Notice the pattern? <strong>Component name + resource + action + unit</strong>. The service or component name is baked right into the metric name. This approach made sense in simpler data models where you had limited options for storing context.</p>
<p>But this creates several problems:</p>
<ul>
<li><p><strong>Cluttered observability backend</strong>: Every component gets its own metric namespace, making it harder to find the right metric among dozens or hundreds of similarly-named metrics</p>
</li>
<li><p><strong>Inflexible aggregation</strong>: Can't easily sum metrics across different components</p>
</li>
<li><p><strong>Vendor lock-in</strong>: Metric names become tied to specific implementations</p>
</li>
<li><p><strong>Maintenance overhead</strong>: Adding new services requires new metric names</p>
</li>
</ul>
<h2 id="heading-the-core-anti-pattern-service-names-in-metric-names">The Core Anti-Pattern: Service Names in Metric Names</h2>
<p>Here's the most important principle for OpenTelemetry metrics: <strong>Don't include your service name in the metric name</strong>.</p>
<p>Let's say you have a payment service. You might be tempted to create metrics like:</p>
<ul>
<li><p><code>payment.transaction.count</code></p>
</li>
<li><p><code>payment.latency.p95</code></p>
</li>
<li><p><code>payment.error.rate</code></p>
</li>
</ul>
<p>Don't do this. The service name is already available as context through the <code>service.name</code> resource attribute. Instead, use:</p>
<ul>
<li><p><code>transaction.count</code> with <code>service.name=payment</code></p>
</li>
<li><p><code>http.server.request.duration</code> with <code>service.name=payment</code></p>
</li>
<li><p><code>error.rate</code> with <code>service.name=payment</code></p>
</li>
</ul>
<p>Why is this better? Because now you can easily aggregate across all services:</p>
<pre><code class="lang-plaintext">sum(transaction.count)  // All transactions across all services
sum(transaction.count{service.name="payment"})  // Just payment transactions
</code></pre>
<p>If every service had its own metric name, you'd need to know every service name to build meaningful dashboards. With clean names, one query works for everything.</p>
<h2 id="heading-opentelemetrys-rich-context-model">OpenTelemetry's Rich Context Model</h2>
<p>OpenTelemetry metrics benefit from the same rich context model we discussed in our span attributes article. Instead of forcing everything into the metric name, we have multiple layers where context can live:</p>
<h3 id="heading-traditional-approach-prometheus-style">Traditional Approach (Prometheus style):</h3>
<pre><code class="lang-plaintext">payment_service_transaction_total{method="credit_card",status="success"}
user_service_auth_latency_milliseconds{endpoint="/login",region="us-east"}  
inventory_service_db_query_seconds{table="products",operation="select"}
</code></pre>
<h3 id="heading-opentelemetry-approach">OpenTelemetry Approach:</h3>
<pre><code class="lang-plaintext">transaction.count
- Resource: service.name=payment, service.version=1.2.3, deployment.environment.name=prod
- Scope: instrumentation.library.name=com.acme.payment, instrumentation.library.version=2.1.0
- Attributes: method=credit_card, status=success

auth.duration  
- Resource: service.name=user, service.version=2.0.1, deployment.environment.name=prod
- Scope: instrumentation.library.name=express.middleware
- Attributes: endpoint=/login, region=us-east
- Unit: ms

db.client.operation.duration
- Resource: service.name=inventory, service.version=1.5.2
- Scope: instrumentation.library.name=postgres.client  
- Attributes: db.sql.table=products, db.operation=select
- Unit: s
</code></pre>
<p>This three-layer separation follows the OpenTelemetry specification's <strong>Events → Metric Streams → Timeseries</strong> model, where context flows through multiple hierarchical levels rather than being crammed into names.</p>
<h2 id="heading-units-keep-them-out-of-names-too">Units: Keep Them Out of Names Too</h2>
<p>Just like we learned that service names don't belong in metric names, <strong>units don't belong there either</strong>.</p>
<p>Traditional systems often include units in the name because they lack proper unit metadata:</p>
<ul>
<li><p><code>response_time_milliseconds</code></p>
</li>
<li><p><code>memory_usage_bytes</code></p>
</li>
<li><p><code>throughput_requests_per_second</code></p>
</li>
</ul>
<p>OpenTelemetry treats units as metadata, separate from the name:</p>
<ul>
<li><p><code>http.server.request.duration</code> with unit <code>ms</code></p>
</li>
<li><p><code>system.memory.usage</code> with unit <code>By</code></p>
</li>
<li><p><code>http.server.request.rate</code> with unit <code>{request}/s</code></p>
</li>
</ul>
<p>This approach has several benefits:</p>
<ol>
<li><p><strong>Clean names</strong>: No ugly suffixes cluttering your metric names</p>
</li>
<li><p><strong>Standardized units</strong>: Follow the Unified Code for Units of Measure (UCUM)</p>
</li>
<li><p><strong>Backend flexibility</strong>: Systems can handle unit conversion automatically</p>
</li>
<li><p><strong>Consistent conventions</strong>: Aligns with OpenTelemetry semantic conventions</p>
</li>
</ol>
<p>The specification recommends using non-prefixed units like <code>By</code> (bytes) rather than <code>MiBy</code> (mebibytes) unless there are technical reasons to do otherwise.</p>
<h2 id="heading-practical-naming-guidelines">Practical Naming Guidelines</h2>
<p>When creating metric names, apply the same <code>{verb} {object}</code> principle we learned for spans, where it makes sense:</p>
<ol>
<li><p><strong>Focus on the operation</strong>: What is being measured?</p>
</li>
<li><p><strong>Not the operator</strong>: Who is doing the measuring?</p>
</li>
<li><p><strong>Follow semantic conventions</strong>: Use established patterns when available</p>
</li>
<li><p><strong>Keep units as metadata</strong>: Don't suffix names with units</p>
</li>
</ol>
<p>Here are examples following OpenTelemetry semantic conventions:</p>
<ul>
<li><p><code>http.server.request.duration</code> (not <code>payment_http_requests_ms</code>)</p>
</li>
<li><p><code>db.client.operation.duration</code> (not <code>user_service_db_queries_seconds</code>)</p>
</li>
<li><p><code>messaging.client.sent.messages</code> (not <code>order_service_messages_sent_total</code>)</p>
</li>
<li><p><code>transaction.count</code> (not <code>payment_transaction_total</code>)</p>
</li>
</ul>
<h2 id="heading-real-world-migration-examples">Real-world Migration Examples</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Traditional (Context+Units in Name)</td><td>OpenTelemetry (Clean Separation)</td><td>Why It's Better</td></tr>
</thead>
<tbody>
<tr>
<td><code>payment_transaction_total</code></td><td><code>transaction.count</code> + <code>service.name=payment</code> + unit <code>1</code></td><td>Aggregatable across services</td></tr>
<tr>
<td><code>user_service_auth_latency_ms</code></td><td><code>auth.duration</code> + <code>service.name=user</code> + unit <code>ms</code></td><td>Standard operation name, proper unit metadata</td></tr>
<tr>
<td><code>inventory_db_query_seconds</code></td><td><code>db.client.operation.duration</code> + <code>service.name=inventory</code> + unit <code>s</code></td><td>Follows semantic conventions</td></tr>
<tr>
<td><code>api_gateway_requests_per_second</code></td><td><code>http.server.request.rate</code> + <code>service.name=api-gateway</code> + unit <code>{request}/s</code></td><td>Clean name, proper rate unit</td></tr>
<tr>
<td><code>redis_cache_hit_ratio_percent</code></td><td><code>cache.hit_ratio</code> + <code>service.name=redis</code> + unit <code>1</code></td><td>Ratios are unitless</td></tr>
</tbody>
</table>
</div><h2 id="heading-benefits-of-clean-naming">Benefits of Clean Naming</h2>
<p>Separating context from metric names provides specific technical advantages that improve both query performance and operational workflows. The first benefit is cross-service aggregation. A query like <code>sum(transaction.count)</code> returns data from all services without requiring you to know or maintain a list of service names. In a system with 50 microservices, this means one query instead of 50, and that query doesn't break when you add the 51st service.</p>
<p>This consistency makes dashboards reusable across services. A dashboard built for monitoring HTTP requests in your authentication service works without modification for your payment service, inventory service, or any other HTTP-serving component. You write the query once—<code>http.server.request.duration</code> filtered by <code>service.name</code>—and apply it everywhere. No more maintaining dozens of nearly-identical dashboards. Some observability vendors now take this further, automatically generating dashboards based on semantic convention metric names—when your services emit <code>http.server.request.duration</code>, the platform knows exactly what visualizations and aggregations make sense for that metric.</p>
<p>Clean naming also reduces metric namespace clutter. Consider a platform with dozens of services each defining their own metrics. With traditional naming, your metric browser shows hundreds of service-specific variations: <code>apiserver_request_total</code>, <code>payment_service_request_total</code>, <code>user_service_request_total</code>, <code>inventory_service_request_total</code>, and so on. Finding the right metric becomes an exercise in scrolling and searching through redundant variations. With clean naming, you have one metric name (<code>request.count</code>) with attributes capturing the context. This makes metric discovery straightforward—you find the measurement you need, then filter by the service you care about.</p>
<p>Unit handling becomes systematic when units are metadata rather than name suffixes. Observability platforms can perform unit conversions automatically—displaying the same duration metric as milliseconds in one graph and seconds in another, based on what makes sense for the visualization. The metric remains <code>request.duration</code> with unit metadata <code>ms</code>, not two separate metrics <code>request_duration_ms</code> and <code>request_duration_seconds</code>.</p>
<p>The approach also ensures compatibility between manual and automatic instrumentation. When you follow semantic conventions like <code>http.server.request.duration</code>, your custom metrics align with those generated by auto-instrumentation libraries. This creates a consistent data model where queries work across both manually and automatically instrumented services, and engineers don't need to remember which metrics come from which source.</p>
<h2 id="heading-common-pitfalls-to-avoid">Common Pitfalls to Avoid</h2>
<p>Engineers often embed deployment-specific information directly into metric names, creating patterns like <code>user_service_v2_latency</code>. This breaks when version 3 deploys—every dashboard, alert, and query that references the metric name must be updated. The same problem occurs with instance-specific names like <code>node_42_memory_usage</code>. In a cluster with dynamic scaling, you end up with hundreds of distinct metric names that represent the same measurement, making it impossible to write simple aggregation queries.</p>
<p>Environment-specific prefixes cause similar maintenance problems. With metrics named <code>prod_payment_errors</code> and <code>staging_auth_count</code>, you can't write a single query that works across environments. A dashboard that monitors production can't be used for staging without modification. When you need to compare metrics between environments—a common debugging task—you have to write complex queries that explicitly reference each environment's metric names.</p>
<p>Technology stack details in metric names create future migration headaches. A metric named <code>nodejs_payment_memory</code> becomes misleading when you rewrite the service in Go. Similarly, <code>postgres_user_queries</code> requires renaming if you migrate to something else. These technology-specific names also prevent you from writing queries that work across services using different tech stacks, even when they perform the same business function.</p>
<p>Mixing business domains with infrastructure metrics violates the separation between what a system does and how it does it. A metric like <code>ecommerce_cpu_usage</code> conflates the business purpose (e-commerce) with the technical measurement (CPU usage). This makes it harder to reuse infrastructure monitoring across different business domains and complicates multi-tenant deployments where the same infrastructure serves multiple business functions.</p>
<p>The practice of including units in metric names—<code>latency_ms</code>, <code>memory_bytes</code>, <code>count_total</code>—creates redundancy now that OpenTelemetry provides proper unit metadata. It also prevents automatic unit conversion. With <code>request_duration_ms</code> and <code>request_duration_seconds</code> as separate metrics, you need different queries for different time scales. With a single <code>request.duration</code> metric that includes unit metadata, the observability platform handles conversion automatically.</p>
<p>The pattern is clear: context that varies by deployment, instance, environment, or version belongs in attributes, not in the metric name. The metric name should identify what you're measuring. Everything else—who's measuring it, where it's running, which version it is—goes in the attribute layer where it can be filtered, grouped, and aggregated as needed.</p>
<h2 id="heading-cultivating-better-metrics">Cultivating Better Metrics</h2>
<p>Just like the spans we covered earlier in this series, well-named metrics are a gift to your future self and your team. They provide clarity during incidents, enable powerful cross-service analysis, and make your observability data truly useful rather than just voluminous.</p>
<p>The key insight is the same one we learned with spans: <strong>separation of concerns</strong>. The metric name describes what you're measuring. The context—who's measuring it, where, when, and how—lives in the rich attribute hierarchy that OpenTelemetry provides.</p>
<p>In our next post, we'll dive deep into <strong>metric attributes</strong>—the context layer that makes metrics truly powerful. We'll explore how to structure the rich contextual information that doesn't belong in names, and how to balance informativeness with cardinality concerns.</p>
<p>Until then, remember: a clean metric name is like a well-tended garden path—it leads you exactly where you need to go.</p>
]]></content:encoded></item><item><title><![CDATA[How to Name Your Span Attributes]]></title><description><![CDATA[Welcome to the second installment in our series on OpenTelemetry naming best practices. In our previous post, we explored how to name spans using the {verb} {object} pattern. Today, we're diving into span attributes, the rich contextual data that tra...]]></description><link>https://blog.olly.garden/how-to-name-your-span-attributes</link><guid isPermaLink="true">https://blog.olly.garden/how-to-name-your-span-attributes</guid><category><![CDATA[observability]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[telemetry]]></category><category><![CDATA[instrumentation]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Tue, 26 Aug 2025 22:00:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756384123234/2ba9e63f-3cbc-4e0c-a2ba-9891f61830f8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to the second installment in our series on OpenTelemetry naming best practices. In our previous post, we explored how to name spans using the <code>{verb} {object}</code> pattern. Today, we're diving into span attributes, the rich contextual data that transforms your traces from simple operation logs into powerful debugging and analysis primitives.</p>
<p>This guide targets developers who are:</p>
<ul>
<li><strong>Instrumenting their own applications</strong> with custom spans and attributes  </li>
<li><strong>Enriching telemetry</strong> beyond what auto-instrumentation provides  </li>
<li><strong>Creating libraries</strong> that others will instrument</li>
</ul>
<p>The attribute naming decisions you make directly impact the usability and maintainability of your observability data. Let's get them right.</p>
<h2 id="heading-start-with-semantic-conventions">Start with Semantic Conventions</h2>
<p>Here's the most important rule that will save you time and improve interoperability: <strong>if an OpenTelemetry semantic convention exists and the semantics match your use case, use it</strong>.</p>
<p>This isn't just about convenience—it's about building telemetry that integrates seamlessly with the broader OpenTelemetry ecosystem. When you use standardized attribute names, your data automatically works with existing dashboards, alerting rules, and analysis tools.</p>
<h3 id="heading-when-semantics-match-use-the-convention">When Semantics Match, Use the Convention</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Your Need</td><td>Use This Semantic Convention</td><td>Why</td></tr>
</thead>
<tbody>
<tr>
<td>HTTP request method</td><td><code>http.request.method</code></td><td>Standardized across all HTTP instrumentation</td></tr>
<tr>
<td>Database collection name</td><td><code>db.collection.name</code></td><td>Works with database monitoring tools</td></tr>
<tr>
<td>Service identification</td><td><code>service.name</code></td><td>Core resource attribute for service correlation</td></tr>
<tr>
<td>Network peer address</td><td><code>network.peer.address</code></td><td>Standard for network-level debugging</td></tr>
<tr>
<td>Error classification</td><td><code>error.type</code></td><td>Enables consistent error analysis</td></tr>
</tbody>
</table>
</div><p>The key principle is <strong>semantic match over naming preference</strong>. Even if you prefer <code>database_table</code> over <code>db.collection.name</code>, use the semantic convention when it accurately describes your data.</p>
<h3 id="heading-when-semantics-dont-match-dont-force-it">When Semantics Don't Match, Don't Force It</h3>
<p>Resist the temptation to misuse semantic conventions:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Don't Do This</td><td>Why It's Wrong</td></tr>
</thead>
<tbody>
<tr>
<td>Using <code>db.collection.name</code> for a file name</td><td>Files and database collections are different concepts</td></tr>
<tr>
<td>Using <code>http.request.method</code> for business actions</td><td>"approve_payment" isn't an HTTP method</td></tr>
<tr>
<td>Using <code>user.id</code> for a transaction ID</td><td>Users and transactions are different entities</td></tr>
</tbody>
</table>
</div><p>Misusing semantic conventions is worse than creating custom attributes—it creates confusion and breaks tooling that expects the standard semantics.</p>
<h2 id="heading-the-golden-rule-domain-first-never-company-first">The Golden Rule: Domain First, Never Company First</h2>
<p>When you need custom attributes beyond the semantic conventions, the most critical principle is: <strong>start with the domain or technology, never your company or application name</strong>.</p>
<p>This principle seems obvious but is consistently violated across the industry. Here's why it matters and how to get it right.</p>
<h3 id="heading-why-company-first-naming-fails">Why Company-First Naming Fails</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Bad Attribute Name</td><td>Problems</td></tr>
</thead>
<tbody>
<tr>
<td><code>og.user.id</code></td><td>Company prefix pollutes global namespace</td></tr>
<tr>
<td><code>myapp.request.size</code></td><td>Application-specific, not reusable</td></tr>
<tr>
<td><code>acme.inventory.count</code></td><td>Makes correlation with standard attributes difficult</td></tr>
<tr>
<td><code>shopify_store.product.sku</code></td><td>Unnecessarily ties concept to one vendor</td></tr>
</tbody>
</table>
</div><p>These approaches create attributes that are:</p>
<ul>
<li>Difficult to correlate across teams and organizations  </li>
<li>Impossible to reuse in different contexts  </li>
<li>Vendor-locked and inflexible  </li>
<li>Inconsistent with OpenTelemetry's interoperability goals</li>
</ul>
<h3 id="heading-domain-first-success-stories">Domain-First Success Stories</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Good Attribute Name</td><td>Why It Works</td></tr>
</thead>
<tbody>
<tr>
<td><code>user.id</code></td><td>Universal concept, vendor-neutral</td></tr>
<tr>
<td><code>request.size</code></td><td>Reusable across applications</td></tr>
<tr>
<td><code>inventory.count</code></td><td>Clear, domain-specific concept</td></tr>
<tr>
<td><code>product.sku</code></td><td>Standard e-commerce terminology</td></tr>
<tr>
<td><code>workflow.step.name</code></td><td>Generic process management concept</td></tr>
</tbody>
</table>
</div><p>This approach creates attributes that are universally understandable, reusable by others facing similar problems, and future-proof.</p>
<h2 id="heading-understanding-the-structure-dots-and-underscores">Understanding the Structure: Dots and Underscores</h2>
<p>OpenTelemetry attribute names follow a specific structural pattern that balances readability with consistency. Understanding this pattern helps you create attributes that feel natural alongside standard semantic conventions.</p>
<h3 id="heading-use-dots-for-hierarchical-separation">Use Dots for Hierarchical Separation</h3>
<p>Dots (<code>.</code>) separate hierarchical components, following the pattern: <code>{domain}.{component}.{property}</code></p>
<p>Examples from semantic conventions:</p>
<ul>
<li><code>http.request.method</code> - HTTP domain, request component, method property  </li>
<li><code>db.collection.name</code> - Database domain, collection component, name property  </li>
<li><code>service.instance.id</code> - Service domain, instance component, id property</li>
</ul>
<h3 id="heading-use-underscores-for-multi-word-components">Use Underscores for Multi-Word Components</h3>
<p>When a single component contains multiple words, use underscores (<code>_</code>):</p>
<ul>
<li><code>http.response.status_code</code> - "status_code" is one logical component  </li>
<li><code>system.memory.usage_percent</code> - "usage_percent" is one measurement concept</li>
</ul>
<h3 id="heading-create-deeper-hierarchies-when-needed">Create Deeper Hierarchies When Needed</h3>
<p>You can nest further when it adds clarity:</p>
<ul>
<li><code>http.request.body.size</code>  </li>
<li><code>k8s.pod.label.{key}</code>  </li>
<li><code>messaging.kafka.message.key</code></li>
</ul>
<p>Each level should represent a meaningful conceptual boundary.</p>
<h2 id="heading-reserved-namespaces-what-you-must-never-use">Reserved Namespaces: What You Must Never Use</h2>
<p>Certain namespaces are strictly reserved, and violating these rules can break your telemetry data.</p>
<h3 id="heading-the-otel-namespace-is-off-limits">The <code>otel.*</code> Namespace is Off-Limits</h3>
<p>The <code>otel.*</code> prefix is exclusively reserved for the OpenTelemetry specification itself. It's used to express OpenTelemetry concepts in telemetry formats that don't natively support them.</p>
<p>Reserved <code>otel.*</code> attributes include:</p>
<ul>
<li><code>otel.scope.name</code> - Instrumentation scope name  </li>
<li><code>otel.status_code</code> - Span status code  </li>
<li><code>otel.span.sampling_result</code> - Sampling decision</li>
</ul>
<p><strong>Never create attributes starting with <code>otel.</code></strong> Any additions to this namespace must be approved as part of the OpenTelemetry specification.</p>
<h3 id="heading-other-reserved-attributes">Other Reserved Attributes</h3>
<p>The specification also reserves these specific attribute names:</p>
<ul>
<li><code>error.type</code>  </li>
<li><code>exception.message</code>, <code>exception.stacktrace</code>, <code>exception.type</code>  </li>
<li><code>server.address</code>, <code>server.port</code>  </li>
<li><code>service.name</code>  </li>
<li><code>telemetry.sdk.language</code>, <code>telemetry.sdk.name</code>, <code>telemetry.sdk.version</code>  </li>
<li><code>url.scheme</code></li>
</ul>
<h2 id="heading-semantic-convention-patterns">Semantic Convention Patterns</h2>
<p>The best way to develop good attribute naming intuition is studying OpenTelemetry's semantic conventions. These represent thousands of hours of design work by observability experts.</p>
<h3 id="heading-domain-organization-patterns">Domain Organization Patterns</h3>
<p>Notice how semantic conventions organize around clear domains:</p>
<p><strong>Infrastructure Domains</strong></p>
<ul>
<li><code>service.*</code> - Service identity and metadata  </li>
<li><code>host.*</code> - Host/machine information  </li>
<li><code>container.*</code> - Container runtime information  </li>
<li><code>process.*</code> - Operating system processes</li>
</ul>
<p><strong>Communication Domains</strong></p>
<ul>
<li><code>http.*</code> - HTTP protocol specifics  </li>
<li><code>network.*</code> - Network layer information  </li>
<li><code>rpc.*</code> - Remote procedure call attributes  </li>
<li><code>messaging.*</code> - Message queue systems</li>
</ul>
<p><strong>Data Domains</strong></p>
<ul>
<li><code>db.*</code> - Database operations  </li>
<li><code>url.*</code> - URL components</li>
</ul>
<h3 id="heading-universal-property-patterns">Universal Property Patterns</h3>
<p>Across all domains, consistent patterns emerge for common properties:</p>
<p><strong>Identity Properties</strong></p>
<ul>
<li><code>.name</code> - Human-readable identifiers (<code>service.name</code>, <code>container.name</code>)  </li>
<li><code>.id</code> - System identifiers (<code>container.id</code>, <code>process.pid</code>)  </li>
<li><code>.version</code> - Version information (<code>service.version</code>)  </li>
<li><code>.type</code> - Classification (<code>messaging.operation.type</code>, <code>error.type</code>)</li>
</ul>
<p><strong>Network Properties</strong></p>
<ul>
<li><code>.address</code> - Network addresses (<code>server.address</code>, <code>client.address</code>)  </li>
<li><code>.port</code> - Port numbers (<code>server.port</code>, <code>client.port</code>)</li>
</ul>
<p><strong>Measurement Properties</strong></p>
<ul>
<li><code>.size</code> - Byte measurements (<code>http.request.body.size</code>)  </li>
<li><code>.count</code> - Quantities (<code>messaging.batch.message_count</code>)  </li>
<li><code>.duration</code> - Time measurements (<code>http.server.request.duration</code>)</li>
</ul>
<p>When creating custom domains, follow these same patterns. For inventory management, consider:</p>
<ul>
<li><code>inventory.item.name</code>  </li>
<li><code>inventory.item.id</code>  </li>
<li><code>inventory.location.address</code>  </li>
<li><code>inventory.batch.count</code></li>
</ul>
<h2 id="heading-creating-custom-domains-safely">Creating Custom Domains Safely</h2>
<p>Sometimes your business logic requires attributes outside existing semantic conventions. This is normal—OpenTelemetry can't cover every possible business domain.</p>
<h3 id="heading-guidelines-for-safe-custom-domains">Guidelines for Safe Custom Domains</h3>
<ol>
<li><strong>Choose descriptive, generic names</strong> that others could reuse  </li>
<li><strong>Avoid company-specific terminology</strong> in the domain name  </li>
<li><strong>Follow hierarchical patterns</strong> established by semantic conventions  </li>
<li><strong>Consider if your domain could become a future semantic convention</strong></li>
</ol>
<h3 id="heading-examples-of-well-designed-custom-attributes">Examples of Well-Designed Custom Attributes</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Domain</td><td>Good Attributes</td><td>Why They Work</td></tr>
</thead>
<tbody>
<tr>
<td>Business</td><td><code>payment.method</code>, <code>order.status</code></td><td>Clear, reusable business concepts</td></tr>
<tr>
<td>Logistics</td><td><code>inventory.location</code>, <code>shipment.carrier</code></td><td>Domain-specific but transferable</td></tr>
<tr>
<td>Process</td><td><code>workflow.step.name</code>, <code>approval.status</code></td><td>Generic process management</td></tr>
<tr>
<td>Content</td><td><code>document.format</code>, <code>media.codec</code></td><td>Universal content concepts</td></tr>
</tbody>
</table>
</div><h2 id="heading-the-rare-exception-when-prefixes-make-sense">The Rare Exception: When Prefixes Make Sense</h2>
<p>In rare cases, you might need company or application prefixes. This typically happens when your custom attribute might conflict with attributes from other sources in a distributed system.</p>
<p><strong>Consider prefixes when:</strong></p>
<ul>
<li>Your attribute might conflict with vendor attributes in a distributed system  </li>
<li>You're instrumenting proprietary technology that's truly company-specific  </li>
<li>You're capturing internal implementation details that shouldn't be generalized</li>
</ul>
<p>For most business logic attributes, stick with domain-first naming.</p>
<h2 id="heading-your-action-plan">Your Action Plan</h2>
<p>Naming span attributes well creates telemetry data that's maintainable, interoperable, and valuable across your organization. Here's your roadmap:</p>
<ol>
<li><strong>Always check semantic conventions first</strong> - Use them when semantics match  </li>
<li><strong>Lead with domain, never company</strong> - Create vendor-neutral attributes  </li>
<li><strong>Respect reserved namespaces</strong> - Especially avoid <code>otel.*</code>  </li>
<li><strong>Follow hierarchical patterns</strong> - Use dots and underscores consistently  </li>
<li><strong>Build for reusability</strong> - Think beyond your current needs</li>
</ol>
<p>By following these principles, you're not just solving today's instrumentation challenges, you're contributing to a more coherent, interoperable observability ecosystem that benefits everyone.</p>
<p>In our next post in this series, we'll shift our focus from spans to metrics, exploring how to name the quantitative measurements that tell us how our systems are performing, and why the same principles of separation and domain-first thinking apply to the numbers that matter most.  </p>
]]></content:encoded></item><item><title><![CDATA[How to Name Your Spans]]></title><description><![CDATA[One of the most fundamental yet often overlooked aspects of good instrumentation
is naming. This post is the first in a series dedicated to the art and science
of naming things in OpenTelemetry. We'll start with spans, the building blocks
of a distri...]]></description><link>https://blog.olly.garden/how-to-name-your-spans</link><guid isPermaLink="true">https://blog.olly.garden/how-to-name-your-spans</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[distributed tracing]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Tue, 05 Aug 2025 22:00:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754500711825/72f72ddf-ae37-487f-8648-cc92a7991192.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the most fundamental yet often overlooked aspects of good instrumentation
is naming. This post is the first in a series dedicated to the art and science
of naming things in OpenTelemetry. We'll start with spans, the building blocks
of a distributed trace, and give you the most important takeaway right at the
beginning: how to name the spans that describe your unique business logic.</p>
<h2 id="heading-naming-your-business-spans">Naming your business spans</h2>
<p>While OpenTelemetry's automatic instrumentation is fantastic for covering
standard operations (like incoming HTTP requests or database calls), the most
valuable insights often come from the custom spans you add to your own business
logic. These are the operations unique to your application's domain.</p>
<p>For these custom spans, we recommend a pattern that borrows from basic grammar.
Simple, clear sentences often follow a subject -&gt; verb -&gt; direct object
structure. The "subject" (the service performing the work) is already part of
the trace's context. We can use the rest of that structure for our span name:</p>
<h2 id="heading-verb-object">{verb} {object}</h2>
<p>This pattern is descriptive, easy to understand, and helps maintain low cardinality—a
crucial concept we'll touch on later.</p>
<ul>
<li><strong>{verb}</strong>: A verb describing the work being done (for example: process, send,
calculate, render).</li>
<li><strong>{object}</strong>: A noun describing what is being acted upon (for example:
payment, invoice, shopping_cart, ad).</li>
</ul>
<p>Let's look at some examples:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Bad Name</td><td>Good Span Name</td><td>Why It's Better</td></tr>
</thead>
<tbody>
<tr>
<td>process_payment_for_user_jane_doe</td><td>process payment</td><td>The verb and object are clear. The user ID belongs in an attribute.</td></tr>
<tr>
<td>send<em>invoice</em>#98765</td><td>send invoice</td><td>Aggregable. You can easily find the P95 latency for sending all invoices.</td></tr>
<tr>
<td>render_ad_for_campaign_summer_sale</td><td>render ad</td><td>The specific campaign is a detail, not the core operation. Put it in an attribute.</td></tr>
<tr>
<td>calculate_shipping_for_zip_90210</td><td>calculate shipping</td><td>The operation is consistent. The zip code is a parameter, not part of the name.</td></tr>
<tr>
<td>validation_failed</td><td>validate user_input</td><td>Focus on the operation, not the outcome. The result belongs in the span's status.</td></tr>
</tbody>
</table>
</div><p>By adhering to the <code>{verb} {object}</code> format, you create a clear, consistent
vocabulary for your business operations. This makes your traces incredibly
powerful. A product manager could ask, "How long does it take to process
payments?" and an engineer can immediately filter for those spans and get an
answer.</p>
<h2 id="heading-why-this-pattern-works">Why this pattern works</h2>
<p>So why is <code>process payment</code> good and <code>process*invoice*#98765</code> bad? The reason is
<strong>cardinality</strong>.</p>
<p>Cardinality refers to the number of unique values a piece of data can have. A
span name should have <strong>low cardinality</strong>. If you include unique identifiers
like a user ID or an invoice number in the span name, you will create a unique
name for every single operation. This floods your observability backend, makes
it impossible to group and analyze similar operations, and can significantly
increase costs.</p>
<p>The <code>{verb} {object}</code> pattern naturally produces low-cardinality names. The
unique, high-cardinality details (<code>invoice\_#98765, user_jane_doe</code>) belong in
<strong>span attributes</strong>, which we will cover in a future blog post.</p>
<h2 id="heading-learning-from-semantic-conventions">Learning from Semantic Conventions</h2>
<p>This <code>{verb} {object}</code> approach isn't arbitrary. It's a best practice that
reflects the principles behind the official <strong>OpenTelemetry Semantic Conventions
(SemConv)</strong>. SemConv provides a standardized set of names for common operations,
ensuring that a span for an HTTP request is named consistently, regardless of
the language or framework.</p>
<p>When you look closely, you'll see this same pattern of describing an operation
on a resource echoed throughout the conventions. By following it for your custom
spans, you are aligning with the established philosophy of the entire
OpenTelemetry ecosystem.</p>
<p>Let's look at a few examples from SemConv.</p>
<h3 id="heading-http-spans">HTTP spans</h3>
<p>For server-side HTTP spans, the convention is <code>{method} {route}</code>.</p>
<ul>
<li><strong>Example:</strong> <code>GET /api/users/:ID</code></li>
<li><strong>Analysis:</strong> This is a verb (<code>GET</code>) acting on an object (<code>/api/users/:id</code>).
The use of a route template instead of the actual path (<code>/api/users/123</code>) is a
perfect example of maintaining low cardinality.</li>
</ul>
<h3 id="heading-database-spans">Database spans</h3>
<p>Database spans are often named <code>{db.operation} {db.name}.{db.sql.table}</code>.</p>
<ul>
<li><strong>Example:</strong> <code>INSERT my_database.users</code></li>
<li><strong>Analysis:</strong> This is a verb (<code>INSERT</code>) acting on an object
(<code>my_database</code>.users). The specific values being inserted are high-cardinality
and are rightly excluded from the name.</li>
</ul>
<h3 id="heading-rpc-spans">RPC spans</h3>
<p>For Remote Procedure Calls, the convention is <code>{rpc.service}/{rpc.method}</code>.</p>
<ul>
<li><strong>Example:</strong> <code>com.example.UserService/GetUser</code></li>
<li><strong>Analysis:</strong> While the format is different, the principle is the same. It
describes a method (<code>GetUser</code>), which is a verb, within a service
(<code>com.example.UserService</code>), which is the object or resource.</li>
</ul>
<p>The key takeaway is that by using <code>{verb} {object}</code>, you are speaking the same
language as the rest of your instrumentation.</p>
<h2 id="heading-cultivating-a-healthy-system">Cultivating a healthy system</h2>
<p>Naming spans is not a trivial task. It's a foundational practice for building a
robust and effective observability strategy. By adopting a clear, consistent
pattern like <code>{verb} {object}</code> for your business-specific spans, you can
transform your telemetry data from a tangled mess into a well-tended garden.</p>
<p>A well-named span is a gift to your future self and your team. It provides
clarity during stressful outages, enables powerful performance analysis, and
ultimately helps you build better, more reliable software.</p>
<p>In our next post in this series, we will dig into the next layer of detail:
<strong>span attributes</strong>. We'll explore how to add the rich, high-cardinality context
to your spans that is necessary for deep debugging, without compromising the
aggregability of your span names.</p>
]]></content:encoded></item><item><title><![CDATA[🌱 Cultivating Unique service.instance.id on NGINX Ingress with OpenTelemetry]]></title><description><![CDATA[A task to set a unique service.instance.id on NGINX Ingress Controller using the OpenTelemetry should be simple, right? But as it turned out, the ingress controller doesn't expose all of NGINX’s OTel knobs out of the box, so I had to roll my own twea...]]></description><link>https://blog.olly.garden/cultivating-unique-serviceinstanceid-on-nginx-ingress-with-opentelemetry</link><guid isPermaLink="true">https://blog.olly.garden/cultivating-unique-serviceinstanceid-on-nginx-ingress-with-opentelemetry</guid><category><![CDATA[#observability #monitoring #DevOps #tools #softwaredevelopment #infrastructure #performance #metrics #logging #troubleshooting]]></category><dc:creator><![CDATA[Yuri Oliveira Sá]]></dc:creator><pubDate>Thu, 31 Jul 2025 06:00:06 GMT</pubDate><content:encoded><![CDATA[<p>A task to set a unique <code>service.instance.id</code> on NGINX Ingress Controller using the OpenTelemetry should be simple, right? But as it turned out, the ingress controller doesn't expose all of NGINX’s OTel knobs out of the box, so I had to roll my own tweak garden.</p>
<h3 id="heading-why-should-i-care">Why should I care?</h3>
<p>Without unique instance IDs, your tracing data looks like a tangled tangleweed—hard to trace, difficult to debug, and completely undermines what we’re trying to do with observability.</p>
<h2 id="heading-the-solution">The solution</h2>
<h3 id="heading-part-1-nginx-ingress-controller-helm-chart">Part 1 - Nginx Ingress Controller - Helm Chart</h3>
<p>To get <code>POD_UID</code> injected into each trace, here’s the minimal yet powerful config snippet I landed on:</p>
<ul>
<li>Set the POD_UID environment variable, once NGINX Ingress Controller set by default only the POD_NAME and POD_NAMESPACE.</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-attr">extraEnvs:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">POD_UID</span>
    <span class="hljs-attr">valueFrom:</span>
      <span class="hljs-attr">fieldRef:</span>
         <span class="hljs-attr">fieldPath:</span> <span class="hljs-string">metadata.uid</span>
</code></pre>
<ul>
<li>Set the POD_UID as span-attribute for instance.</li>
</ul>
<pre><code class="lang-yaml"><span class="hljs-attr">controller:</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">main-snippet:</span> <span class="hljs-string">|
      env POD_UID;
</span>    <span class="hljs-attr">server-snippet:</span> <span class="hljs-string">|
      set $pod_uid "unknown";
      access_by_lua_block {
        ngx.var.pod_uid = os.getenv("POD_UID") or "unknown"
      }
      opentelemetry_attribute service.instance.id $pod_uid;</span>
</code></pre>
<h3 id="heading-part-2-opentelemetry-collector-config">Part 2 - OpenTelemetry Collector Config</h3>
<ul>
<li>Configure the <a target="_blank" href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor">transform</a> processor to set the resource-attribute. </li>
</ul>
<pre><code class="lang-yaml">    <span class="hljs-attr">processors:</span>
      <span class="hljs-attr">transform:</span>
        <span class="hljs-attr">error_mode:</span> <span class="hljs-string">ignore</span>
        <span class="hljs-attr">trace_statements:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">set(resource.attributes["service.instance.id"],</span> <span class="hljs-string">span.attributes["service.instance.id"])</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">delete_key(span.attributes,</span> <span class="hljs-string">"service.instance.id"</span><span class="hljs-string">)</span>
</code></pre>
<h3 id="heading-how-it-works">How it works:</h3>
<ul>
<li>Explicitly exposes the <code>POD_UID</code> environment variables so NGINX can see them.</li>
<li>Initializes a default <code>$pod_uid</code>—my fail-safe in case the env var goes missing.</li>
<li>Uses Lua to pull in the real <code>POD_UID</code>.</li>
<li>Finally, sets the OpenTelemetry attribute <code>service.instance.id</code> to match the actual UID.</li>
<li>Through OpenTelemetry Collector capture the <code>service.instance.id</code> from span-attributes and set it on resource-attributes.</li>
</ul>
<h3 id="heading-benefits-to-your-telemetry">Benefits to your telemetry:</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Without unique <code>service.instance.id</code></td><td>With unique <code>service.instance.id</code></td></tr>
</thead>
<tbody>
<tr>
<td>Pods share the same instance ID, causing confusion in telemetry.</td><td>Each pod is individually identifiable.</td></tr>
<tr>
<td>Difficult to isolate errors and debug effectively.</td><td>Allows precise tracing to individual pods.</td></tr>
<tr>
<td>Limited visibility into pod-specific issues.</td><td>Enables accurate root-cause analysis.</td></tr>
</tbody>
</table>
</div><h3 id="heading-final-result">Final result</h3>
<p>By implementing this solution, each NGINX ingress pod is clearly distinguishable in tracing data. This improves observability significantly by providing accurate, pod-specific telemetry, facilitating precise troubleshooting and diagnostics.</p>
]]></content:encoded></item><item><title><![CDATA[Introducing OllyGarden]]></title><description><![CDATA[In the rush for visibility, many organizations find themselves lost in an overgrown jungle of data. Teams generate a constant stream of telemetry, hoping it will sprout into useful insights. Instead, they often end up with "bad telemetry"—data that i...]]></description><link>https://blog.olly.garden/introducing-ollygarden</link><guid isPermaLink="true">https://blog.olly.garden/introducing-ollygarden</guid><category><![CDATA[#observability #monitoring #DevOps #tools #softwaredevelopment #infrastructure #performance #metrics #logging #troubleshooting]]></category><dc:creator><![CDATA[OllyGarden]]></dc:creator><pubDate>Wed, 09 Jul 2025 07:00:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1751918339757/6a8ca332-0975-4bd1-ba76-8f65f1a83099.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the rush for visibility, many organizations find themselves lost in an overgrown jungle of data. Teams generate a constant stream of telemetry, hoping it will sprout into useful insights. Instead, they often end up with "bad telemetry"—data that is noisy, irrelevant, or incomplete, driving up costs and obscuring the very answers they seek.</p>
<p>Today, we're coming out of stealth to introduce <a target="_blank" href="https://olly.garden/">OllyGarden</a>, a new company dedicated to tending this garden of observability. Backed by a <strong>pre-seed round</strong> led by <a target="_blank" href="https://www.dig.ventures/">DIG Ventures</a>, with investments from observability leaders like <a target="_blank" href="https://www.datadoghq.com/">Datadog Ventures</a>, <a target="_blank" href="https://grafana.com/">Grafana Labs</a>, and <a target="_blank" href="https://www.dash0.com/">Dash0</a>, as well as special angels with deep knowledge in our space, like <a target="_blank" href="https://www.linkedin.com/in/batuhanuslu/">Batuhan Uslu</a>, <a target="_blank" href="https://www.linkedin.com/in/bensigelman/">Ben Sigelman</a>, <a target="_blank" href="https://www.linkedin.com/in/caniszczyk/">Chris Aniszczyk</a>, and <a target="_blank" href="https://www.linkedin.com/in/irlivingstone/">Ian Livingstone</a>, we're ready to get to work.</p>
<p>Our mission is simple: to improve the efficiency of telemetry pipelines. We believe the first and most crucial step is to help companies understand and optimize the telemetry they are already generating.</p>
<h2 id="heading-the-weeds-were-tackling-real-world-telemetry-pains"><strong>The Weeds We’re Tackling: Real-World Telemetry Pains</strong></h2>
<p>The problem of bad telemetry isn't theoretical; it's a daily struggle for engineering teams everywhere. During our research, we heard the same stories again and again. These might sound familiar to you:</p>
<ul>
<li><p><strong>Runaway Costs:</strong> An engineer at a company based in Berlin mentioned that they had a single high-cardinality metric generating over <strong>$20,000 worth of telemetry per month</strong> without a clear value. After reducing the cardinality, the cost was reduced to $2,000. How many such metrics are hiding in your pipelines, undetected?</p>
</li>
<li><p><strong>Broken Insights &amp; Poor User Experience:</strong> What happens when telemetry fails? An engineer working at a company in Australia mentioned that their support team uses distributed tracing to debug user-reported problems, but incomplete traces limit their ability to help their users, directly impacting customer satisfaction.</p>
</li>
<li><p><strong>The "Instrument Everything" Trap:</strong> Auto-instrumentation tools are powerful, but if left unconfigured they can gather far more information than necessary. This can generate extremely high volumes of data, overloading systems before you even get a chance to analyze the data.</p>
</li>
<li><p><strong>Vast, Unchecked Data Volumes:</strong> Another company shared that they generate about <strong>3 Petabytes of uncompressed telemetry per month</strong>, acknowledging that among that trove of data, "a lot of it is bad telemetry". With industry experts estimating that up to 90% of telemetry data goes unused, imagine how many CPU cycles, collector instances, and how much egress traffic could be spared if that bad telemetry wasn't generated in the first place.</p>
</li>
</ul>
<h2 id="heading-our-first-step-ollygarden-insights-amp-the-instrumentation-score"><strong>Our First Step: OllyGarden Insights &amp; the Instrumentation Score</strong></h2>
<p>You can't improve what you can't measure. For too long, evaluating telemetry quality has been a subjective exercise based on gut feelings and tribal knowledge. We're tackling this problem by giving observability engineers superpowers, allowing them to see exactly how good the telemetry is inside their pipelines.</p>
<p>Our first product, <strong>OllyGarden Insights</strong>, analyzes your telemetry streams to give you deep insights into your data quality. It assesses your instrumentation against best practices, identifies services that are over- or under-instrumented, and helps you make informed, data-driven decisions about what to change.</p>
<p>To provide a common vocabulary for this, we launched the <strong>Instrumentation Score</strong>, a standardized value to objectively assess OpenTelemetry instrumentation. We believe such a fundamental metric for telemetry health shouldn't be proprietary. That’s why we initiated the Instrumentation Score as an <strong>open-source effort</strong>, with an open governance model and support from partners like Dash0, Datadog, Grafana Labs, Honeycomb, New Relic, and Splunk. This is our first major contribution back to the community we care so much about, and a testament to how we plan to operate.</p>
<h2 id="heading-why-ollygarden-our-roots-in-opentelemetry"><strong>Why OllyGarden? Our Roots in OpenTelemetry</strong></h2>
<p>We know the challenges of telemetry because we've been helping build the solutions for years. OllyGarden was founded by OpenTelemetry veteran Juraci Paixão Kröhling and SRE expert Yuri Oliveira Sá. As a member of the OpenTelemetry Governance Committee, creator of the OpenTelemetry Operator and OpenTelemetry Collector Builder (ocb), and one of the project's top contributors, Juraci has been on the front lines of the observability revolution.</p>
<p>This deep experience shapes our core philosophy:</p>
<ul>
<li><p><strong>A Vendor-Neutral Approach:</strong> Our business is to make your telemetry more efficient. We are a neutral partner you can trust, acting as a complement to observability backends by helping their customers send them higher-quality data. We succeed when you gain clarity, regardless of where your data is stored.</p>
</li>
<li><p><strong>The Start of Our Journey:</strong> OllyGarden Insights is our first major step. We are focused on delivering immediate value by helping you understand and improve the telemetry you already have. We are just at the beginning of our journey and are incredibly excited about the future, but our commitment today is clear: to empower engineers with the backend-neutral tools they need to cultivate clarity from their data.</p>
</li>
</ul>
<h2 id="heading-start-cultivating-better-telemetry-today"><strong>Start Cultivating Better Telemetry Today</strong></h2>
<p>It's time to move beyond the jungle of bad telemetry and start purposefully cultivating a garden of clear, actionable insights.</p>
<p>We are now engaging with <strong>early users</strong>. By joining us, you will not only see the power of purposeful instrumentation firsthand but also have the unique opportunity to help shape our product and secure special early-bird pricing. We'll analyze a sample of your telemetry and provide you with your <strong>Instrumentation Score</strong> and actionable insights for improvement.</p>
<p>Ready to see what's really growing in your telemetry pipelines? Contact us at <a target="_blank" href="mailto:contact@olly.garden"><strong>contact@olly.garden</strong></a> or visit our website to learn more.</p>
]]></content:encoded></item><item><title><![CDATA[Introducing the Instrumentation Score]]></title><description><![CDATA[Telemetry data is the foundation of our observability. We gather metrics, traces, and logs, aiming to cultivate a clear understanding of application health. Yet, a persistent question often arises: "Is our telemetry actually good?" How do we distingu...]]></description><link>https://blog.olly.garden/instrumentation-score</link><guid isPermaLink="true">https://blog.olly.garden/instrumentation-score</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[instrumentation]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Wed, 11 Jun 2025 04:09:52 GMT</pubDate><content:encoded><![CDATA[<p>Telemetry data is the foundation of our observability. We gather metrics, traces, and logs, aiming to cultivate a clear understanding of application health. Yet, a persistent question often arises: "Is our telemetry actually <em>good</em>?" How do we distinguish valuable insights from data that merely consumes resources?</p>
<p>For too long, evaluating instrumentation effectiveness has been a subjective exercise. We've lacked a common language or a standard measure to truly understand if our telemetry is enriching our insights or just overgrowing the plot. At <a target="_blank" href="https://olly.garden/">OllyGarden</a>, we recognize this challenge and are building a product to give super powers to observability engineers. We heard from them that our Instrumentation Score was something special. And we believe that <a target="_blank" href="https://www.youtube.com/watch?v=vhYMRtqvMg8&amp;t=118s">if you want something really special, you share it</a>.</p>
<p>Today, OllyGarden introduces its first major contribution to the observability ecosystem: the <strong>Instrumentation Score</strong>.</p>
<p>The Instrumentation Score is a standardized, numerical value designed to objectively assess the consistency and effectiveness of <a target="_blank" href="https://opentelemetry.io/">OpenTelemetry</a> instrumentation. It analyzes <a target="_blank" href="https://opentelemetry.io/docs/specs/otlp/">OTLP</a> (OpenTelemetry Protocol) data streams against a predefined set of rules rooted in OpenTelemetry best practices and <a target="_blank" href="https://opentelemetry.io/docs/specs/semconv/">semantic conventions</a>. It’s a health check for your telemetry, providing a clear, actionable measure of its quality.</p>
<p>As <a target="_blank" href="https://www.linkedin.com/in/jpkroehling/">Juraci Paixão Kröhling</a>, Co-founder at OllyGarden, states:</p>
<blockquote>
<p>"As an OpenTelemetry contributor and enthusiast, I've seen firsthand the project's power to democratize instrumentation. Yet, a persistent question has always been: 'Are we generating good telemetry?' Too often, the answer is unclear, leading to missed insights or wasted resources. The Instrumentation Score, an initiative we're launching from OllyGarden, aims to provide that clarity. It's about establishing a common, actionable language for telemetry quality, built on OpenTelemetry principles, to empower every engineer and organization to confidently improve their observability practices and truly harness the value of their data."</p>
</blockquote>
<p>The Instrumentation Score provides a common vocabulary for discussing instrumentation effectiveness. For <strong>engineers and SREs</strong>, it offers actionable guidance, highlighting where instrumentation can be improved. For <strong>CTOs and technology leaders</strong>, the strategic value includes improved ROI on observability by focusing on <a target="_blank" href="https://blog.olly.garden/purposeful-instrumentation">purposeful telemetry</a> and reducing <a target="_blank" href="https://blog.olly.garden/theres-a-lot-of-bad-telemetry-out-there">bad telemetry</a>.</p>
<p>OllyGarden is committed to OpenTelemetry and the open-source ecosystem. The Instrumentation Score leverages OpenTelemetry Semantic Conventions and analyzes OTLP data. Crucially, we've ensured this initiative is not only open source but is being developed with an open governance model, with support or contributions from key industry players like <a target="_blank" href="https://www.dash0.com/">Dash0</a>, <a target="_blank" href="https://newrelic.com">New Relic</a>, <a target="_blank" href="https://splunk.com/">Splunk</a>, <a target="_blank" href="https://datadog.com/">Datadog</a>, and <a target="_blank" href="https://grafana.com/">Grafana Labs</a>. The specification is an open-source effort, and we are actively requesting observability engineers to contribute with their own rules by opening a pull request against the GitHub repository. Our goal is a collaborative evolution, with the Instrumentation Score eventually finding a home within a neutral foundation.</p>
<p>The introduction of the Instrumentation Score is a step towards a future where organizations can confidently understand and improve their telemetry.</p>
<p>We invite you to learn more and get involved:</p>
<ul>
<li><p>Explore the <strong>Instrumentation Score landing page</strong>: <a target="_blank" href="https://score.olly.garden/">https://score.olly.garden/</a></p>
</li>
<li><p>Review the <strong>specification and contribute on GitHub</strong>: <a target="_blank" href="https://github.com/instrumentation-score/">https://github.com/instrumentation-score/</a></p>
</li>
</ul>
<p>OllyGarden aims to improve the efficiency of telemetry pipelines. The Instrumentation Score is the first seed we’re planting, hoping that together, we can help everyone grow a more effective observability practice.</p>
]]></content:encoded></item><item><title><![CDATA[Concrete Applications of Purposeful Instrumentation]]></title><description><![CDATA[In our Purposeful Instrumentation blog post, we laid the groundwork for a more disciplined approach to observability. We argued that the goal isn't to merely amass data, but to cultivate high-quality telemetry signals – focusing on quality over quant...]]></description><link>https://blog.olly.garden/concrete-applications-of-purposeful-instrumentation</link><guid isPermaLink="true">https://blog.olly.garden/concrete-applications-of-purposeful-instrumentation</guid><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><category><![CDATA[telemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Wed, 04 Jun 2025 22:00:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749055629118/11dea6ef-f2d6-400e-9e31-6a85b0c64e96.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In our <a target="_blank" href="https://blog.olly.garden/purposeful-instrumentation">Purposeful Instrumentation</a> blog post, we laid the groundwork for a more disciplined approach to observability. We argued that the goal isn't to merely amass data, but to <strong>cultivate high-quality telemetry signals</strong> – focusing on quality over quantity. The aim is to transform our experience during high-pressure incidents from frantically searching through a "dense thicket of irrelevant data" to confidently navigating a "well-lit path to the root cause."</p>
<p>Many of us have experienced the pitfalls of the "instrument everything" mantra. While well-intentioned, it often leads to an "overgrown jungle of telemetry data," where critical signals are drowned out by noise. Purposeful instrumentation, in contrast, encourages us to strategically gather the <em>right</em> data. This isn't just about digital decluttering; it yields tangible benefits: <strong>reduced noise, faster troubleshooting, and improved clarity and maintainability</strong> in our systems.</p>
<p>This post moves from philosophy to practice. We'll dive into concrete examples and techniques, showcasing how to apply purposeful instrumentation in real-world scenarios—from initial telemetry design to ongoing pipeline adjustments and even code-level optimizations.</p>
<h2 id="heading-designing-telemetry-with-nasas-rigor"><strong>Designing Telemetry with NASA's Rigor</strong></h2>
<p>When we think about systems operating under the most severe limitations, spacecraft telemetry, particularly from missions like NASA's Mars rovers, offers <a target="_blank" href="https://www-robotics.jpl.nasa.gov/media/documents/Flight_Software_Case_Study_Spacecraft_Telemetry.pdf">profound inspiration</a>. The extreme constraints of space exploration—limited bandwidth, power, and processing capabilities—force engineers to meticulously justify and optimize every single bit of data transmitted. For observability engineers on Earth, even without such stark limitations, these practices offer invaluable lessons in <strong>cultivating efficiency</strong>.</p>
<p>Here are some key takeaways:</p>
<ul>
<li><p><strong>Data Type Optimization</strong>: Spacecraft systems often convert 64-bit floating-point numbers to 32-bit or even 16-bit integers. Sometimes, scaled integers (like centi-degrees Celsius) are used to preserve essential precision while drastically reducing data volume. For our enterprise systems, this prompts a critical question: Do we <em>really</em> need microsecond precision for every timer, or would seconds suffice for certain metrics, thereby reducing storage and processing overhead?</p>
</li>
<li><p><strong>Bit Packing and Enumerated Types</strong>: To save space, boolean flags and enumerated values with a limited set of states are often packed into smaller integer types on spacecraft. For example, 15 distinct safety checks might be encoded into a single 16-bit integer. This principle is directly applicable to software telemetry, particularly in how we design attributes to <strong>reduce cardinality</strong> and data volume. Instead of verbose string representations for statuses, can an enumerated integer suffice?</p>
</li>
<li><p><strong>Configurable Data Collection</strong>: Spacecraft aren't static in their data collection. They possess "knobs" that allow operators to increase data verbosity for anomaly investigations, switching between "Brief records" for nominal operations and "Verbose records" when digging deeper. This mirrors the need in our systems for dynamic control over telemetry, perhaps adjusting log levels or sampling rates based on operational context rather than maintaining a constant, high-volume stream.</p>
</li>
<li><p><strong>Summary Data and Compression</strong>: Reporting small, high-level summary data packets independently from detailed diagnostic data products allows for quick operational decision-making. If summaries are nominal, large, detailed data products might even be discarded to save precious bandwidth. Lossless compression is also a standard practice, always balancing the CPU cost of compression/decompression against bandwidth savings.</p>
</li>
<li><p><strong>The "Very Small Products" Problem</strong>: Interestingly, generating a multitude of tiny data products can be inefficient, consuming storage slots and impacting system performance, as was observed with the Mars 2020 rover's packetizer. This highlights the importance of <strong>batching and aggregation</strong> not just for network efficiency but also for processing and storage optimization within our telemetry pipelines. The OpenTelemetry Collector’s batch processor is a prime example of applying this principle.</p>
</li>
</ul>
<p>These extreme examples from NASA underscore a fundamental discipline: diligently asking, "What data do I <em>really</em> need?" and "What is the cost versus the value?" This scrutiny is crucial for building sustainable and effective telemetry strategies, ensuring we're not just collecting data, but harvesting actionable insights.</p>
<h2 id="heading-tuning-automatic-instrumentation-for-precision-with-opentelemetry"><strong>Tuning Automatic Instrumentation for Precision with OpenTelemetry</strong></h2>
<p>OpenTelemetry's <a target="_blank" href="https://opentelemetry.io/docs/zero-code/">auto-instrumentation</a> agents are a massive boon, offering broad telemetry coverage for popular libraries and frameworks with minimal upfront effort. It’s tempting to see this as "zero code, zero thought." However, this convenience doesn't absolve us from the need for <strong>purposeful configuration</strong>. Blindly enabling instrumentation for every conceivable library can quickly lead back to that "overgrown jungle of telemetry data," swamping your systems with noise and incurring unnecessary costs.</p>
<ul>
<li><p><strong>Review Default Configurations</strong>: Auto-instrumentation defaults are often tuned for maximum coverage, which might not align with your specific observability goals or the critical paths of your application. As <a target="_blank" href="https://youtu.be/QzStkLbA7Qk">Elena Kovalenko of Delivery Hero noted</a>, unconfigured auto-instrumentation can generate extremely high cardinality and massive data volumes, potentially overloading collectors and backend systems. It’s vital to treat the default settings as a starting point, not a final destination.</p>
</li>
<li><p><strong>Selectively Disable Unnecessary Instrumentation</strong>: Most OpenTelemetry auto-instrumentation agents allow for fine-grained control, enabling you to disable instrumentation for components that are irrelevant to your critical diagnostic paths or those known to produce excessive, low-value data.</p>
<ul>
<li><strong>Concrete Example: Suppressing JDBC Telemetry</strong>: If your primary diagnostic focus is at the service interaction level, the verbose telemetry generated by JDBC instrumentation (tracing every database call) might be more noise than signal. With the OpenTelemetry Java agent, for instance, you can easily <a target="_blank" href="https://opentelemetry.io/docs/zero-code/java/agent/disable/">disable this by setting</a> the environment variable <code>OTEL_INSTRUMENTATION_JDBC_ENABLED=false</code>. This targeted <strong>pruning</strong> ensures that resources aren't wasted collecting, processing, and storing data that doesn't contribute significantly to your understanding of system health.</li>
</ul>
</li>
</ul>
<p>Auto-instrumentation plants the seeds of visibility; purposeful configuration helps you cultivate the desired crop, ensuring a healthy yield of actionable insights rather than a field of weeds.</p>
<h2 id="heading-optimizing-data-flow-with-the-opentelemetry-collector-pipeline-adjustments"><strong>Optimizing Data Flow with the OpenTelemetry Collector: Pipeline Adjustments</strong></h2>
<p>The OpenTelemetry Collector is more than just a telemetry forwarder; it's a powerful, vendor-agnostic control plane. It’s a great place to implement purposeful telemetry strategies by filtering, sampling, enriching, and transforming data <em>before</em> it even reaches your observability backends. Let's look at how sophisticated organizations are leveraging the Collector.</p>
<h3 id="heading-ebays-journey-scaling-distributed-tracing-with-cost-optimization"><strong>eBay's Journey: Scaling Distributed Tracing with Cost Optimization</strong></h3>
<p><a target="_blank" href="https://youtu.be/qq8hTct8zm4?si=t5pnl-tyV9-WQ6VP">Handling telemetry at eBay's scale</a>—ingesting 6.5 million spans per second—necessitates highly judicious instrumentation and aggressive optimization. They faced challenges with broken call chains due to context propagation issues and the difficulty of applying uniform sampling across APIs with vastly different traffic volumes.</p>
<p>Their approach to sampling evolved:</p>
<ol>
<li><p><strong>Initial Strategy</strong>: They started with head sampling at the client (e.g., 2% of requests) combined with parent-based sampling to ensure entire traces were captured if any part was sampled.</p>
</li>
<li><p><strong>Adding Tail Sampling</strong>: After that, they employed a tail-sampling strategy to retain "interesting" traces—those with errors, high latency, or specific critical attributes—along with a baseline 1% of successful traces, storing these for 14 days. This allowed them to focus retention on the most valuable diagnostic data.</p>
</li>
<li><p><strong>Evolving Tail Sampling with OTel Collector</strong>: Recognizing the significant memory and complexity challenges of performing in-memory tail sampling within the OpenTelemetry Collector for long-duration traces or requests spanning multiple clusters, eBay pivoted. They now leverage <strong>exemplars from metrics</strong> to identify traces of interest. These traces are then copied from a raw trace table to a sampled table after a 10-15 minute delay. This innovative, storage-based tail sampling approach demonstrates a mature balance between comprehensive diagnostic capability and cost control.</p>
</li>
</ol>
<h3 id="heading-tomtoms-centralized-control-enforcing-governance-and-flexibility"><strong>TomTom's Centralized Control: Enforcing Governance and Flexibility</strong></h3>
<p><a target="_blank" href="https://engineering.tomtom.com/opentelemetry-simpifying-observability/">TomTom implemented</a> a <strong>centralized OpenTelemetry Collector Service</strong> that acts as a gateway between their internal applications and various SaaS observability platforms. This central hub provides several advantages:</p>
<ul>
<li><p><strong>Governance and Standardization</strong>: It allows them to enforce authentication, manage general configurations like batching and encryption consistently, and, crucially, handle <strong>data enrichment and manipulation</strong> centrally.</p>
</li>
<li><p><strong>Filtering and PII Redaction</strong>: They use the filterprocessor to drop noisy or irrelevant logs (e.g., from specific Kubernetes namespaces). For sensitive data, a combination of the transformprocessor and attributesprocessor is used to redact Personally Identifiable Information (PII) before telemetry leaves their trust boundary.</p>
</li>
<li><p><strong>Telemetry Enrichment</strong>: Data is enriched with valuable metadata, such as an "owner" label, which provides better context during troubleshooting and improves accountability.</p>
</li>
<li><p><strong>Strategic Benefits</strong>: This centralized model offers flexibility in switching telemetry backends, enforces data governance policies, and has proven critical for cost control and maintaining data quality at an enterprise scale.</p>
</li>
</ul>
<p>These real-world examples illustrate the power of the OpenTelemetry Collector as a central point for <strong>cultivating telemetry quality</strong>.</p>
<h2 id="heading-crafting-intentional-manual-instrumentation-ollygardens-example"><strong>Crafting Intentional Manual Instrumentation: OllyGarden’s Example</strong></h2>
<p>While auto-instrumentation provides breadth, manual instrumentation offers depth and precision. But even here, more isn't always better. A common pitfall is "over-spanning": creating an excessive number of highly granular spans for minor, sequential internal operations within a single logical unit of work. This can obscure the true flow of a request, add unnecessary overhead, and make traces harder to interpret—akin to "wandering aimlessly in the woods" instead of following a clear path. For example, a single logical <code>onTraces</code> operation might be fragmented into several child spans for <code>processResourceSpans</code>, cluttering the trace view and inflating span counts unnecessarily.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749028779417/6c876cb5-54e0-4774-8546-0780f772c774.png" alt class="image--center mx-auto" /></p>
<p>Here’s the original Go code we wrote and landed in production:</p>
<pre><code class="lang-go">    ctx, span := telemetry.Tracer().Start(ctx, <span class="hljs-string">"tendril.processResourceSpans"</span>)
    <span class="hljs-keyword">defer</span> span.End()

    <span class="hljs-comment">// Extract service information from resource</span>
    svcName := getResourceString(rs.Resource(), attrServiceName)
    span.SetAttributes(attribute.String(<span class="hljs-string">"service.name"</span>, svcName))

    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    span.SetAttributes(attribute.String(<span class="hljs-string">"service.version"</span>, svcVersion))

    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)
    span.SetAttributes(attribute.String(<span class="hljs-string">"deployment.environment.name"</span>, svcEnv))
</code></pre>
<p><strong>The Purposeful Solution: Leveraging Span Events</strong></p>
<p>Instead of creating distinct child spans for every micro-step, it's often far more effective to <strong>consolidate these internal milestones as span events within a single, overarching span</strong> that represents the larger logical operation. This aligns with the core principle of choosing the "most effective signal type for your defined purpose." Logs provide detailed context for discrete occurrences, metrics track aggregatable trends, and traces show flow; span events offer a way to add rich, contextual markers to a span without creating new ones.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749028851145/a7995198-ff05-4582-abbb-e0fcb85ea1e1.png" alt class="image--center mx-auto" /></p>
<p>And here’s the code after the fine-tuning:</p>
<pre><code class="lang-go">    span := trace.SpanFromContext(ctx)

    <span class="hljs-comment">// Extract service information from resource</span>
    svcName := getResourceString(rs.Resource(), attrServiceName)
    svcVersion := getResourceString(rs.Resource(), attrServiceVersion)
    svcEnv := getResourceString(rs.Resource(), attrEnvironmentName)

    span.AddEvent(<span class="hljs-string">"processing resource spans for service"</span>, trace.WithAttributes(
        attribute.String(<span class="hljs-string">"service.name"</span>, svcName),
        attribute.String(<span class="hljs-string">"service.version"</span>, svcVersion),
        attribute.String(<span class="hljs-string">"deployment.environment.name"</span>, svcEnv),
    ))
</code></pre>
<p><strong>Benefits of Using Span Events Over Excessive Child Spans:</strong></p>
<ul>
<li><p><strong>Clearer Trace Representation</strong>: A single span with well-defined events provides a <strong>cleaner, more focused view</strong> of a component's internal workings within the context of the larger trace. This gives a "well-lit path" to understanding that component's behavior.</p>
</li>
<li><p><strong>Reduced Overhead and Cost</strong>: Span events are generally lighter-weight than full spans. This translates to <strong>reduced data volume</strong> and consequently <strong>lower processing and storage costs</strong> in your observability backend.</p>
</li>
<li><p><strong>Enhanced Context</strong>: Events, with their associated attributes, allow you to capture crucial details (e.g., input size, processing duration for a specific sub-task, success/failure flags) at precise points within the operation, without fragmenting the trace into many tiny pieces.</p>
</li>
</ul>
<h2 id="heading-conclusion-towards-insightful-and-economical-observability"><strong>Conclusion: Towards Insightful and Economical Observability</strong></h2>
<p>Moving from indiscriminate data collection to <strong>purposeful software telemetry</strong> is more than an engineering exercise; it's a strategic imperative. It ensures that our substantial investments in observability deliver tangible business value—faster incident resolution, optimized performance, and controlled costs—rather than just overwhelming data lakes.</p>
<p>This journey of <strong>continuous cultivation</strong> is not a one-off task. It requires ongoing review, governance, and a feedback loop where insights from incidents, performance anomalies, and cost reports are fed back into your instrumentation design and data pipeline policies. As your systems evolve, so too must your telemetry strategy.</p>
<p>The guiding questions we discussed in our previous post remain your most valuable tools:</p>
<ul>
<li><p>"What question are we trying to answer with this data?"</p>
</li>
<li><p>"What data do we <em>truly</em> need, and at what precision?"</p>
</li>
<li><p>"Why <em>this specific signal type</em> (metric, log, trace, event)?"</p>
</li>
<li><p>"How will this data actually be <em>used</em> and by whom?"</p>
</li>
<li><p>"And critically, what is its ongoing <em>cost versus its value</em>?"</p>
</li>
</ul>
<p>By consistently applying this critical lens, engineering teams can cultivate an observability practice that is not only powerful and insightful but also sustainable and economically sound. This deliberate, adaptive, and insight-driven approach is the future of effective software observability. OllyGarden is committed to being a neutral and valuable partner in this ecosystem, helping you analyze, optimize, and manage your OpenTelemetry pipelines to harvest the richest insights efficiently.</p>
]]></content:encoded></item><item><title><![CDATA[Purposeful Instrumentation]]></title><description><![CDATA[It’s the middle of the night. An alert jolts you awake – a critical service is sputtering. Your mind races as you dive into a labyrinth of dashboards, logs, and traces. Are you navigating a well-lit path to the root cause, or are you lost in a dense ...]]></description><link>https://blog.olly.garden/purposeful-instrumentation</link><guid isPermaLink="true">https://blog.olly.garden/purposeful-instrumentation</guid><category><![CDATA[instrumentation]]></category><category><![CDATA[OpenTelemetry]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Thu, 01 May 2025 08:01:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1746086306894/23a8b470-5d3d-4e77-865c-79f74ef0450d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>It’s the middle of the night. An alert jolts you awake – a critical service is sputtering. Your mind races as you dive into a labyrinth of dashboards, logs, and traces. Are you navigating a well-lit path to the root cause, or are you lost in a dense thicket of irrelevant data? In these high-pressure moments, the quality – not just the quantity – of your observability instrumentation is what truly counts.</p>
<p>Many teams, in their quest for visibility, fall into the trap of "instrument everything." The intention is good, but the result is often an overgrown jungle of telemetry data: noisy metrics, verbose logs, and sprawling traces that obscure rather than illuminate, or perhaps a monoculture of one of those telemetry data types. This is where the practice of Purposeful Instrumentation comes in – a disciplined approach to cultivating high-quality observability signals. It's about moving beyond simply collecting data to strategically gathering the right data to understand system health, optimize performance, and troubleshoot effectively. Think of it as tending a garden: you don't just let everything grow wild; you carefully select, nurture, and prune to ensure a healthy and productive yield. It's fundamentally about quality over quantity, having the telemetry you need, without excess.</p>
<h2 id="heading-why-prune-the-noise">Why Prune the Noise?</h2>
<p>Adopting a purposeful approach isn't just about tidiness; it delivers tangible benefits that directly impact your team's effectiveness and your organization's bottom line.</p>
<ol>
<li><p><strong>Reduced Noise &amp; Increased Signal:</strong> Over-instrumentation creates a cacophony. Imagine trying to hear a single bird's song in the middle of a roaring stadium. Purposeful instrumentation acts like a filter, silencing the distracting roar and amplifying the signals that truly indicate system behavior and potential issues. You focus your resources on telemetry that provides genuine insight, making it easier to spot anomalies and trends.</p>
</li>
<li><p><strong>Faster Troubleshooting &amp; Resolution:</strong> When an incident occurs, time is critical. Sifting through irrelevant data wastes precious minutes, if not hours. With instrumentation designed to answer specific questions or diagnose common failure modes, you have targeted data trails leading you directly towards the problem's source. It’s the difference between wandering aimlessly in the woods and following a clearly marked trail.</p>
</li>
<li><p><strong>Significant Cost Optimization:</strong> Telemetry data isn't free. Storage, processing, and analysis all incur costs, which can escalate rapidly with high data volumes and cardinality. Instrumenting only what provides clear value ensures you're not paying to store noise. This optimizes your observability spend and demonstrably increases the return on your investment (ROI). Think of it as allocating water and fertilizer only to the plants you intend to grow.</p>
</li>
<li><p><strong>Improved Clarity &amp; Maintainability:</strong> Code cluttered with arbitrary instrumentation is harder to read, understand, and maintain. When instrumentation is added with clear intent, documented appropriately (even if informally via commit messages or code comments), it serves as a form of living documentation. Future engineers (including your future self!) can readily grasp <em>why</em> a particular metric, span, or log statement exists and how it contributes to understanding the system.</p>
</li>
</ol>
<h2 id="heading-guiding-questions-for-purposeful-instrumentation">Guiding Questions for Purposeful Instrumentation</h2>
<p>Before adding <em>any</em> new metric, span, span event, or log line, pause and cultivate intention by asking critical questions:</p>
<ul>
<li><p><strong>What question am I trying to answer?</strong> This is the cornerstone. Are you trying to understand latency distribution, error rates under specific conditions, resource consumption patterns, or the flow of requests across services? Defining the question sharpens the focus of your instrumentation. Don't aim to predict every <em>possible</em> future question, but consider the <em>types</em> of questions most likely to arise based on the service's function and history. What are the known failure modes or performance bottlenecks for this component?</p>
</li>
<li><p><strong>What data do I <em>really</em> need to answer this?</strong> Challenge the defaults. Do you need millisecond precision, or would seconds suffice? Do you need the full user ID (potentially creating high cardinality), or could you use a user <em>type</em> or a randomized cohort ID? Can data be aggregated at the source to reduce volume and cardinality? For example, instead of logging every request, could you use metrics instead with enough labels to distinguish between outcomes?</p>
</li>
<li><p><strong>Why <em>this</em> type of signal (Metric, Trace, Log)?</strong> Each signal type has strengths. Metrics are great for aggregatable trends and alerting (e.g., overall request rate). Traces excel at illustrating request flows and latency breakdowns across distributed systems. Logs provide detailed, event-specific context, especially for non-transaction data (configuration changes, connections/disconnections to databases, …). Are you choosing the most effective signal type for your defined purpose? Adding high-cardinality attributes to metrics intended for aggregation, for instance, is often an anti-pattern. Creating a span when a simple event on an existing span would suffice adds unnecessary overhead.</p>
</li>
<li><p><strong>How will this data actually be <em>used</em>?</strong> Will this feed a critical dashboard panel? Trigger an alert? Be used primarily for ad-hoc debugging during incidents? Understanding the consumption pattern helps determine the required granularity, retention, and format. How do you envision it being visualized or queried? Instrumenting data that no one knows how to use or interpret is like planting seeds you never intend to water. Again, don’t aim to predict exactly how things will be used, but having an idea helps set the direction.</p>
</li>
<li><p><strong>What is the <em>cost</em> versus the <em>value</em>?</strong> Consider the compute resources needed to generate the data, the network bandwidth to transmit it, and the storage/processing costs in your observability backend. Is the potential insight or troubleshooting value gained worth this ongoing cost? Regularly reassess this balance, especially for verbose or high-frequency telemetry.</p>
</li>
</ul>
<p>You’ll never get the perfect balance on the first shot. In fact, even if you get into a perfect state today, it won’t be suitable anymore tomorrow as systems evolve. Keep an open mind and add things that are going in the direction of what you believe you’ll need. Adding too much fertilizer does hurt your garden.</p>
<h2 id="heading-applying-purposefulness-in-practice">Applying Purposefulness in Practice</h2>
<p>Purposeful instrumentation isn't just a theoretical concept; it's applied through conscious choices during development and operation.</p>
<ul>
<li><p><strong>Manual Instrumentation:</strong> When you manually add code to emit telemetry (e.g., using OpenTelemetry APIs), be explicit. Add enough details explaining the 'why' behind non-obvious metrics or attributes. Document the intended use case, especially for custom, high-value signals. This foresight is invaluable during incident response or later refactoring, and writing about them is a great exercise to reason about them in the first place.</p>
</li>
<li><p><strong>Use semantic conventions to your favor:</strong> not only does it tell how things should be named, but it also helps brainstorming what kind of instrumentation can be used. For instance, are you adding <code>deployment.environment.name</code> to your resource attributes?</p>
</li>
<li><p><strong>Auto-Instrumentation:</strong> Tools like OpenTelemetry's auto-instrumentation agents (e.g., the Java agent) are powerful, providing broad coverage with minimal effort. However, "zero code" doesn't mean "zero thought." Don't blindly enable every single instrumentation library offered. Review the default configuration. Can you disable instrumentation for components irrelevant to your critical paths (e.g., verbose JDBC logging if you primarily diagnose issues at the service level)? Can you configure sampling decisions more intelligently? Tune or suppress instrumentation known to generate excessive noise or high-cardinality data that bloats costs without commensurate value. Auto-instrumentation provides the seeds; purposeful configuration helps you cultivate the desired crop.</p>
</li>
<li><p><strong>Regular Review &amp; Weeding:</strong> Instrumentation needs aren't static. Systems evolve, code gets refactored, and priorities shift. Schedule periodic reviews (e.g., quarterly) of your existing telemetry. It might still be a bit early, but consider using <a target="_blank" href="https://github.com/open-telemetry/weaver">OTel Weaver</a> to help you here. Ask: Are there metrics, logs, or trace attributes that haven't been queried or looked at in months? Be ruthless about pruning unused or redundant instrumentation. This ongoing "weeding" keeps your observability garden healthy, cost-effective, and focused on yielding insights.</p>
</li>
</ul>
<h2 id="heading-reaping-the-benefits-of-critical-scrutiny">Reaping the Benefits of Critical Scrutiny</h2>
<p>Consistently applying a critical, purposeful lens to your instrumentation strategy, whether manual or automatic, transforms observability from a potential data swamp into a beautiful field full of data ready to harvest. It ensures your telemetry remains:</p>
<ul>
<li><p><strong>Focused:</strong> Directly addressing key operational questions and business KPIs.</p>
</li>
<li><p><strong>Relevant:</strong> Aligned with current system architecture and troubleshooting needs.</p>
</li>
<li><p><strong>Cost-Effective:</strong> Providing maximum insight for the resources invested.</p>
</li>
<li><p><strong>Actionable:</strong> Enabling swift diagnosis, resolution, and performance optimization.</p>
</li>
</ul>
<p>By consciously choosing <em>what</em> to plant in your observability garden and <em>why</em>, you cultivate a rich harvest of insights.</p>
<h2 id="heading-next-up">Next up</h2>
<p>What does purposeful instrumentation <em>actually</em> look like in real code, and how do we correct existing instrumentation that might not be that useful? Stay tuned for our next article, where we'll expose concrete examples of common instrumentation pitfalls and walk you through the precise steps to fix them, both directly at the source and with the OTel Collector.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In today's complex, distributed systems, observability is non-negotiable. But the path to enlightenment isn't paved with sheer data volume. It's built on the foundation of <strong>purposeful instrumentation</strong> – the deliberate act of gathering the <em>right</em> signals to illuminate system behavior.</p>
<p>By embedding the practice of asking "Why this signal? Why now? How will it help?" into our development workflow, we shift from reactive data collection to proactive insight generation. We reduce noise, accelerate troubleshooting, control costs, and ultimately, build more reliable and performant software.</p>
<p>So, the next time you reach for that instrumentation library or add that log line, take a moment. Pause. Ask yourself: "What is my purpose?". Cultivate clarity, and you'll reap the rewards of truly effective observability.</p>
<p><em>Acknowledgement: The concept of intentional instrumentation gained prominence for me through a conversation with Adriel Perkins, which evolved into purposeful instrumentation.</em></p>
]]></content:encoded></item><item><title><![CDATA[There's a Lot of Bad Telemetry Out There]]></title><description><![CDATA[Ninety percent. That's the number the founder of an observability company mentioned some time ago when talking about telemetry data that withers away unused. It is data created, collected, transmitted, and stored, without ever blooming into a dashboa...]]></description><link>https://blog.olly.garden/theres-a-lot-of-bad-telemetry-out-there</link><guid isPermaLink="true">https://blog.olly.garden/theres-a-lot-of-bad-telemetry-out-there</guid><category><![CDATA[observability]]></category><category><![CDATA[telemetry]]></category><category><![CDATA[OpenTelemetry]]></category><dc:creator><![CDATA[Juraci Paixão Kröhling]]></dc:creator><pubDate>Fri, 28 Mar 2025 13:21:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1743091623264/5303c3c0-1e2c-42ae-a6e9-ee965a970a67.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ninety percent. That's the number the founder of an observability company mentioned some time ago when talking about telemetry data that withers away unused. It is data created, collected, transmitted, and stored, without ever blooming into a dashboard, alert, or query result. Our monitoring tools keep planting more and more data seeds, hoping they'll sprout into useful insights during a 2am production incident. Like planting without proper soil analysis, these generic instrumentation tools create low-quality telemetry that isn't suited for our environment. Traces from health check endpoints are like weeds taking up valuable space.</p>
<p>Don’t get me wrong, this is still better than nothing! However, many companies have matured past this stage and need to cultivate higher quality telemetry.</p>
<p>Telemetry is essential for us, observability engineers, to understand how our systems are performing. It's the soil from which observability and modern monitoring blossom. Despite its importance, a significant amount of telemetry data out there is, to put it bluntly, bad. This post will explore what good telemetry looks like, digging into the problems caused by bad telemetry, examine the root causes, and plant a thought on how to cultivate higher quality data.</p>
<h1 id="heading-what-is-good-telemetry">What is Good Telemetry?</h1>
<p>Good telemetry is characterized by providing an <strong>accurate</strong>, <strong>relevant</strong>, <strong>timely</strong>, and <strong>actionable</strong> picture of a system's health and performance. It offers <strong>just enough</strong> information — not too much or too little — to allow for improved troubleshooting, quick identification and resolution of issues, and a deeper understanding of the system. Additionally, good telemetry facilitates faster incident response, minimizing downtime and service disruptions, and bears fruit in the form of data-driven decisions that optimize performance.</p>
<p>Examples include traces that show the path of a request through a system, metrics that accurately reflect resource utilization, and logs that provide context for errors and warnings. Quite frequently, good telemetry is used to provide business insights along with operational data.</p>
<h1 id="heading-bad-telemetry">Bad Telemetry</h1>
<p>Bad telemetry is the opposite of good telemetry: inaccurate, irrelevant, old, non-actionable, or far too much in terms of volume.</p>
<ul>
<li><p><strong>Inaccurate</strong> telemetry happens when we have the wrong values, leading us to wrong conclusions. For instance, concurrently counting the number of leaves in a tree and storing the counter on a non-concurrent data structure will eventually result in the wrong number of leaves being reported  </p>
</li>
<li><p><strong>Irrelevant</strong> telemetry doesn’t provide any meaningful insight. It might be interesting to know that I have three boxes of kiwis, but without knowing the size of the box, that information is meaningless.  </p>
</li>
<li><p><strong>Incomplete</strong> telemetry means that we can’t determine the root cause of a problem. How can I tell whether my strawberries received enough sun if I’m not recording the amount of sun they received? How can I tell why my tulips didn’t grow if I don’t know whether they were planted in the first place?  </p>
</li>
<li><p><strong>Old</strong> data is bad telemetry, meaning that I’m only taking action when it might be too late. There’s no point in employing a scarecrow when the birds ate all the seeds already  </p>
</li>
<li><p>Having <strong>too much</strong> data can also make it harder to find the real information we are looking for. It’ll definitely take us longer to find a magic herb in the middle of overgrown bushes.</p>
</li>
</ul>
<p>One interesting consideration is that the definition of bad telemetry also varies based on the backend we are using. Rice needs tons of water, purple coneflower will probably complain. For a time-series database, a high-cardinality metric is certainly not desirable.</p>
<h1 id="heading-consequences-of-bad-telemetry">Consequences of bad telemetry</h1>
<p>We, as an industry, have tolerated bad telemetry as a price to pay for having telemetry in the first place. However, bad telemetry isn’t just bad: it’s the weed in our garden. Misleading insights from erroneous data can lead to poor decisions, and time and effort can be wasted analyzing useless data. Additionally, bad telemetry can result in slower problem resolution, due to the difficulty in identifying and fixing issues. It can lead to poor decision-making and incorrect business strategies based on faulty metrics.</p>
<p>Perhaps more relevant for today’s economic realities, bad telemetry often means paying way more for observability than you should: high egress costs, overprovisioned infrastructure, complex pipelines, and unreasonably big checks to observability vendors.</p>
<h1 id="heading-root-causes-of-bad-telemetry">Root Causes of Bad Telemetry</h1>
<p>Many developers and engineers lack a full understanding of the principles of good telemetry. Observability is not taught alongside writing operating systems or databases at the university. It's not surprising that engineers don't learn how to properly instrument their applications.</p>
<p>Like security, observability is typically a discipline that engineers come across when they get more experienced, once they’ve been burned by bug reports that they couldn’t reproduce, leading to frustrated users. Insufficient planning is another issue, as telemetry is often an afterthought rather than a core part of system design. Poor implementation is also a factor, as instrumentation can be complex and mistakes are easily made. Finally, telemetry systems require ongoing tending and upkeep, and inadequate maintenance can lead to having an infestation of bad telemetry. Instrumenting complex systems can be particularly challenging, as tasks like ensuring that context is propagated correctly across distributed services are crucial for accurate tracing.</p>
<p>It doesn’t help that things are still moving really fast in observability, causing best practices to suddenly become anti-patterns, as well as the lack of standardization in new areas. Applications making use of generative AI tools or LLMs need to be instrumented today, but standards to instrument those components are still being worked on. Without clear industry guidance, people often have to make decisions based on their experiences (or their vendor’s suggestions), which quite often means that they’d be at odds with the standard when it gets created. OpenTelemetry semantic conventions definitely help here, and while OTel is in constant progress, we don’t have enough stable conventions for all that matters out there.</p>
<h1 id="heading-how-to-improve-telemetry-quality">How to Improve Telemetry Quality</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743190253404/11f25ca7-aee5-4f56-abc7-af4b01a0ac27.png" alt class="image--center mx-auto" /></p>
<p>We are at the beginning of this journey towards high quality telemetry, and while there’s a lot to learn, I believe that we have enough to start changing the status quo. We already know a few anti-patterns, like the ones we mentioned earlier. We can also be opinionated about some solutions, such as OpenTelemetry, and gather insights based on those strong opinions. Perhaps all my telemetry should have a <a target="_blank" href="http://service.name"><code>service.name</code></a> and <code>service.version</code> field? Perhaps some resource attributes, like <code>process.executable.path</code>, should be filtered out at the source when going to a time-series metrics backend?</p>
<p>Improving telemetry quality is a long-term activity, and I believe we’ll never have perfectly good telemetry. New services come and go every day, new versions are deployed all the time, and they all bring new telemetry with them, which is likely to be imperfect.</p>
<p>It’s our job as observability engineers to understand what is good and what is bad telemetry, knowing our tools and how they’ll behave with the telemetry we are sending to them. Once we know and have our own recipes of what’s good and what’s bad, it’s a matter of looking into our telemetry and applying our knowledge, cross-pollinating with the engineers doing the instrumentation. Very likely, they are the ones doing the actual changes, or at least reviewing the pull requests we send their way improving the instrumentation of their code.</p>
<p>The question that remains is: how do we show “progress”? Are we really better off with good telemetry, or can we just survive with bad telemetry? Instinctively, we know that good telemetry is more efficient than bad telemetry, but we should be ready to measure the impact.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Bad telemetry remains a pervasive challenge with significant consequences, one that we, as observability engineers, have often tolerated or underestimated. This is partly due to a lack of clear understanding of what constitutes good telemetry and partly due to the absence of robust tools and processes for detection and remediation. However, by actively defining and implementing standards for good telemetry, identifying the root causes of poor data quality (like pulling out weeds), and leveraging tools like OpenTelemetry, we can unlock the full potential of observability. This means reduced alert fatigue, faster incident resolution, and ultimately, more reliable and performant systems. Pioneering teams are already demonstrating the value of this approach, implementing semantic conventions and data quality pipelines that yield healthy, actionable insights. It's time for each of us to assess our own telemetry landscape, advocate for better instrumentation practices within our teams, and ensure that the data we rely on is truly serving our needs.</p>
<h1 id="heading-credits">Credits</h1>
<p>Dan Blanco, my colleague at OpenTelemetry’s Governance Committee, is the author of the quote “There’s a lot of bad telemetry out there,” and he used it when I was describing what we are building at OllyGarden. He immediately understood our value proposition, wrapping it up with this quote. I like it: it’s blunt, it’s real, and it’s blameless.</p>
]]></content:encoded></item></channel></rss>