Observability

Observability is how we understand what’s happening inside a system by analyzing its outputs: metrics, logs, and traces. This page explains how telemetry flows through Midaz, the tools that power it, and how you can connect your own systems to monitor performance, troubleshoot issues, and ensure operational excellence.

📘

Note

Midaz uses OpenTelemetry to collect and forward telemetry data to a centralized platform. Our stack includes Prometheus, Loki, Tempo, and Grafana, all working together to give you full visibility into your applications.

How the data flows

Telemetry data originates in your application and flows through a Client Collector running in your environment. This data is enriched and securely forwarded to a Central Collector managed by Lerian. From there, it’s routed to three specialized backends:

Grafana sits on top of everything, giving you a unified view of all telemetry signals.

Stack components


OpenTelemetry Collector (Client)

A lightweight agent running close to your application, either as a DaemonSet or a Deployment. It enriches telemetry with Kubernetes metadata and your tenant identifier (client_id), then routes it to the Central Collector.

  • Why it matters: Reduces load on the central pipeline, enables source-level filtering, and adds crucial metadata like k8s.pod.name.
  • How it’s installed: Managed via Helm and Terraform.

OpenTelemetry Collector (Central)

A centralized deployment that receives telemetry from all clients. It performs global processing, enforces multi-tenancy, and exports signals to the appropriate storage backends.

Prometheus

Our metrics backend. Optimized for storing and querying numerical time-series data. The Central Collector pushes metrics using remote_write.

Loki

Our logs backend. Stores logs using label-based indexing, making it fast and cost-effective. Logs are sent from the Central Collector to the loki-write service.

Tempo

Our traces backend. Stores full distributed traces and integrates tightly with Prometheus and Loki through Grafana.

Grafana

Your single pane of glass. Grafana connects to Prometheus, Loki, and Tempo, enabling you to correlate metrics, logs, and traces in one place.

Deployment option: embedded client collector


You can enable the Client Collector as a dependency of your Midaz application using a single configuration flag:

otel-collector-lerian:
  enabled:true

This automatically installs a DaemonSet and configures your application to export telemetry to it. The necessary environment variables and secrets are injected via Helm.

👍

Tip

This setup reduces maintenance, ensures compatibility, and gives you out-of-the-box observability with no extra steps.

Client Collector Processors


In the OpenTelemetry Collector, processors are the core of data manipulation. They run sequentially to enrich, filter, sample, and transform telemetry data before exporting it to the backends.

Below is the list of processors configured in the Lerian Client Collector, their purpose, and how to configure them.

1. batch

What it is: Groups multiple telemetry signals (metrics, logs, or traces) into batches before sending them to the next stage.

Why it matters: Improves compression efficiency, reduces network requests, and enhances overall pipeline performance — a best practice in production.

Configuration:

processors:
  batch: {}

2. memory_limiter

What it is: Monitors the collector's memory usage and drops data if it approaches a defined threshold.

Why it matters: Prevents the collector from being terminated by Kubernetes due to out-of-memory (OOMKilled) events, ensuring stability under heavy load.

Configuration:

memory_limiter:
  check_interval: 1s
  limit_percentage: 75
  spike_limit_percentage: 15

3. spanmetrics

What it is: Generates metrics directly from trace data (latency, request count, and error count).

Why it matters: Produces “RED” metrics (Rate, Errors, Duration) automatically, without additional code instrumentation. Configured dimensions become labels in Prometheus for granular analysis.

Configuration:

spanmetrics:
  metrics_exporter: prometheus
  dimensions:
    - name: http.method
    - name: http.status_code
    - name: service.name
    - name: client.id

4. transform/remove_sensitive_attributes

What it is: Uses OpenTelemetry Transformation Language (OTTL) to remove sensitive span attributes.

How it works: Deletes all attributes starting with app.request. except app.request.request_id, using a negative lookahead pattern.

Why it matters: Removes sensitive request data (e.g., headers, bodies) while keeping a unique identifier for correlation.

Configuration:

transform/remove_sensitive_attributes:
  trace_statements:
    - context: span
      statements:
        - delete_matching_keys(attributes, "^app\\\\.request\\\\.(?!request_id$).*")

5. tail_sampling

What it is: A sampling strategy that decides which traces to keep after all spans for a trace have been received.

Why it matters: Reduces storage costs by focusing only on high-value traces (e.g., specific clients or HTTP 5xx errors).

Configuration:

tail_sampling:
  policies:
    - name: keep_firmino_traces_policy
      # Keeps 100% of traces with client.id = "Firmino"
    - name: http_server_errors_policy
      # Keeps 100% of traces with HTTP 5xx status
    - name: drop_all_other_traces_policy
      # Drops all other traces

6. filter/drop_node_metrics

What it is: Filters out metrics based on their names.

Why it matters: Removes high-volume, low-value node-level metrics (e.g., k8s.node.cpu.utilization) to save storage and focus on application-level data.

Configuration:

filter/drop_node_metrics:
  metrics:
    exclude:
      match_type: regexp
      metric_names:
        - ^k8s\\.node\\..*$

7. filter/include_midaz_namespaces

What it is: Keeps only metrics from specific Kubernetes namespaces.

Why it matters: Ensures processing is limited to telemetry from midaz and midaz-plugins, eliminating noise from unrelated workloads.

Configuration:

filter/include_midaz_namespaces:
  metrics:
    include:
      match_type: regexp
      resource_attributes:
        - key: k8s.namespace.name
          value: '^(midaz|midaz-plugins)$'

8. k8sattributes

What it is: Enriches telemetry with Kubernetes metadata (e.g., pod name, namespace, deployment name).

Why it matters: Provides essential context for filtering, correlation, and troubleshooting in Grafana.

Configuration:

k8sattributes:
  auth_type: "serviceAccount"
  passthrough: false
  extract:
    metadata:
      - k8s.pod.name
      - k8s.deployment.name
      - k8s.namespace.name
      - k8s.node.name

9. resource/add_client_id

What it is: Adds or updates the client.id attribute for all telemetry data.

Why it matters: Enables multi-tenancy in Midaz by ensuring that all telemetry is tagged with the correct client identifier.

Configuration:

resource/add_client_id:
  attributes:
    - key: client.id
      value: "Firmino"
      action: upsert

10. transform/remove_log_body

What it is: Uses OTTL to remove the log body content.

How it works: Replaces the log record body with an empty string.

Why it matters: Prevents sensitive or personally identifiable information (PII) from being stored in logs.

Configuration:

transform/remove_log_body:
  log_statements:
    - context: log
      statements:
        - set(body, "")

In short

ProcessorData typePrimary functionDirect benefit
batchMetrics, Logs, TracesGroups telemetry into batches before export.Improves compression, reduces network overhead, and optimizes pipeline performance.
memory_limiterMetrics, Logs, TracesMonitors memory usage and drops data if limits are reached.Prevents OOMKilled events and ensures stability under high load.
spanmetricsTraces → MetricsGenerates RED metrics (Rate, Errors, Duration) from trace spans.Enables detailed performance analysis without extra instrumentation.
transform/remove_sensitive_attributesTracesRemoves sensitive span attributes while keeping request_id.Protects sensitive data while retaining identifiers for correlation.
tail_samplingTracesApplies intelligent sampling after receiving the full trace.Reduces storage costs by keeping only high-value traces (e.g., errors, targeted clients).
filter/drop_node_metricsMetricsExcludes node-level metrics by name.Reduces data volume and focuses on application-level metrics.
filter/include_midaz_namespacesMetricsKeeps only metrics from midaz and midaz-plugins namespaces.Eliminates noise from unrelated workloads.
k8sattributesMetrics, Logs, TracesAdds Kubernetes metadata (pod, namespace, deployment, node).Provides essential context for filtering, correlation, and troubleshooting in Grafana.
resource/add_client_idMetrics, Logs, TracesAdds or updates the client.id attribute for all telemetry.Enables multi-tenancy and data isolation across clients.
transform/remove_log_bodyLogsClears the log body content.Prevents storing sensitive or PII data in logs.

Telemetry flow: end-to-end


Here’s what happens when telemetry is enabled:

  1. Application starts and detects OpenTelemetry configuration.
  2. Telemetry is exported to the local Client Collector.
  3. Client Collector enriches data with Kubernetes metadata and your client_id.
  4. Processors enrich, filter, and transform the data.
  5. Data is forwarded to the Central Collector.
  6. Central Collector processes and routes data:
    • Metrics → Prometheus
    • Logs → Loki
    • Traces → Tempo
  7. Grafana lets you query it all, correlating across signals.

You can, for example, run:

sum(rate(http_server_duration_seconds_count{
  k8s_pod_name=~"checkout-.*",
  client_id="client-name"
}[5m]))

And then jump straight to the related logs or traces.

Authenticating collector requests


To ensure data security and integrity, all telemetry sent from your cluster to Lerian’s platform must be authenticated using a secure API key.

How to set it up

  1. Create the Kubernetes Secret to store your API token:
kubectl create secret generic otel-api-key \
  --from-literal=OTEL_API_KEY='YOUR_TOKEN_HERE' \
  -n midaz
  1. Reference the secret in your Helm values file to inject it as an environment variable:
extraEnvs:
  - name: OTEL_API_KEY
    valueFrom:
      secretKeyRef:
        name: otel-api-key
        key: OTEL_API_KEY
  1. Telemetry is securely sent to Lerian’s telemetry endpoint over HTTPS, with the API key included in the headers.
<https://telemetry.lerian.io:443>
❗️

Important

This key must remain private. If compromised, contact Lerian support immediately to rotate the token.

Data encryption in transit

All telemetry data, including metrics, logs, and traces, is transmitted from your environment to Lerian’s observability platform using HTTPS with TLS encryption.

This means:

  • The communication between the Client Collector and the Central Collector is fully encrypted.
  • Data in transit is protected against interception, tampering, or unauthorized access.
  • Even if network traffic is inspected, the contents remain unreadable without the proper cryptographic keys.

Combined with API key authentication, this ensures your telemetry is both secure and verifiable from source to destination.

❗️

Important

We enforce encrypted transport by default. No data is accepted over insecure channels.

Dual Routing: send metrics to your own Prometheus


Need to keep a copy of your metrics internally? You can configure the Client Collector to send telemetry to multiple destinations.

Example

exporters:
  otlphttp/server:
    endpoint: "<https://telemetry.lerian.io:443>"
    headers:
      x-api-key: "${OTEL_API_KEY}"
  prometheus/local:
    endpoint: prometheus-server-example:8889

Add both exporters to your metrics pipeline, and the same metrics will be sent to both our platform and your internal Prometheus.

📘

Note

This setup is ideal for local monitoring without disrupting the standard flow to Lerian’s observability stack.

Glossary


DaemonSet

A Kubernetes workload type that ensures a Pod runs on every (or selected) node in a cluster. Used for deploying the Client Collector, so it can collect node-level data like Kubelet metrics.

Deployment

A Kubernetes workload that manages replicas of a Pod. Used for the Central Collector and other platform services.

Exporter

Sends telemetry data from the Collector to one or more backends (e.g., Prometheus for metrics, Loki for logs, Tempo for traces).

Grafana

An open-source visualization layer. Grafana connects to Prometheus, Loki, and Tempo to provide a unified interface for querying and exploring metrics, logs, and traces.

Loki

Our backend for logs. Loki indexes metadata labels rather than full log content, making it fast and cost-efficient for high-volume use cases.

Multi-tenancy

An architectural approach where a single platform serves multiple clients (tenants). In Midaz, telemetry data is tagged with a client_id to ensure isolation and traceability across tenants.

Observability

The ability to understand a system’s internal state by analyzing its external outputs. In practice, it means collecting and analyzing metrics, logs, and traces to monitor performance and troubleshoot issues.

OpenTelemetry (OTel)

An open-source framework with tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data — metrics, logs, and traces.

OTel Collector

A standalone service that receives, processes, and exports telemetry data. It acts as a bridge between instrumented applications and backends like Prometheus or Grafana.

OTLP (OpenTelemetry Protocol)

The default protocol used by OpenTelemetry to transport telemetry data between applications, collectors, and backends via gRPC or HTTP.

Pipeline

Defines how telemetry flows through the Collector. A pipeline typically chains together receivers, processors, and exporters for a given signal type (metrics, logs, or traces).

Processor

Handles data transformation inside the Collector, such as enriching signals with metadata, filtering unwanted data, batching messages, or enforcing sampling policies.

Prometheus

Our backend for storing and querying metrics. It supports powerful time-series queries (PromQL) and integrates with the OpenTelemetry Collector via remote write.

Receiver

The component of the Collector that ingests incoming telemetry data. Supports formats like OTLP, Jaeger, Prometheus, and others.

SDK (Software Development Kit)

A set of libraries you embed in your application code to produce telemetry signals like spans, counters, or logs.

Tempo

Our backend for traces. It stores full distributed traces and integrates closely with Prometheus and Loki for seamless correlation in Grafana.

Terraform

An Infrastructure as Code (IaC) tool we use to provision and manage cloud infrastructure, including the installation of observability components via Helm.