> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lerian.studio/llms.txt
> Use this file to discover all available pages before exploring further.

# About Observability

> Understand how telemetry — metrics, logs, and traces — flows through Midaz so you can monitor performance and troubleshoot issues.

Observability is how we understand what’s happening inside a system by analyzing its outputs: metrics, logs, and traces.

This page explains how telemetry flows through **Midaz**, the tools that power it, and how you can connect your own systems to monitor performance, troubleshoot issues, and ensure operational excellence.

## Who configures what?

***

To avoid confusion, here’s a quick split of responsibilities:

### Client side

On your infrastructure, the main configuration for observability lives in the `components/infra/grafana/otelcol-config.yaml` file.

In this file, you define the collector’s behavior:

* **Processors**: batching, memory limits, filtering, obfuscation, sampling, etc.
* **Exporters**: for example, dual routing to Prometheus
* **API key authentication secrets**

<Danger>
  After editing this file, you **must** restart the stack with `make down && make up` for changes to take effect.
</Danger>

This setup ensures telemetry is processed efficiently, secured properly, and routed to the right destinations.

### Lerian side

On Lerian’s managed infrastructure, the observability stack is centrally configured and operated. This includes:

* **Central Collector**
* **Prometheus**
* **Loki**
* **Tempo**
* **Grafana**

<Note>
  These components are preconfigured and maintained by Lerian. You don’t edit them directly.
</Note>

This ensures consistency across environments and removes the need for local maintenance on your side.

## How the data flows

***

Telemetry data originates in your application and flows through a **Client Collector**, powered by [OpenTelemetry](https://opentelemetry.io/). Running in your environment, this collector enriches the data and securely forwards it to a **Central Collector** managed by Lerian. From there, it’s routed to three specialized backends:

* [**Prometheus**](https://grafana.com/oss/prometheus/) for metrics
* [**Loki**](https://grafana.com/oss/loki) for logs
* [**Tempo**](https://grafana.com/oss/tempo) for traces

Grafana sits on top of everything, giving you a unified view of all telemetry signals. This flow ensures observability at scale, built on OpenTelemetry standards for portability and consistency.

## Stack components

***

Together, these components form a complete observability pipeline: flexible on your side, consistent and secure on Lerian’s side, and fully based on **OpenTelemetry standards**.

#### Client Collector

The Client Collector is a lightweight [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) that runs close to your application, either as a DaemonSet or a Deployment. It enriches telemetry with Kubernetes metadata and your tenant identifier (`client_id`), then routes the data to the Central Collector.

It matters because it reduces load on the central pipeline, enables source-level filtering, and attaches crucial metadata such as `k8s.pod.name`. Installation is managed via Helm and Terraform, making it easy to integrate into your infrastructure.

#### Central Collector

The Central Collector is a centralized [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) deployment that receives telemetry from all clients. It performs global processing, enforces multi-tenancy, and exports signals to the appropriate storage backends.

<Note>
  The Central Collector is fully managed by Lerian. You don’t configure or modify it directly.
</Note>

This setup ensures consistency across tenants and guarantees that telemetry data is routed securely and efficiently to its final destinations.

#### Prometheus

Prometheus is optimized for storing and querying numerical time-series data. The Central Collector pushes metrics using `remote_write`.

#### Loki

Loki stores logs using label-based indexing, making it fast and cost-effective. Logs are sent from the Central Collector to the `loki-write` service.

#### Tempo

Tempo stores full distributed traces and integrates tightly with Prometheus and Loki through Grafana.

#### Grafana

Grafana is your single pane of glass. It connects to Prometheus, Loki, and Tempo, enabling you to correlate metrics, logs, and traces in one place.

<Tip>
  You can pivot between metrics, logs, and traces directly inside Grafana to speed up troubleshooting.
</Tip>

## Embedded Collector

***

You can enable the Client Collector as a dependency of your Midaz application with a single configuration flag:

```
otel-collector-lerian:
  enabled:true
```

This automatically installs a DaemonSet and configures your application to export telemetry to it. The required environment variables and secrets are injected via Helm, so you don’t need to manage them manually.

## Editing the Client Collector

***

When you need to customize behavior (obfuscation, filtering, sampling, etc.), you will:

1. Edit `components/infra/grafana/otelcol-config.yaml`.
2. Add or adjust the `processors` or `exporters` blocks.
3. Restart the stack:

   ```
   make down
   make up
   ```

## Client Collector Processors

***

In the OpenTelemetry Collector, **processors** are the core of data manipulation. They run sequentially to enrich, filter, sample, and transform telemetry data before exporting it to the backends.

Below is the list of processors configured in the **Lerian Client Collector**, their purpose, and how to configure them.

<Tip>
  **Where to configure**: add each block under `processors:` in `otelcol-config.yaml`.
</Tip>

**1. batch**

* **What it is:** Groups multiple telemetry signals (metrics, logs, or traces) into batches before sending them to the next stage.
* **Why it matters:** Improves compression efficiency, reduces network requests, and enhances overall pipeline performance.
* **Configuration:**

```
processors:
  batch: {}
```

**2. memory\_limiter**

* **What it is:** Monitors the collector's memory usage and drops data if it approaches a defined threshold.
* **Why it matters:** Prevents the collector from being OOMKilled by Kubernetes.
* **Configuration:**

```
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 15
```

**3. spanmetrics**

* **What it is:** Generates metrics directly from trace data.
* **Why it matters:** Produces “RED” metrics (Rate, Errors, Duration) automatically.
* **Configuration:**

```
processors:
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: http.method
      - name: http.status_code
      - name: service.name
      - name: client.id
```

**4. transform/remove\_sensitive\_attributes**

* **What it is:** Removes sensitive span attributes using regex.
* **Why it matters:** Keeps identifiers but strips headers, bodies, or other sensitive request data.
* **Configuration:**

```
processors:
  transform/remove_sensitive_attributes:
    trace_statements:
      - context: span
        statements:
          - delete_matching_keys(attributes, "^app\\.request\\.(?!request_id$).*")
```

**5. tail\_sampling**

* **What it is:** A sampling strategy applied after spans are received.
* **Why it matters:** Keeps only high-value traces (errors, specific clients) and reduces storage costs.
* **Configuration:**

```
processors:
  tail_sampling:
    policies:
      - name: keep_client_traces_policy
      - name: http_server_errors_policy
      - name: drop_all_other_traces_policy
```

**6. filter/drop\_node\_metrics**

* **What it is:** Filters out node-level metrics.
* **Why it matters:** Reduces noise and focuses on app-level telemetry.
* **Configuration:**

```
processors:
  filter/drop_node_metrics:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^k8s\\.node\\..*$
```

**7. filter/include\_midaz\_namespaces**

* **What it is:** Keeps only metrics from `midaz` and `midaz-plugins`.
* **Why it matters:** Eliminates irrelevant Kubernetes workloads.
* **Configuration:**

```
processors:
  filter/include_midaz_namespaces:
    metrics:
      include:
        match_type: regexp
        resource_attributes:
          - key: k8s.namespace.name
            value: '^(midaz|midaz-plugins)$'
```

**8. k8sattributes**

* **What it is:** Adds Kubernetes metadata to telemetry.
* **Why it matters:** Enables richer context in Grafana queries.
* **Configuration:**

```
processors:
  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    extract:
      metadata:
        - k8s.pod.name
        - k8s.deployment.name
        - k8s.namespace.name
        - k8s.node.name
```

**9. resource/add\_client\_id**

* **What it is:** Inserts or updates `client.id` in telemetry.
* **Why it matters:** Critical for multi-tenancy.
* **Configuration:**

```
processors:
  resource/add_client_id:
    attributes:
      - key: client.id
        value: "example-client"
        action: upsert
```

**10. transform/remove\_log\_body**

* **What it is:** Removes log body content.
* **Why it matters:** Prevents sensitive or PII data from persisting in logs.
* **Configuration:**

```
processors:
  transform/remove_log_body:
    log_statements:
      - context: log
        statements:
          - set(body, "")
```

**11. transform/obfuscate\_attributes**

* **What it is:** Obfuscates selected attributes.
* **Why it matters:** Protects sensitive values (like `legalDocument` or `accountAlias`) before data leaves your cluster.
* **Configuration (**`otelcol-config.yaml`**):**

```
processors:
  transform/obfuscate_attributes:
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["legalDocument"], ".*", "***")
          - replace_pattern(attributes["accountAlias"], ".*", "***")
    log_statements:
      - context: log
        statements:
          - replace_pattern(attributes["legalDocument"], ".*", "***")
          - replace_pattern(attributes["accountAlias"], ".*", "***")
```

* **Customizing the fields:**
  * Defaults: `legalDocument`, `accountAlias`
  * Add or remove fields as needed
  * Restart required: `make down && make up`

### In short

| Processor                                   | Data type             | Function                             | Benefit                                |
| :------------------------------------------ | :-------------------- | :----------------------------------- | :------------------------------------- |
| **batch**                                   | Metrics, Logs, Traces | Groups telemetry before export       | Improves compression and network use   |
| **memory\_limiter**                         | Metrics, Logs, Traces | Drops data when memory limit is near | Prevents OOMKilled                     |
| **spanmetrics**                             | Traces → Metrics      | Creates RED metrics                  | Immediate performance insights         |
| **transform/remove\_sensitive\_attributes** | Traces                | Strips sensitive span attrs          | Keeps IDs, removes secrets             |
| **tail\_sampling**                          | Traces                | Smart sampling of traces             | Lower storage, focus on errors/targets |
| **filter/drop\_node\_metrics**              | Metrics               | Excludes noisy node-level data       | Cleaner dataset                        |
| **filter/include\_midaz\_namespaces**       | Metrics               | Keeps only Midaz namespaces          | Removes irrelevant metrics             |
| **k8sattributes**                           | Metrics, Logs, Traces | Adds K8s metadata                    | Richer Grafana context                 |
| **resource/add\_client\_id**                | All signals           | Tags telemetry with client ID        | Enables multi-tenancy                  |
| **transform/remove\_log\_body**             | Logs                  | Clears log body                      | Avoids storing PII                     |
| **transform/obfuscate\_attributes**         | All signals           | Masks chosen fields                  | Ensures sensitive data never leaves    |

## Protecting sensitive data

***

Midaz treats the **Client Collector as a telemetry firewall**. All filtering, sampling, and transformation rules are defined in your configuration file (`components/infra/grafana/otelcol-config.yaml`).

<Frame caption="Client vs Lerian responsibilities in the observability pipeline">
  <img src="https://mintcdn.com/lerian-49cb71fc/SEOef3JqTInYAAau/images/en/d2/observability-client-lerian.svg?fit=max&auto=format&n=SEOef3JqTInYAAau&q=85&s=abe74dfcf252e0f0fb5edf9483729e1a" alt="Observability Client Leria Jp" width="2461" height="714" data-path="images/en/d2/observability-client-lerian.svg" />
</Frame>

This file runs inside **your infrastructure**, ensuring that sensitive attributes are removed or obfuscated *before* data leaves your cluster.

<Danger>
  Sensitive values such as request bodies, legal documents, or account aliases never reach Lerian’s Central Collector.
</Danger>

Our architecture enforces this separation:

* **Client Collector (you configure):** Runs in your cluster. Apply processors such as `transform/remove_sensitive_attributes`, `transform/remove_log_body`, and `transform/obfuscate_attributes`.
* **Central Collector (Lerian managed):** Receives only the filtered, sanitized telemetry streams and routes them to Prometheus, Loki, and Tempo.

Configurations are fully packaged and managed via Helm, keeping deployments consistent, traceable, and aligned with best practices.

<Note>
  You decide what’s sensitive in `otelcol-config.yaml`. Lerian only sees sanitized telemetry.
</Note>

## Telemetry flow

***

Here’s what happens when telemetry is enabled:

1. **Application starts** and detects OpenTelemetry configuration.
2. **Telemetry is exported** to the local Client Collector.
3. **Client Collector enriches data** with Kubernetes metadata and your `client_id`.
4. **Processors enrich**, filter, and transform the data.
5. **Data is forwarded** to the Central Collector.
6. **Central Collector processes and routes data**:
   * Metrics → Prometheus
   * Logs → Loki
   * Traces → Tempo
7. **Grafana lets you query it all**, correlating across signals.

You can, for example, run:

```
sum(rate(http_server_duration_seconds_count{
  k8s_pod_name=~"checkout-.*",
  client_id="client-name"
}[5m]))
```

And then jump straight to the related logs or traces.

## Authenticating collector requests

***

To ensure data security and integrity, all telemetry sent from your cluster to Lerian’s platform must be authenticated using a secure API key.

### How to set it up

1. **Create the Kubernetes Secret** to store your API token:

```
kubectl create secret generic otel-api-key \
  --from-literal=OTEL_API_KEY='YOUR_TOKEN_HERE' \
  -n midaz
```

2. **Reference the secret in your Helm values file** to inject it as an environment variable:

```
extraEnvs:
  - name: OTEL_API_KEY
    valueFrom:
      secretKeyRef:
        name: otel-api-key
        key: OTEL_API_KEY
```

3. **Telemetry is securely sent** to Lerian’s telemetry endpoint over HTTPS, with the API key included in the headers.

```
<https://telemetry.lerian.io:443>
```

<Danger>
  This key must remain private. If compromised, contact Lerian support immediately to rotate the token.
</Danger>

## Data encryption in transit

***

All telemetry data, including metrics, logs, and traces, is transmitted from your environment to Lerian’s observability platform using **HTTPS with TLS encryption**.

This means:

* The communication between the Client Collector and the Central Collector is fully encrypted.
* Data in transit is protected against interception, tampering, or unauthorized access.
* Even if network traffic is inspected, the contents remain unreadable without the proper cryptographic keys.

Combined with API key authentication, this ensures your telemetry is both **secure and verifiable** from source to destination.

<Danger>
  We enforce encrypted transport by default. No data is accepted over insecure channels.
</Danger>

## Dual routing

***

Need to keep a copy of your metrics internally? You can configure the Client Collector to send telemetry to multiple destinations.

### Example

```
exporters:
  otlphttp/server:
    endpoint: "<https://telemetry.lerian.io:443>"
    headers:
      x-api-key: "${OTEL_API_KEY}"
  prometheus/local:
    endpoint: prometheus-server-example:8889
```

Add both exporters to your metrics pipeline, and the same metrics will be sent to our platform and your internal Prometheus.

<Note>
  This setup is ideal for local monitoring without disrupting the standard flow to Lerian’s observability stack.
</Note>

## Glossary

***

<AccordionGroup>
  <Accordion title="DaemonSet">
    A Kubernetes workload type that ensures a Pod runs on every (or selected) node in a cluster. Used for deploying the **Client Collector**, so that it can collect node-level data like Kubelet metrics.
  </Accordion>

  <Accordion title="Deployment">
    A Kubernetes workload that manages replicas of a Pod. Used for the **Central Collector** and other platform services.
  </Accordion>

  <Accordion title="Exporter">
    Sends telemetry data from the Collector to one or more backends (e.g., Prometheus for metrics, Loki for logs, Tempo for traces).
  </Accordion>

  <Accordion title="Grafana">
    A source-available visualization layer. Grafana connects to Prometheus, Loki, and Tempo to provide a unified interface for querying and exploring metrics, logs, and traces.
  </Accordion>

  <Accordion title="Loki">
    Our backend for **logs**. Loki indexes metadata labels rather than full log content, making it fast and cost-efficient for high-volume use cases.
  </Accordion>

  <Accordion title="Multi-tenancy">
    An architectural approach where a single platform serves multiple clients (tenants). In Midaz, telemetry data is tagged with a `client_id` to ensure isolation and traceability across tenants.
  </Accordion>

  <Accordion title="Observability">
    The ability to understand a system’s internal state by analyzing its external outputs. In practice, it means collecting and analyzing **metrics**, **logs**, and **traces** to monitor performance and troubleshoot issues.
  </Accordion>

  <Accordion title="OpenTelemetry (OTel)">
    A source-available framework with tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data — metrics, logs, and traces.
  </Accordion>

  <Accordion title="OTel Collector">
    A standalone service that receives, processes, and exports telemetry data. It acts as a bridge between instrumented applications and backends like Prometheus or Grafana.
  </Accordion>

  <Accordion title="OTLP (OpenTelemetry Protocol)">
    The default protocol used by OpenTelemetry to transport telemetry data between applications, collectors, and backends via gRPC or HTTP.
  </Accordion>

  <Accordion title="Pipeline">
    Defines how telemetry flows through the Collector. A pipeline typically chains together **receivers**, **processors**, and **exporters** for a given signal type (metrics, logs, or traces).
  </Accordion>

  <Accordion title="Processor">
    Handles data transformation inside the Collector, such as enriching signals with metadata, filtering unwanted data, batching messages, or enforcing sampling policies.
  </Accordion>

  <Accordion title="Prometheus">
    Our backend for storing and querying **metrics**. It supports powerful time-series queries (PromQL) and integrates with the OpenTelemetry Collector via remote write.
  </Accordion>

  <Accordion title="Receiver">
    The component of the Collector that ingests incoming telemetry data. Supports formats like OTLP, Jaeger, Prometheus, and others.
  </Accordion>

  <Accordion title="SDK (Software Development Kit)">
    A set of libraries you embed in your application code to produce telemetry signals like spans, counters, or logs.
  </Accordion>

  <Accordion title="Tempo">
    Our backend for **traces**. It stores full distributed traces and integrates closely with Prometheus and Loki for seamless correlation in Grafana.
  </Accordion>

  <Accordion title="Terraform">
    An Infrastructure as Code (IaC) tool we use to provision and manage cloud infrastructure, including the installation of observability components via Helm.
  </Accordion>
</AccordionGroup>