> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lerian.studio/llms.txt
> Use this file to discover all available pages before exploring further.

# Implementing observability in Flowker

Flowker emits traces, metrics, and structured logs using the **OpenTelemetry** standard. This guide explains what's available, how to enable it, and how to interpret the data in your observability stack.

## Overview

***

Flowker's telemetry is built on three signals:

| Signal  | Backend    | What it covers                                         |
| ------- | ---------- | ------------------------------------------------------ |
| Traces  | Tempo      | Distributed spans across workflow executions and steps |
| Metrics | Prometheus | HTTP request rates, latency, and system resource usage |
| Logs    | Loki       | Structured JSON logs for every operation               |

All signals are exported via **OTLP (OpenTelemetry Protocol)** to a collector of your choice.

## Configuration

***

Telemetry is controlled by environment variables.

```bash theme={null}
# Enable telemetry (required to activate OTLP export)
ENABLE_TELEMETRY=true

# OTLP collector endpoint (required when ENABLE_TELEMETRY=true)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Service identity
OTEL_RESOURCE_SERVICE_NAME=flowker
OTEL_RESOURCE_SERVICE_VERSION=1.0.0
OTEL_RESOURCE_DEPLOYMENT_ENVIRONMENT=production
OTEL_LIBRARY_NAME=flowker

# Log verbosity: debug | info | warn | error
LOG_LEVEL=info
```

<Note>
  If `ENABLE_TELEMETRY=true` is set without `OTEL_EXPORTER_OTLP_ENDPOINT`, Flowker will fail to start.
</Note>

## Distributed tracing

***

Every HTTP request and internal operation creates an **OpenTelemetry span**. Spans are propagated through the full execution chain, so a single workflow run produces a connected trace from the HTTP handler down to individual executor steps.

### Span naming convention

Spans follow a `<layer>.<resource>.<operation>` pattern:

**Execution spans**

| Span name                                        | Description                                              |
| ------------------------------------------------ | -------------------------------------------------------- |
| `command.execution.execute`                      | Root span for a workflow execution                       |
| `command.execution.execute_executor_node`        | Span for each executor node processed                    |
| `command.execution.execute_with_provider_config` | Span for a node resolved with a specific provider config |
| `command.execution.recover`                      | Span for incomplete execution recovery at startup        |

**Workflow command spans**

| Span name                     | Description                 |
| ----------------------------- | --------------------------- |
| `command.workflow.create`     | Create a new workflow       |
| `command.workflow.update`     | Update an existing workflow |
| `command.workflow.activate`   | Activate a workflow         |
| `command.workflow.deactivate` | Deactivate a workflow       |
| `command.workflow.clone`      | Clone a workflow            |
| `command.workflow.delete`     | Delete a workflow           |

**Executor configuration spans**

| Span name                                   | Description                     |
| ------------------------------------------- | ------------------------------- |
| `command.executor_config.create`            | Create executor configuration   |
| `command.executor_config.update`            | Update executor configuration   |
| `command.executor_config.activate`          | Activate executor configuration |
| `command.executor_config.enable`            | Enable executor configuration   |
| `command.executor_config.disable`           | Disable executor configuration  |
| `command.executor_config.mark_configured`   | Mark executor as configured     |
| `command.executor_config.mark_tested`       | Mark executor as tested         |
| `command.executor_config.test_connectivity` | Test executor connectivity      |
| `command.executor_config.delete`            | Delete executor configuration   |

**Provider configuration spans**

| Span name                                   | Description                    |
| ------------------------------------------- | ------------------------------ |
| `command.provider_config.create`            | Create provider configuration  |
| `command.provider_config.update`            | Update provider configuration  |
| `command.provider_config.enable`            | Enable provider configuration  |
| `command.provider_config.disable`           | Disable provider configuration |
| `command.provider_config.test_connectivity` | Test provider connectivity     |
| `command.provider_config.delete`            | Delete provider configuration  |

**Query spans**

| Span name                              | Description                             |
| -------------------------------------- | --------------------------------------- |
| `query.execution.get`                  | Get execution by ID                     |
| `query.execution.list`                 | List executions                         |
| `query.execution.get_results`          | Get execution results                   |
| `query.workflow.get`                   | Get workflow by ID                      |
| `query.workflow.get_by_name`           | Get workflow by name                    |
| `query.workflow.list`                  | List workflows                          |
| `query.executor_config.get`            | Get executor config by ID               |
| `query.executor_config.get_by_name`    | Get executor config by name             |
| `query.executor_config.list`           | List executor configs                   |
| `query.executor_config.exists`         | Check executor config existence         |
| `query.executor_config.exists_by_name` | Check executor config existence by name |
| `query.provider_config.get`            | Get provider config by ID               |
| `query.provider_config.list`           | List provider configs                   |

<Tip>
  In Grafana Tempo, search by service name (`flowker`) and filter by span name to isolate specific operations. Use `command.execution.execute` as the entry point to see a full workflow trace.
</Tip>

## Metrics

***

Flowker exposes HTTP and system metrics automatically via the OpenTelemetry SDK. No additional configuration is needed beyond enabling telemetry.

### HTTP metrics (via otelfiber)

Collected per route by the `otelfiber` middleware:

| Metric                        | Type          | Description                      |
| ----------------------------- | ------------- | -------------------------------- |
| `http.server.duration`        | Histogram     | Request duration in milliseconds |
| `http.server.request.size`    | Histogram     | Request payload size in bytes    |
| `http.server.response.size`   | Histogram     | Response payload size in bytes   |
| `http.server.active_requests` | UpDownCounter | Number of in-flight requests     |

Each metric carries labels: `http.method`, `http.route`, `http.status_code`.

### System metrics

| Metric             | Type  | Unit       | Description                      |
| ------------------ | ----- | ---------- | -------------------------------- |
| `system.cpu.usage` | Gauge | percentage | CPU usage of the process host    |
| `system.mem.usage` | Gauge | percentage | Memory usage of the process host |

### Histogram buckets

Latency histograms use the following bucket boundaries (in seconds):

```
0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
```

<Note>
  Flowker does not expose a Prometheus scrape endpoint (`/metrics`) directly. Metrics are exported via OTLP to your collector, which then forwards to Prometheus. Configure your OTLP collector to include a `prometheusremotewrite` exporter.
</Note>

## Structured logging

***

Flowker uses **structured JSON logging** via Zap. Every log entry is enriched with contextual fields that can be indexed and queried in Loki.

### Log fields reference

| Field           | Description                       | Example                     |
| --------------- | --------------------------------- | --------------------------- |
| `operation`     | Span/operation name               | `command.execution.execute` |
| `workflow.id`   | Workflow identifier               | `wf_abc123`                 |
| `execution.id`  | Execution identifier              | `exec_xyz789`               |
| `node.id`       | Node identifier within a workflow | `node-payment`              |
| `executor.id`   | Executor identifier               | `exec_cfg_001`              |
| `error.message` | Error description when applicable | `database ping failed: ...` |

### Log levels

| Level   | When used                                                        |
| ------- | ---------------------------------------------------------------- |
| `debug` | Detailed internal state — for development only                   |
| `info`  | Normal operation milestones (execution started, recovered, etc.) |
| `warn`  | Recoverable issues or unexpected but non-fatal conditions        |
| `error` | Operation failures that require attention                        |

Set the `LOG_LEVEL` environment variable to control verbosity.

### Example log entries

Workflow execution started:

```json theme={null}
{
  "level": "info",
  "operation": "command.execution.execute",
  "workflow.id": "wf_abc123",
  "message": "Starting workflow execution"
}
```

Incomplete execution recovery:

```json theme={null}
{
  "level": "info",
  "operation": "command.execution.recover",
  "count": 3,
  "message": "Recovering incomplete executions"
}
```

Execution failed:

```json theme={null}
{
  "level": "error",
  "execution.id": "exec_xyz789",
  "workflow.id": "wf_abc123",
  "execution.status": "failed",
  "error.message": "executor node missing providerConfigId",
  "message": "Workflow execution failed"
}
```

## Health probes

***

Flowker exposes Kubernetes-compatible liveness and readiness probes for operational monitoring. Liveness signals whether the process is running; readiness signals whether dependencies (notably the database) are reachable. Configure both at the cluster level as part of your deployment manifests so that orchestration can restart unhealthy pods and remove degraded instances from load balancers.

## Grafana dashboards

***

Flowker's telemetry integrates directly with the Lerian observability stack. Pre-configured dashboards are available through the Lerian-managed Grafana instance.

### Recommended panels

**Request throughput**

* Query: `sum(rate(http_server_duration_count{service_name="flowker"}[5m])) by (http_route)`
* Shows requests per second, broken down by route

**P95 latency**

* Query: `histogram_quantile(0.95, sum(rate(http_server_duration_bucket{service_name="flowker"}[5m])) by (le, http_route))`
* Shows the 95th percentile response time per route

**Error rate**

* Query: `sum(rate(http_server_duration_count{service_name="flowker", http_status_code=~"5.."}[5m])) / sum(rate(http_server_duration_count{service_name="flowker"}[5m]))`
* Shows the ratio of 5xx responses

**Active executions (via logs)**

* Loki query: `{service_name="flowker"} |= "Starting workflow execution" | count_over_time([1m])`

<Note>
  For full observability stack setup, see [**Platform → Observability**](/en/platform/observability).
</Note>
