Observability Guide

Flowker emits traces, metrics, and structured logs using the OpenTelemetry standard. This guide explains what’s available, how to enable it, and how to interpret the data in your observability stack.

Overview

Flowker’s telemetry is built on three signals:

Signal	Backend	What it covers
Traces	Tempo	Distributed spans across workflow executions and steps
Metrics	Prometheus	HTTP request rates, latency, and system resource usage
Logs	Loki	Structured JSON logs for every operation

All signals are exported via OTLP (OpenTelemetry Protocol) to a collector of your choice.

Configuration

Telemetry is controlled by environment variables.

# Enable telemetry (required to activate OTLP export)
ENABLE_TELEMETRY=true

# OTLP collector endpoint (required when ENABLE_TELEMETRY=true)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Service identity
OTEL_RESOURCE_SERVICE_NAME=flowker
OTEL_RESOURCE_SERVICE_VERSION=1.0.0
OTEL_RESOURCE_DEPLOYMENT_ENVIRONMENT=production
OTEL_LIBRARY_NAME=flowker

# Log verbosity: debug | info | warn | error
LOG_LEVEL=info

If ENABLE_TELEMETRY=true is set without OTEL_EXPORTER_OTLP_ENDPOINT, Flowker will fail to start.

Distributed tracing

Every HTTP request and internal operation creates an OpenTelemetry span. Spans are propagated through the full execution chain, so a single workflow run produces a connected trace from the HTTP handler down to individual executor steps.

Span naming convention

Spans follow a <layer>.<resource>.<operation> pattern: Execution spans

Span name	Description
`command.execution.execute`	Root span for a workflow execution
`command.execution.execute_executor_node`	Span for each executor node processed
`command.execution.execute_with_provider_config`	Span for a node resolved with a specific provider config
`command.execution.recover`	Span for incomplete execution recovery at startup

Workflow command spans

Span name	Description
`command.workflow.create`	Create a new workflow
`command.workflow.update`	Update an existing workflow
`command.workflow.activate`	Activate a workflow
`command.workflow.deactivate`	Deactivate a workflow
`command.workflow.clone`	Clone a workflow
`command.workflow.delete`	Delete a workflow

Executor configuration spans

Span name	Description
`command.executor_config.create`	Create executor configuration
`command.executor_config.update`	Update executor configuration
`command.executor_config.activate`	Activate executor configuration
`command.executor_config.enable`	Enable executor configuration
`command.executor_config.disable`	Disable executor configuration
`command.executor_config.mark_configured`	Mark executor as configured
`command.executor_config.mark_tested`	Mark executor as tested
`command.executor_config.test_connectivity`	Test executor connectivity
`command.executor_config.delete`	Delete executor configuration

Provider configuration spans

Span name	Description
`command.provider_config.create`	Create provider configuration
`command.provider_config.update`	Update provider configuration
`command.provider_config.enable`	Enable provider configuration
`command.provider_config.disable`	Disable provider configuration
`command.provider_config.test_connectivity`	Test provider connectivity
`command.provider_config.delete`	Delete provider configuration

Query spans

Span name	Description
`query.execution.get`	Get execution by ID
`query.execution.list`	List executions
`query.execution.get_results`	Get execution results
`query.workflow.get`	Get workflow by ID
`query.workflow.get_by_name`	Get workflow by name
`query.workflow.list`	List workflows
`query.executor_config.get`	Get executor config by ID
`query.executor_config.get_by_name`	Get executor config by name
`query.executor_config.list`	List executor configs
`query.executor_config.exists`	Check executor config existence
`query.executor_config.exists_by_name`	Check executor config existence by name
`query.provider_config.get`	Get provider config by ID
`query.provider_config.list`	List provider configs

In Grafana Tempo, search by service name (flowker) and filter by span name to isolate specific operations. Use command.execution.execute as the entry point to see a full workflow trace.

Metrics

Flowker exposes HTTP and system metrics automatically via the OpenTelemetry SDK. No additional configuration is needed beyond enabling telemetry.

HTTP metrics (via otelfiber)

Collected per route by the otelfiber middleware:

Metric	Type	Description
`http.server.duration`	Histogram	Request duration in milliseconds
`http.server.request.size`	Histogram	Request payload size in bytes
`http.server.response.size`	Histogram	Response payload size in bytes
`http.server.active_requests`	UpDownCounter	Number of in-flight requests

Each metric carries labels: http.method, http.route, http.status_code.

System metrics

Metric	Type	Unit	Description
`system.cpu.usage`	Gauge	percentage	CPU usage of the process host
`system.mem.usage`	Gauge	percentage	Memory usage of the process host

Histogram buckets

Latency histograms use the following bucket boundaries (in seconds):

0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Flowker does not expose a Prometheus scrape endpoint (/metrics) directly. Metrics are exported via OTLP to your collector, which then forwards to Prometheus. Configure your OTLP collector to include a prometheusremotewrite exporter.

Structured logging

Flowker uses structured JSON logging via Zap. Every log entry is enriched with contextual fields that can be indexed and queried in Loki.

Log fields reference

Field	Description	Example
`operation`	Span/operation name	`command.execution.execute`
`workflow.id`	Workflow identifier	`wf_abc123`
`execution.id`	Execution identifier	`exec_xyz789`
`node.id`	Node identifier within a workflow	`node-payment`
`executor.id`	Executor identifier	`exec_cfg_001`
`error.message`	Error description when applicable	`database ping failed: ...`

Log levels

Level	When used
`debug`	Detailed internal state — for development only
`info`	Normal operation milestones (execution started, recovered, etc.)
`warn`	Recoverable issues or unexpected but non-fatal conditions
`error`	Operation failures that require attention

Set the LOG_LEVEL environment variable to control verbosity.

Example log entries

Workflow execution started:

{
  "level": "info",
  "operation": "command.execution.execute",
  "workflow.id": "wf_abc123",
  "message": "Starting workflow execution"
}

Incomplete execution recovery:

{
  "level": "info",
  "operation": "command.execution.recover",
  "count": 3,
  "message": "Recovering incomplete executions"
}

Execution failed:

{
  "level": "error",
  "execution.id": "exec_xyz789",
  "workflow.id": "wf_abc123",
  "execution.status": "failed",
  "error.message": "executor node missing providerConfigId",
  "message": "Workflow execution failed"
}

Health probes

Flowker exposes Kubernetes-compatible liveness and readiness probes for operational monitoring. Liveness signals whether the process is running; readiness signals whether dependencies (notably the database) are reachable. Configure both at the cluster level as part of your deployment manifests so that orchestration can restart unhealthy pods and remove degraded instances from load balancers.

Grafana dashboards

Flowker’s telemetry integrates directly with the Lerian observability stack. Pre-configured dashboards are available through the Lerian-managed Grafana instance.

Recommended panels

Request throughput

Query: sum(rate(http_server_duration_count{service_name="flowker"}[5m])) by (http_route)
Shows requests per second, broken down by route

P95 latency

Query: histogram_quantile(0.95, sum(rate(http_server_duration_bucket{service_name="flowker"}[5m])) by (le, http_route))
Shows the 95th percentile response time per route

Error rate

Query: sum(rate(http_server_duration_count{service_name="flowker", http_status_code=~"5.."}[5m])) / sum(rate(http_server_duration_count{service_name="flowker"}[5m]))
Shows the ratio of 5xx responses

Active executions (via logs)

Loki query: {service_name="flowker"} |= "Starting workflow execution" | count_over_time([1m])

For full observability stack setup, see Platform → Observability.

​Overview

​Configuration

​Distributed tracing

​Span naming convention

​Metrics

​HTTP metrics (via otelfiber)

​System metrics

​Histogram buckets

​Structured logging

​Log fields reference

​Log levels

​Example log entries

​Health probes

​Grafana dashboards

​Recommended panels

Overview

Configuration

Distributed tracing

Span naming convention

Metrics

HTTP metrics (via otelfiber)

System metrics

Histogram buckets

Structured logging

Log fields reference

Log levels

Example log entries

Health probes

Grafana dashboards

Recommended panels