Skip to main content
Flowker emits traces, metrics, and structured logs using the OpenTelemetry standard. This guide explains what’s available, how to enable it, and how to interpret the data in your observability stack.

Overview


Flowker’s telemetry is built on three signals:
SignalBackendWhat it covers
TracesTempoDistributed spans across workflow executions and steps
MetricsPrometheusHTTP request rates, latency, and system resource usage
LogsLokiStructured JSON logs for every operation
All signals are exported via OTLP (OpenTelemetry Protocol) to a collector of your choice.

Configuration


Telemetry is controlled by environment variables.
# Enable telemetry (required to activate OTLP export)
ENABLE_TELEMETRY=true

# OTLP collector endpoint (required when ENABLE_TELEMETRY=true)
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

# Service identity
OTEL_RESOURCE_SERVICE_NAME=flowker
OTEL_RESOURCE_SERVICE_VERSION=1.0.0
OTEL_RESOURCE_DEPLOYMENT_ENVIRONMENT=production
OTEL_LIBRARY_NAME=flowker

# Log verbosity: debug | info | warn | error
LOG_LEVEL=info
If ENABLE_TELEMETRY=true is set without OTEL_EXPORTER_OTLP_ENDPOINT, Flowker will fail to start.

Distributed tracing


Every HTTP request and internal operation creates an OpenTelemetry span. Spans are propagated through the full execution chain, so a single workflow run produces a connected trace from the HTTP handler down to individual executor steps.

Span naming convention

Spans follow a <layer>.<resource>.<operation> pattern: Execution spans
Span nameDescription
command.execution.executeRoot span for a workflow execution
command.execution.execute_executor_nodeSpan for each executor node processed
command.execution.execute_with_provider_configSpan for a node resolved with a specific provider config
command.execution.recoverSpan for incomplete execution recovery at startup
Workflow command spans
Span nameDescription
command.workflow.createCreate a new workflow
command.workflow.updateUpdate an existing workflow
command.workflow.activateActivate a workflow
command.workflow.deactivateDeactivate a workflow
command.workflow.cloneClone a workflow
command.workflow.deleteDelete a workflow
Executor configuration spans
Span nameDescription
command.executor_config.createCreate executor configuration
command.executor_config.updateUpdate executor configuration
command.executor_config.activateActivate executor configuration
command.executor_config.enableEnable executor configuration
command.executor_config.disableDisable executor configuration
command.executor_config.mark_configuredMark executor as configured
command.executor_config.mark_testedMark executor as tested
command.executor_config.test_connectivityTest executor connectivity
command.executor_config.deleteDelete executor configuration
Provider configuration spans
Span nameDescription
command.provider_config.createCreate provider configuration
command.provider_config.updateUpdate provider configuration
command.provider_config.enableEnable provider configuration
command.provider_config.disableDisable provider configuration
command.provider_config.test_connectivityTest provider connectivity
command.provider_config.deleteDelete provider configuration
Query spans
Span nameDescription
query.execution.getGet execution by ID
query.execution.listList executions
query.execution.get_resultsGet execution results
query.workflow.getGet workflow by ID
query.workflow.get_by_nameGet workflow by name
query.workflow.listList workflows
query.executor_config.getGet executor config by ID
query.executor_config.get_by_nameGet executor config by name
query.executor_config.listList executor configs
query.executor_config.existsCheck executor config existence
query.executor_config.exists_by_nameCheck executor config existence by name
query.provider_config.getGet provider config by ID
query.provider_config.listList provider configs
In Grafana Tempo, search by service name (flowker) and filter by span name to isolate specific operations. Use command.execution.execute as the entry point to see a full workflow trace.

Metrics


Flowker exposes HTTP and system metrics automatically via the OpenTelemetry SDK. No additional configuration is needed beyond enabling telemetry.

HTTP metrics (via otelfiber)

Collected per route by the otelfiber middleware:
MetricTypeDescription
http.server.durationHistogramRequest duration in milliseconds
http.server.request.sizeHistogramRequest payload size in bytes
http.server.response.sizeHistogramResponse payload size in bytes
http.server.active_requestsUpDownCounterNumber of in-flight requests
Each metric carries labels: http.method, http.route, http.status_code.

System metrics

MetricTypeUnitDescription
system.cpu.usageGaugepercentageCPU usage of the process host
system.mem.usageGaugepercentageMemory usage of the process host

Histogram buckets

Latency histograms use the following bucket boundaries (in seconds):
0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Flowker does not expose a Prometheus scrape endpoint (/metrics) directly. Metrics are exported via OTLP to your collector, which then forwards to Prometheus. Configure your OTLP collector to include a prometheusremotewrite exporter.

Structured logging


Flowker uses structured JSON logging via Zap. Every log entry is enriched with contextual fields that can be indexed and queried in Loki.

Log fields reference

FieldDescriptionExample
operationSpan/operation namecommand.execution.execute
workflow.idWorkflow identifierwf_abc123
execution.idExecution identifierexec_xyz789
node.idNode identifier within a workflownode-payment
executor.idExecutor identifierexec_cfg_001
error.messageError description when applicabledatabase ping failed: ...

Log levels

LevelWhen used
debugDetailed internal state — for development only
infoNormal operation milestones (execution started, recovered, etc.)
warnRecoverable issues or unexpected but non-fatal conditions
errorOperation failures that require attention
Set the LOG_LEVEL environment variable to control verbosity.

Example log entries

Workflow execution started:
{
  "level": "info",
  "operation": "command.execution.execute",
  "workflow.id": "wf_abc123",
  "message": "Starting workflow execution"
}
Incomplete execution recovery:
{
  "level": "info",
  "operation": "command.execution.recover",
  "count": 3,
  "message": "Recovering incomplete executions"
}
Execution failed:
{
  "level": "error",
  "execution.id": "exec_xyz789",
  "workflow.id": "wf_abc123",
  "execution.status": "failed",
  "error.message": "executor node missing providerConfigId",
  "message": "Workflow execution failed"
}

Health endpoints


Flowker exposes three health endpoints for operational monitoring and Kubernetes probe configuration.

GET /health

Combined health check. Returns service status, version, uptime, and dependency checks. Response (healthy):
{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": "4h32m15s",
  "timestamp": "2024-11-01T14:32:00Z",
  "checks": {
    "database": {
      "status": "healthy",
      "message": "database connection ok"
    }
  }
}
Response (unhealthy — status 503):
{
  "status": "unhealthy",
  "timestamp": "2024-11-01T14:32:00Z",
  "checks": {
    "database": {
      "status": "unhealthy",
      "message": "database ping failed: connection refused"
    }
  }
}

GET /health/live

Kubernetes liveness probe. Returns 200 OK if the process is running. Does not check dependencies.
{
  "status": "alive",
  "timestamp": "2024-11-01T14:32:00Z"
}

GET /health/ready

Kubernetes readiness probe. Returns 200 OK when all dependencies are healthy. Returns 503 Service Unavailable when the database is unreachable.
{
  "status": "ready",
  "timestamp": "2024-11-01T14:32:00Z",
  "checks": {
    "database": {
      "status": "healthy",
      "message": "database connection ok"
    }
  }
}

Kubernetes probe configuration

livenessProbe:
  httpGet:
    path: /health/live
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10
Use /health/live for the liveness probe (restart on process hang) and /health/ready for the readiness probe (remove from load balancer when database is unavailable). Use /health for manual checks and monitoring dashboards.

Grafana dashboards


Flowker’s telemetry integrates directly with the Lerian observability stack. Pre-configured dashboards are available through the Lerian-managed Grafana instance. Request throughput
  • Query: sum(rate(http_server_duration_count{service_name="flowker"}[5m])) by (http_route)
  • Shows requests per second, broken down by route
P95 latency
  • Query: histogram_quantile(0.95, sum(rate(http_server_duration_bucket{service_name="flowker"}[5m])) by (le, http_route))
  • Shows the 95th percentile response time per route
Error rate
  • Query: sum(rate(http_server_duration_count{service_name="flowker", http_status_code=~"5.."}[5m])) / sum(rate(http_server_duration_count{service_name="flowker"}[5m]))
  • Shows the ratio of 5xx responses
Active executions (via logs)
  • Loki query: {service_name="flowker"} |= "Starting workflow execution" | count_over_time([1m])
For full observability stack setup, see Platform → Observability.