GlassFlow Metrics

This guide provides comprehensive information about GlassFlow’s metrics, including available metrics, labels, and monitoring best practices.

Metrics are enabled by default and available at the OTEL collector endpoint
All backend metrics follow Prometheus format and can be scraped by Prometheus
Metrics include component-specific labels for detailed monitoring
OpenTelemetry (OTEL) collector exposes metrics via Prometheus exporter
UI metrics are exported via OTLP, not Prometheus scraping

Metrics Overview

GlassFlow exports comprehensive metrics in Prometheus format through an OpenTelemetry collector. The metrics are designed to provide visibility into:

Data Ingestion: Kafka record consumption rates and volumes
Data Processing: Processing duration, throughput, and byte volume metrics
Data Sinking: ClickHouse write operations and performance
Error Handling: Dead Letter Queue (DLQ) operations
OTLP Receiver: Incoming OTLP request rates and latency
HTTP Server: API request rates and latency
UI: Page views, interactions, and frontend API performance

Metric Naming Convention

All GlassFlow backend metrics follow a consistent naming pattern:


{namespace}_gfm_{metric_name}

Where:

{namespace} - Deployment namespace prefix (e.g., “glassflow” if deployed in glassflow namespace)
gfm - GlassFlow Metrics prefix
{metric_name} - Descriptive metric name

UI metrics use the prefix gfm_ui_ and are exported via OTLP.

The namespace prefix is automatically added based on your deployment configuration. If you deploy GlassFlow in a different namespace, the prefix will change accordingly.

Counter metrics gain a _total suffix at scrape time. The OpenTelemetry-to-Prometheus exporter appends _total to every counter when it exposes the /metrics endpoint. The base names shown in this document are the names used in GlassFlow source code; the names you query in Prometheus include the suffix.

For example, the source-side name gfm_receiver_request_count is scraped as glassflow_gfm_receiver_request_count_total. The histogram metrics (_seconds) are unaffected. When in doubt, query the Prometheus /metrics endpoint and use the names you see there.

Core Metrics

Data Ingestion Metrics

`{namespace}_gfm_kafka_records_read_total`

Type: Counter
Description: Total number of records read from Kafka
Unit: Records
Components: Ingestor
Labels:
- component: Component type (e.g., “ingestor”) - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

Example:


glassflow_gfm_kafka_records_read_total{component="ingestor",instance="ingestor-0-7f44fbbfd8-bqbw9",job="pipeline-load-pipeline-1-05b7/ingestor",pipeline_id="load-pipeline-1-05b7"} 16914

In this example, glassflow is the namespace prefix. If you deploy in a different namespace, the prefix will change accordingly.

Data Processing Metrics

`{namespace}_gfm_processing_duration_seconds`

Type: Histogram
Description: Processing duration in seconds
Unit: Seconds
Components: Ingestor, Sink, Transform, Filter, Dedup
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- stage: (Optional) Processing stage — Added by GlassFlow. Values: dedup_filter, dedup_write, schema_mapping, total_preparation, per_message. Omitted when not applicable.
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets:

0.001s (1ms)
0.005s (5ms)
0.01s (10ms)
0.025s (25ms)
0.05s (50ms)
0.1s (100ms)
0.25s (250ms)
0.5s (500ms)
1.0s (1s)
2.5s (2.5s)
5.0s (5s)
10.0s (10s)

Example:


glassflow_gfm_processing_duration_seconds_bucket{component="ingestor",instance="ingestor-0-7f44fbbfd8-bqbw9",job="pipeline-load-pipeline-1-05b7/ingestor",pipeline_id="load-pipeline-1-05b7",le="5"} 16914
glassflow_gfm_processing_duration_seconds_sum{component="ingestor",instance="ingestor-0-7f44fbbfd8-bqbw9",job="pipeline-load-pipeline-1-05b7/ingestor",pipeline_id="load-pipeline-1-05b7"} 2.8343126270000267
glassflow_gfm_processing_duration_seconds_count{component="ingestor",instance="ingestor-0-7f44fbbfd8-bqbw9",job="pipeline-load-pipeline-1-05b7/ingestor",pipeline_id="load-pipeline-1-05b7"} 16914

In this example, glassflow is the namespace prefix. If you deploy in a different namespace, the prefix will change accordingly.

`{namespace}_gfm_processor_messages_total`

Type: Counter
Description: Total number of messages processed by a processor, by status
Unit: Messages
Components: filter, transform, dedup, otlp.logs, otlp.metrics, otlp.traces
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- status: Outcome - Added by GlassFlow. The valid status values depend on the emitting component (see table below).
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

Status values by component:

Component	Statuses
`filter`	`success`, `filtered`, `error`
`transform`	`success`, `error`
`dedup`	`success`, `duplicate`
`otlp.logs` / `otlp.metrics` / `otlp.traces`	`out`

The Kafka ingestor and ClickHouse sink do not emit this metric — queries like processor_messages_total{component="ingestor"} or component="sink" will return no series. For ingestor counts use gfm_kafka_records_read_total; for sink counts use gfm_clickhouse_records_written_total.

There is no separate gfm_records_filtered_total metric. To track filtered records, query gfm_processor_messages_total with component="filter" and status="filtered".

`{namespace}_gfm_bytes_processed_total`

Type: Counter
Description: Total bytes processed
Unit: Bytes
Components: ingestor, sink, transform, filter, dedup, otlp.logs, otlp.metrics, otlp.traces
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- direction: Data flow direction - Added by GlassFlow — Values: in, out
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

Data Sinking Metrics

`{namespace}_gfm_clickhouse_records_written_total`

Type: Counter
Description: Total number of records written to ClickHouse
Unit: Records
Components: Sink
Labels:
- component: Component type (e.g., “sink”) - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

Example:


glassflow_gfm_clickhouse_records_written_total{component="sink",instance="sink-7c87594fd9-jmdw2",job="pipeline-load-pipeline-1-05b7/sink",pipeline_id="load-pipeline-1-05b7"} 80000

In this example, glassflow is the namespace prefix. If you deploy in a different namespace, the prefix will change accordingly.

`{namespace}_gfm_sink_errors_by_classification_total`

Type: Counter
Description: Sink-side ClickHouse errors broken down by retryability and error name. Counts increment whether the error is NACK’d back to JetStream for retry or routed to the DLQ.
Unit: Errors
Components: Sink
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- classification: Error class (retryable or permanent) - Added by GlassFlow
- error_name: Error identifier from the ClickHouse client - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_sink_nack_messages_total`

Type: Counter
Description: Number of sink messages NACK’d back to JetStream for redelivery (retryable ClickHouse errors). A steady stream of NACKs indicates transient ClickHouse instability.
Unit: Messages
Components: Sink
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_sink_retries_total`

Type: Counter
Description: Sink batch retry attempts.
Unit: Retries
Components: Sink
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- outcome: retry for retries still in progress, exhausted for batches that hit the retry budget - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_sink_batch_size_records`

Type: Histogram (Int64)
Description: Distribution of records per sink batch flushed to ClickHouse.
Unit: Records per batch
Components: Sink
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: 1, 10, 100, 1000, 10000, 100000 records.

`{namespace}_gfm_sink_batch_size_bytes`

Type: Histogram (Int64)
Description: Distribution of bytes per sink batch flushed to ClickHouse.
Unit: Bytes per batch
Components: Sink
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: 1KiB, 10KiB, 100KiB, 1MiB, 10MiB, 100MiB.

Error Handling Metrics

`{namespace}_gfm_dlq_records_written_total`

Type: Counter
Description: Total number of records written to the dead letter queue.
Unit: Records
Components: Ingestor, Dedup, Transform, Filter, Sink (any stage that can route a record to the DLQ).
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- reason: Why the record was DLQ’d. One of parse_error, schema_mismatch, sink_rejection, retry_exhausted, dedup_overflow, unrecoverable - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

This metric is defined in the code but may not appear in the sample metrics if no records have been written to the DLQ during the observation period. Slice by reason to break down DLQ traffic by cause; the new reason label was introduced in v3.2.0.

HTTP Server Metrics

`{namespace}_gfm_http_server_request_count`

Type: Counter
Description: Total number of HTTP requests
Unit: Requests
Components: API server
Labels:
- method: HTTP method - Added by GlassFlow
- path: Route path template - Added by GlassFlow
- status: HTTP response status code (integer) - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

HTTP server metrics do not include component or pipeline_id labels. They are scoped by method, path, and status only.

`{namespace}_gfm_http_server_request_duration_seconds`

Type: Histogram
Description: Duration of HTTP requests in seconds
Unit: Seconds
Components: API server
Labels:
- method: HTTP method - Added by GlassFlow
- path: Route path template - Added by GlassFlow
- status: HTTP response status code (integer) - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: Same as processing_duration_seconds (0.001 to 10.0).

OTLP Receiver Metrics

`{namespace}_gfm_receiver_request_count`

Type: Counter
Description: Total number of OTLP receiver requests
Unit: Requests
Components: otlp.logs, otlp.metrics, otlp.traces
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- transport: Transport protocol - Added by GlassFlow — Values: http, grpc
- status: Request outcome - Added by GlassFlow — Values: ok, error
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_receiver_request_duration_seconds`

Type: Histogram
Description: Duration of OTLP receiver requests in seconds
Unit: Seconds
Components: otlp.logs, otlp.metrics, otlp.traces
Labels:
- component: Component type - Added by GlassFlow
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- transport: Transport protocol - Added by GlassFlow — Values: http, grpc
- status: Request outcome - Added by GlassFlow — Values: ok, error
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: Same as processing_duration_seconds (0.001 to 10.0).

Backpressure Metrics

These metrics are emitted by the ingestor component and the NATS stream sampler. They help identify whether backpressure is active and how long episodes last.

`{namespace}_gfm_ingestor_backpressure_active`

Type: Gauge (Int64)
Description: Set to 1 while the ingestor is blocked waiting for NATS to drain; 0 otherwise
Unit: —
Components: Ingestor
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_ingestor_backpressure_events_total`

Type: Counter
Description: Total number of times the ingestor entered a backpressure episode
Unit: Events
Components: Ingestor
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_ingestor_backpressure_duration_seconds`

Type: Histogram
Description: Duration of each ingestor backpressure episode in seconds
Unit: Seconds
Components: Ingestor
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: 0.1, 0.5, 1, 2.5, 5, 10, 30, 60, 120, 300, 600, 1800 seconds (second-to-minute scale, unlike the millisecond-scale default buckets used by request/processing histograms).

`{namespace}_gfm_component_backpressure_active`

Type: Gauge (Int64)
Description: Set to 1 while a component is in a backpressure episode; 0 otherwise. Per-component variant of gfm_ingestor_backpressure_active, covering dedup, transform, filter, join, and the OTLP receiver.
Unit: —
Components: Dedup, Transform, Filter, Join, OTLP Receiver
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- component: Component name (dedup, transform, filter, join, otlp-receiver) - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_component_backpressure_events_total`

Type: Counter
Description: Total number of times a component entered a backpressure episode.
Unit: Events
Components: Dedup, Transform, Filter, Join, OTLP Receiver
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- component: Component name - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_component_backpressure_duration_seconds`

Type: Histogram
Description: Duration of each component backpressure episode in seconds.
Unit: Seconds
Components: Dedup, Transform, Filter, Join, OTLP Receiver
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- component: Component name - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus
- le: Histogram bucket boundary - Added by Prometheus

Histogram Buckets: same second-to-minute scale as gfm_ingestor_backpressure_duration_seconds.

`{namespace}_gfm_stream_depth`

Type: Gauge (Int64)
Description: Number of messages currently stored in a JetStream stream
Unit: Messages
Components: Ingestor
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- stream: JetStream stream name - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

`{namespace}_gfm_stream_depth_ratio`

Type: Gauge (Float64)
Description: Stream depth divided by max_messages; ranges from 0.0 to 1.0. Sustained values near 1.0 indicate the stream is near capacity and backpressure is likely
Unit: Ratio (0.0–1.0)
Components: Ingestor
Labels:
- pipeline_id: Unique pipeline identifier - Added by GlassFlow
- stream: JetStream stream name - Added by GlassFlow
- instance: Instance identifier - Added by Prometheus
- job: Job identifier - Added by Prometheus

UI Metrics

UI metrics are exported via OTLP (not Prometheus scraping). All UI metrics use the gfm_ui_ prefix.

Interaction Metrics

`gfm_ui_page_views_total`

Type: Counter
Description: Total page views
Labels: path, component

`gfm_ui_button_clicks_total`

Type: Counter
Description: Total button clicks
Labels: button_name, component

`gfm_ui_form_submissions_total`

Type: Counter
Description: Total form submissions
Labels: form_name, success, component

Pipeline Lifecycle Metrics

`gfm_ui_pipeline_created_total`

Type: Counter
Description: Total pipelines created from the UI
Labels: pipeline_type, component

`gfm_ui_pipeline_deleted_total`

Type: Counter
Description: Total pipelines deleted from the UI
Labels: pipeline_id, component

`gfm_ui_pipeline_status_changed_total`

Type: Counter
Description: Total pipeline status changes from the UI
Labels: pipeline_id, from_status, to_status, component

UI API Metrics

`gfm_ui_api_request_count`

Type: Counter
Description: Total API requests made from the UI
Labels: method, path, status, component

`gfm_ui_api_request_errors_total`

Type: Counter
Description: Total API request errors from the UI (HTTP status >= 400)
Labels: method, path, status, component

`gfm_ui_api_request_duration_seconds`

Type: Histogram
Description: Duration of API requests made from the UI
Labels: method, path, status, component

`gfm_ui_page_load_duration_seconds`

Type: Histogram
Description: Page load duration
Labels: path, component

Component-Specific Metrics

Ingestor Component

The Ingestor component primarily exports:

{namespace}_gfm_kafka_records_read_total - Records consumed from Kafka
{namespace}_gfm_processing_duration_seconds - Processing time for ingested records
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)
{namespace}_gfm_dlq_records_written_total - Records sent to DLQ on processing errors
{namespace}_gfm_ingestor_backpressure_active - 1 while blocked on NATS backpressure
{namespace}_gfm_ingestor_backpressure_events_total - Count of backpressure episodes
{namespace}_gfm_ingestor_backpressure_duration_seconds - Duration of each backpressure episode
{namespace}_gfm_stream_depth - Current message count in the JetStream stream
{namespace}_gfm_stream_depth_ratio - Stream fill ratio (0.0–1.0)

Sink Component

The Sink component primarily exports:

{namespace}_gfm_clickhouse_records_written_total - Records written to ClickHouse
{namespace}_gfm_processing_duration_seconds - Processing time for sink operations (with optional stage: schema_mapping, total_preparation, per_message)
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)
{namespace}_gfm_sink_batch_size_records, {namespace}_gfm_sink_batch_size_bytes - Distribution of records and bytes per flushed batch
{namespace}_gfm_sink_retries_total - Sink batch retry attempts (outcome: retry or exhausted)
{namespace}_gfm_sink_errors_by_classification_total - ClickHouse errors by class (retryable, permanent)
{namespace}_gfm_sink_nack_messages_total - Messages NACK’d back to JetStream for retryable errors
{namespace}_gfm_dlq_records_written_total - Records sent to DLQ (carries a reason label)

Transform Component

The Transform component primarily exports:

{namespace}_gfm_processing_duration_seconds - Processing time for transform operations
{namespace}_gfm_processor_messages_total - Message counts by status
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)

Filter Component

The Filter component primarily exports:

{namespace}_gfm_processing_duration_seconds - Processing time for filter operations
{namespace}_gfm_processor_messages_total - Message counts by status (use status="filtered" to track filtered records)
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)

Dedup Component

The Dedup component primarily exports:

{namespace}_gfm_processing_duration_seconds - Processing time for dedup operations (with stage: dedup_filter, dedup_write)
{namespace}_gfm_processor_messages_total - Message counts by status (use status="duplicate" to track deduplicated records)
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)

API Server

The API server exports HTTP metrics when metrics are enabled:

{namespace}_gfm_http_server_request_count - HTTP request count by method, path, and status
{namespace}_gfm_http_server_request_duration_seconds - HTTP request duration by method, path, and status

OTLP Receiver

The OTLP receiver components (otlp.logs, otlp.metrics, otlp.traces) export:

{namespace}_gfm_receiver_request_count - Receiver request count by transport (http/grpc) and status (ok/error)
{namespace}_gfm_receiver_request_duration_seconds - Receiver request duration by transport and status
{namespace}_gfm_processor_messages_total - Records emitted to downstream (status="out")
{namespace}_gfm_bytes_processed_total - Bytes processed (in/out)

The gfm_ingestor_backpressure_* and gfm_stream_depth* metrics are emitted by the Kafka ingestor and are not available for OTLP source pipelines.

UI

The UI exports interaction and performance metrics via OTLP:

gfm_ui_page_views_total - Page view counts
gfm_ui_button_clicks_total - Button click counts
gfm_ui_form_submissions_total - Form submission counts
gfm_ui_api_request_count - API request counts from the UI
gfm_ui_api_request_errors_total - API request error counts from the UI
gfm_ui_api_request_duration_seconds - API request duration from the UI
gfm_ui_page_load_duration_seconds - Page load duration
gfm_ui_pipeline_created_total - Pipeline creation counts
gfm_ui_pipeline_deleted_total - Pipeline deletion counts
gfm_ui_pipeline_status_changed_total - Pipeline status change counts

Metric Labels

GlassFlow metrics include labels from two sources:

Application Labels (Added by GlassFlow)

These labels are added by the GlassFlow application code:

Label	Description	Example Values
`component`	Component type	`ingestor`, `sink`, `dedup`, `transform`, `filter`, `api`, `otlp.logs`, `otlp.metrics`, `otlp.traces`
`pipeline_id`	Unique pipeline identifier	`load-pipeline-1-05b7`
`stage`	Processing stage (optional, for `processing_duration_seconds`)	`dedup_filter`, `dedup_write`, `schema_mapping`, `total_preparation`, `per_message`
`status`	Outcome (processor messages), request result (receiver), or HTTP status code (HTTP/UI metrics)	`success`, `error`, `filtered`, `duplicate`, `out`, `ok`, or HTTP code e.g. `200`
`direction`	Data flow direction (for `bytes_processed_total`)	`in`, `out`
`transport`	Transport protocol (for receiver metrics)	`http`, `grpc`
`stream`	JetStream stream name (for stream depth metrics)	pipeline stream name
`method`	HTTP method (HTTP and UI API metrics)	`GET`, `POST`, `PUT`, `DELETE`
`path`	Route path template (HTTP and UI metrics)	`/api/v1/pipelines`, `/health`

Prometheus Labels (Added by Prometheus)

These labels are automatically added by Prometheus during the scraping process:

Label	Description	Example Values
`instance`	Instance identifier (typically pod name)	`ingestor-0-7f44fbbfd8-bqbw9`
`job`	Job identifier (from Prometheus config)	`pipeline-load-pipeline-1-05b7/ingestor`
`le`	Histogram bucket boundary (for histogram metrics only)	`0.001`, `0.005`, `1.0`, `+Inf`

Label Sources:

Application labels (component, pipeline_id) are added by GlassFlow code and are consistent across all deployments
Prometheus labels (instance, job, le) are added by Prometheus during scraping and depend on your monitoring setup
The job label comes from your Prometheus configuration’s job_name field
The instance label typically contains the Kubernetes pod name or target endpoint
HTTP server metrics (gfm_http_server_request_count, gfm_http_server_request_duration_seconds) do not include component or pipeline_id labels

Accessing Metrics

Metrics Endpoint

The OTEL collector service exposes metrics at:


{release-name}-otel-collector.{namespace}.svc.cluster.local:9090/metrics

For example, if you installed GlassFlow with the release name glassflow-chart in the glassflow namespace:


glassflow-chart-otel-collector.glassflow.svc.cluster.local:9090/metrics

Prometheus Scraping

To scrape metrics with Prometheus, add the following configuration to your Prometheus config:


# GlassFlow OTEL Collector metrics
- job_name: 'glassflow-otel-collector'
  static_configs:
  - targets: ['glassflow-chart-otel-collector.glassflow.svc.cluster.local:9090']
  metrics_path: /metrics
  scrape_interval: 15s

💡

Replace glassflow-chart and glassflow with your actual release name and namespace if different.

Understanding the `job` Label

The job label in your metrics comes from the job_name field in your Prometheus configuration. For example:

If your Prometheus config has job_name: 'glassflow-otel-collector', then job="glassflow-otel-collector"
If you use Kubernetes service discovery, the job name might be auto-generated based on the service name
The job name helps Prometheus identify which scrape configuration was used to collect the metrics

Job Label Examples:

job="pipeline-load-pipeline-1-05b7/ingestor" - Indicates this metric came from an ingestor component
job="pipeline-load-pipeline-1-05b7/sink" - Indicates this metric came from a sink component
job="glassflow-otel-collector" - Indicates this metric came from the OTEL collector endpoint

Monitoring Best Practices

Key Metrics to Monitor

Throughput Metrics:
- rate({namespace}_gfm_kafka_records_read_total[5m]) - Kafka consumption rate
- rate({namespace}_gfm_clickhouse_records_written_total[5m]) - ClickHouse write rate
- rate({namespace}_gfm_bytes_processed_total{direction="in"}[5m]) - Bytes ingestion rate
- rate({namespace}_gfm_bytes_processed_total{direction="out"}[5m]) - Bytes output rate
Latency Metrics:
- histogram_quantile(0.95, rate({namespace}_gfm_processing_duration_seconds_bucket[5m])) - 95th percentile processing time
- histogram_quantile(0.99, rate({namespace}_gfm_processing_duration_seconds_bucket[5m])) - 99th percentile processing time
- Use the stage label on processing_duration_seconds for per-stage timing (e.g., dedup_filter, dedup_write, schema_mapping, total_preparation, per_message)
- histogram_quantile(0.95, rate({namespace}_gfm_http_server_request_duration_seconds_bucket[5m])) - 95th percentile API latency
Error Metrics:
- rate({namespace}_gfm_dlq_records_written_total[5m]) - DLQ write rate
- rate({namespace}_gfm_processor_messages_total{status="error"}[5m]) - Processor error rate
- rate({namespace}_gfm_processor_messages_total{status="filtered"}[5m]) - Record filtering rate
- rate({namespace}_gfm_processor_messages_total{status="duplicate"}[5m]) - Deduplication rate
Receiver Metrics:
- rate({namespace}_gfm_receiver_request_count{status="error"}[5m]) - OTLP receiver error rate
- histogram_quantile(0.95, rate({namespace}_gfm_receiver_request_duration_seconds_bucket[5m])) - 95th percentile receiver latency
Backpressure Metrics (Kafka ingestor pipelines only):
- {namespace}_gfm_ingestor_backpressure_active - Currently in backpressure (1) or not (0)
- rate({namespace}_gfm_ingestor_backpressure_events_total[5m]) - Rate of new backpressure episodes
- histogram_quantile(0.95, rate({namespace}_gfm_ingestor_backpressure_duration_seconds_bucket[5m])) - 95th percentile episode duration
- {namespace}_gfm_stream_depth_ratio - Stream fill ratio; alert when sustained near 1.0
- {namespace}_gfm_stream_depth - Absolute message backlog in stream
Health Metrics:
- {namespace}_up - Service availability
- {namespace}_target_info - Service metadata

Installation and Setup

For detailed installation instructions and configuration options, see the Observability Installation Guide.

GlassFlow Metrics

Metrics Overview

Metric Naming Convention

Core Metrics

Data Ingestion Metrics

{namespace}_gfm_kafka_records_read_total

Data Processing Metrics

{namespace}_gfm_processing_duration_seconds

{namespace}_gfm_processor_messages_total

{namespace}_gfm_bytes_processed_total

Data Sinking Metrics

{namespace}_gfm_clickhouse_records_written_total

{namespace}_gfm_sink_errors_by_classification_total

{namespace}_gfm_sink_nack_messages_total

{namespace}_gfm_sink_retries_total

{namespace}_gfm_sink_batch_size_records

{namespace}_gfm_sink_batch_size_bytes

Error Handling Metrics

{namespace}_gfm_dlq_records_written_total

HTTP Server Metrics

{namespace}_gfm_http_server_request_count

{namespace}_gfm_http_server_request_duration_seconds

OTLP Receiver Metrics

{namespace}_gfm_receiver_request_count

{namespace}_gfm_receiver_request_duration_seconds

Backpressure Metrics

{namespace}_gfm_ingestor_backpressure_active

{namespace}_gfm_ingestor_backpressure_events_total

{namespace}_gfm_ingestor_backpressure_duration_seconds

{namespace}_gfm_component_backpressure_active

{namespace}_gfm_component_backpressure_events_total

{namespace}_gfm_component_backpressure_duration_seconds

{namespace}_gfm_stream_depth

{namespace}_gfm_stream_depth_ratio

UI Metrics

Interaction Metrics

gfm_ui_page_views_total

gfm_ui_button_clicks_total

gfm_ui_form_submissions_total

Pipeline Lifecycle Metrics

gfm_ui_pipeline_created_total

gfm_ui_pipeline_deleted_total

gfm_ui_pipeline_status_changed_total

UI API Metrics

gfm_ui_api_request_count

gfm_ui_api_request_errors_total

gfm_ui_api_request_duration_seconds

gfm_ui_page_load_duration_seconds

Component-Specific Metrics

Ingestor Component

Sink Component

Transform Component

Filter Component

Dedup Component

API Server

OTLP Receiver

UI

Metric Labels

Application Labels (Added by GlassFlow)

Prometheus Labels (Added by Prometheus)

Accessing Metrics

Metrics Endpoint

Prometheus Scraping

Understanding the job Label

Monitoring Best Practices

Key Metrics to Monitor

Installation and Setup

`{namespace}_gfm_kafka_records_read_total`

`{namespace}_gfm_processing_duration_seconds`

`{namespace}_gfm_processor_messages_total`

`{namespace}_gfm_bytes_processed_total`

`{namespace}_gfm_clickhouse_records_written_total`

`{namespace}_gfm_sink_errors_by_classification_total`

`{namespace}_gfm_sink_nack_messages_total`

`{namespace}_gfm_sink_retries_total`

`{namespace}_gfm_sink_batch_size_records`

`{namespace}_gfm_sink_batch_size_bytes`

`{namespace}_gfm_dlq_records_written_total`

`{namespace}_gfm_http_server_request_count`

`{namespace}_gfm_http_server_request_duration_seconds`

`{namespace}_gfm_receiver_request_count`

`{namespace}_gfm_receiver_request_duration_seconds`

`{namespace}_gfm_ingestor_backpressure_active`

`{namespace}_gfm_ingestor_backpressure_events_total`

`{namespace}_gfm_ingestor_backpressure_duration_seconds`

`{namespace}_gfm_component_backpressure_active`

`{namespace}_gfm_component_backpressure_events_total`

`{namespace}_gfm_component_backpressure_duration_seconds`

`{namespace}_gfm_stream_depth`

`{namespace}_gfm_stream_depth_ratio`

`gfm_ui_page_views_total`

`gfm_ui_button_clicks_total`

`gfm_ui_form_submissions_total`

`gfm_ui_pipeline_created_total`

`gfm_ui_pipeline_deleted_total`

`gfm_ui_pipeline_status_changed_total`

`gfm_ui_api_request_count`

`gfm_ui_api_request_errors_total`

`gfm_ui_api_request_duration_seconds`

`gfm_ui_page_load_duration_seconds`

Understanding the `job` Label