Skip to Content
Usage GuideScaling Guide

Scaling Guide

GlassFlow supports horizontal scaling of pipeline components. You can run multiple replicas of the ingestor, transformation, and sink stages to increase throughput and improve availability. This page describes what can be scaled, how to configure it, and recommended practices.

Overview

  • Ingestor — Scale the number of consumers reading from Kafka (supported for base, left, and right ingestors when using joins).
  • Transform — Scale the number of transformation workers that process messages from the buffer.
  • Sink — Scale the number of sink workers that write batches to ClickHouse.

Together, scaling these components allows you to fully scale a pipeline horizontally except when using join. Join pipelines require the join stage to run with a single replica; ingestor and sink can still be scaled.

What Can Be Scaled

ComponentScalableNotes
IngestorYesIncrease replicas to consume from more Kafka partitions in parallel.
TransformYesIncrease replicas when transformation is the bottleneck. Transform replicas can not be changed after the pipeline is created when deduplication is enabled.
SinkYesIncrease replicas to write to ClickHouse with higher throughput.
JoinNoJoin must run with 1 replica. Use multiple pipelines or filter-based routing if you need more join capacity.

How to Configure Scaling

Replicas are configured under pipeline_resources in the pipeline JSON. See the Pipeline JSON Reference for the full schema.

Example: scaling ingestor, transform, and sink:

{ "pipeline_resources": { "ingestor": { "base": { "replicas": 4 } }, "transform": { "replicas": 2 }, "sink": { "replicas": 2 } } }

For join pipelines, configure left and right ingestors separately; the join stage remains at 1 replica.

Best Practices

Match ingestor replicas to Kafka partitions

For non-join pipelines, the number of ingestor replicas is typically limited by the number of Kafka topic partitions. Running more ingestor replicas than partitions does not increase throughput for that topic; aim for at most one consumer per partition per consumer group.

Identify the bottleneck first

Before scaling, use metrics to see where time is spent (ingestor, transform, or sink). Scale the component that is CPU- or I/O-bound. Scaling the wrong stage adds cost without improving throughput.

Join pipelines

Join pipelines do not currently support scaling the join component. To handle more load with joins:

  • Scale left and right ingestor replicas to match the partition count of each joined topic.
  • Scale the sink if the join output is the bottleneck.
  • For very high throughput, consider splitting workload across multiple join pipelines (e.g. by key ranges or filters) so each pipeline has one join instance.

Transform and deduplication

When deduplication is enabled, the transform stage runs with deduplication logic and transform replicas are not mutable (fixed at 1 in practice). If you need both deduplication and more transformation capacity, scale the ingestor and sink; the single transform/dedup instance will process from the buffer as fast as it can.

Scaling beyond a single pipeline

For higher throughput than a single pipeline can provide, add more pipelines and distribute Kafka load across them (e.g. by topic or filter). See Performance for throughput examples and scaling by adding pipelines.

Summary

  • Ingestor, transform, and sink can be scaled via pipeline resources in the pipeline configuration.
  • Join pipelines support scaling ingestors and sink only; the join stage is single-replica.
  • Align ingestor replicas with Kafka partitions, use metrics to find bottlenecks, and combine component scaling with multi-pipeline scaling when needed.
Last updated on