Skip to Content
FAQ

FAQ

Q: How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT)  performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

Q: How does GlassFlow’s deduplication work?

GlassFlow uses BadgerDB for deduplication in disk and memory. For more details, see Deduplication.

Q: Why do duplicates happen in data pipelines at all?

Duplicate events can occur for several reasons depending on the source. In Kafka, common causes include producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. In OTLP pipelines, duplicates can arise from collector retries or overlapping export intervals.

These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

Q: What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as an internal buffer between pipeline stages. The source is only acknowledged after messages are successfully published to NATS — for Kafka this means committing offsets, for OTLP this means returning a success response to the sender. Data then flows through optional transformations (filter, deduplication, joins) and is batch-inserted into ClickHouse using the native protocol. If the system crashes after acknowledging the source but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap. For a detailed breakdown, see the Data Flow documentation.

Last updated on