Skip to Content
Introduction

GlassFlow for ClickHouse ETL Documentation

GlassFlow is an open-source streaming ETL for Kafka to Clickhouse streams. It has built-in deduplication, temporal joins, handles late-arriving events with exactly-once guarantees, and scales for high-throughput low-latency workloads on ClickHouse. It reduces the need to use ReplacingMergeTree, FINAL and Joins on ClickHouse.


GlassFlow ClickHouse ETL Introduction

Getting Started

Features

Streaming Deduplication

  • Real-time deduplication of Kafka streams before ingestion into ClickHouse (Deduplication)
  • Configurable time windows up to 7 days for deduplication
  • Simple configuration of deduplication keys and time windows
  • Prevents duplicate data from reaching ClickHouse

Temporal Stream Joins

  • Join two Kafka streams in real-time (Join)
  • Configurable time windows up to 7 days for stream joins
  • Configure join keys and time windows through the UI
  • Simplified join setup process
  • Produce joined streams ready for ClickHouse ingestion

Filter

  • Expression-based filtering to process only events that match your criteria (Filter)
  • Configure filter expressions at the pipeline level; non-matching events are dropped before downstream processing
  • Uses the expr language for type-safe, field-based conditions

Stateless Transformations

  • Reshape event payloads with expression-based field mappings (Stateless transformations)
  • Normalize fields, derive new fields, and map nested JSON to a flat schema
  • Each event transformed independently—no stored state; rich expression language with built-in helpers

Kubernetes Native Architecture

  • Robust and scalable architecture natively built for Kubernetes (Architecture)
  • Easy installation using Helm
  • Custom Kubernetes controller for managing pipelines
  • Horizontal scalability

Built-in Kafka Connector

  • Automatic data extraction from Kafka topics (Supported Kafka connections)
  • Seamless integration with Kafka clusters
  • No manual data pulling required
  • Supports multiple Kafka topics and partitions
  • Native support for JSON data types including nested JSON and arrays

Optimized ClickHouse Sink

  • Native ClickHouse connection for maximum performance
  • Configurable batch sizes for efficient data ingestion
  • Adjustable wait times for optimal throughput
  • Built-in retry mechanisms
  • Automatic schema detection and management
  • Full support for JSON data types in ClickHouse including nested JSON and arrays

Additional Features

  • User-Friendly Interface: Web-based UI for pipeline configuration and management
  • SDK Support: Python SDK for programmatic management of pipelines
  • Local Development: Includes demo setup with local Kafka and ClickHouse instances
  • Self-Hosted: Open-source solution that can be self-hosted in your infrastructure

Support

Last updated on