GlassFlow for ClickHouse ETL Documentation
GlassFlow for ClickHouse Streaming ETL is a real-time stream processor designed to simplify data pipeline creation and management between Kafka and ClickHouse. It provides a powerful, user-friendly interface for building and managing real-time data pipelines with built-in support for deduplication and temporal joins.
Its fully open-source and built for data engineers by data engineers. GlassFlow handles late-arriving events, ensures exactly-once correctness, and scales with high-throughput data. It delivers accurate, low-latency results from streaming data without compromising simplicity or performance. The tool’s intuitive web interface makes it easy to configure and monitor pipelines, while its robust architecture ensures reliable data processing.

Features
Streaming Deduplication
- Real-time deduplication of Kafka streams before ingestion into ClickHouse
- Configurable time windows up to 7 days for deduplication
- Simple configuration of deduplication keys and time windows
- Prevents duplicate data from reaching ClickHouse
Temporal Stream Joins
- Join two Kafka streams in real-time
- Configurable time windows up to 7 days for stream joins
- Configure join keys and time windows through the UI
- Simplified join setup process
- Produce joined streams ready for ClickHouse ingestion
Kubernetes Native Architecture
- Robust and scalable architecture natively built for Kubernetes
- Easy installation using Helm
- Custom Kubernetes controller for managing pipelines
- Horizontal scalability
Built-in Kafka Connector
- Automatic data extraction from Kafka topics
- Seamless integration with Kafka clusters
- No manual data pulling required
- Supports multiple Kafka topics and partitions
- Native support for JSON data types including nested JSON and arrays
Optimized ClickHouse Sink
- Native ClickHouse connection for maximum performance
- Configurable batch sizes for efficient data ingestion
- Adjustable wait times for optimal throughput
- Built-in retry mechanisms
- Automatic schema detection and management
- Full support for JSON data types in ClickHouse including nested JSON and arrays
Additional Features
- User-Friendly Interface: Web-based UI for pipeline configuration and management
- SDK Support: Python SDK for programmatic management of pipelines
- Local Development: Includes demo setup with local Kafka and ClickHouse instances
- Docker Support: Easy deployment using Docker and docker compose
- Self-Hosted: Open-source solution that can be self-hosted in your infrastructure