Kafka Deduplication Demo
This demo walks you through a local installation using GlassFlow CLI. The CLI brings up a Kind cluster and deploys GlassFlow ETL, with local Kafka and ClickHouse. You will create a pipeline via the UI and verify that duplicate events are removed before they reach ClickHouse.
Prerequisites
- Docker (or compatible runtime like Docker Desktop, OrbStack, Colima, or Podman)
- Helm β install via Helm docsΒ or run:
brew install helm - kubectl (installed automatically via Homebrew if you use the recommended install)
Install GlassFlow locally
Verify installation
glassflow versionStart GlassFlow with Kafka and ClickHouse
glassflow up --demoThis will:
- Create a Kind cluster (if needed)
- Download pre-built container images (first run only)
- Install GlassFlow ETL (glassflow namespace), Kafka (kafka namespace), and ClickHouse (clickhouse namespace)
- Wait for all services to be ready
For more options and details, see the Installation Guide.
Set up the pipeline via the UI
Create a new topic in Kafka
kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/client.properties << EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password";
EOF
kafka-topics.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \
--command-config /tmp/client.properties \
--create --topic duplicated-events \
--partitions 1 --replication-factor 1'Send example data to the topic
Send one sample event to the duplicated-events topic so the pipeline will have data to ingest once you create it in the UI:
kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/events.json << "EOFEVENTS"
{"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:00:00Z"}
EOFEVENTS
cat > /tmp/client.properties << EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password";
EOF
cat /tmp/events.json | kafka-console-producer.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \
--producer.config /tmp/client.properties \
--topic duplicated-events'Create a new table in ClickHouse
kubectl exec -n clickhouse svc/clickhouse -- clickhouse-client \
--user default \
--password glassflow-demo-password \
--query "CREATE TABLE IF NOT EXISTS deduplicated_events (event_id UUID, type String, source String, created_at DateTime) ENGINE = MergeTree ORDER BY event_id"Configure pipeline in the UI
Once running, you can access (ports may vary if alternatives were chosen):
- GlassFlow UI: http://localhost:30080Β
- GlassFlow API: http://localhost:30180Β
- ClickHouse HTTP: http://localhost:30090Β
Open the GlassFlow UI and use the connection details below to create a pipeline.
- In this demo we will create a single source pipeline.
- Give the pipeline a name (for example, βDemo Pipelineβ).
- The UI will automatically generate a pipeline ID for the pipeline.
Kafka Connection
Authentication Method: SASL/PLAIN
Security Protocol: SASL_PLAINTEXT
Bootstrap Servers: kafka.kafka.svc.cluster.local:9092
Username: user1
Password: glassflow-demo-passwordKafka Topic
Topic Name: duplicated-events
Consumer Group Initial Offset: latest
Schema:
{
"event_id": "ddccabe2-c673-4d8a-affc-8647db00f7b5",
"type": "page_view",
"source": "web",
"created_at": "2025-12-03T15:17:34.907877Z"
}Deduplication
Enabled: true
Deduplicate Key: event_id
Deduplicate Key Type: string
Time Window: 1hSkip Filter and skip Transform.
ClickHouse Connection
Host: clickhouse.clickhouse.svc.cluster.local
HTTP/S Port: 8123
Native Port: 9000
Username: default
Password: glassflow-demo-password
Use SSL: falseClickHouse Table
Table: deduplicated_eventsWait for the pipeline to be deployed. The UI will redirect to the pipeline page once the pipeline is deployed.
Send data to Kafka
Run the following on your machine in a terminal:
# Send multiple JSON events to Kafka
kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/events.json << "EOFEVENTS"
{"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:00:00Z"}
{"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:01:00Z"}
{"event_id": "f0ed455046a543459d9a51502cdc756d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:03:00Z"}
EOFEVENTS
cat > /tmp/client.properties << EOF
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password";
EOF
cat /tmp/events.json | kafka-console-producer.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \
--producer.config /tmp/client.properties \
--topic duplicated-events'Verify Results
After a few seconds (maximum delay time - default 1 minute), you should see the deduplicated events in ClickHouse:
kubectl exec -n clickhouse svc/clickhouse -- clickhouse-client \
--user default \
--password glassflow-demo-password \
--format prettycompact \
--query "SELECT * FROM deduplicated_events" ββevent_idββββββββββββββββββββββββββββββ¬βtypeβββββββ¬βsourceββ¬ββββββββββcreated_atββ
1. β 49a6fdd6-f305-4288-81f3-436eb498fc9d β page_view β web β 2025-03-20 10:00:00 β
2. β f0ed4550-46a5-4345-9d9a-51502cdc756d β page_view β web β 2025-03-20 10:03:00 β
ββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββ΄ββββββββββββββββββββββCongratulations! Youβve completed the demo.
What you achieved:
- You created a GlassFlow pipeline via the UI running on your local Kubernetes cluster.
- You sent events β including duplicates β to a Kafka topic.
- GlassFlow consumed from the topic, deduplicated by
event_id, and wrote the result to a ClickHouse table. - You verified the deduplicated data in the ClickHouse table.
Cleaning Up
To clean up the demo environment:
glassflow downor force deletion by adding the --force flag:
glassflow down --force