Skip to Content
Getting StartedDemosKafka Deduplication

Kafka Deduplication Demo

This demo walks you through a local installation using GlassFlow CLI. The CLI brings up a Kind cluster and deploys GlassFlow ETL, with local Kafka and ClickHouse. You will create a pipeline via the UI and verify that duplicate events are removed before they reach ClickHouse.

Prerequisites

  • Docker (or compatible runtime like Docker Desktop, OrbStack, Colima, or Podman)
  • Helm β€” install via Helm docsΒ  or run: brew install helm
  • kubectl (installed automatically via Homebrew if you use the recommended install)

Install GlassFlow locally

Verify installation

glassflow version

Start GlassFlow with Kafka and ClickHouse

glassflow up --demo

This will:

  • Create a Kind cluster (if needed)
  • Download pre-built container images (first run only)
  • Install GlassFlow ETL (glassflow namespace), Kafka (kafka namespace), and ClickHouse (clickhouse namespace)
  • Wait for all services to be ready

For more options and details, see the Installation Guide.

Set up the pipeline via the UI

Create a new topic in Kafka

kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/client.properties << EOF security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password"; EOF kafka-topics.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \ --command-config /tmp/client.properties \ --create --topic duplicated-events \ --partitions 1 --replication-factor 1'

Send example data to the topic

Send one sample event to the duplicated-events topic so the pipeline will have data to ingest once you create it in the UI:

kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/events.json << "EOFEVENTS" {"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:00:00Z"} EOFEVENTS cat > /tmp/client.properties << EOF security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password"; EOF cat /tmp/events.json | kafka-console-producer.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \ --producer.config /tmp/client.properties \ --topic duplicated-events'

Create a new table in ClickHouse

kubectl exec -n clickhouse svc/clickhouse -- clickhouse-client \ --user default \ --password glassflow-demo-password \ --query "CREATE TABLE IF NOT EXISTS deduplicated_events (event_id UUID, type String, source String, created_at DateTime) ENGINE = MergeTree ORDER BY event_id"

Configure pipeline in the UI

Once running, you can access (ports may vary if alternatives were chosen):

Open the GlassFlow UI and use the connection details below to create a pipeline.

  • In this demo we will create a single source pipeline.
  • Give the pipeline a name (for example, β€œDemo Pipeline”).
  • The UI will automatically generate a pipeline ID for the pipeline.

Kafka Connection

Authentication Method: SASL/PLAIN Security Protocol: SASL_PLAINTEXT Bootstrap Servers: kafka.kafka.svc.cluster.local:9092 Username: user1 Password: glassflow-demo-password

Kafka Topic

Topic Name: duplicated-events Consumer Group Initial Offset: latest Schema: { "event_id": "ddccabe2-c673-4d8a-affc-8647db00f7b5", "type": "page_view", "source": "web", "created_at": "2025-12-03T15:17:34.907877Z" }

Deduplication

Enabled: true Deduplicate Key: event_id Deduplicate Key Type: string Time Window: 1h

Skip Filter and skip Transform.

ClickHouse Connection

Host: clickhouse.clickhouse.svc.cluster.local HTTP/S Port: 8123 Native Port: 9000 Username: default Password: glassflow-demo-password Use SSL: false

ClickHouse Table

Table: deduplicated_events

Wait for the pipeline to be deployed. The UI will redirect to the pipeline page once the pipeline is deployed.

Send data to Kafka

Run the following on your machine in a terminal:

# Send multiple JSON events to Kafka kubectl exec -n kafka svc/kafka -- bash -c 'cat > /tmp/events.json << "EOFEVENTS" {"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:00:00Z"} {"event_id": "49a6fdd6f305428881f3436eb498fc9d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:01:00Z"} {"event_id": "f0ed455046a543459d9a51502cdc756d", "type": "page_view", "source": "web", "created_at": "2025-03-20T10:03:00Z"} EOFEVENTS cat > /tmp/client.properties << EOF security.protocol=SASL_PLAINTEXT sasl.mechanism=PLAIN sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user1" password="glassflow-demo-password"; EOF cat /tmp/events.json | kafka-console-producer.sh --bootstrap-server kafka.kafka.svc.cluster.local:9092 \ --producer.config /tmp/client.properties \ --topic duplicated-events'

Verify Results

After a few seconds (maximum delay time - default 1 minute), you should see the deduplicated events in ClickHouse:

kubectl exec -n clickhouse svc/clickhouse -- clickhouse-client \ --user default \ --password glassflow-demo-password \ --format prettycompact \ --query "SELECT * FROM deduplicated_events"
β”Œβ”€event_id─────────────────────────────┬─type──────┬─source─┬──────────created_at─┐ 1. β”‚ 49a6fdd6-f305-4288-81f3-436eb498fc9d β”‚ page_view β”‚ web β”‚ 2025-03-20 10:00:00 β”‚ 2. β”‚ f0ed4550-46a5-4345-9d9a-51502cdc756d β”‚ page_view β”‚ web β”‚ 2025-03-20 10:03:00 β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Congratulations! You’ve completed the demo.

What you achieved:

  • You created a GlassFlow pipeline via the UI running on your local Kubernetes cluster.
  • You sent events β€” including duplicates β€” to a Kafka topic.
  • GlassFlow consumed from the topic, deduplicated by event_id, and wrote the result to a ClickHouse table.
  • You verified the deduplicated data in the ClickHouse table.

Cleaning Up

To clean up the demo environment:

glassflow down

or force deletion by adding the --force flag:

glassflow down --force
Last updated on