Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions analytics/kafka/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Kafka on EKS Auto Mode (Strimzi)

A workshop-ready Apache Kafka cluster running on the Spark-on-EKS Auto Mode cluster, deployed via the [Strimzi](https://strimzi.io/) operator in **KRaft mode** (no ZooKeeper).

The shipped configuration runs **3 controllers + 3 brokers** with rack-aware placement across availability zones, JMX Prometheus metrics, Cruise Control, and the Kafka Exporter.

## Why this folder?

The workshop's Spark labs already use the EKS Auto Mode cluster created by `analytics/terraform/spark-k8s-operator/`. This folder adds an interactive Kafka lab on top of that same cluster — no additional infrastructure required. You install the operator, deploy the Kafka custom resources, and run a smoke test.

## Prerequisites

- The Spark-on-EKS workshop cluster is already running (you've completed the bootstrap that the workshop CFN does on your IDE).
- `kubectl` is configured against the cluster:
```sh
kubectl get nodes
```
- The default `gp3` StorageClass exists (the workshop's Terraform already creates it):
```sh
kubectl get storageclass
```
You should see `gp3 (default)` with provisioner `ebs.csi.eks.amazonaws.com`.

## Layout

```
analytics/kafka/
├── README.md # this file
├── install-strimzi.sh # installs the latest Strimzi operator
├── kafka-cluster.yaml # Kafka CR + KafkaNodePools + JMX ConfigMap
├── deploy-kafka.sh # applies the cluster and waits for Ready
├── cleanup.sh # tears everything down
└── examples/
├── hello-test-topic.yaml # 3-partition, 3-replica test topic
└── kafka-rebalance.yaml # Cruise Control rebalance demo
```

## Step 1: Install the Strimzi operator

The Strimzi Cluster Operator is the controller that watches for `Kafka`, `KafkaNodePool`, `KafkaTopic`, and other Strimzi CRs and provisions the corresponding workloads.

```sh
./install-strimzi.sh
```

This creates the `kafka` namespace and applies the latest Strimzi install bundle directly from `https://strimzi.io/install/latest?namespace=kafka`. The script waits for `deploy/strimzi-cluster-operator` to be Ready (typically under a minute).

> **Note:** the script uses `kubectl create` rather than `kubectl apply` because the Strimzi install bundle contains CRDs whose annotations exceed Kubernetes' size limit for `last-applied-configuration`.

## Step 2: Deploy the Kafka cluster

```sh
./deploy-kafka.sh
```

This applies `kafka-cluster.yaml` which contains:

- A `Kafka` resource named `cluster` (KRaft mode, Kafka 4.0.0, plain + TLS internal listeners, rack-aware)
- A `KafkaNodePool` named `controller` with **3 controller replicas**, 100 Gi gp3 each
- A `KafkaNodePool` named `broker` with **3 broker replicas**, 100 Gi gp3 each
- A `ConfigMap` with the JMX Prometheus exporter rules
- The `entityOperator` (for `KafkaTopic` / `KafkaUser` reconciliation), `cruiseControl`, and `kafkaExporter` all enabled

Each broker requests 6 CPU / 58 Gi memory (limit 8 CPU / 64 Gi). On Auto Mode, this typically pulls **6 × `r5a.4xlarge` instances** (one per broker/controller) over 3 availability zones thanks to the `rack` and `topologySpreadConstraints` settings. The first deploy takes around 3 minutes once Karpenter has provisioned nodes.

When complete you should see:

```console
NAME READY METADATA STATE WARNINGS
kafka.kafka.strimzi.io/cluster True
NAME DESIRED REPLICAS ROLES NODEIDS
kafkanodepool.kafka.strimzi.io/broker 3 ["broker"] [0,1,2]
kafkanodepool.kafka.strimzi.io/controller 3 ["controller"] [3,4,5]

cluster-broker-0 1/1 Running
cluster-broker-1 1/1 Running
cluster-broker-2 1/1 Running
cluster-controller-3 1/1 Running
cluster-controller-4 1/1 Running
cluster-controller-5 1/1 Running
cluster-cruise-control-* 1/1 Running
cluster-entity-operator-* 2/2 Running
cluster-kafka-exporter-* 1/1 Running
```

Bootstrap servers in the cluster:

- Plain: `cluster-kafka-bootstrap.kafka.svc:9092`
- TLS: `cluster-kafka-bootstrap.kafka.svc:9093`

## Step 3: Smoke test

Create a topic:

```sh
kubectl apply -f examples/hello-test-topic.yaml
kubectl get kafkatopic -n kafka
```

Produce 3 messages (the `quay.io/strimzi/kafka` image ships with `kafka-console-*` scripts):

```sh
kubectl run kafka-producer-test -n kafka --rm -i --restart=Never \
--image=quay.io/strimzi/kafka:latest-kafka-4.0.0 \
-- bash -c 'echo -e "msg-1\nmsg-2\nmsg-3" | /opt/kafka/bin/kafka-console-producer.sh \
--bootstrap-server cluster-kafka-bootstrap:9092 --topic hello-test'
```

Consume them back:

```sh
kubectl run kafka-consumer-test -n kafka --rm -i --restart=Never \
--image=quay.io/strimzi/kafka:latest-kafka-4.0.0 \
-- bash -c '/opt/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server cluster-kafka-bootstrap:9092 --topic hello-test \
--from-beginning --timeout-ms 10000'
```

Expected output:

```console
msg-1
msg-2
msg-3
```

## Optional: Cruise Control rebalance

The `Kafka` CR enables Cruise Control, which can rebalance partitions across brokers. To trigger an analysis:

```sh
kubectl apply -f examples/kafka-rebalance.yaml
kubectl get kafkarebalance rebalance -n kafka -w
```

The CR transitions through `PendingProposal` → `ProposalReady`. When `ProposalReady`, you can approve it:

```sh
kubectl annotate kafkarebalance rebalance -n kafka \
strimzi.io/rebalance=approve --overwrite
```

It then transitions to `Rebalancing` → `Ready`.

## Step 4: Tear down

When you're done with the lab:

```sh
./cleanup.sh
```

This removes all Kafka topics/users, the Kafka CR, the KafkaNodePools, the broker/controller PVCs (broker data is deleted), the Strimzi operator, and the `kafka` namespace.

## What you learned

- How to install the Strimzi operator on an existing EKS Auto Mode cluster.
- How to declare a KRaft-mode Kafka cluster with `KafkaNodePool` (the post-StatefulSet abstraction Strimzi uses for elastic broker pools).
- How rack awareness + topology spread constraints + pod anti-affinity combine to spread brokers across AZs.
- How to produce/consume from inside the cluster.
- How Cruise Control fits into a Strimzi-managed Kafka cluster.

## Customizing

- **Storage size**: change `spec.storage.volumes[0].size` in each `KafkaNodePool` (default 100 Gi).
- **Broker count**: change `spec.replicas` on the `broker` and `controller` node pools (defaults 3 + 3). Keep controller replicas odd to maintain KRaft quorum.
- **Resource request/limits**: edit `spec.kafka.resources` on the `Kafka` CR. Smaller dev sizes (e.g. 2 CPU / 8 Gi) work fine for hello-world testing and let Karpenter consolidate onto smaller instances.
- **Storage class**: the manifest uses `class: gp3`. To experiment with NVMe-backed local storage, look at the [EC2 Instance Store CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/lis-csi.html) — but note it does **not** work with Auto Mode today.
32 changes: 32 additions & 0 deletions analytics/kafka/cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
set -euo pipefail

NAMESPACE=kafka

echo "Deleting KafkaTopics, KafkaUsers, the Kafka CR, and KafkaNodePools..."
kubectl delete kafkatopic --all -n "${NAMESPACE}" --ignore-not-found
kubectl delete kafkauser --all -n "${NAMESPACE}" --ignore-not-found
kubectl delete kafka cluster -n "${NAMESPACE}" --ignore-not-found
kubectl delete kafkanodepool --all -n "${NAMESPACE}" --ignore-not-found

echo "Deleting kafka-metrics ConfigMap..."
kubectl delete configmap kafka-metrics -n "${NAMESPACE}" --ignore-not-found

echo ""
echo "Waiting up to 3 minutes for broker/controller pods to terminate..."
kubectl wait --for=delete pod -l strimzi.io/cluster=cluster -n "${NAMESPACE}" --timeout=180s || true

echo ""
echo "Deleting persistent volume claims (broker + controller data)..."
kubectl delete pvc -l strimzi.io/cluster=cluster -n "${NAMESPACE}" --ignore-not-found

echo ""
echo "Uninstalling Strimzi operator..."
kubectl delete -f "https://strimzi.io/install/latest?namespace=${NAMESPACE}" -n "${NAMESPACE}" --ignore-not-found

echo ""
echo "Deleting namespace ${NAMESPACE}..."
kubectl delete namespace "${NAMESPACE}" --ignore-not-found

echo ""
echo "Cleanup complete."
20 changes: 20 additions & 0 deletions analytics/kafka/deploy-kafka.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
NAMESPACE=kafka

echo "Applying Kafka cluster + KafkaNodePools (controller, broker)..."
kubectl apply -f "${SCRIPT_DIR}/kafka-cluster.yaml"

echo ""
echo "Waiting for the Kafka cluster to become Ready (3-5 minutes typical)..."
kubectl wait --for=condition=Ready kafka/cluster -n "${NAMESPACE}" --timeout=600s

echo ""
kubectl get kafka,kafkanodepool -n "${NAMESPACE}"
echo ""
kubectl get pods -n "${NAMESPACE}"
echo ""
echo "Bootstrap servers:"
kubectl get kafka cluster -n "${NAMESPACE}" -o jsonpath='{range .status.listeners[*]} {.name}: {.bootstrapServers}{"\n"}{end}'
12 changes: 12 additions & 0 deletions analytics/kafka/examples/hello-test-topic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: hello-test
namespace: kafka
labels:
strimzi.io/cluster: cluster
spec:
partitions: 3
replicas: 3
config:
retention.ms: 3600000
25 changes: 25 additions & 0 deletions analytics/kafka/examples/kafka-rebalance.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
name: rebalance
namespace: kafka
labels:
strimzi.io/cluster: cluster
spec:
goals:
- RackAwareGoal
- ReplicaCapacityGoal
- DiskCapacityGoal
- NetworkInboundCapacityGoal
- NetworkOutboundCapacityGoal
- CpuCapacityGoal
- ReplicaDistributionGoal
- PotentialNwOutGoal
- DiskUsageDistributionGoal
- NetworkInboundUsageDistributionGoal
- NetworkOutboundUsageDistributionGoal
- CpuUsageDistributionGoal
- TopicReplicaDistributionGoal
- LeaderReplicaDistributionGoal
- LeaderBytesInDistributionGoal
- PreferredLeaderElectionGoal
24 changes: 24 additions & 0 deletions analytics/kafka/install-strimzi.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/env bash
set -euo pipefail

# Install latest Strimzi Cluster Operator into the kafka namespace
# Reference: https://strimzi.io/quickstarts/

NAMESPACE=kafka

echo "Creating namespace: ${NAMESPACE}"
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -

echo "Installing latest Strimzi operator into ${NAMESPACE}..."
# 'kubectl create' is intentional: the install bundle contains CRDs that
# can fail with 'kubectl apply' due to Kubernetes annotation size limits.
kubectl create -f "https://strimzi.io/install/latest?namespace=${NAMESPACE}" -n "${NAMESPACE}"

echo ""
echo "Waiting for the Strimzi cluster operator to become Ready..."
kubectl -n "${NAMESPACE}" rollout status deploy/strimzi-cluster-operator --timeout=180s

echo ""
kubectl get pods -n "${NAMESPACE}"
echo ""
echo "Strimzi operator is ready. Next: run ./deploy-kafka.sh"
Loading