Kafka Partitions & Consumer Groups
Kafka topics are divided into partitions — ordered, immutable append-only logs. Consumer groups allow multiple consumers to read from a topic in parallel, with each partition assigned to one consumer in the group.
Kafka = a multi-lane highway with a logbook. Each lane (partition) is one-way and in-order. Consumer groups = teams of toll booth operators, each team reading every lane once. More lanes = more throughput.
Partitions are the unit of parallelism: N partitions can be processed by up to N consumers in a group simultaneously. Messages within a partition are totally ordered; across partitions, ordering is not guaranteed. Producers assign messages to partitions by key (consistent hash) or round-robin. Consumer groups track progress via committed offsets stored in the __consumer_offsets topic. Rebalancing occurs when consumers join or leave a group — partitions are reassigned.
Partition count determines max consumer parallelism: you can't have more active consumers than partitions in a group. Plan partition count carefully — it's hard to decrease partitions (requires topic recreation). Kafka retains messages by retention.ms (time) or retention.bytes (size) regardless of consumption — this enables replay. Leader/follower replication: each partition has one leader broker and N-1 followers. Producers write to and consumers read from the leader. ISR (In-Sync Replicas) — followers within a configurable lag. acks=all (producer) + min.insync.replicas=2 ensures a write is acknowledged only when written to the leader and at least one follower, preventing data loss on leader failure. Consumer group rebalancing with incremental cooperative rebalancing (Kafka 2.4+) reduces the rebalance stop-the-world effect by only moving partitions that need to change rather than revoking all assignments.
I'd design partition count based on expected throughput and consumer parallelism. A rule of thumb: size for 2x expected peak throughput to allow room to grow without adding partitions. I'd use a partition key that distributes load evenly while keeping related events together (e.g., user_id for user event streams). I'd set acks=all and min.insync.replicas=2 for any topic carrying business-critical data. For consumer lag monitoring, I'd track the lag per partition and alert when lag exceeds a threshold that corresponds to my maximum acceptable processing delay.
Using a poor partition key that creates hot partitions. If you partition by event type and one type has 100x more volume, one partition gets all the load. Always analyze message volume distribution before choosing a partition key.