Apache Kafka is a distributed streaming platform which allows applications to publish and subscribe to streams of records. Kafka is architected as a cluster of one or more servers. A stream of records is called a ‘topic’. A ‘Producer’ can publish messages to a topic. A ‘Consumer’ can subscribe to a topic and consume the messages that were published.
The above diagram shows that when messages are published on a Kafka topic, the messages are added to the tail of the log. As each message is published on the topic, it is identified by the ‘offset’ of the message in the log. All consumers retrieve messages from the same log, and as messages are consumed they are not destroyed but remain in the log for the pre-defined retention period. The offset is used by consumers to identify their position when receiving messages. The retention period can be defined by time, or by log size.
When a consumer starts, it must tell Kafka from where in the commit log it wishes to start receiving messages. This can either be the earliest available message or the latest message. When a consumer disconnects and then re-connects and it wishes to see all messages published while it was disconnected, then it must remember the offset of the last message it consumed.
To balance load and achieve scalability, Kafka allows topics to be divided into ‘partitions’. Each partition is an ordered, immutable sequence of records that is continually appended to in a structured commit log.
By defining a topic to have multiple partitions, messages published on the topic will be distributed equally amongst the partitions. Consumers in a consumer group will also split equally across the partitions. Kafka manages the distribution of messages and the assignment of consumers.
For more information, see the Apache Kafka documentation.