Apache Kafka order of messages with multiple partitions

Apache Kafka

Apache Kafka Problem Overview


As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?

Apache Kafka Solutions


Solution 1 - Apache Kafka

In Kafka the parallelism is equal to the number of partitions for a topic.

For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.

Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..

Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.

If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.

To completely answer your question, Kafka only provides a total order over messages within a partition, not between different partitions in a topic.

ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.

Solution 2 - Apache Kafka

I decided to move my comment to a separate answer as I think it makes sense to do so.

While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?

If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.

But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.

Solution 3 - Apache Kafka

In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order

another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition

So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.

In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.

Solution 4 - Apache Kafka

You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.

Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.

Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.

As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.

Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.

In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.

Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.

Solution 5 - Apache Kafka

Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.

With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.

You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture. ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8

Solution 6 - Apache Kafka

Well, this is an old thread, but still relevant, hence decided to share my view.

I think this question is a bit confusing.

If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need

Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).

Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRajan R.GView Question on Stackoverflow
Solution 1 - Apache KafkaVishal JohnView Answer on Stackoverflow
Solution 2 - Apache KafkaserejjaView Answer on Stackoverflow
Solution 3 - Apache KafkaDean JainView Answer on Stackoverflow
Solution 4 - Apache KafkawandermonkView Answer on Stackoverflow
Solution 5 - Apache KafkaYashView Answer on Stackoverflow
Solution 6 - Apache KafkaSaketView Answer on Stackoverflow