scala - How does maxOffsetsPerTrigger works in spark structured streaming - Stack Overflow

时间: 2025-01-06 admin 业界

I am a bit confused about the behavior of the maxOffsetsPerTrigger parameter. Does it apply to a global limit, or is it per partition?

For example, if my maxOffsetsPerTrigger is set to 1 and I have subscribed to a topic with 5 partitions, my expectation is that it will read a total of 1 message in each micro-batch. However, from the checkpoint location, I observe that it is reading 1 message from each partition, resulting in a total of 5 messages in each batch.

Is this the expected behavior?

Spark version: 3.5.1

I am a bit confused about the behavior of the maxOffsetsPerTrigger parameter. Does it apply to a global limit, or is it per partition?

For example, if my maxOffsetsPerTrigger is set to 1 and I have subscribed to a topic with 5 partitions, my expectation is that it will read a total of 1 message in each micro-batch. However, from the checkpoint location, I observe that it is reading 1 message from each partition, resulting in a total of 5 messages in each batch.

Is this the expected behavior?

Spark version: 3.5.1

Share Improve this question edited yesterday Ged 17.9k8 gold badges47 silver badges102 bronze badges asked yesterday mt_leomt_leo 871 silver badge12 bronze badges 2
  • I don't see earlier replies now. Did someone delete it? – mt_leo Commented yesterday
  • yes it was deleted by it's author because it was wrong and probably copied from ChatGPT. – Gaël J Commented 23 hours ago
Add a comment  | 

2 Answers 2

Reset to default 1
scenario behavior
1 topic, 5partition, maxOffsetsPerTrigger=1 Spark reads from each partition and process total 5 messages in each micro batch(This maybe to ensure that all partitions contribute data even when maxOffsetsPerTrigger is very low, preventing starvation of some partitions.[this is my assumption as I did not find anything documented in spark doc. ] )
1 topic, 5 partition, maxOffsetsPerTrigger=10 Spark proportionally divided among partitions and processed overall 10 messages in each micro batch .

The official doc on Spark Kafka Integration is poorly worded.

Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.

Your observation is correct. This parameter applies per Partition of a Kafka Topic.

最新文章