scala - How does maxOffsetsPerTrigger works in spark structured streaming - Stack Overflow
I am a bit confused about the behavior of the maxOffsetsPerTrigger
parameter. Does it apply to a global limit, or is it per partition?
For example, if my maxOffsetsPerTrigger
is set to 1 and I have subscribed to a topic with 5 partitions, my expectation is that it will read a total of 1 message in each micro-batch. However, from the checkpoint location, I observe that it is reading 1 message from each partition, resulting in a total of 5 messages in each batch.
Is this the expected behavior?
Spark version: 3.5.1
I am a bit confused about the behavior of the maxOffsetsPerTrigger
parameter. Does it apply to a global limit, or is it per partition?
For example, if my maxOffsetsPerTrigger
is set to 1 and I have subscribed to a topic with 5 partitions, my expectation is that it will read a total of 1 message in each micro-batch. However, from the checkpoint location, I observe that it is reading 1 message from each partition, resulting in a total of 5 messages in each batch.
Is this the expected behavior?
Spark version: 3.5.1
Share Improve this question edited yesterday Ged 17.9k8 gold badges47 silver badges102 bronze badges asked yesterday mt_leomt_leo 871 silver badge12 bronze badges 2- I don't see earlier replies now. Did someone delete it? – mt_leo Commented yesterday
- yes it was deleted by it's author because it was wrong and probably copied from ChatGPT. – Gaël J Commented 23 hours ago
2 Answers
Reset to default 1scenario | behavior |
---|---|
1 topic, 5partition, maxOffsetsPerTrigger=1 |
Spark reads from each partition and process total 5 messages in each micro batch(This maybe to ensure that all partitions contribute data even when maxOffsetsPerTrigger is very low, preventing starvation of some partitions.[this is my assumption as I did not find anything documented in spark doc. ] ) |
1 topic, 5 partition, maxOffsetsPerTrigger=10 |
Spark proportionally divided among partitions and processed overall 10 messages in each micro batch . |
The official doc on Spark Kafka Integration is poorly worded.
Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.
Your observation is correct. This parameter applies per Partition of a Kafka Topic.
- AMD、英特尔等开始疏远Windows
- python - Problem of redirect when trying to authenticate user using supabase - Stack Overflow
- linux - How to use GNU ld with rustc-compiled obj file without cargo? - Stack Overflow
- powershell - Questionable output while iterating over an array of arrays - Stack Overflow
- Vercel hosted SvelteKit app keeps 404ing on form submit? - Stack Overflow
- Is it possible to solve this vagrant ruby gems error? - Stack Overflow
- tomcat9 - guacd WebSocket Tunnel Timeout with "Support for protocol 'vnc' is not installed" Er
- c++ - Changing STL container (std::string) values in debugger - Stack Overflow
- OpenAI Assistant: File Search tool stops working when I enable custom function calling - Stack Overflow
- html - Javascript cannot map object with repetative values - Stack Overflow
- c++ - How to create a class setup where two classes each holds an instance of the other? - Stack Overflow
- sublimetext3 - Sublime Text 34: copypaste all text excluding comments - Stack Overflow
- google apps script - How do I correct my code to move a row from one tab to another tab in Sheets, and then delete it from the s
- node.js - How to handle time taking processes in slack app view - Stack Overflow
- ESP32 Bluetooth - Is it possible to keep Provisioning Manager Bluetooth Up, After Provisioning Is Done? - Stack Overflow
- android - How to Optimize Contact Fetching in Flutter with Getx and Implement Caching for Performance? - Stack Overflow
- React native - OCR of price tag - Stack Overflow