Pentaho

 View Only

 Kafka producer reconnect

Imre Sandor's profile image
Imre Sandor posted 10-26-2022 08:39
Hi,

We have a pentaho integration in place to stream data to a data warehouse system through Kafka. We have the connection set up by providing bootstrap servers and kafka topic. As bootstrap servers we specify all kafka brokers as broker1:9091,broker2:9092,broker3:9092
Pentaho connects to the kafka cluster as expected and can send data, however if for any reason the lead broker for that topic dies and restarts, the sending does not resume, even though kafka has elected a new leader for the specific topic. 
Pentaho retries to send data for a minute but since the original lead broker is no longer valid for the topic, it fails and never tries to reconnect. 
Questions:
How to make Pentaho try a reconnect after a failure in the send operation?
Is there a way to lenghten the retry period after a failed send?
Why do we experience massive message loss during such an incident? (Last time when we experimented with this, status lines indicated ~27000 messages sent, while we only had ~23000 messages in the kafka topic.)

By the way: we are using 9.3.0.0-428 build.

Thanks in advance,
Imre
Sandeep Chinaga Kemparaju's profile image
Sandeep Chinaga Kemparaju
Hi Imre,

Can you attach the transformation and logs to review the errors? I read you have 3 brokers, ideally when the lead broker comes up Pentaho should automatically connect to it. Would like to see the error when it's trying to connect again.
Imre Sandor's profile image
Imre Sandor
Hi,

Sorry, I couldn't get back earlier.
I attach the relevant log fragments. 
There were two attempts to push messages. During the transfer we killed the lead broker for the topic. The kafka cluster quickly changed the lead broker to a different broker and cruisecontrol soon (within a couple of seconds) restarted the missing broker but the kafka producer was not able to resume the push operation. It failed and was left with inconsistent numbers as described above.

Imre
Imre Sandor's profile image
Imre Sandor
Hi Sandeep, 
Did you have a chance to take a look at the logs?

Regards,
Imre
Sandeep Chinaga Kemparaju's profile image
Sandeep Chinaga Kemparaju
Imre,

Reviewing the logs, the connection to the server is being disconnected. This can also occur when the lead broker dies or being killed for some reason which is what you have described initially as well.
As I mentioned earlier, a new leader should be taken up which isn't happening in your case.
This could be possible because of the way your HA setup is done, however you can try to set the parameter " retried" instead of default value 0 to a value using which it should pick up the new leader.
Can you try with "retries" and "retry.backoff.ms" parameter in the Kafka Producer step - Options tab? See attached screen shot
Imre Sandor's profile image
Imre Sandor

Hi,

Finally I was able to make the other party set Pentaho producer to retry connection as you recommended. The settings helped indeed, but we are still not quite satisfied.
As far the pentaho producer is concerned, they sent 3603416 messages to a single topic (tszpayments) as shown in the output below. 

However when I check the tszpayments topic on the kafka, there are 61 messages missing there. Pentaho logs do not show any errors. During the transfer of those messages we performed two topic leader switches (killed the topic leader for the tszpayment topic).
We can associate some of the missing messages with the first leader switch, but not all. It seems that the second leader switch didn't result in any loss of messages. 

Questions: 
How can we detect on the pentaho producer side that some messages in the batch didn't go through? 
Is it expected that a leader switch could result in some data loss?
As the above output shows, Pentaho takes the records from a PostgreSQL table. Is there any way to ensure ordering of the messages? (The records have a recid field that is unique for the records to be transferred and monotonous increasing in the table. However when consuming the messages from Kafka, they are not in order.  (There is a single partition for that topic)

Thanks in advance,
Imre