watchfrr: force kill daemons on restart #17163

Tuetuopay · 2024-10-18T13:32:58Z

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.

Let's not waste time and kill -9 from the get go.

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it. Let's not waste time and kill -9 from the get go. Signed-off-by: Tuetuopay <[email protected]>

Tuetuopay · 2024-10-18T13:37:48Z

Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes.

Anyways.

This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using kill -2 is way too polite and will never recover from such a case. Literally today I've had a route-reflector where watchfrr tries for almost two hours to restart bgpd, to no avail (and it's not like waiting any longer would have fixed anything).

Jafaral · 2024-10-21T18:45:39Z

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

Tuetuopay · 2024-10-22T09:58:28Z

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy.

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Jafaral · 2024-10-23T16:06:49Z

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Yeah. That is more reasonable, give it 12 or maybe 30 seconds?

Tuetuopay · 2024-10-23T16:44:31Z

Great!

watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s?

frrbot bot added the watchfrr label Oct 18, 2024

github-actions bot added master size/XS labels Oct 18, 2024

Tuetuopay force-pushed the watchfrr-force-kill branch from 9c2550a to 5e0d6d6 Compare October 18, 2024 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watchfrr: force kill daemons on restart #17163

watchfrr: force kill daemons on restart #17163

Tuetuopay commented Oct 18, 2024

Tuetuopay commented Oct 18, 2024

Jafaral commented Oct 21, 2024

Tuetuopay commented Oct 22, 2024 •

edited

Loading

Jafaral commented Oct 23, 2024

Tuetuopay commented Oct 23, 2024

watchfrr: force kill daemons on restart #17163

Are you sure you want to change the base?

watchfrr: force kill daemons on restart #17163

Conversation

Tuetuopay commented Oct 18, 2024

Tuetuopay commented Oct 18, 2024

Jafaral commented Oct 21, 2024

Tuetuopay commented Oct 22, 2024 • edited Loading

Jafaral commented Oct 23, 2024

Tuetuopay commented Oct 23, 2024

Tuetuopay commented Oct 22, 2024 •

edited

Loading