-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchfrr: force kill daemons on restart #17163
base: master
Are you sure you want to change the base?
Conversation
Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it. Let's not waste time and kill -9 from the get go. Signed-off-by: Tuetuopay <[email protected]>
9c2550a
to
5e0d6d6
Compare
Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes. Anyways. This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using |
This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those. |
While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy. Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails? |
Yeah. That is more reasonable, give it 12 or maybe 30 seconds? |
Great! watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s? |
Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.
Let's not waste time and kill -9 from the get go.