Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchfrr: force kill daemons on restart #17163

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Tuetuopay
Copy link
Contributor

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through frrcommon. The issue is, a stuck daemon (like in a thread starvation situation) will not honor a SIGSTOP, and watchfrr will try indefinitely to kill it.

Let's not waste time and kill -9 from the get go.

Today, watchfrr sends a SIGSTOP to a misbehaving daemon through
frrcommon. The issue is, a stuck daemon (like in a thread starvation
situation) will not honor a SIGSTOP, and watchfrr will try indefinitely
to kill it.

Let's not waste time and kill -9 from the get go.

Signed-off-by: Tuetuopay <[email protected]>
@Tuetuopay
Copy link
Contributor Author

Note: this PR is also to start the discussion. I would definitely understand if this is not the way you want to do this, or if there are huge issues I missed with those changes.

Anyways.

This is becoming problematic at scale, because I've been hitting a lot of issues with bgpd becoming completely unresponsive. And bgpd unresponsive also means bgpd not handling signals. Using kill -2 is way too polite and will never recover from such a case. Literally today I've had a route-reflector where watchfrr tries for almost two hours to restart bgpd, to no avail (and it's not like waiting any longer would have fixed anything).

@Jafaral
Copy link
Member

Jafaral commented Oct 21, 2024

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

@Tuetuopay
Copy link
Contributor Author

Tuetuopay commented Oct 22, 2024

This seems like a band-aid solution. Instead of using a "hammer" with bgp, we should be looking at the reasons why it becomes unresponsive in the first place and fix those.

While I do agree with you that this should ultimately be fixed (we are trying to reproduce the issue in a lab), as it stands watchfrr is nigh useless because it cannot restart a failed bgpd. So yes, this is a band-aid, but the mechanism is already here. And honestly I don't want to be woken up at 4AM because the auto-heal does not work (though I've already deployed this as a quick measure). It's not a solution, but it keeps production happy.

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

@Jafaral
Copy link
Member

Jafaral commented Oct 23, 2024

Would having two hammers be better for you? Keeping the kill -2 for e.g. 12s, and going the kill -9 route if this fails?

Yeah. That is more reasonable, give it 12 or maybe 30 seconds?

@Tuetuopay
Copy link
Contributor Author

Great!

watchfrr has a by default 20s timeout. do you prefer I give the daemon a bit less than that (e.g. 15s), or raise the default watchfrr timeout to 35s and use the big gun after 30s?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants