-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddm needs to be resilient to transient underlying platform errors #345
Comments
I've tar'd up the logs from both switch zones and put them on catacomb:/staff/core/maghemite-345:
I'll also leave the environment around in this state for a few hours in case anybody wants to look. I'll have to put your ssh key into root's authorized_keys but otherwise you should be able to read stuff from the lab network.
|
My intuition here is that the request from dendrite to softnpu hung. The communication between these two entities is over a virtual UART using TTY and all the loveliness that entails. While I've tried to make that channel reasonably robust, it still remains a bit hacky. But more importantly, I think this has revealed a weakness in Lines 955 to 981 in c92d6ff
which eventually calls Lines 99 to 120 in c92d6ff
in Lines 230 to 234 in c92d6ff
this means we cannot recover from transient errors. I really need to go split |
To see if this is a persistent issue, I went to the owner of the missing prefix
Now back in the switch zone we see the prefix
and the ping from
So, in short, if we had an upper/lower architecture for mg-ddm with a state-driver/reconciler execution model. This would not have resulted in a permanent bad state. |
Using:
5ba7808685dcbfa5c4ef0bc251d27d16d1671304
I launched an a4x2 setup with this environment:
a4x2 launch
succeeded but the system never made the handoff to Nexus. Sled Agent reports:It's failing to connect to that Nexus instance's internal API. Over on that Nexus, the log is reporting a bunch of:
I also noticed a bunch of database errors from other Nexus zones:
I dug into CockroachDB and found that 2 of the 5 nodes are reported "dead" and the reason is that their heartbeats are routinely taking more than 5s. Both of these nodes are on the same sled, g1. And those two nodes don't seem to have connectivity to the nodes on other sleds. After a bunch of digging I boiled it down to this observation: from g1's global zone, I cannot ping the underlay address of g3's global zone, but I can ping in the reverse direction. But even the reverse direction fails if I pick a different path.
So this works:
This doesn't work:
The question is: where is the packet being lost? @rmustacc helped me map out the various L2 devices that make up the Falcon topology. For my own future reference, you start with the config files in
a4x2/.falcon/g{0,1,2,3}.toml
, find the Viona NICs there, and look at the corresponding illumos devices to see which simnet the VNIC is over, what simnet that is connected to, and which VNIC is over that. For g1, that's:For g3, it's:
(I also learned through this that softnpu is running as an emulated device inside Propolis for the Scrimlet VMs.)
By snooping along these various points, we can figure out exactly where the packet is being dropped. We did that with
pfexec snoop -d DEVICE -r icmp6
on ivanova, the machine hosting the a4x2.From the above, there are two paths from g1's global zone to g3's global zone, and the one in use turns out to be:
It turns out the replies are going back over the other path, which goes through the switch attached to g0 rather than g3:
So the packet is being dropped in g0's softnpu. But why? Either softnpu is broken or the system has configured it wrong. Well, let's look at its routes:
That's odd. We have no route for
fd00:1122:3344:102::/64
. We do on the other switch (g3):This seems likely to be the problem! But why do we have no routes there? g0's mgd does seem to know about the 102 prefix:
@FelixMcFelix helped dig in and mentioned that it's mg-ddm that's responsible for keeping Dendrite (and thus the switch) up to date with the prefixes found. It's at this point that I start to lose the plot about what happened.
Searching through the mg-ddm log for that prefix, we find:
and then:
It's a little clearer if we grep for tfportrear1_0:
At this point it seems like:
Over in Dendrite, we do have one instance of this message from Dropshot:
These are consistent with the client-side timeouts reported by mg-ddm. Note that if this happens, it shouldn't actually affect the request because Dropshot won't cancel the request handler. So even if this happened with all the "add route" requests, I think this wouldn't explain why the routes were not ultimately added.
But why do we only have one of these on the server, while we have a bunch of timeouts on the client:
The only explanations I can come up with are:
That's about as far as we got. I should mention that I saw this note in the docs:
I believe we correctly applied that workaround and it made no difference here. And from what I can tell, the static routing config only affects what's upstream of the switches, not the rack-internal routing, so I think this is unrelated to my use of static routing here.
The text was updated successfully, but these errors were encountered: