Synapse fails to restart after zookeeper node disappears #94

Jaykah · 2014-10-09T23:45:33Z

I am using the Zookeeper watcher to get a list of available nodes - which it successfully does.

However, after a node fails, it does not get removed from haproxy.

Haproxy monitoring page shows those nodes as down, Zookeeper no longer has them as they are ephemeral, but the reconfiguration does not take place.

If I restart Synapse, however, everything goes back to normal.

I, [2014-10-09T22:17:20.844370 #12]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service proddb
I, [2014-10-09T22:17:20.861672 #12]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 6 backends for service proddb
[WARNING] 281/221756 (157) : Server proddb/100.0.10.10:3306_100.0.10.10 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 1ms. 6 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 281/221756 (157) : Server proddb/100.0.10.10:3306_100.0.10.10 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 2ms. 5 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
I, [2014-10-09T22:18:15.682057 #12]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service proddb
I, [2014-10-09T22:18:15.690843 #12]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 5 backends for service proddb

Note: Synapse is deployed in Docker and managed by supervisord

The text was updated successfully, but these errors were encountered:

Jaykah · 2014-10-16T13:16:42Z

I've done some further debugging, and so far I can say that:

-Haproxy config is getting successfully re-applied when there are changes in the number of endpoints (begin begins & end ends below)
-But it does not get changed by Synapse once the number of endpoints changes
-Because it is not getting re-read for changes in the first place
-Because it is not getting changes from Zookeeper unless it is restarted

I, [2014-10-14T12:55:37.381646 #500]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service db
I, [2014-10-14T12:55:37.388731 #500]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 4 backends for service db
W, [2014-10-14T12:55:37.389157 #500]  WARN -- Synapse::ZookeeperWatcher: synapse: warning: begin begins
W, [2014-10-14T12:55:37.389220 #500]  WARN -- Synapse::ZookeeperWatcher: synapse: warning: end ends

Here's what happens if I restart Synapse (for reference):

 I, [2014-10-14T12:56:55.014435 #524]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 4 backends for service db
W, [2014-10-14T12:56:55.014877 #524]  WARN -- Synapse::ZookeeperWatcher: synapse: warning: begin begins
W, [2014-10-14T12:56:55.014951 #524]  WARN -- Synapse::ZookeeperWatcher: synapse: warning: end ends
I, [2014-10-14T12:56:55.015023 #524]  INFO -- Synapse::Synapse: synapse: regenerating haproxy config
W, [2014-10-14T12:56:55.016053 #524]  WARN -- Synapse::Haproxy: synapse: warning: read old conf start
W, [2014-10-14T12:56:55.016214 #524]  WARN -- Synapse::Haproxy: synapse: warning: read old conf end
W, [2014-10-14T12:56:55.016265 #524]  WARN -- Synapse::Haproxy: synapse: warning: conf write begin
W, [2014-10-14T12:56:55.016481 #524]  WARN -- Synapse::Haproxy: synapse: warning: conf write ends

I'm curious if anyone is experiencing a similar issue, or if it is a corner case caused by Docker/Supervisord combination

Jaykah · 2014-11-06T22:37:58Z

More debugging:

Identifying threads:

Restart Synapse

I, [2014-11-06T22:02:59.634475 #3181]  INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher connecting to ZK at xhbtr.internal.com:2181,xhbtr3.internal.com:2181,xhbtr2.internal.com:2181
I, [2014-11-06T22:02:59.707555 #3181]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service db
I, [2014-11-06T22:02:59.719320 #3181]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 5 backends for service db
I, [2014-11-06T22:02:59.719512 #3181]  INFO -- Synapse::Synapse: RECONFIGURE INSIDE SYNAPSE OBJECT ID 70237320665780 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.719685 #3181]  INFO -- Synapse::Synapse: TEST PING OBJECT ID 70237320665780 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.719853 #3181]  INFO -- Synapse::Synapse: TEST PING OBJECT ID 70237320665780 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.720117 #3181]  INFO -- Synapse::Synapse: TEST PING OBJECT ID 70237320665780 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.720362 #3181]  INFO -- Synapse::Synapse: CONFIG UPDATED IS true OBJECT ID 70237320665780 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.720582 #3181]  INFO -- Synapse::Synapse: synapse: regenerating haproxy config
I, [2014-11-06T22:02:59.720797 #3181]  INFO -- Synapse::Haproxy: CODE BLOCK 1 OBJECT ID 70237320663500 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.721970 #3181]  INFO -- Synapse::Haproxy: CODE BLOCK 2 OBJECT ID 70237320663500 THREAD ID 70237310423840
I, [2014-11-06T22:02:59.722415 #3181]  INFO -- Synapse::Haproxy: SLEEP IS -63582530577.72238
[WARNING] 309/220259 (3198) : config : backend 'chat' : ignoring cookie for server '100.0.10.13:5280_100.0.10.13' as HTTP mode is disabled.
[WARNING] 309/220259 (3198) : config : backend 'db' : ignoring cookie for server '100.0.10.11:3306_100.0.10.11' as HTTP mode is disabled.
[WARNING] 309/220259 (3198) : config : backend 'db' : ignoring cookie for server '100.1.10.4:3306_100.1.10.4' as HTTP mode is disabled.
[WARNING] 309/220259 (3198) : config : backend 'db' : ignoring cookie for server '100.0.10.5:3306_100.0.10.5' as HTTP mode is disabled.
[WARNING] 309/220259 (3198) : config : backend 'db' : ignoring cookie for server '100.0.10.20:3306_100.0.10.20' as HTTP mode is disabled.
[WARNING] 309/220259 (3198) : config : backend 'db' : ignoring cookie for server '100.0.10.10:3306_100.0.10.10' as HTTP mode is disabled.

Kill one of the nodes:

[WARNING] 309/220329 (3198) : Server db/100.0.10.10:3306_100.0.10.10 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
I, [2014-11-06T22:03:48.709254 #3181]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service db
I, [2014-11-06T22:03:48.717118 #3181]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 4 backends for service db
I, [2014-11-06T22:03:48.717329 #3181]  INFO -- Synapse::Synapse: RECONFIGURE INSIDE SYNAPSE OBJECT ID 70237320665780 THREAD ID 70236942398180

Counting threads

Restart Synapse

I, [2014-11-06T22:18:53.418929 #3283]  INFO -- Synapse::Synapse: THREAD COUNT IN INTIALIZE 2
I, [2014-11-06T22:18:53.419181 #3283]  INFO -- Synapse::Synapse: synapse: starting...
I, [2014-11-06T22:18:53.419383 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: starting ZK watcher application @ hosts: xhbtr3.internal.com:2181,xhbtr2.internal.com:2181,xhbtr.internal.com:2181, path: /services/application
I, [2014-11-06T22:18:53.419662 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher connecting to ZK at xhbtr3.internal.com:2181,xhbtr2.internal.com:2181,xhbtr.internal.com:2181
I, [2014-11-06T22:18:53.490143 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service application
I, [2014-11-06T22:18:53.492769 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 1 backends for service application
I, [2014-11-06T22:18:53.492996 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: starting ZK watcher chat @ hosts: xhbtr.internal.com:2181,xhbtr3.internal.com:2181,xhbtr2.internal.com:2181, path: /services/chat
I, [2014-11-06T22:18:53.493161 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher connecting to ZK at xhbtr.internal.com:2181,xhbtr3.internal.com:2181,xhbtr2.internal.com:2181
I, [2014-11-06T22:18:53.562536 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service chat
I, [2014-11-06T22:18:53.564986 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 1 backends for service chat
I, [2014-11-06T22:18:53.565202 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: starting ZK watcher db @ hosts: xhbtr3.internal.com:2181,xhbtr.internal.com:2181,xhbtr2.internal.com:2181, path: /services/db
I, [2014-11-06T22:18:53.565365 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: zookeeper watcher connecting to ZK at xhbtr3.internal.com:2181,xhbtr.internal.com:2181,xhbtr2.internal.com:2181
I, [2014-11-06T22:18:53.631336 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service db
I, [2014-11-06T22:18:53.639681 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 5 backends for service db
I, [2014-11-06T22:18:53.639878 #3283]  INFO -- Synapse::Synapse: THREAD COUNT IN LOOP 14
I, [2014-11-06T22:18:53.640057 #3283]  INFO -- Synapse::Synapse: synapse: regenerating haproxy config

Kill one of the nodes:

[WARNING] 309/221943 (3300) : Server db/100.0.10.10:3306_100.0.10.10 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 4 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
I, [2014-11-06T22:20:02.709009 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovering backends for service db
I, [2014-11-06T22:20:02.724917 #3283]  INFO -- Synapse::ZookeeperWatcher: synapse: discovered 4 backends for service db

Somehow, when the update happens, another thread is taking over, and @config_updated gets overwritten by each of them, causing the config update to fail (so I assume).

Most likely, it's the ZK gem that's causing the issue (by creating threads and locks).

I will try various Ruby versions, as well as to roll back ZK/Synapse, to see if that helps, as I am out of ideas.

P.S. I am using a manually compiled Ruby 2.1.2

# Install RUBY
RUN wget http://cache.ruby-lang.org/pub/ruby/2.1/ruby-2.1.2.tar.gz
RUN tar xzfv ruby-2.1.2.tar.gz
WORKDIR /tmp/ruby-2.1.2
ENV RUBY_CONFIGURE_OPTS --with-readline-dir=/usr/lib/x86_64-linux-gnu/libreadline.so
RUN ./configure
RUN make && make install
RUN echo "gem: --no-ri --no-rdoc" > ~/.gemrc

Jaykah · 2014-11-07T22:36:15Z

Update: I have taken Synapse out of Docker and Supervisord, as well as have tried multiple versions of Ruby and Synapse and still have the same issue.

Jaykah · 2014-11-10T22:17:49Z

FINALLY!

A solution:

TL;DR

The issue has been caused an has been repeatedly reproduced because of the way it was being launched - in the foreground.

Forking the process did the trick.

Apparently, having one instance of Synapse would prevent it from being able to both query for changes and execute loops - something that I still can't wrap my head around.

Running from Supevisord, as a command, or via plain synapse -c /etc/synapse.conf.yaml in the foreground was doing just that - making the process busy and not able to write modified Haproxy configs.

To make it work, I had to adjust my Supervisord config:

[program:synapse]
process_name=synapse-%(process_num)s
command=synapse -c /etc/synapse.conf.yaml
numprocs=2
stopasgroup=true
autorestart=unexpected

thus forking Synapse.

It took me a while to debug, so hopefully this will help someone else too.

Jaykah · 2014-11-16T00:51:44Z

Today I have noticed another abnormality - although the changes made (as per my previous comment) did help with dead node removal, new nodes never made it to Haproxy config when registered.

However, they started adding with numprocs=4, which is even stranger.

Despite that working, I am still convinced that repeating every command multiple times is not the right way to go. What I have also noticed is that with numprocs=4 there are only 3 processes executing, which means that something is getting lost.

Can anyone shed some light on how forking/threading is being done in Synapse?

hankya · 2015-08-29T07:48:09Z

is this fixed? anyone?

jolynch · 2015-08-29T11:09:33Z

@hankya We haven't seen the behavior referenced in this PR but we don't use the supervisord. Have you been seeing the same issue and if so what's your setup?

I did see something that I thought was like this bug when ZK was overloaded, but in that case the issue was just a large lag in event delivery. Our connection pooling fix alleviated those issues (plus a bug fix to the pooling code). While investigating the pooling bug though I setup a pretty thorough test setup where I was removing and adding nodes while messing with the network via netem and I still wasnt able to repro with Ruby 1.9 and running Synapse directly. I could try to modify that setup to get a repro of this issue if it's still impacting people.

Jaykah · 2015-08-31T16:31:36Z

@jolynch This issue is related to the fact that (AFAIK) Synapse does reloads by forking the process and then executing the reload. The problem with Supervisord is that it keeps attached to STDOUT, running the process in the foreground, thus messing with the ability to fork&reload. That's a pity, because Supervisord is a really great tool for process monitoring.

igor47 · 2015-08-31T17:02:42Z

i don't think that makes sense. we run synapse via runit, which also keeps the process attached in the foreground, and we have no problems. also, i'm not sure what you're referring to as far as reloads via forking. our usual model is to exit the process and allow the process supervisor to restart it.

jolynch · 2015-08-31T17:46:08Z

I'm also confused by the forking issue. We use upstart and just have it kill the old process before starting the new one. For nerve we run two copies during a flip to achieve hitless nerve reloads. @Jaykah why do you need to fork Synapse to reload?

I think that @hankya is having a separate issue, which I'm following up on in #137.

Jaykah · 2015-08-31T19:15:37Z

@jolynch The first six comments of this ticket explain my reasoning. Basically what happens is if we have only one instance of Synapse running it does not write modified Haproxy configs for node removal, if we run two it does remove nodes, but does not add new ones. And if we run is as numprocs=4 (basically forking it into 4 processes) it seems to work fine, but I've still had some cases with reloads not happening (maybe once every 100 reloads or so as opposed to 100% of the time without forking). Unfortunately, this is something that I wasn't yet able to figure out...

jolynch · 2015-08-31T20:16:32Z

@Jaykah I think the relevant context here is that you are using ruby 2.1.2, same as @hankya over in #137. I will focus my efforts on that ruby version. It's also been reported that 2.2 doesn't work (although ruby-zk doesn't work with 2.2 so ...).

For context, I am unable to reproduce this at all in our production environment or in my own testing using ruby 1.9, so if you can downgrade, that might be a good short term solution.

jolynch · 2015-08-31T20:21:46Z

@igor47 I thought you had mentioned airbnb moving to ruby 2.X soon, if you guys get around to debugging this before me that is also cool heh :-)

igor47 · 2015-08-31T21:59:12Z

fyi @rickypai

evie404 · 2015-09-01T01:45:08Z

I sent a PR to the zk gem to have its tests run and pass against Ruby 2.2. Unfortunately, we are yet to be running our infrastructure on Ruby 2.x, so we have not seen this error.

Addresses incompatibility of per_callback threading in ZK, the way that synapse calls unsubscribe within the callback, and ruby 2.X really not liking joining on the current running thread. Should fix #137 #99 and possibly #94

c4urself · 2015-09-10T16:14:19Z

FWIW, I've run the following test on Ubuntu 14.04 which works on ruby1.9.1 and does not work on ruby2.0 running a supervisord based nerve and synapse.

[on service B] supervisorctl stop nerve
- Correct. HAproxy keeps servers in backend.
[on service A] supervisorctl stop synapse
- Correct. HAproxy keeps servers in backend.
[on service A] supervisorctl start synapse
- Correct. Synapse falls back to default servers
[on service B] supervisorctl start nerve
- Correct. Nerve starts fine and registers with Zookeeper on service B
- Incorrect. Synapse does not get callback for change and does not include newly registered servers in backend, keeps running on default servers on service A

The difference is that on step 4, ruby1.9.1 synapse re-registers the backends.

jolynch · 2015-09-10T16:32:53Z

@c4urself did you run this test with latest master? We had a bug with a Thread api change between 1.9 and 2.0, but I thought that I fixed it in PR #138.

c4urself · 2015-09-10T16:44:20Z

@jolynch No, I ran the test with 0.12.1. I retried it with the master branch and it looks like it's working correctly now. Thanks!

jolynch · 2015-09-10T17:00:30Z

@igor47 can you cut a bug fix release? I haven't been able to roll it out here but if you've done that then we probably want to cut a release.

igor47 · 2015-09-11T16:24:21Z

i'll do that now; i've only given it the barest testing on our infra, haven't rolled it out site-wide yet

jolynch · 2015-10-04T02:12:49Z

@Jaykah how does master work for you? Do you still see the lack of restart on update?

Jaykah · 2015-10-12T20:52:02Z

@jolynch Have just tested it, and it still does not do reloads properly. Built from master.

I, [2015-10-12T20:47:38.823828 #18]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovering backends for service application
I, [2015-10-12T20:47:38.855753 #18]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovered 2 backends for service application
I, [2015-10-12T20:47:38.857344 #18]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovering backends for service application

and then after restarted through supervisorctl:

I, [2015-10-12T20:53:29.343200 #520]  INFO -- Synapse::Synapse: synapse: starting...
I, [2015-10-12T20:53:29.344505 #520]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: starting ZK watcher application @ hosts: xhbtr.site.com:2181,xhbtr2.site.com:2181,xhbtr3.site.com:2181, path: /services/application
I, [2015-10-12T20:53:29.344907 #520]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: zookeeper watcher connecting to ZK at xhbtr.site.com:2181,xhbtr2.site.com:2181,xhbtr3.site.com:2181
I, [2015-10-12T20:53:29.345259 #520]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: creating pooled connection to xhbtr.site.com:2181,xhbtr2.site.com:2181,xhbtr3.site.com:2181
I, [2015-10-12T20:53:29.388866 #520]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: successfully created zk connection to xhbtr.site.com:2181,xhbtr2.site.com:2181,xhbtr3.site.com:2181
...
I, [2015-10-12T20:53:29.517319 #520]  INFO -- Synapse::Synapse: synapse: configuring haproxy
I, [2015-10-12T20:53:29.519207 #520]  INFO -- Synapse::Haproxy: synapse: restart required because we have a new backend application/10.0.12.119:80_10.0.12.119
I, [2015-10-12T20:53:29.520283 #520]  INFO -- Synapse::Haproxy: synapse: reconfigured haproxy

jolynch · 2015-10-12T21:02:26Z

@Jaykah, OK I'll look into it. Can you post your docker file, it might make it easier for me to reproduce.

Just to make sure we're talking about the same thing, Synapse is not supposed to reload itself if Synapse configuration changes (adding or removing services) however it is supposed to reload HAProxy if the servers backing those services change. I guess the part I'm confused about in this ticket is why Synapse needs to fork at all to reload HAProxy.

Actually, how does this work at all in a Docker container? Does HAProxy run inside the container?

Jaykah · 2015-10-12T21:07:51Z

@jolynch Sure, will post the Dockerfile shortly.

That is correct, I am referring to the HAProxy reload when the list of nodes changes in Zookeeper. So the steps would be:

Launch a new backend container that has Nerve
It will register its IP in Zookeeper
Frontend container's Synapse would see that the list of nodes changed in Zk
It should then reload HAProxy (but doesn't).

Both HAProxy and Synapse run in Docker with Supervisord.

igor47 · 2015-10-12T21:09:26Z

it needs to fork to reload haproxy because the haproxy reload is accomplished with a command. the unix way to execute another command is to fork and then execve. i wonder if this is docker-related at this point

jolynch · 2015-10-12T21:20:10Z

I think that having the docker file (or something close to it) will be a big help in reproducing the bug. I have a suspicion supervisord and docker are not acting like we might expect.

jolynch · 2015-10-12T21:28:21Z

Yea there are going to be two forks (one by Synapse and one by the HAProxy that starts up), so if supervisord is like "what is this thing that just forked" that might be an issue.

Jaykah · 2015-10-12T21:56:54Z

@jolynch I've created a Dockerfile bundle for you that can be used to reproduce the issue. All you have to do is add a "synapse.conf.yaml" and your public key (to ssh into container via ip:222), named "myserver.pub".

You will then be able to connect to Supervisord console via "supervisorctl" with user "supervisord" and password "testpass". Port 3212 is also exposed for HAProxy stats.

https://www.dropbox.com/sh/yjs6eds1t0xdpg7/AAA7BUQ-MkP8nfOyjPvAZ549a?dl=0

Jaykah · 2015-10-22T17:00:54Z

@jolynch Were you able to reproduce it?

jolynch · 2015-10-24T02:58:19Z

@Jaykah sorry I'll try to take a look this weekend, sorry about that.

Jaykah · 2015-10-29T17:59:08Z

@jolynch Let me know if I can help in solving that issue in any way, would be glad to provide a staging environment or any other information that might be helpful.

jolynch · 2015-11-03T07:24:04Z

@Jaykah I'm still working on getting a repro working, but one of the first things I noticed is that you have the supervisord process managing haproxy which is probably not going to work with Synapse causing it to fork. This sounds similar to the problems that folks have with systemsd which assumes that the process it's managing does not fork. See the patch for more background.

So I guess two questions:

What is your haproxy reload command in your synapse.conf.json?
Have you tried managing the haproxy-systemd-wrapper (instead of haproxy) with supervisord and then having synapse trigger a SIGHUP2, see this post for some insight.

Jaykah · 2015-11-04T02:42:43Z

@jolynch

[program:haproxy] 
command=/usr/sbin/haproxy -db -f /etc/haproxy/haproxy.cfg
numprocs=1

Haproxy is launched in foreground as 1 process, but then exits with a status 0 (expected) when Synapse loads it. That way we are monitoring the initial launch of Haproxy without restraining forking (for the purpose you've mentioned). We are then stopping it's management by Supervisord with:

autorestart=unexpected

The reload command is the following:

reload_command: "haproxy -db -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(pidof haproxy)"

We have also tested dropping SYN and reloading via supervisorctl with limited success)

iptables -I INPUT -p tcp --dport 3213 --syn -j DROP
sleep 1
supervisorctl -u login -p pass restart haproxy:*
iptables -D INPUT -p tcp --dport 3213 --syn -j DROP

2.-Will try the haproxy-systemd-wrapper shortly.

Jaykah · 2015-11-11T19:32:50Z

@jolynch Unfortunately nothing changes by using the wrapper. Synapse is not restarting Haproxy:

I, [2015-11-11T19:27:53.419479 #345]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovering backends for service application
I, [2015-11-11T19:27:53.423973 #345]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovered 1 backends for service application
I, [2015-11-11T19:27:53.969748 #345]  INFO -- Synapse::Synapse: synapse: configuring haproxy
I, [2015-11-11T19:27:53.971475 #345]  INFO -- Synapse::Haproxy: synapse: reconfigured haproxy

If I restart Synapse manually I get:

I, [2015-11-11T19:29:36.421316 #366]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovering backends for service proddb
I, [2015-11-11T19:29:36.429431 #366]  INFO -- Synapse::ServiceWatcher::ZookeeperWatcher: synapse: discovered 4 backends for service proddb
I, [2015-11-11T19:29:36.429514 #366]  INFO -- Synapse::Synapse: synapse: configuring haproxy
I, [2015-11-11T19:29:36.431798 #366]  INFO -- Synapse::Haproxy: synapse: reconfigured haproxy
I, [2015-11-11T19:29:36.450096 #366]  INFO -- Synapse::Haproxy: synapse: restarted haproxy

Moreover, with supervisord.conf

command=/usr/sbin/haproxy-systemd-wrapper -Ds -f /etc/haproxy/haproxy.cfg

and synapse.conf.yaml

reload_command: "/bin/kill -SIGUSR2 $(pidof haproxy-systemd-wrapper)"

I can still see Haproxy's forks even after the process is terminated via Supervisor (we start with haproxy-systemd-wrapper & two children, two more are started when I issue a HUP, and then one of the first two and one of the second two remain after).

Were you able to reproduce the issue with the Dockerfile I provided?

Jaykah · 2015-11-13T02:55:04Z

Did some more tests with/without -Ds, with/without -db, with numprocs=1/2/3/4, without fork control, restarts/reloads via -USR2, -HUP - nothing worked. At this time I am almost certain that the issue has to do with Supervisor reaping Synapse processes because Synapse does not have a "foreground" (or non-daemon) mode, making it not systemd compliant.

2015-11-13 02:37:41,657 CRIT reaped unknown pid 478)
2015-11-13 02:38:46,552 CRIT reaped unknown pid 473)
2015-11-13 02:38:46,562 CRIT reaped unknown pid 477)
2015-11-13 02:38:46,564 CRIT reaped unknown pid 479)

Jaykah · 2015-11-18T00:37:30Z

@jolynch Were you able to reproduce it? We're debating if we should move our infra off Synapse because of this issue, but if you have any pointers as to how we can make it run in foreground, that would be super helpful.

Thanks.

jolynch · 2015-11-18T01:03:31Z

@Jaykah Yea I've been dropping the ball on this. I really think that the issue lies with how a Docker container really isn't a proper init system. We really like running synapse/haproxy outside the containers because of all kinds of reasons (namely our service data plane stays available during a docker roll and we don't like paying conntrack), is there any possibility you can run on the bare metal? To be honest I thought that Synapse does stay attached in the foreground (as @igor47 has implied they run it via runit).

Why do you think those reaps are Synapse PIDs? Couldn't they just as easily be HAProxy instances?

I'll keep working on repro but ... well I've been saying that for a while.

jolynch · 2016-01-17T01:46:19Z

@Jaykah Alright I got a few hours to debug this today and this is not a bug in Synapse, it's an issue with how supervisord and haproxy interact. There are a couple of problems with running Synapse like this but they all revolve around the problem that Supervisord assumes that it is the only thing creating processes ever and this is simply not true with Synapse (which typically forks haproxy) or HAProxy (which typically forks itself).

Because we're running in supervisord the only way that Synapse can reload the service is via supervisorctl, e.g. supervisorctl restart haproxy:haproxy because it can't just fork off a process as we usually would because supervisord can't handle that (this leads to the reaping of unknown pid problem you were mentioning).

However, if we do that and use the haproxy-systemd-wrapper, then we run into the problem that supervisord only supports stop then start restarts (I think), which means that it waits for the previous haproxy to completely finish before it will start the new one. This is bad because the way that HAProxy gracefully drains traffic is to have the old process drop its listeners but crucially not exit until all TCP connections are finished cleanly. This leads to the unknown pid problem as well, because there are a bunch of haproxy instances hanging around that will never go away due to the way the haproxy systemd wrapper works.

I think this would all be solved if we could just supply the pid in the command for HAProxy but I don't know how to do this in supervisord.

I'm going to keep poking at this to see if I can properly work around supervisord and haproxy's terribleness, but at this point I'm pretty convinced that this is not a bug with Synapse but rather an issue with supervisord being a really bad replacement for a proper init system.

jolynch · 2016-01-18T20:25:42Z

For what it's worth I was able to get a container that would restart haproxy, but it worked by basically having a bash script that would exec:

#!/bin/bash                                                                                                                    
touch /var/haproxy/haproxy.pid                                                  
exec /usr/sbin/haproxy -db -f /etc/haproxy/haproxy.cfg -p /var/haproxy/haproxy.pid -sf $(cat /var/haproxy/haproxy.pid)

And then a supervisord configuration that would just use this as the command and USR2 as the stopsignal. This is really bad though because it means there will be a dead period where no HAProxy is listening for connections between when supervisord stops the old and HAProxy and starts the new one.

I'm pulling in HAProxy 1.6 to see if they've made any improvements to the haproxy-systemd-wrapper, but given my reading of the source code the haproxy-systemd-wrapper I'm not hopeful.

@Jaykah I really think that this is going to work better running on bare metal... but it would be nice to support a docker deployment so I'll keep poking.

jolynch · 2016-01-18T21:21:46Z

Alright @Jaykah, I've got a working Docker setup with latest synapse, haproxy 1.6 and ruby 2.2 over at https://gist.github.com/jolynch/54db6eb64a8951450d6e. The TLDR is: use the systemd wrapper, have synapse send signals to the systemd wrapper (not to haproxy itself) and have supervisord always autostart the wrapper and try to start it before synapse (there is a bit of a chicken and an egg problem because synapse has to generate the config before supervisord can start it properly).

But yea, I really wouldn't encourage someone doing this in production because:

Your service data plane (haproxy) is tied to this docker image's uptime. This means that you can't restart synapse with new configuration or reload docker without dropping traffic.
HAProxy really plays a lot better with real init systems, it is designed for bare metal (and it's packaged on basically every platform so this isn't a huge deal imo...)
I don't think that Docker actually buys you that much over building .deb/.rpm/w.e. packages your distro provides. I think the only dependency that's tricky is ruby, and that's really not that tricky.

Jaykah closed this as completed Nov 10, 2014

Jaykah reopened this Nov 16, 2014

Jaykah mentioned this issue Nov 16, 2014

HAProxy configuration is not being updated #99

Closed

Jaykah mentioned this issue Apr 2, 2015

Allow registrations to be manifested on the file system and add 'use_previous_backends' option #115

Closed

evie404 mentioned this issue Sep 1, 2015

Run tests against Ruby 2.2 zk-ruby/zk#89

Merged

jolynch mentioned this issue Sep 3, 2015

Do not unregister the callback in the zk watcher #138

Merged

che mentioned this issue Dec 11, 2015

Reload if HAProxy config has been changed #150

Closed

Synapse fails to restart after zookeeper node disappears #94

Synapse fails to restart after zookeeper node disappears #94

Comments

Jaykah commented Oct 9, 2014

Jaykah commented Oct 16, 2014

Jaykah commented Nov 6, 2014

Jaykah commented Nov 7, 2014

Jaykah commented Nov 10, 2014

Jaykah commented Nov 16, 2014

hankya commented Aug 29, 2015

jolynch commented Aug 29, 2015

Jaykah commented Aug 31, 2015

igor47 commented Aug 31, 2015

jolynch commented Aug 31, 2015

Jaykah commented Aug 31, 2015

jolynch commented Aug 31, 2015

jolynch commented Aug 31, 2015

igor47 commented Aug 31, 2015

evie404 commented Sep 1, 2015

c4urself commented Sep 10, 2015

jolynch commented Sep 10, 2015

c4urself commented Sep 10, 2015

jolynch commented Sep 10, 2015

igor47 commented Sep 11, 2015

jolynch commented Oct 4, 2015

Jaykah commented Oct 12, 2015

jolynch commented Oct 12, 2015

Jaykah commented Oct 12, 2015

igor47 commented Oct 12, 2015

jolynch commented Oct 12, 2015

jolynch commented Oct 12, 2015

Jaykah commented Oct 12, 2015

Jaykah commented Oct 22, 2015

jolynch commented Oct 24, 2015

Jaykah commented Oct 29, 2015

jolynch commented Nov 3, 2015

Jaykah commented Nov 4, 2015

Jaykah commented Nov 11, 2015

Jaykah commented Nov 13, 2015

Jaykah commented Nov 18, 2015

jolynch commented Nov 18, 2015

jolynch commented Jan 17, 2016

jolynch commented Jan 18, 2016

jolynch commented Jan 18, 2016