Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispersed volume becomes near unresponsive if brick replaced and healing is ongoing with performance.iot-pass-through enabled #4395

Open
edrock200 opened this issue Jul 29, 2024 · 0 comments

Comments

@edrock200
Copy link

Description of problem:

Gluster 11.1
Dispersed 4+1 over 200 bricks over 20 nodes

I've run into a rather odd scenario. I have a dispersed cluster with global threading enabled, and performance.iot-pass-through enabled as recommended in the documentation. I had a drive/brick go bad, so I followed the reset-brick process to replace it and allow it to rebuild.

After initially replacing the brick, all the san nodes CPU and throughput dropped exponentially, even on nodes that hosted subvolumes unrelated to the one with the replaced/healing drive, and the client mounts became pretty much unresponsive. All but 2 clients are mounted read-only. So as a test, I blocked the san performing the drive rebuild from all the read-only clients, but did not block the two clients with rw access, and everything became responsive again. Upon further troubleshooting, I disabled performance.iot-pass-through and unblocked the san from the read-only clients, and things remained responsive. To test the theory, I re-enabled the iot-pass-through setting, and everything became unresponsive again. Disabled it, and everything came back to life.

It seems odd that 1 of 200 bricks being replaced would have such a significant impact on performance with this setting enabled.

The exact command to reproduce the issue:
Enable global threading
Enable performance.iot-pass-through
replace a brick and reset it
See performance degredation
Disable performance.iot-pass-through
See performance improvement

Hosts are ubuntu 22.04
Gluster 11.1

Options Reconfigured:
disperse.shd-max-threads: 4
disperse.background-heals: 4
performance.read-ahead-page-count: 4
config.brick-threads: 16
config.client-threads: 16
cluster.rmdir-optimize: off
performance.readdir-ahead-pass-through: off
dht.force-readdirp: true
disperse.eager-lock: on
performance.least-prio-threads: 2
server.event-threads: 8
client.event-threads: 8
server.outstanding-rpc-limit: 128
performance.md-cache-timeout: 600
performance.md-cache-statfs: on
performance.iot-cleanup-disconnected-reqs: off
cluster.background-self-heal-count: 4
performance.read-ahead-pass-through: disable
performance.write-behind-pass-through: disable
performance.open-behind-pass-through: disable
performance.nl-cache-pass-through: disable
performance.io-cache-pass-through: disable
performance.md-cache-pass-through: disable
performance.quick-read-cache-size: 256MB
cluster.rebal-throttle: aggressive
features.scrub-freq: monthly
features.scrub-throttle: normal
features.scrub: Active
features.bitrot: on
performance.quick-read: off
performance.open-behind: on
performance.write-behind: on
performance.io-cache: on
performance.write-behind-window-size: 128MB
performance.rda-cache-limit: 1GB
performance.cache-max-file-size: 7GB
performance.cache-size: 8GB
performance.nl-cache-timeout: 600
performance.nl-cache: off
performance.parallel-readdir: enable
performance.cache-invalidation: true
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.client-io-threads: on
cluster.lookup-optimize: on
performance.flush-behind: on
performance.read-ahead: off
cluster.lookup-unhashed: off
cluster.weighted-rebalance: off
performance.readdir-ahead: on
cluster.readdir-optimize: off
cluster.min-free-disk: 5%
network.compression.mem-level: -1
network.compression: off
storage.build-pgfid: on
config.global-threading: on
performance.iot-pass-through: disable
cluster.force-migration: disable
cluster.disperse-self-heal-daemon: enable
performance.cache-refresh-timeout: 60
performance.enable-least-priority: on
locks.trace: off
storage.linux-io_uring: on
server.tcp-user-timeout: 60
performance.stat-prefetch: off
performance.xattr-cache-list: *
performance.cache-capability-xattrs: on
performance.quick-read-cache-invalidation: on
performance.cache-samba-metadata: on
server.manage-gids: off
performance.nl-cache-positive-entry: disable
client.send-gids: off
features.acl: off
disperse.read-policy: gfid-hash
disperse.stripe-cache: 10
cluster.brick-multiplex: off
cluster.enable-shared-storage: disable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant