# ungleich Matrix Setup anaylsis (2021-07-16) ## Findings * Postgres is too slow * Redis has timeouts that it should not have (30s) * Login does not work ## Problems / Situation * Cannot send messages * Cannot login anymore Situation: * Added new workers to federation, sync, client before * Added new path to nginx for client workers * However login did not work before either * Own login worker gets gateway timeout * Checking if the ldap server gets a request: seems not to be the case! * Listening on wrong source IP! * Signin directly causes ldap traffic -> good! * However still getting timeout * Main worker is running at 100% cpu * Might be the reason... * Replication timeout messages in the log might be the main reason * * Incoming messages from federation seem to work / arrive! * Browsing rooms from outside seems to be impossible * Federation workers have high load -> let's disable federation for the moment * redis responses not reached ## Steps * Read up about *how it should be* * Then compare *how it is* * Verify the nginx configuration * Verify the worker configuration * Verify that all needed workers are up and running * I think we were missing the persister before ? * But there is no configuration for it ## Should be state definition (based on research) * 1 main worker running on port 8008 * According to https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-scalability * One or more event persisters? * Redis is up and running ## *IS* state research (what is on our system) * **NO PASSWORD** for worker_replication_secret * Redis is up and running * Main process * 1 main worker running on port 9093 with [replication] * 1 main worker running on port 8008 with [federation,client,metrics] * Workers * background, metrics, port 8086 * run_background_tasks_on: background in the homeserver.yaml * Jul 16 02:10:07 ungleich-v6 matrix-synapse-background[2322]: Config is missing macaroon_secret_key * Jul 16 02:10:08 ungleich-v6 matrix-synapse-background[2322]: 2021-07-16 02:10:08,820 - synapse.storage.engines.postgres - 69 - WARNING - main- Database has incorrect collation of 'C.UTF-8'. Should be 'C' * client1: metrics+client port 8084 * client2: metrics+client port 8092 * event1: metrics port 8082 * federation1: metrics+federation 8085 * federation2: metrics+federation 8087 * federation3: metrics+federation 8095 * federation4: metrics+federation 8096 * sync1: metrics+client port 8083 * sync2: metrics+client port 8093 * sync3: metrics+client port 8094 * **NEW WORKERS added 2021-07-16** * login1: metrics+client port 8097 * Other synapse settings * pushing is still in the main process * Persisting seems to be in the main process * No MEDIA worker (could be very helpful!) synapse.app.media_repository * nginx upstreams * inbound_federation: 8085, 8087, 8095, 8096: ok * sync_workers: 8083, 8093, 8094: ok * federation: 8085, 8087, 8095, 8096: ok * client_workers: 8084, 8092: ok * **event_sending: 8082: UNCLEAR** * nginx config / paths * /sync and /initialSync for same user * /sync with ?since can go to different instances * Isolating the login syncs vs. regular syncs * Federation and client requests can be balanced via simple round robin. * ^/_matrix/federation/v1/send/ (INBOUND federation ) should be balanced by source IP * Registration/login requests can be handled separately purely to help ensure that unexpected load doesn't affect new logins and sign ups. * event sending requests can be balanced by the room ID in the URI - the room ID is the path component after /rooms/ ## Should vs. IS table | Should | IS | Notes | |---|---|---| | main worker port: replication | yes | | main worker NO other ports | NO | Also listening on 8008 | | workers should be on ::1 | NO | Are on :: for, allowing world access to /metrics | ## Open todos * background: Fix Jul 16 02:10:08 ungleich-v6 matrix-synapse-background[2322]: 2021-07-16 02:10:08,820 - synapse.storage.engines.postgres - 69 - WARNING - main- Database has incorrect collation of 'C.UTF-8'. Should be 'C' * background: Fix Config is missing macaroon_secret_key ## Actions taken 2021-07-16 * Remove port 8008 from main worker, restart (0946) * Result: messages can still not be sent from our instance ``` [10:00:04] ungleich-v6:/etc/nginx/sites-enabled# mkdir ~/backup-2021-07-16 [10:00:10] ungleich-v6:/etc/nginx/sites-enabled# cp https-ungleich.matrix.ungleich.cloud ~/backup-2021-07-16/ ``` * Adding login worker ``` [10:29:12] ungleich-v6:/etc/matrix-synapse/workers# vi login.yaml (reverse-i-search)`': ^C [10:31:52] ungleich-v6:/etc/matrix-synapse/workers# systemctl enable matrix-synapse-worker@login1.service Created symlink /etc/systemd/system/matrix-synapse.target.wants/matrix-synapse-worker@login1.service _ /etc/systemd/system/matrix-synapse-worker@.service. [10:38:03] ungleich-v6:/etc/matrix-synapse/workers# mv login.yaml login1.yaml [10:38:09] ungleich-v6:/etc/matrix-synapse/workers# systemctl start matrix-synapse-worker@login1.service ``` * Installing ldap-utils to check on ldap connection ## System state definition Before: ``` [09:45:08] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ls matrix-synapse-worker@background.service matrix-synapse-worker@federation_sender-01.service matrix-synapse-worker@client1.service matrix-synapse-worker@generic-01.service matrix-synapse-worker@client2.service matrix-synapse-worker@generic-02.service matrix-synapse-worker@event1.service matrix-synapse-worker@generic-03.service matrix-synapse-worker@event_persister-01.service matrix-synapse-worker@generic-04.service matrix-synapse-worker@federation1.service matrix-synapse-worker@sync1.service matrix-synapse-worker@federation2.service matrix-synapse-worker@sync2.service matrix-synapse-worker@federation3.service matrix-synapse-worker@sync3.service matrix-synapse-worker@federation4.service matrix-synapse-worker@synchrotron-01.service ``` * Stopping federation workers to reduce load ``` [10:30:23] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation1.service [11:00:02] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation2.service [11:00:08] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation3.service [11:00:29] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation4.service [11:00:36] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ``` * Restarting redis ``` txredisapi.TimeoutError: Not received Redis response in 30 seconds 2021-07-16 11:06:09,494 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-56- Background process 'send-cmd' threw an excep tion Traceback (most recent call last): File "/usr/lib/python3/dist-packages/synapse/metrics/background_process_metrics.py", line 214, in run return await maybe_awaitable(func(*args, **kwargs)) File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command self.synapse_stream_name, encoded_string File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply raise r txredisapi.TimeoutError: Not received Redis response in 30 seconds 2021-07-16 11:06:09,500 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-57- Background process 'send-cmd' threw an exce$ tion ``` * Disabling federation using an empty list at 1111 * Returning 503 on inbound federation at 1119 via nginx * On two paths * Splitting worker log of main process log ``` [11:25:22] ungleich-v6:/etc/matrix-synapse/workers# sed -i "s,/etc/matrix-synapse/log.yaml,/etc/matrix-synapse/log-worker.yaml," *.yaml [11:25:52] ungleich-v6:/etc/matrix-synapse/workers# mv ../worker_log.yaml ../log-worker.yaml [11:26:49] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# for a in matrix-synapse-worker@background.service matrix-synapse-worker@client1.service matrix-synapse-worker@client2.service matrix-synapse-worker@login1.service matrix-synapse-worker@sync1.service ; do systemctl restart $a; done [11:29:12] ungleich-v6:/etc/matrix-synapse# chmod a+r log-worker.yaml ``` * disable all workers EXCLUDING the login workers @ 1153 * Setup is 1 main process + 1 worker * Still same 504 timeout on login ``` [11:53:22] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ps aux | grep login1 matrix-+ 15438 4.9 0.0 642920 99944 ? Ssl 11:30 1:08 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/login1.yaml root 16998 0.0 0.0 3084 892 pts/2 S+ 11:53 0:00 grep login1 [11:53:25] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ``` * switching all traffic BESIDES login to the main worker @ 1156 * Port 8008 re-enabled * Nginx reconfigured to move everything there (besides login parts) * filtering out metrics support to calm log down @ 1204 * setting in nginx ``` ``` * Disabling login worker -> everything back to main worker @ 1209 * It might "just" be a replication issue * Matrix sending messages WORKS, but it is extremely slow. * Main process is running at 100% cpu * Many replication failures ``` 2021-07-16 12:05:28,355 - synapse.replication.tcp.handler - 670 - WARNING - replication_notifier-14- Dropping command as not connected: 'RDATA' ``` * Restarting federation workers (4x), re-enabling them in nginx, re-enabling federation -- NOT DONE ``` ``` * Shutting down matrix to trigger state compression * Room selection already takes quite some time in postgres ``` [12:20:37] ungleich-v6:~# mkdir state-compressiontfiles [12:20:42] ungleich-v6:~# systemctl stop matrix-synapse.service [12:22:10] ungleich-v6:/usr/local/bin# sudo -u matrix-synapse bash sudo: unable to resolve host ungleich-v6: Name or service not known matrix-synapse@ungleich-v6:/usr/local/bin$ pwd /usr/local/bin matrix-synapse@ungleich-v6:/usr/local/bin$ cd /tmp/ matrix-synapse@ungleich-v6:/tmp$ mkdir state-compression matrix-synapse@ungleich-v6:/tmp$ matrix-synapse@ungleich-v6:/tmp/state-compression$ /usr/local/bin/compress-rooms.sh [12:23:12] ungleich-v6:~# df -h /tmp/ Filesystem Size Used Avail Use% Mounted on /dev/vda1 247G 108G 139G 44% / [12:23:15] ungleich-v6:~# ``` * Deleting oldvar directory ``` [12:25:16] ungleich-v6:/var/lib/matrix-synapse# rm -rf oldhddvar/ & [1] 18293 ``` * Creating a data base backup @ 1227 ``` [12:27:08] ungleich-v6:/var/lib/matrix-synapse# mkdir backup-2021-07-16 [12:27:14] ungleich-v6:/var/lib/matrix-synapse# chown postgres backup-2021-07-16/ [12:27:45] ungleich-v6:~$ pg_dump matrix-synapse | xz > /var/lib/matrix-synapse/backup-2021-07-16/pgdump.xz ``` * Upgrading matrix-synapse to 1.38.0 via Debian SID repo ``` [12:44:08] ungleich-v6:/etc/apt# apt install -t sid matrix-synapse Reading package lists... Done Building dependency tree Reading state information... Done The following packages were automatically installed and are no longer required: irqbalance libbind9-140 libdns162 libisc160 libisccc140 libisccfg140 liblwres141 libnuma1 libruby2.3 libxerces-c3.1 libxml-security-c17v5 python3-daemonize ruby2.3 runit-helper Use 'apt autoremove' to remove them. Suggested packages: python3-authlib Recommended packages: python3-pympler The following packages will be upgraded: matrix-synapse 1 upgraded, 0 newly installed, 0 to remove and 560 not upgraded. Need to get 1550 kB of archives. After this operation, 97.3 kB of additional disk space will be used. Get:1 http://deb.debian.org/debian sid/main amd64 matrix-synapse all 1.38.0-1 [1550 kB] Fetched 1550 kB in 0s (4760 kB/s) Preconfiguring packages ... (Reading database ... 64918 files and directories currently installed.) Preparing to unpack .../matrix-synapse_1.38.0-1_all.deb ... Unpacking matrix-synapse (1.38.0-1) over (1.37.1-1~fto10+1) ... Setting up matrix-synapse (1.38.0-1) ... Processing triggers for systemd (241-7~deb10u7) ... # Disabling repo afterwards [12:46:28] ungleich-v6:/etc/apt# cat sources.list deb http://deb.debian.org/debian/ buster main deb-src http://deb.debian.org/debian/ buster main deb http://deb.debian.org/debian/ buster-updates main deb-src http://deb.debian.org/debian/ buster-updates main deb http://security.debian.org/debian-security buster/updates main deb-src http://security.debian.org/debian-security buster/updates main deb https://fasttrack.debian.net/debian/ buster-fasttrack main contrib deb https://fasttrack.debian.net/debian/ buster-backports main contrib #deb http://deb.debian.org/debian/ sid main ``` * Aborted state compression script due to high load @ 1247 * Stopping background worker @ 1258 * To phase in work into main worker * Commented out "run_background_tasks_on: background_worker" in homeserver.yaml ``` [12:58:00] ungleich-v6:/etc/matrix-synapse# vi homeserver.yaml [12:58:40] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse.service ``` * No listener on port 8008 or 9093 * Raising log level to DEBUG in log.yaml * Aborted pg_dump @ 1305 * Synapse starts! * Removing DEBUG log, changing back to INFO * Trying to sign in @ 1310 * Still getting a 504 / timeout, even on main worker * Many redis timeout messages * No more redis warnings after tuning sysctl and hugepages ``` 22206:M 16 Jul 2021 13:01:16.214 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. 22206:M 16 Jul 2021 13:01:16.215 # Server initialized 22206:M 16 Jul 2021 13:01:16.215 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. # fixing [13:13:19] ungleich-v6:~# sysctl -w net.core.somaxconn=8192 net.core.somaxconn = 8192 [13:13:22] ungleich-v6:~# sysctl vm.overcommit_memory=1 vm.overcommit_memory = 1 [13:14:05] ungleich-v6:~# echo never > /sys/kernel/mm/transparent_hugepage/enabled ``` * Removing redis support @ 1320 * Still timeouts, even after redis tuning * Disabling metrics support @ 1328 To ensure metrics are not the core problem ``` txredisapi.TimeoutError: Not received Redis response in 30 seconds 2021-07-16 13:18:56,159 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-132- Background process 'send-cmd' threw an exce ption Traceback (most recent call last): File "/usr/lib/python3/dist-packages/synapse/metrics/background_process_metrics.py", line 214, in run return await maybe_awaitable(func(*args, **kwargs)) File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command self.synapse_stream_name, encoded_string File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply raise r txredisapi.TimeoutError: Not received Redis response in 30 seconds ``` * Re-enabling redis @1334 * To be potentially able to re-add workers * Getting timeouts again! ``` File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply txredisapi.TimeoutError: Not received Redis response in 30 seconds 2021-07-16 13:39:00,270 - synapse.replication.tcp.redis - 259 - WARNING - redis_ping-3- Failed to send ping to a redis connection 2021-07-16 13:39:11,476 - synapse.replication.tcp.redis - 282 - INFO - sentinel- Connection to redis server b'localhost':6379 lost: Connection was aborted locally, using. 2021-07-16 13:39:11,478 - synapse.replication.tcp.redis - 181 - INFO - sentinel- Lost connection to redis 2021-07-16 13:39:11,556 - synapse.replication.tcp.redis - 282 - INFO - GET-174- Connection to redis server b'localhost':6379 lost: Connection was aborted locally, using. ``` * Removing the "activerehasing" config in redis.conf @ 1341 * It is causing additional latency, up to 2ms * Still getting timeouts ``` 2021-07-16 13:43:00,210 - synapse.replication.tcp.redis - 266 - INFO - federation_transaction_transmission_loop-3969- Connecting to redis server b'localhost':6379 2021-07-16 13:43:00,251 - synapse.replication.tcp.redis - 274 - INFO - federation_transaction_transmission_loop-3969- Connection to redis server b'localhost':6379 failed: User timeout caused connection failure. ``` * Re-enabling backgronud task @ 1351 * enabling "run_background_tasks_on: background_worker" in homeserver.yaml ``` [13:50:50] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse-worker@background.service [13:51:19] ungleich-v6:/etc/matrix-synapse# vi homeserver.yaml [13:51:40] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse ``` * Re-configure by cdist it still can't send a message * Reboot & restart synapse and workers it still can't send a message ... Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: return func(*args,**kw) Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: File "/usr/lib/python3/dist-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: conn.rollback() Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: --- <exception caught here> --- Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: File "/usr/lib/python3/dist-packages/twisted/enterprise/adbapi.py", line 56, in rollback Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: self._connection.rollback() Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: psycopg2.InterfaceError: connection already closed ... ... Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: 2021-07-16 16:49:49,211 - synapse.replication.tcp.handler - 276 - ERROR - process-replication-data-6779- Failed to handle command <synapse.replication.tcp.commands.RdataCommand object at 0x7f63fc3f59b0> Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: Traceback (most recent call last): Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 549, in new_transaction Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: r = func(cursor, *args, **kwargs) Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/databases/main/roommember.py", line 484, in _get_rooms_for_user_with_stream_ordering_txn Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: txn.execute(sql, (user_id, Membership.JOIN)) Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 297, in execute Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: self._do_execute(self.txn.execute, sql, *args) Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 330, in _do_execute Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: return func(sql, *args) Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: psycopg2.OperationalError: server closed the connection unexpectedly ... ... ul 16 20:13:31 ungleich-v6 python3[23875]: File "/usr/lib/python3/dist-packages/psycopg2/extras.py", line 1209, in execute_batch Jul 16 20:13:31 ungleich-v6 python3[23875]: cur.execute(b";".join(sqls)) Jul 16 20:13:31 ungleich-v6 python3[23875]: psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "e2e_one_time_keys_json_uniqueness" Jul 16 20:13:31 ungleich-v6 python3[23875]: DETAIL: Key (user_id, device_id, algorithm, key_id)=(@nico:ungleich.ch, DJALANLHYA, signed_curve25519, AAAGBw) already exists. ... * Remove worker message sending success * Re-configure worker by cdist message sending success ## Failing URLS * https://ungleich.matrix.ungleich.cloud/_matrix/client/r0/rooms/!ayZCWJehJJJpcWvQxS%3Aungleich.ch/read_markers ## Related documents in the web * https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-scalability * https://github.com/matrix-org/synapse/blob/develop/docs/workers.md * https://github.com/matrix-org/synapse/blob/develop/docs/reverse_proxy.md