# ungleich Matrix Setup anaylsis (2021-07-16)
## Findings
* Postgres is too slow
* Redis has timeouts that it should not have (30s)
* Login does not work
## Problems / Situation
* Cannot send messages
* Cannot login anymore
Situation:
* Added new workers to federation, sync, client before
* Added new path to nginx for client workers
* However login did not work before either
* Own login worker gets gateway timeout
* Checking if the ldap server gets a request: seems not to be the case!
* Listening on wrong source IP!
* Signin directly causes ldap traffic -> good!
* However still getting timeout
* Main worker is running at 100% cpu
* Might be the reason...
* Replication timeout messages in the log might be the main reason
*
* Incoming messages from federation seem to work / arrive!
* Browsing rooms from outside seems to be impossible
* Federation workers have high load -> let's disable federation for the moment
* redis responses not reached
## Steps
* Read up about *how it should be*
* Then compare *how it is*
* Verify the nginx configuration
* Verify the worker configuration
* Verify that all needed workers are up and running
* I think we were missing the persister before ?
* But there is no configuration for it
## Should be state definition (based on research)
* 1 main worker running on port 8008
* According to https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-scalability
* One or more event persisters?
* Redis is up and running
## *IS* state research (what is on our system)
* **NO PASSWORD** for worker_replication_secret
* Redis is up and running
* Main process
* 1 main worker running on port 9093 with [replication]
* 1 main worker running on port 8008 with [federation,client,metrics]
* Workers
* background, metrics, port 8086
* run_background_tasks_on: background in the homeserver.yaml
* Jul 16 02:10:07 ungleich-v6 matrix-synapse-background[2322]: Config is missing macaroon_secret_key
* Jul 16 02:10:08 ungleich-v6 matrix-synapse-background[2322]: 2021-07-16 02:10:08,820 - synapse.storage.engines.postgres - 69 - WARNING - main- Database has incorrect collation of 'C.UTF-8'. Should be 'C'
* client1: metrics+client port 8084
* client2: metrics+client port 8092
* event1: metrics port 8082
* federation1: metrics+federation 8085
* federation2: metrics+federation 8087
* federation3: metrics+federation 8095
* federation4: metrics+federation 8096
* sync1: metrics+client port 8083
* sync2: metrics+client port 8093
* sync3: metrics+client port 8094
* **NEW WORKERS added 2021-07-16**
* login1: metrics+client port 8097
* Other synapse settings
* pushing is still in the main process
* Persisting seems to be in the main process
* No MEDIA worker (could be very helpful!) synapse.app.media_repository
* nginx upstreams
* inbound_federation: 8085, 8087, 8095, 8096: ok
* sync_workers: 8083, 8093, 8094: ok
* federation: 8085, 8087, 8095, 8096: ok
* client_workers: 8084, 8092: ok
* **event_sending: 8082: UNCLEAR**
* nginx config / paths
* /sync and /initialSync for same user
* /sync with ?since can go to different instances
* Isolating the login syncs vs. regular syncs
* Federation and client requests can be balanced via simple round robin.
* ^/_matrix/federation/v1/send/ (INBOUND federation ) should be balanced by source IP
* Registration/login requests can be handled separately purely to help ensure that unexpected load doesn't affect new logins and sign ups.
* event sending requests can be balanced by the room ID in the URI - the room ID is the path component after /rooms/
## Should vs. IS table
| Should | IS | Notes |
|---|---|---|
| main worker port: replication | yes |
| main worker NO other ports | NO | Also listening on 8008 |
| workers should be on ::1 | NO | Are on :: for, allowing world access to /metrics |
## Open todos
* background: Fix Jul 16 02:10:08 ungleich-v6 matrix-synapse-background[2322]: 2021-07-16 02:10:08,820 - synapse.storage.engines.postgres - 69 - WARNING - main- Database has incorrect collation of 'C.UTF-8'. Should be 'C'
* background: Fix Config is missing macaroon_secret_key
## Actions taken 2021-07-16
* Remove port 8008 from main worker, restart (0946)
* Result: messages can still not be sent from our instance
```
[10:00:04] ungleich-v6:/etc/nginx/sites-enabled# mkdir ~/backup-2021-07-16
[10:00:10] ungleich-v6:/etc/nginx/sites-enabled# cp https-ungleich.matrix.ungleich.cloud ~/backup-2021-07-16/
```
* Adding login worker
```
[10:29:12] ungleich-v6:/etc/matrix-synapse/workers# vi login.yaml
(reverse-i-search)`': ^C
[10:31:52] ungleich-v6:/etc/matrix-synapse/workers# systemctl enable matrix-synapse-worker@login1.service
Created symlink /etc/systemd/system/matrix-synapse.target.wants/matrix-synapse-worker@login1.service _ /etc/systemd/system/matrix-synapse-worker@.service.
[10:38:03] ungleich-v6:/etc/matrix-synapse/workers# mv login.yaml login1.yaml
[10:38:09] ungleich-v6:/etc/matrix-synapse/workers# systemctl start matrix-synapse-worker@login1.service
```
* Installing ldap-utils to check on ldap connection
## System state definition
Before:
```
[09:45:08] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ls
matrix-synapse-worker@background.service matrix-synapse-worker@federation_sender-01.service
matrix-synapse-worker@client1.service matrix-synapse-worker@generic-01.service
matrix-synapse-worker@client2.service matrix-synapse-worker@generic-02.service
matrix-synapse-worker@event1.service matrix-synapse-worker@generic-03.service
matrix-synapse-worker@event_persister-01.service matrix-synapse-worker@generic-04.service
matrix-synapse-worker@federation1.service matrix-synapse-worker@sync1.service
matrix-synapse-worker@federation2.service matrix-synapse-worker@sync2.service
matrix-synapse-worker@federation3.service matrix-synapse-worker@sync3.service
matrix-synapse-worker@federation4.service matrix-synapse-worker@synchrotron-01.service
```
* Stopping federation workers to reduce load
```
[10:30:23] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation1.service
[11:00:02] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation2.service
[11:00:08] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation3.service
[11:00:29] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# systemctl stop matrix-synapse-worker@federation4.service
[11:00:36] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants#
```
* Restarting redis
```
txredisapi.TimeoutError: Not received Redis response in 30 seconds
2021-07-16 11:06:09,494 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-56- Background process 'send-cmd' threw an excep
tion Traceback (most recent call last): File "/usr/lib/python3/dist-packages/synapse/metrics/background_process_metrics.py", line 214, in run return await maybe_awaitable(func(*args, **kwargs))
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command self.synapse_stream_name, encoded_string File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply raise r txredisapi.TimeoutError: Not received Redis response in 30 seconds 2021-07-16 11:06:09,500 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-57- Background process 'send-cmd' threw an exce$
tion
```
* Disabling federation using an empty list at 1111
* Returning 503 on inbound federation at 1119 via nginx
* On two paths
* Splitting worker log of main process log
```
[11:25:22] ungleich-v6:/etc/matrix-synapse/workers# sed -i "s,/etc/matrix-synapse/log.yaml,/etc/matrix-synapse/log-worker.yaml," *.yaml
[11:25:52] ungleich-v6:/etc/matrix-synapse/workers# mv ../worker_log.yaml ../log-worker.yaml
[11:26:49] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# for a in matrix-synapse-worker@background.service matrix-synapse-worker@client1.service matrix-synapse-worker@client2.service matrix-synapse-worker@login1.service matrix-synapse-worker@sync1.service ; do systemctl restart $a; done
[11:29:12] ungleich-v6:/etc/matrix-synapse# chmod a+r log-worker.yaml
```
* disable all workers EXCLUDING the login workers @ 1153
* Setup is 1 main process + 1 worker
* Still same 504 timeout on login
```
[11:53:22] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants# ps aux | grep login1
matrix-+ 15438 4.9 0.0 642920 99944 ? Ssl 11:30 1:08 /usr/bin/python3 -m synapse.app.generic_worker --config-path=/etc/matrix-synapse/homeserver.yaml --config-path=/etc/matrix-synapse/conf.d/ --config-path=/etc/matrix-synapse/workers/login1.yaml
root 16998 0.0 0.0 3084 892 pts/2 S+ 11:53 0:00 grep login1
[11:53:25] ungleich-v6:/etc/systemd/system/matrix-synapse.target.wants#
```
* switching all traffic BESIDES login to the main worker @ 1156
* Port 8008 re-enabled
* Nginx reconfigured to move everything there (besides login parts)
* filtering out metrics support to calm log down @ 1204
* setting in nginx
```
```
* Disabling login worker -> everything back to main worker @ 1209
* It might "just" be a replication issue
* Matrix sending messages WORKS, but it is extremely slow.
* Main process is running at 100% cpu
* Many replication failures
```
2021-07-16 12:05:28,355 - synapse.replication.tcp.handler - 670 - WARNING - replication_notifier-14- Dropping command as not connected: 'RDATA'
```
* Restarting federation workers (4x), re-enabling them in nginx, re-enabling federation -- NOT DONE
```
```
* Shutting down matrix to trigger state compression
* Room selection already takes quite some time in postgres
```
[12:20:37] ungleich-v6:~# mkdir state-compressiontfiles
[12:20:42] ungleich-v6:~# systemctl stop matrix-synapse.service
[12:22:10] ungleich-v6:/usr/local/bin# sudo -u matrix-synapse bash
sudo: unable to resolve host ungleich-v6: Name or service not known
matrix-synapse@ungleich-v6:/usr/local/bin$ pwd
/usr/local/bin
matrix-synapse@ungleich-v6:/usr/local/bin$ cd /tmp/
matrix-synapse@ungleich-v6:/tmp$ mkdir state-compression
matrix-synapse@ungleich-v6:/tmp$
matrix-synapse@ungleich-v6:/tmp/state-compression$ /usr/local/bin/compress-rooms.sh
[12:23:12] ungleich-v6:~# df -h /tmp/
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 247G 108G 139G 44% /
[12:23:15] ungleich-v6:~#
```
* Deleting oldvar directory
```
[12:25:16] ungleich-v6:/var/lib/matrix-synapse# rm -rf oldhddvar/ &
[1] 18293
```
* Creating a data base backup @ 1227
```
[12:27:08] ungleich-v6:/var/lib/matrix-synapse# mkdir backup-2021-07-16
[12:27:14] ungleich-v6:/var/lib/matrix-synapse# chown postgres backup-2021-07-16/
[12:27:45] ungleich-v6:~$ pg_dump matrix-synapse | xz > /var/lib/matrix-synapse/backup-2021-07-16/pgdump.xz
```
* Upgrading matrix-synapse to 1.38.0 via Debian SID repo
```
[12:44:08] ungleich-v6:/etc/apt# apt install -t sid matrix-synapse
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
irqbalance libbind9-140 libdns162 libisc160 libisccc140 libisccfg140 liblwres141 libnuma1 libruby2.3
libxerces-c3.1 libxml-security-c17v5 python3-daemonize ruby2.3 runit-helper
Use 'apt autoremove' to remove them.
Suggested packages:
python3-authlib
Recommended packages:
python3-pympler
The following packages will be upgraded:
matrix-synapse
1 upgraded, 0 newly installed, 0 to remove and 560 not upgraded.
Need to get 1550 kB of archives.
After this operation, 97.3 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian sid/main amd64 matrix-synapse all 1.38.0-1 [1550 kB]
Fetched 1550 kB in 0s (4760 kB/s)
Preconfiguring packages ...
(Reading database ... 64918 files and directories currently installed.)
Preparing to unpack .../matrix-synapse_1.38.0-1_all.deb ...
Unpacking matrix-synapse (1.38.0-1) over (1.37.1-1~fto10+1) ...
Setting up matrix-synapse (1.38.0-1) ...
Processing triggers for systemd (241-7~deb10u7) ...
# Disabling repo afterwards
[12:46:28] ungleich-v6:/etc/apt# cat sources.list
deb http://deb.debian.org/debian/ buster main
deb-src http://deb.debian.org/debian/ buster main
deb http://deb.debian.org/debian/ buster-updates main
deb-src http://deb.debian.org/debian/ buster-updates main
deb http://security.debian.org/debian-security buster/updates main
deb-src http://security.debian.org/debian-security buster/updates main
deb https://fasttrack.debian.net/debian/ buster-fasttrack main contrib
deb https://fasttrack.debian.net/debian/ buster-backports main contrib
#deb http://deb.debian.org/debian/ sid main
```
* Aborted state compression script due to high load @ 1247
* Stopping background worker @ 1258
* To phase in work into main worker
* Commented out "run_background_tasks_on: background_worker" in homeserver.yaml
```
[12:58:00] ungleich-v6:/etc/matrix-synapse# vi homeserver.yaml
[12:58:40] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse.service
```
* No listener on port 8008 or 9093
* Raising log level to DEBUG in log.yaml
* Aborted pg_dump @ 1305
* Synapse starts!
* Removing DEBUG log, changing back to INFO
* Trying to sign in @ 1310
* Still getting a 504 / timeout, even on main worker
* Many redis timeout messages
* No more redis warnings after tuning sysctl and hugepages
```
22206:M 16 Jul 2021 13:01:16.214 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
22206:M 16 Jul 2021 13:01:16.215 # Server initialized
22206:M 16 Jul 2021 13:01:16.215 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
# fixing
[13:13:19] ungleich-v6:~# sysctl -w net.core.somaxconn=8192
net.core.somaxconn = 8192
[13:13:22] ungleich-v6:~# sysctl vm.overcommit_memory=1
vm.overcommit_memory = 1
[13:14:05] ungleich-v6:~# echo never > /sys/kernel/mm/transparent_hugepage/enabled
```
* Removing redis support @ 1320
* Still timeouts, even after redis tuning
* Disabling metrics support @ 1328
To ensure metrics are not the core problem
```
txredisapi.TimeoutError: Not received Redis response in 30 seconds
2021-07-16 13:18:56,159 - synapse.metrics.background_process_metrics - 218 - ERROR - send-cmd-132- Background process 'send-cmd' threw an exce
ption
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/synapse/metrics/background_process_metrics.py", line 214, in run
return await maybe_awaitable(func(*args, **kwargs))
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
self.synapse_stream_name, encoded_string
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
raise r
txredisapi.TimeoutError: Not received Redis response in 30 seconds
```
* Re-enabling redis @1334
* To be potentially able to re-add workers
* Getting timeouts again!
```
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
File "/usr/lib/python3/dist-packages/synapse/replication/tcp/redis.py", line 212, in _async_send_command
File "/usr/local/lib/python3.7/dist-packages/txredisapi.py", line 533, in handle_reply
txredisapi.TimeoutError: Not received Redis response in 30 seconds
2021-07-16 13:39:00,270 - synapse.replication.tcp.redis - 259 - WARNING - redis_ping-3- Failed to send ping to a redis connection
2021-07-16 13:39:11,476 - synapse.replication.tcp.redis - 282 - INFO - sentinel- Connection to redis server b'localhost':6379 lost: Connection was aborted locally, using.
2021-07-16 13:39:11,478 - synapse.replication.tcp.redis - 181 - INFO - sentinel- Lost connection to redis
2021-07-16 13:39:11,556 - synapse.replication.tcp.redis - 282 - INFO - GET-174- Connection to redis server b'localhost':6379 lost: Connection was aborted locally, using.
```
* Removing the "activerehasing" config in redis.conf @ 1341
* It is causing additional latency, up to 2ms
* Still getting timeouts
```
2021-07-16 13:43:00,210 - synapse.replication.tcp.redis - 266 - INFO - federation_transaction_transmission_loop-3969- Connecting to redis server b'localhost':6379
2021-07-16 13:43:00,251 - synapse.replication.tcp.redis - 274 - INFO - federation_transaction_transmission_loop-3969- Connection to redis server b'localhost':6379 failed: User timeout caused connection failure.
```
* Re-enabling backgronud task @ 1351
* enabling "run_background_tasks_on: background_worker" in homeserver.yaml
```
[13:50:50] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse-worker@background.service
[13:51:19] ungleich-v6:/etc/matrix-synapse# vi homeserver.yaml
[13:51:40] ungleich-v6:/etc/matrix-synapse# systemctl restart matrix-synapse
```
* Re-configure by cdist
it still can't send a message
* Reboot & restart synapse and workers
it still can't send a message
...
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: return func(*args,**kw)
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: File "/usr/lib/python3/dist-packages/twisted/enterprise/adbapi.py", line 303, in _runWithConnection
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: conn.rollback()
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: --- <exception caught here> ---
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: File "/usr/lib/python3/dist-packages/twisted/enterprise/adbapi.py", line 56, in rollback
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: self._connection.rollback()
Jul 16 16:49:04 ungleich-v6 matrix-synapse-sync1[9275]: psycopg2.InterfaceError: connection already closed
...
...
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: 2021-07-16 16:49:49,211 - synapse.replication.tcp.handler - 276 - ERROR - process-replication-data-6779- Failed to handle command <synapse.replication.tcp.commands.RdataCommand object at 0x7f63fc3f59b0>
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: Traceback (most recent call last):
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 549, in new_transaction
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: r = func(cursor, *args, **kwargs)
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/databases/main/roommember.py", line 484, in _get_rooms_for_user_with_stream_ordering_txn
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: txn.execute(sql, (user_id, Membership.JOIN))
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 297, in execute
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: self._do_execute(self.txn.execute, sql, *args)
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: File "/usr/lib/python3/dist-packages/synapse/storage/database.py", line 330, in _do_execute
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: return func(sql, *args)
Jul 16 16:49:49 ungleich-v6 matrix-synapse-event1[11971]: psycopg2.OperationalError: server closed the connection unexpectedly
...
...
ul 16 20:13:31 ungleich-v6 python3[23875]: File "/usr/lib/python3/dist-packages/psycopg2/extras.py", line 1209, in execute_batch
Jul 16 20:13:31 ungleich-v6 python3[23875]: cur.execute(b";".join(sqls))
Jul 16 20:13:31 ungleich-v6 python3[23875]: psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "e2e_one_time_keys_json_uniqueness"
Jul 16 20:13:31 ungleich-v6 python3[23875]: DETAIL: Key (user_id, device_id, algorithm, key_id)=(@nico:ungleich.ch, DJALANLHYA, signed_curve25519, AAAGBw) already exists.
...
* Remove worker
message sending success
* Re-configure worker by cdist
message sending success
## Failing URLS
* https://ungleich.matrix.ungleich.cloud/_matrix/client/r0/rooms/!ayZCWJehJJJpcWvQxS%3Aungleich.ch/read_markers
## Related documents in the web
* https://matrix.org/blog/2020/11/03/how-we-fixed-synapses-scalability
* https://github.com/matrix-org/synapse/blob/develop/docs/workers.md
* https://github.com/matrix-org/synapse/blob/develop/docs/reverse_proxy.md