Faster remote room joins

# Faster remote room joins **Outdated**: this document is now superceded by [MSC3902](https://github.com/matrix-org/matrix-spec-proposals/pull/3902). ## Design overview * We no longer get the whole state in response to a `send_join` * This leads to a dramatic reduction in response size, making the response come back quicker, and (particularly) making it much faster to process the response. * So, the join event is flagged as having *partial state*. * ... as are any events that use that join event as a `prev_event`, and so on. * Synapse's DB layer is updated so that any queries for the state at such events *block* until the state is resolved. (This is where we need good cancellation support.) * But an exception: if the `StateFilter` shows that we don't need the membership events, then there is no need to block. This allows lazy-loading clients to keep using the room anyway. * We have a background process which back-populates the state. * In theory we can do this with `/state_ids` and lots of `/event` requests, but that is glacial, so we have to optimise the code to use `/state` instead. This has shaken out a surprising number of bugs. * Once the state at a particular event is populated, we can unblock any pending DB queries for state at that event. This requires a certain amount of marshalling (and is particularly involved in a multi-worker environment). The initial draft doesn't need any client-side changes, though it's likely we will want to make some once we see how both lazy-loading and non-lazy-loading clients perform (ie, let's do better than just presenting spinners). ## Metrics * a graph showing time taken to join a selection of rooms over time: [prometheus](https://synapse-performance-test.lab.element.dev/prometheus/graph?g0.expr=performance_join_time_seconds%20and%20performance_join_success%20%3E%200%20and%20(time()%20-%20performance_join_timestamp%20%3C%2030000)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=16d) ## Detailed spec changes * Extend the `send_join` API to return less state: covered in [MSC3706](https://github.com/matrix-org/matrix-doc/pull/3706). ### Receiving events over federation during the resync We may not have enough state, so we replace the "state at event" check (cf [Checks performed on receipt of a PDU](https://spec.matrix.org/v1.3/server-server-api/#checks-performed-on-receipt-of-a-pdu)) with a check against the state-res of the auth events and the state at the event. ### Soft fail We can't follow the current soft-fail algorithm, since we may not have the sender's membership event in the current state. For now, we will skip the soft-fail check if there is partial state. (We may wish to return and improve this) ### Device lists * https://github.com/matrix-org/synapse/pull/13913 * https://github.com/matrix-org/synapse/issues/13891 * https://github.com/matrix-org/synapse/pull/13913 ### Handling incoming federation requests * https://github.com/matrix-org/matrix-spec-proposals/pull/3895 * TODO: we're changing how these are authenticated (see https://github.com/matrix-org/synapse/issues/13288). * TODO: does it imply changes to how we send events? ## 2022-05-30 state of play Done so far: * server-side support for extended `send_join` API: [MSC3706](https://github.com/matrix-org/matrix-doc/pull/3706), [#11967](https://github.com/matrix-org/synapse/pull/11967). * Initial client-side support for just hitting the API and populating the DB: [#11994](https://github.com/matrix-org/synapse/pull/11994), [#12005](https://github.com/matrix-org/synapse/pull/12005), [#12011](https://github.com/matrix-org/synapse/pull/12011), [#12012](https://github.com/matrix-org/synapse/pull/12012), [#12039](https://github.com/matrix-org/synapse/pull/12039). * Making `/state` work correctly for outliers: [#12173](https://github.com/matrix-org/synapse/pull/12173), [#12155](https://github.com/matrix-org/synapse/pull/12155), [#12154](https://github.com/matrix-org/synapse/pull/12154), [#12087](https://github.com/matrix-org/synapse/pull/12087), tests fixes, and more in flight ([sytest#1211](https://github.com/matrix-org/sytest/pull/1211), [sytest#1192](https://github.com/matrix-org/sytest/pull/1192), [#12191](https://github.com/matrix-org/synapse/pull/12191)). * Use `/state` for resyncing large fractions of the room state: [#12013](https://github.com/matrix-org/synapse/pull/12013), [#12040](https://github.com/matrix-org/synapse/pull/12040). * walk the list of partial-state events, and fill them in: [#12394](https://github.com/matrix-org/synapse/pull/12394). * a manager for tracking which events have partial state: [#12399](https://github.com/matrix-org/synapse/pull/12399). ## Testing results, 2022/05/30 I attempted to join #element:matrix.org (a room of 13K users) from sw1v.org. Results: * 20:08:48 (+0:00): Start * 20:09:28 (+0:40): The join itself completes, comprising: * 2s warming up (`/query/directory`, `/make_join`, etc) * 16s waiting for `/send_join` response * 19s checking signatures on `/send_join` response * 3s persisting events in the `/send_join` response * 20:09:33 (+0:45): room is included in `/sync`. At this point, eleweb no longer shows the room as "joining", but it still shows a spinner for history. * 20:15:49 (+7:01): join event is de-partial-stated * Any messages sent by local users before this point are now processed. * 20:16:00 (+7:12): `/backfill` request made * 20:16:12 (+7:24): state resync process completes * 20:16:34 (+7:46): `/members` request completes * 20:21:22 (+12:34): `/messages` request completes For comparison, a regular (not-faster-joins) join: * 21:55:30 (+0:00): start * 21:56:29 (+1:00): client times out, reports an error * 22:01:22 (+5:22): join completes, client shows room with pagination spinner * 22:05:11 (+9:51): `/messages` request completes ## Next steps Work is now being tracked under milestones in the Synapse issue tracker: * [Q2 2022 ─ Faster joins phase 2: correctness](https://github.com/matrix-org/synapse/milestone/6) * [Q3 2022: Faster joins: fix major known bugs for monoliths](https://github.com/matrix-org/synapse/milestone/8) * [Q4 2022: Faster joins: worker-mode and remaining work](https://github.com/matrix-org/synapse/milestone/10) ## Outstanding questions * What do we do if the `/state` request never completes (eg, the resident server becomes unreachable, or leaves the room, or the `/state` response causes us to OOM)? * We probably struggle on zombie-like, repeatedly retrying the `/state`. But we could end up with lots of rooms like that... * What happens if we try to leave the room while the resync is still in progress? Once we do so, we will be unable to make `/state` requests. * Just leave the state incomplete? * Purge the room? * Not allow the last user to leave? * Allow them to leave but not tell other servers about it? * It's possible that, once we get the full state and chase it down through the DAG, we'll discover some state transition is impossible. (Eg, a state event was created by a user which turns out to have left the room at that point.) How do we handle this? * If we'd know about the problem upfront, we'd have just rejected the event. * We can mark the event as rejected retrospectively, but we might have told clients and even other servers about it in the mean time. ## other TODO list (richvdh brain dump) * resync: * fix the race in persistence (where the persistence thread reads a lazy-stated event just before we re-sync it and finish up the resync job) * there are a bunch of races in the resync code. * [x] add the tables to `purge_rooms` (https://github.com/matrix-org/synapse/pull/12889) * [ ] find out why `/send_join` is so slow to respond (jaeger shows it doing lots of `bulk_get_push_rules`. Oddly, sending a message first doesn't help - so maybe it's just not being cached right on our test server) * Tests * [ ] ex-outliers with lazy-loading. A unit test? * [ ] state which turns out to be wrong when we resync * [x] port the schema defs to postgres * [x] switch the schema to use event_ids. It's too difficult to de-outlier things otherwise. --- # Older design notes - no longer relevant ## Handling the half-joined state Auth doesn't actually depend on resolved room state - it depends on *auth events* (though the magic "reconcile auth events" code is likely to make things behave oddly). What *does* depend on room state is soft-fail. Maybe we can get away with not soft-failing anything while the state sync is in progress. So once the `send_join` completes, we need to kick off a process which: * does a `/state` request, and updates the state at the initial join event. * updates the state at any subsequently-received events. * does the post-room-upgrade stuff. For added fun, that process needs to withstand server restarts. So how do we identify a half-joined room? Guess we should keep a db table. ### What do we do for state_groups in half-joined rooms? We need to be able to auth events sent by local users, which really does mean having the concept of "current state", even if it's partial. So, I think we'll have to have cut-down state groups, and generate new state groups at resync. ### Processing incoming requests We only want to do the reprocessing for events whose prev-events are all either fully-stated, or on the list of events to fix up (otherwise we won't be able to figure out the correct state). So, for any event that arrives in the meantime: * if any prev_events are unknown, we should get_missing_events for them (which will populate them as regular events) * if any prev_events are on the lazy-stated list, the new event joins the list ### Managing the list of lazy-stated events * we could make it implicit via links to the DAG, but that gets annoyingly inefficient * we could assume *all* events are lazy-stated (which implies that we must be able to get all prev_events for incoming events, or ignore them) - but that's pretty bogus in a leave/rejoin scenario * it's just everything with a `stream_ordering` larger than the join event ### syncing Ideally we want to withhold lazy-joined rooms from non-LL `/sync` requests until the `/state` completes. "State sync completion" therefore needs to trigger some marker for the sync handler to pick up. ### Endpoints which need changing * federation: * `send_join` * `state` * `state_ids` * c-s: * `/members` * `/joined_members` * `/state` * `/initialSync` * `/sync` ## Old notes from outlier-based design * It seems nice to avoid giving the events state_groups at all? * Why can't we just mark the damn things as outliers? It'll mean updating sync and push code not to just ignore outliers, but that might be a good thing Doesn't work because we need partial state at these events. ### What should we do with forward and backward extremities? Clearly, the lazy-stated events should not be excluded from being forward extremities. So, either we need to update the forward-extremity logic to consider lazy-stated events despite their being outliers, or we need to decide they aren't really outliers (and update everything else that expects non-outliers to have state). https://github.com/matrix-org/synapse/issues/9595