RabbitMQ TWG Notes

# RabbitMQ TWG Notes ## Pinned notes https://confluence.diamond.ac.uk/display/SSCC/RabbitMQ ### Tasks BW: Looking into RabbitMQ server configuration, DH offers to assist JH: FITSM/ITSM ~~RG: Added RMQ version of queue monitor [(dlstbx/#58)](https://github.com/DiamondLightSource/python-dlstbx/pull/58)~~ SciComp: [HTTPS to admin proxy (SC-2991)](https://jira.diamond.ac.uk/browse/SC-2991) SciComp: [Host aliases for RabbitMQ servers (SC-2969)](https://jira.diamond.ac.uk/browse/SC-2969) SciComp: [Set up systemd script for RabbitMQ (SC-2967)](https://jira.diamond.ac.uk/browse/SC-2967) RG: To chase up above tickets so they can be resolved before project week DH: 2-way bridge RG: did set this up monitoring, this is running ??: Alerting ### Upcoming project week 2021-08-13 (Friday) to 2021-08-19 (Thursday) * Document and link monitoring on Confluence; Alerting * https://gitlab.diamond.ac.uk/scisoft/zocalo/-/tree/master/monitoring * Setting up logging RabbitMQ -> Graylog * Erlang 23 / RabbitMQ 3.8: Logging via Lager. That can probably be configured/connected to Graylog (GELF) in some way * ~~Erlang 24 offers a new logging framework (name?) which may make this easier. We can deploy RabbitMQ 3.9+Erlang 24 via the Jenkins job. May or may not suffice to make use of the new logging framework.~~ * Try to use lager * https://github.com/esl/lager_graylog last release 2018, but last activity 2020 * https://github.com/silviucpp/graylog_lager last activity June, was fix for Erlang 24 * Identify a service for 2-way bridge and set this up * Need configuration/support to start a service on RabbitMQ * all --live/--test command line arguments, replaced by -e/--environment * Service suggestions: * Images services, fairly high number of messages, but lost messages don't really matter, leaf node * Dispatcher? Could update Synchweb to submit to RMQ * Note: liveness-messages go on the same server, so if the Controller runs on ActiveMQ then it won't see RabbitMQ services. We could spin up 2 controllers, but then they'd need separate configuration files. Probably best start with a non-critical service that only runs on RMQ (run on toolservers, or from a cluster array job) and figure out the rest from there. * FitSM * James: Thursday * Monitoring: See if we can get individual queue monitoring / queue selection on Grafana * First test scenario * ActiveMQ: Run SampleTxnProducer - writes to transient.transaction * Bridge: transient.transaction ActiveMQ -> RabbitMQ * RabbitMQ: Run SampleTxn - reads from t.t, writes to transient.destination * Bridge: transient.destination RabbitMQ -> ActiveMQ * ActiveMQ: Run SampleConsumer - reads from t.d * Actions needed to make this happen: * ~~MG: make workflows/zocalo releases, reinstall dials/now~~ * ~~MG: create second configuration file for RabbitMQ and dials+rabbitmq modules that point to that second config file~~ * ~~DH: add transient.transaction to the RabbitMQ config~~ * rewrite pikatransport * make `transient.status` a topic ## 2021-03-08 * PMG has worked on the Workflows transport class * VM 3 misbehaving (DLS-33639) * Script to start a server: ```bash module load rabbitmq /dls_sw/apps/rabbitmq/configuration/start-cluster ``` * What user will the servers run as? * Functional account, MG will contact MC and set this up together * How will the servers be started? (systemd, other, ..?) * Will be systemd ## 2021-03-01 * MC has joined us from Scientific Computing * Machines to be racked at the beginning of Run 2, possibly in Shutdown 1 if MC can find the time * Basic monitoring with Nagios via Scientific Computing * Further monitoring to be discussed between PMG and MC * JH to discuss service catalogue with Anton. ## 2021-02-22 * PMG has made progress on Workflows+Pika * AH is working on a configuration generator and the first bits have been merged into the zocalo configuration repository * JH will meet up with NS and AL to discuss how we can approach FitSM for the RabbitMQ service * There has been no update from Scientific Computing on delivery schedules, at this point it looks unlikely that we will make the next run * MG to send an update to the project board ## 2021-02-17 JH is the project lead for the FitSM component. ## 2021-02-01 ##### `module load rabbitmq` limitations * Plugins are disabled by default * PMG has found a way to get this working * Permissions by default do not allow sending from hosts other than localhost * Default configuration does not have a user * Information is on Confluence * https://confluence.diamond.ac.uk/display/~atd44888/RabbitMQ+Setup ##### Queue setup * On the dashboard you can import a policies file, that sets up queues and exchanges * https://www.rabbitmq.com/parameters.html * Exclusive queues are tied to the lifetime of the connection: https://www.rabbitmq.com/ha.html#exclusive-queues-are-not-mirrored https://www.rabbitmq.com/reliability.html#clustering ##### Monitoring * Scientific computing would like to have rabbitmq in kubernetes because it would make monitoring easier. * That's not going to happen. ##### Action points: * [x] PMG to add link to confluence information above * [ ] MG and PMG to discuss monitoring in more detail * [ ] PMG and MG to have a look at the configuration/policies ## 2021-01-27 ##### [Quorum queues vs. mirrored queues](https://www.rabbitmq.com/quorum-queues.html) Former are better for data integrity and should be preferred. They do not provide the same features. Notable differences are: * no exclusivity. We rely on exclusivity * in the X-ray centring service: only one instance is allowed to collect results, so that this one instance can aggregate everything * in the controller service: to determine which instance is allowed to shut down and spin up and new services. * mimas backlog: same reason, only one instance to drip feed jobs into the cluster * Further: cluster statistics service, dropfile pickup, `dc_sim_verify` * no message TTL. * We use this in ActiveMQ for development and testing (`transient.` queues), but only because ActiveMQ requires messages to expire before queues can be expired, and what we really care about is the queues expiring. * So this is not an issue. As we need exclusivity we know that not all queues can be quorum queues. We will definitely need some sort of `declare_queue()` function in workflows. ##### Queue configuration In ActiveMQ queues could be created on the fly. A service could start listen for a name without that name being predefined. Similarly, another service could send messages to a queue that did not already exist. In RabbitMQ we will need to predefine exchanges/queues in some way: * services can set up their environment * easy to add new services * declaration next to code using it * obvious problem when eg. PIA service starts before the X-ray centring service, as the destination queue isn't set up * Q: Does the configuration translate across the RabbitMQ cluster? * Q: What happens if a brand new RabbitMQ cluster node appears? Does the configuration get transferred instantaneously, or will there be a period of potential data loss or failure because destinations are not set up? * queues can be declared in a central configuration repository * more rigid configuration, less flexible * single place to go to if you want to know who is responsible for a queue (think DLQ messages) * Q: How will this work together with system self tests, where we need to generate hundreds of unique temporary queues? ##### Action points: * [x] PMG to investigate clustering, monitoring and alerting ([via kubernetes](https://confluence.diamond.ac.uk/display/SC/Kubernetes+User+Guide#KubernetesUserGuide-Monitoring)) * [ ] AH and MG to read through [pika tutorial](https://www.rabbitmq.com/tutorials/tutorial-one-python.html) * [x] Investigate how other projects do queue configuration ## 2021-01-25 PMG presented her evaluation of 5 client libraries. Discussion: * 2 discarded due to development model (we don't really do async/coroutines) * reference implementation ([pika](https://pypi.org/project/pika/)) has best documentation * alternative implementation ([amqp](https://github.com/celery/py-amqp)) can be twice as fast, but documentation is out of date * both implementations are close in terms of the development model **Result: we will use [pika](https://pypi.org/project/pika/)**, and if performance becomes an issue we can consider switching in the future.