# RabbitMQ TWG Notes
## Pinned notes
https://confluence.diamond.ac.uk/display/SSCC/RabbitMQ
### Tasks
BW: Looking into RabbitMQ server configuration, DH offers to assist
JH: FITSM/ITSM
~~RG: Added RMQ version of queue monitor [(dlstbx/#58)](https://github.com/DiamondLightSource/python-dlstbx/pull/58)~~
SciComp: [HTTPS to admin proxy (SC-2991)](https://jira.diamond.ac.uk/browse/SC-2991)
SciComp: [Host aliases for RabbitMQ servers (SC-2969)](https://jira.diamond.ac.uk/browse/SC-2969)
SciComp: [Set up systemd script for RabbitMQ (SC-2967)](https://jira.diamond.ac.uk/browse/SC-2967)
RG: To chase up above tickets so they can be resolved before project week
DH: 2-way bridge
RG: did set this up monitoring, this is running
??: Alerting
### Upcoming project week
2021-08-13 (Friday) to 2021-08-19 (Thursday)
* Document and link monitoring on Confluence; Alerting
* https://gitlab.diamond.ac.uk/scisoft/zocalo/-/tree/master/monitoring
* Setting up logging RabbitMQ -> Graylog
* Erlang 23 / RabbitMQ 3.8: Logging via Lager. That can probably be configured/connected to Graylog (GELF) in some way
* ~~Erlang 24 offers a new logging framework (name?) which may make this easier. We can deploy RabbitMQ 3.9+Erlang 24 via the Jenkins job. May or may not suffice to make use of the new logging framework.~~
* Try to use lager
* https://github.com/esl/lager_graylog last release 2018, but last activity 2020
* https://github.com/silviucpp/graylog_lager last activity June, was fix for Erlang 24
* Identify a service for 2-way bridge and set this up
* Need configuration/support to start a service on RabbitMQ
* all --live/--test command line arguments, replaced by -e/--environment
* Service suggestions:
* Images services, fairly high number of messages, but lost messages don't really matter, leaf node
* Dispatcher? Could update Synchweb to submit to RMQ
* Note: liveness-messages go on the same server, so if the Controller runs on ActiveMQ then it won't see RabbitMQ services. We could spin up 2 controllers, but then they'd need separate configuration files. Probably best start with a non-critical service that only runs on RMQ (run on toolservers, or from a cluster array job) and figure out the rest from there.
* FitSM
* James: Thursday
* Monitoring: See if we can get individual queue monitoring / queue selection on Grafana
* First test scenario
* ActiveMQ: Run SampleTxnProducer - writes to transient.transaction
* Bridge: transient.transaction ActiveMQ -> RabbitMQ
* RabbitMQ: Run SampleTxn - reads from t.t, writes to transient.destination
* Bridge: transient.destination RabbitMQ -> ActiveMQ
* ActiveMQ: Run SampleConsumer - reads from t.d
* Actions needed to make this happen:
* ~~MG: make workflows/zocalo releases, reinstall dials/now~~
* ~~MG: create second configuration file for RabbitMQ and dials+rabbitmq modules that point to that second config file~~
* ~~DH: add transient.transaction to the RabbitMQ config~~
* rewrite pikatransport
* make `transient.status` a topic
## 2021-03-08
* PMG has worked on the Workflows transport class
* VM 3 misbehaving (DLS-33639)
* Script to start a server:
```bash
module load rabbitmq
/dls_sw/apps/rabbitmq/configuration/start-cluster
```
* What user will the servers run as?
* Functional account, MG will contact MC and set this up together
* How will the servers be started? (systemd, other, ..?)
* Will be systemd
## 2021-03-01
* MC has joined us from Scientific Computing
* Machines to be racked at the beginning of Run 2, possibly in Shutdown 1 if MC can find the time
* Basic monitoring with Nagios via Scientific Computing
* Further monitoring to be discussed between PMG and MC
* JH to discuss service catalogue with Anton.
## 2021-02-22
* PMG has made progress on Workflows+Pika
* AH is working on a configuration generator and the first bits have been merged into the zocalo configuration repository
* JH will meet up with NS and AL to discuss how we can approach FitSM for the RabbitMQ service
* There has been no update from Scientific Computing on delivery schedules, at this point it looks unlikely that we will make the next run
* MG to send an update to the project board
## 2021-02-17
JH is the project lead for the FitSM component.
## 2021-02-01
##### `module load rabbitmq` limitations
* Plugins are disabled by default
* PMG has found a way to get this working
* Permissions by default do not allow sending from hosts other than localhost
* Default configuration does not have a user
* Information is on Confluence
* https://confluence.diamond.ac.uk/display/~atd44888/RabbitMQ+Setup
##### Queue setup
* On the dashboard you can import a policies file, that sets up queues and exchanges
* https://www.rabbitmq.com/parameters.html
* Exclusive queues are tied to the lifetime of the connection:
https://www.rabbitmq.com/ha.html#exclusive-queues-are-not-mirrored
https://www.rabbitmq.com/reliability.html#clustering
##### Monitoring
* Scientific computing would like to have rabbitmq in kubernetes because it would make monitoring easier.
* That's not going to happen.
##### Action points:
* [x] PMG to add link to confluence information above
* [ ] MG and PMG to discuss monitoring in more detail
* [ ] PMG and MG to have a look at the configuration/policies
## 2021-01-27
##### [Quorum queues vs. mirrored queues](https://www.rabbitmq.com/quorum-queues.html)
Former are better for data integrity and should be preferred. They do not provide the same features. Notable differences are:
* no exclusivity. We rely on exclusivity
* in the X-ray centring service: only one instance is allowed to collect results, so that this one instance can aggregate everything
* in the controller service: to determine which instance is allowed to shut down and spin up and new services.
* mimas backlog: same reason, only one instance to drip feed jobs into the cluster
* Further: cluster statistics service, dropfile pickup, `dc_sim_verify`
* no message TTL.
* We use this in ActiveMQ for development and testing (`transient.` queues), but only because ActiveMQ requires messages to expire before queues can be expired, and what we really care about is the queues expiring.
* So this is not an issue.
As we need exclusivity we know that not all queues can be quorum queues. We will definitely need some sort of `declare_queue()` function in workflows.
##### Queue configuration
In ActiveMQ queues could be created on the fly. A service could start listen for a name without that name being predefined. Similarly, another service could send messages to a queue that did not already exist.
In RabbitMQ we will need to predefine exchanges/queues in some way:
* services can set up their environment
* easy to add new services
* declaration next to code using it
* obvious problem when eg. PIA service starts before the X-ray centring service, as the destination queue isn't set up
* Q: Does the configuration translate across the RabbitMQ cluster?
* Q: What happens if a brand new RabbitMQ cluster node appears? Does the configuration get transferred instantaneously, or will there be a period of potential data loss or failure because destinations are not set up?
* queues can be declared in a central configuration repository
* more rigid configuration, less flexible
* single place to go to if you want to know who is responsible for a queue (think DLQ messages)
* Q: How will this work together with system self tests, where we need to generate hundreds of unique temporary queues?
##### Action points:
* [x] PMG to investigate clustering, monitoring and alerting ([via kubernetes](https://confluence.diamond.ac.uk/display/SC/Kubernetes+User+Guide#KubernetesUserGuide-Monitoring))
* [ ] AH and MG to read through [pika tutorial](https://www.rabbitmq.com/tutorials/tutorial-one-python.html)
* [x] Investigate how other projects do queue configuration
## 2021-01-25
PMG presented her evaluation of 5 client libraries.
Discussion:
* 2 discarded due to development model (we don't really do async/coroutines)
* reference implementation ([pika](https://pypi.org/project/pika/)) has best documentation
* alternative implementation ([amqp](https://github.com/celery/py-amqp)) can be twice as fast, but documentation is out of date
* both implementations are close in terms of the development model
**Result: we will use [pika](https://pypi.org/project/pika/)**, and if performance becomes an issue we can consider switching in the future.