# Resource Group Development Notes ## Overview Resource Group is resource management module of Greenplum Database. It controls (isolates) resource using in the whole cluster of Greenplum Database. Four kinds of resource can be managed: 1. Concurrency 2. CPU 3. Memory 4. Disk IO ## Application Layers using Cgroup The implementation of `CPU` and `Disk IO` management is based on Linux Cgroup. ![](https://hackmd.io/_uploads/H1Fugr3E2.png) * To client user, the top layer, Greenplum provides the concept of Resource Group with Utility SQLs to manage database tasks. This layer defines the semantics of resgroup. * The bottom layer is OS, Greenplum implements some resgroup features (CPU, IO) using Linux Cgroup. * The middle layer is Greenplum Database. This layer "translates" the user input into values (or configurations) and passes them to Linux Cgroup. The whole architecture is similar to docker. ## Processes in Greenplum Database Cluster It is critical to understand different processes in a running Greenplum Database cluster because an OS process is the unit we can manage its resource. A running Greenplum Database Cluster contains many running Postgres Instances on different hosts: * 1 Coordinator Instance * 1 Standby Instance * Many Primary Segment Instances * Many Mirror Segment Instances A running Postgres Instance has a main process called `Postmaster`. ## Resgroup IO limit ### Linux Cgroup IO control Linux Cgroup provides io subsystem [https://docs.kernel.org/admin-guide/cgroup-v2.html#io](https://docs.kernel.org/admin-guide/cgroup-v2.html#io): > The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices. `io.max` is the interface Greenplum will use for resource group IO control. The configuration paramters `rpbs, wbps, riops, wiops` keeps the same meaning in Greenplum resource group as in Linux Cgroup. The syntax to set Linux Cgroup `io.max` is: ``` echo "8:16 rbps=2097152 wiops=120" > io.max ``` Configuration paramters and Device ID need to be provided. ### Resource Group IO Limit Design Greenplum does two things to control IO Limit: 1. write device id and configuration parameters into `io.max` of corresponding resource group 2. write process id into the of `cgroup.procs` corresponding resource group We choose table space in Greenplum as the object to control, reasons: * given a path of OS, we can find its device id, * data is stored in table spaces in Greenplum, heavy IO (read and write ) are on table spaces The mapping relationship is: * Linux Cgroup <----> Resource Group * Device Id <----> TableSpace Path * A Database role belongs to a resource group | | tablespace 1 | tablespace 2 | ... | tablespace n| | -------- | -------- | -------- | ----|-------------| | resgroup 1 | io configs | io configs | ... | io configs | | resgroup 2 | io configs | io configs | ... | io configs | | ... | ... | ... | ... | ...| | resgroup m | io configs | io configs | ... | io configs | Linux Cgroup needs above configuration matrix. Greenplum needs to provide user-friendly DDLs to generate and maintain the above matrix. ### DDLs for IO Limit ```sql create resource group <rg_name> with(iolimit='iolimit_config_string'); alter resource group <rg_name> set iolimit 'iolimit_config_string'; ``` Typical cases: ```sql create resource group rg with (iolimit='ts1:rbps=100,rbps=13;ts2:rbps=9,iobps=1'); create resource group rg with (iolimit='*:rbps=100,rbps=13'); ``` Static part of `iolimit_config_string`: ``` <iolimit_config_string> :: tablespace_io_config :: tablespace_io_config;<iolimit_config_string> <tablespace_io_config> :: tablespace:ioconfigs <tablespace> :: identifier | * <ioconfigs> :: ioconfig :: ioconfig,<ioconfigs> <ioconfig> :: key=value <key> :: wbps | rbps | wiops | riops <value> :: integer | max ``` Dynamic part of the DDL is: 1. `*` can be used to represent for all tablespaces. If '*' is specicfied, there must be no other table space io configs, otherwise, error will be raised. 2. Allowed io config parameters are: `rbps`, `wbps`, `riops`, `wiops`. In a single `<ioconfigs>`, each parameter can at most appear once or not appear. 3. Duplicated tablespace names are not allowed. 4. Two tablespaces pointing to the same device is not allowed. Semantic of the DDL in Greenplum: ``` create resource group rg with (iolimit='ts1:rbps=100,rbps=13;ts2:rbps=9,iobps=1;...'); Given rg, find its cgroup path for each tablespace ioconfigs: dev id = find the tablespace device id echo "dev id ioconfigs" > cgroup path/io.max ``` ## Get device number(BDI) ### How to check major:minor(or a block device) is a disk ``` > ls /sys/dev/block/{major:minor}/start ``` if `start` file exists, the device is a partition, it not, the device is a disk. ### How to get disk of a path 1. get bdi of corresponding partition: ```shell # using stat, you can get bdi from Device field > stat . File: . Size: 4096 Blocks: 8 IO Block: 4096 directory Device: 8,17 Inode: 11272193 Links: 3 Access: (0755/drwxr-xr-x) Uid: ( 1000/ mt) Gid: ( 100/ users) Access: 2023-05-11 16:31:15.358499430 +0800 Modify: 2023-05-05 10:51:35.878812903 +0800 Change: 2023-05-07 20:58:21.158723508 +0800 Birth: 2023-05-07 20:58:17.955677414 +0800 # find mount option from /proc/slef/mountinfo > cat /proc/self/mountinfo 39 27 0:36 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:19 - fusectl fusectl rw 40 27 0:37 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime shared:20 - configfs configfs rw 63 24 0:38 / /run/credentials/systemd-sysctl.service ro,nosuid,nodev,noexec,relatime shared:21 - ramfs ramfs rw,mode=700 65 24 0:39 / /run/credentials/systemd-tmpfiles-setup-dev.service ro,nosuid,nodev,noexec,relatime shared:22 - ramfs ramfs rw,mode=700 43 30 0:27 /@home /home rw,relatime shared:45 - btrfs /dev/nvme0n1p3 rw,ssd,discard=async,space_cache=v2,subvolid=798,subvol=/@home 42 30 259:6 / /boot rw,relatime shared:48 - vfat /dev/nvme0n1p1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 99 24 0:42 / /run/credentials/systemd-tmpfiles-setup.service ro,nosuid,nodev,noexec,relatime shared:50 - ramfs ramfs rw,mode=700 235 30 0:27 /@/var/lib/docker/btrfs /var/lib/docker/btrfs rw,relatime shared:1 - btrfs /dev/nvme0n1p3 rw,ssd,discard=async,space_cache=v2,subvolid=797,subvol=/@ 252 24 0:62 / /run/user/1000 rw,nosuid,nodev,relatime shared:575 - tmpfs tmpfs rw,size=6577684k,nr_inodes=1644421,mode=700,uid=1000,gid=100 674 252 0:63 / /run/user/1000/gvfs rw,nosuid,nodev,relatime shared:593 - fuse.gvfsd-fuse gvfsd-fuse rw,user_id=1000,group_id=100 710 252 0:64 / /run/user/1000/doc rw,nosuid,nodev,relatime shared:627 - fuse.portal portal rw,user_id=1000,group_id=100 244 43 8:17 / /home/mt/workstage/data rw,relatime shared:155 - ext4 /dev/sdb1 rw,stripe=8191 ``` 2. find disk of partition: ```shell # 1. get reallink # readlink -f /sys/dev/block/{BDI of partition} > readlink -f /sys/dev/block/8:17 /sys/devices/pci0000:00/0000:00:08.1/0000:2f:00.3/usb6/6-1/6-1:1.0/host10/target10:0:0/10:0:0:0/block/sdb/sdb1 # 2. get basename of parant dir > basename $(dirname "$(readlink -f /sys/dev/block/8:17)") sdb ``` ### should we consider stacked block device? + LVM is ok, because every logical volume of LVM is a `Disk`. + software RAID? + btrfs?