Fedora Metrics for Apps - Conversation with Mark/Kevin

# Fedora Metrics for Apps - Conversation with Mark/Kevin ## Meeting 2021-07-06 Meeting with Kevin/Mark - [0] https://docs.google.com/document/d/1IczzjEviRtqleGzt7eWjRVmD-_dmhLJn1WqOWrU4qs4/edit# - [1] https://pagure.io/fedora-infra/metrics-for-apps/boards/metrics-for-apps - [2] https://pagure.io/fedora-infra/metrics-for-apps - [3] https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps ## Questions ### Fedora Infra access.redhat.com account - Stephen added David last year - need vipul and akashdeep to have access - can we be administrators? - this gives us access to interact with the clusters at cloud.redhat.com, open support tickets etc ### subscriptions - How many openshift 4 subscriptions do we currently have? - Can we delete the older clusters defined there, 2019 old communishift, might free up entitlements - I know where we can update it to get more entitelements if required internally. ### sysadmin-noc - the noc01 instance seems to be inaccessible for david, akashdeep, vipul has access, is it the sysadmin-noc group which unlocks this? Can we all get access to this? ### sysadmin-dns - possible can access? - from the batcave01, /srv/git/dns has the raw files, read only access - No way to do PR's, process will be to do git patches and email to the fedora-infra mailing list ### Proxies - use the existing method of proxying - this is all in the ansible repo - migration plan top migration of apps over not a problem we have to fix within the confines of this initative4 - need to add ocp4. to the dns also - we will have a http console ### Hardware - The existing ocp 3 cluster, can/will we repurpose those nodes to add extra nodes to the new clusters? - especially needed for staging, since you need a minimum of 6 machines to install. 3 master, 2 node, 1 boot, the boot machine can be repurposed later. Official docs say the min requirements, 5 nodes, 4 cores, 16 gig ram - These are the management ips for the hosts, accessible from the Red Hat VPN. - Mark can get the management password and reset it if necessary - Is there a way to access the console for us? Was helpful during install on the CentOS CI cluster. - all VMs can't repurpose these - Master nodes could be put on VMs - Are all these in the same chasis? all on 10g we think, all on different boxes New machines for prod: (These are all AMD based boxes, 96 cores, 256GB ram, 8 450GB SSDS) ``` oshift-dell01 IN A 10.3.160.180 oshift-dell02 IN A 10.3.160.181 oshift-dell03 IN A 10.3.160.182 oshift-dell04 IN A 10.3.160.183 oshift-dell05 IN A 10.3.160.184 oshift-dell06 IN A 10.3.160.185 ``` older machines we had marked for masters perhaps: ``` oshift4-x86-fe-01 IN A 10.3.160.58 oshift4-x86-fe-02 IN A 10.3.160.45 oshift4-x86-fe-03 IN A 10.3.160.44 ``` older machines we had marked for perhaps a staging cluster: ``` oshift4-x86-be-01-stg IN A 10.3.160.51 oshift4-x86-be-02-stg IN A 10.3.160.50 oshift4-x86-fe-01-stg IN A 10.3.160.40 oshift4-x86-fe-02-stg IN A 10.3.160.41 ``` ### Storage - What storage is available and in use on the ocp3 cluster currently? - NFS is not a suitable storage for prometheus - What storage is available on the nodes? - Might be able to use https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-ocs.html to offer a managed shared storage made up of any localstorage available on the nodes etc. downstream from rook - fedora ocp3 cluster currently using nfs served from a netapp - netapp has a product for doing storage for openshift - no quotas, but its all controlled via ansible ### Bastian access / playbook execution - None of us are in `sysadmin-main`, will be be able to run playbooks from the the current `os-control01.stg.iad2.fedoraproject.org` bastian node? - make sure we setup new playbooks to use `sysadmin-noc` which will give us the permissions to run the playbooks - maybe add a sysadmin-openshift group? - ### pxeboot/ipmi provisioning ``` Once thats sorted out, installs are via pxeboot... noc01.iad2.fedoraproject.org is the dhcp server for all the networks. dhcpd.conf is in ansible. The pxelinux config is on batcave01 in /srv/web/infra/bigfiles/tftpboot (it's not in ansible because it's got a bunch of image files, etc) We can stick ignition files on batcave01 as well. ``` - We know technically how to pxeboot boot a machine, pass the various things necessary to bring up RHCOS(4.3 at least, might need to refresh knowledge for 4.7), how do we do those things in fedora infra, is there a playbook to restart a node? - Is it on the noc01 instance where these files are stored? - Does noc01 a basic http server running which can serve some of the files needed during rhcos boot? - Where are they stored in source control? Some of the things can't be put in the ansible repo. ssh keys, ignition files, etc - Can we get access to this, or should we have Mark/Kevin to do it? - https://docs.infra.centos.org/operations/ci/installation/install/ some notes we took during the baremetal install for the CentOS CI cluster - We will store the files in the fedora infra private repo - We need to write playbooks to deploy from the private repo to a location on the openshift control/bastion node where we will store the ignition files/kubeadmin certs etc - We need to setup a webserver on the node, and set iptable rules to only allow connections from the openshift nodes. - Ping mark/kevin when adding new playbooks, to have them update the rbac playbook so our group can run them. - https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/bodhi_backend#_42 - https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/groups/ipa.yml#_8 - Ansible flag we should use to stop secrets being printed to the log file - dell boxes management port 443 from redhat vpn - several management passwords, 2-3, need to get em from mark - web browser, idrac ### Kubevirt - On the CentOS CI, fedora-coreos (and others) have elevated permissions, to allow them to access /dev/kvm directly on the ocp4 nodes themselves, they are bringing up VMs `directly on the nodes`, bypassing monitoring checks, potentially making the nodes unstable etc - Are we planning to allow this on this new cluster? - Can we request they instead run any x86 workloads via the koji jobs, similar to what we wer suggesting for the arch64 builds? ### Longer term.. - Can we ever actually replace this cluster with OSD? or self managed OCP4 or AWS ROSA? - Will we always be tied to having baremetal in the data center? - Is prometheus/alertmanager going to be our metrics, alerting stack for Fedora applications going forward, or should we focus on integrating the prometheus metrics into a zabbix stack ### Main points - start off with a staging cluster ## Meeting with Mark - Jul 7th * Agreed: new group sysadmin-openshift and it to have needed permissions ssh tunnel to avoid vpn (to access mgmt login) `ssh -L 4430:10.3.160.180:443 -L 5900:10.3.160.180:5900 batcave01` https://localhost:4430/ https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/bodhi_backend#_42 https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/groups/ipa.yml#_8 https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/fix-docker-iptables.staging https://pagure.io/fedora-infra/ansible/blob/main/f/roles/base/templates/iptables # metrics-for-apps - conversation with kevin ## 26 Aug 2021 ### Machines - we don't have to move them as the frontend nodes are way powerful than the VMs that we have with OpenShift 3.x. ### Authentication `ansible/files/communishift/objects` - has info abot IDC - have to add some config, and retrieve some secret stuff from ipsilon - a proper group-based authentication needs to be implemented for the cluster - Kevin seems at least open to having an official supported solution eg operator to synch groups with ipa ### Apps - there are several categories of apps - some not in prod - some have conjobs we dont want running omre than one copy - some have storage requirements - some have external databases, these should be ok to deploy as is - some mayb use old/legacy/no longer supported api versions for objects - we would want to move the applications slowly, one-by-one ### Storage - Some applications have an external storage / PVC and we have to figure out OCS before we can make the shift

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.