# Fedora Metrics for Apps - Conversation with Mark/Kevin
## Meeting 2021-07-06
Meeting with Kevin/Mark
- [0] https://docs.google.com/document/d/1IczzjEviRtqleGzt7eWjRVmD-_dmhLJn1WqOWrU4qs4/edit#
- [1] https://pagure.io/fedora-infra/metrics-for-apps/boards/metrics-for-apps
- [2] https://pagure.io/fedora-infra/metrics-for-apps
- [3] https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps
## Questions
### Fedora Infra access.redhat.com account
- Stephen added David last year
- need vipul and akashdeep to have access
- can we be administrators?
- this gives us access to interact with the clusters at cloud.redhat.com, open support tickets etc
### subscriptions
- How many openshift 4 subscriptions do we currently have?
- Can we delete the older clusters defined there, 2019 old communishift, might free up entitlements
- I know where we can update it to get more entitelements if required internally.
### sysadmin-noc
- the noc01 instance seems to be inaccessible for david, akashdeep, vipul has access, is it the sysadmin-noc group which unlocks this? Can we all get access to this?
### sysadmin-dns
- possible can access?
- from the batcave01, /srv/git/dns has the raw files, read only access
- No way to do PR's, process will be to do git patches and email to the fedora-infra mailing list
### Proxies
- use the existing method of proxying
- this is all in the ansible repo
- migration plan top migration of apps over not a problem we have to fix within the confines of this initative4
- need to add ocp4. to the dns also
- we will have a http console
### Hardware
- The existing ocp 3 cluster, can/will we repurpose those nodes to add extra nodes to the new clusters?
- especially needed for staging, since you need a minimum of 6 machines to install. 3 master, 2 node, 1 boot, the boot machine can be repurposed later. Official docs say the min requirements, 5 nodes, 4 cores, 16 gig ram
- These are the management ips for the hosts, accessible from the Red Hat VPN.
- Mark can get the management password and reset it if necessary
- Is there a way to access the console for us? Was helpful during install on the CentOS CI cluster.
- all VMs can't repurpose these
- Master nodes could be put on VMs
- Are all these in the same chasis? all on 10g we think, all on different boxes
New machines for prod:
(These are all AMD based boxes, 96 cores, 256GB ram, 8 450GB SSDS)
```
oshift-dell01 IN A 10.3.160.180
oshift-dell02 IN A 10.3.160.181
oshift-dell03 IN A 10.3.160.182
oshift-dell04 IN A 10.3.160.183
oshift-dell05 IN A 10.3.160.184
oshift-dell06 IN A 10.3.160.185
```
older machines we had marked for masters perhaps:
```
oshift4-x86-fe-01 IN A 10.3.160.58
oshift4-x86-fe-02 IN A 10.3.160.45
oshift4-x86-fe-03 IN A 10.3.160.44
```
older machines we had marked for perhaps a staging cluster:
```
oshift4-x86-be-01-stg IN A 10.3.160.51
oshift4-x86-be-02-stg IN A 10.3.160.50
oshift4-x86-fe-01-stg IN A 10.3.160.40
oshift4-x86-fe-02-stg IN A 10.3.160.41
```
### Storage
- What storage is available and in use on the ocp3 cluster currently?
- NFS is not a suitable storage for prometheus
- What storage is available on the nodes?
- Might be able to use https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-ocs.html to offer a managed shared storage made up of any localstorage available on the nodes etc. downstream from rook
- fedora ocp3 cluster currently using nfs served from a netapp
- netapp has a product for doing storage for openshift
- no quotas, but its all controlled via ansible
### Bastian access / playbook execution
- None of us are in `sysadmin-main`, will be be able to run playbooks from the the current `os-control01.stg.iad2.fedoraproject.org` bastian node?
- make sure we setup new playbooks to use `sysadmin-noc` which will give us the permissions to run the playbooks
- maybe add a sysadmin-openshift group?
-
### pxeboot/ipmi provisioning
```
Once thats sorted out, installs are via pxeboot...
noc01.iad2.fedoraproject.org is the dhcp server for all the networks.
dhcpd.conf is in ansible. The pxelinux config is on batcave01 in /srv/web/infra/bigfiles/tftpboot
(it's not in ansible because it's got a bunch of image files, etc)
We can stick ignition files on batcave01 as well.
```
- We know technically how to pxeboot boot a machine, pass the various things necessary to bring up RHCOS(4.3 at least, might need to refresh knowledge for 4.7), how do we do those things in fedora infra, is there a playbook to restart a node?
- Is it on the noc01 instance where these files are stored?
- Does noc01 a basic http server running which can serve some of the files needed during rhcos boot?
- Where are they stored in source control? Some of the things can't be put in the ansible repo. ssh keys, ignition files, etc
- Can we get access to this, or should we have Mark/Kevin to do it?
- https://docs.infra.centos.org/operations/ci/installation/install/ some notes we took during the baremetal install for the CentOS CI cluster
- We will store the files in the fedora infra private repo
- We need to write playbooks to deploy from the private repo to a location on the openshift control/bastion node where we will store the ignition files/kubeadmin certs etc
- We need to setup a webserver on the node, and set iptable rules to only allow connections from the openshift nodes.
- Ping mark/kevin when adding new playbooks, to have them update the rbac playbook so our group can run them.
- https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/bodhi_backend#_42
- https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/groups/ipa.yml#_8
- Ansible flag we should use to stop secrets being printed to the log file
- dell boxes management port 443 from redhat vpn
- several management passwords, 2-3, need to get em from mark
- web browser, idrac
### Kubevirt
- On the CentOS CI, fedora-coreos (and others) have elevated permissions, to allow them to access /dev/kvm directly on the ocp4 nodes themselves, they are bringing up VMs `directly on the nodes`, bypassing monitoring checks, potentially making the nodes unstable etc
- Are we planning to allow this on this new cluster?
- Can we request they instead run any x86 workloads via the koji jobs, similar to what we wer suggesting for the arch64 builds?
### Longer term..
- Can we ever actually replace this cluster with OSD? or self managed OCP4 or AWS ROSA?
- Will we always be tied to having baremetal in the data center?
- Is prometheus/alertmanager going to be our metrics, alerting stack for Fedora applications going forward, or should we focus on integrating the prometheus metrics into a zabbix stack
### Main points
- start off with a staging cluster
## Meeting with Mark - Jul 7th
* Agreed: new group sysadmin-openshift and it to have needed permissions
ssh tunnel to avoid vpn (to access mgmt login)
`ssh -L 4430:10.3.160.180:443 -L 5900:10.3.160.180:5900 batcave01`
https://localhost:4430/
https://pagure.io/fedora-infra/ansible/blob/main/f/inventory/group_vars/bodhi_backend#_42
https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/groups/ipa.yml#_8
https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/fix-docker-iptables.staging
https://pagure.io/fedora-infra/ansible/blob/main/f/roles/base/templates/iptables
# metrics-for-apps - conversation with kevin
## 26 Aug 2021
### Machines
- we don't have to move them as the frontend nodes are way powerful than the VMs that we have with OpenShift 3.x.
### Authentication
`ansible/files/communishift/objects`
- has info abot IDC
- have to add some config, and retrieve some secret stuff from ipsilon
- a proper group-based authentication needs to be implemented for the cluster
- Kevin seems at least open to having an official supported solution eg operator to synch groups with ipa
### Apps
- there are several categories of apps
- some not in prod
- some have conjobs we dont want running omre than one copy
- some have storage requirements
- some have external databases, these should be ok to deploy as is
- some mayb use old/legacy/no longer supported api versions for objects
- we would want to move the applications slowly, one-by-one
### Storage
- Some applications have an external storage / PVC and we have to figure out OCS before we can make the shift