TripleO CI infrastructure upgrades

# TripleO CI infrastructure upgrades ###### tags: `Design` Jakob @ 2022-02-17: > why I think we should upgrade our baremetal infrastructure or OpenStack projects (tenants). Primarily, I have three > use cases: Development on (a) Reproducer, (b) OpenStack deployments with TripleO and (c) others. > > a) Reproducer supports three backends: Running jobs on the reproducer host, on preprovisioned libvirt domains or as > OpenStack compute instances. Atm we have a OpenStack project *-pcci to run jobs on, but it is shared with the whole > team and underpowered. Using this project always bears the risk of breaking other person's systems. > One solution is to ask for unique (sub)projects for individual team member and service users like > `tripleo.ci.ruck.rover@gmail.com`. These projects must be sufficiently large. Depending on (future) jobs, project > networks probably must provide full control over TCP/IP link layer (layer 2) for VLANs etc. Using OpenStack compute > instances could be a challenge whenever nested virtualization is involved, because KVM supports VM-in-VM only, so for > VM-in-VM-in-VM one has to fall back to QEMU TCG which is slow or up to unusable depending on the workload. Another > solution would be to deploy our own OpenStack environment on baremetal, which would give us full control but also > requires maintenance. > > b) Atm we are not able to deploy OpenStack environments which follow production guidelines. Doing so might be > relevant for debugging issues such as UEFI boot bugs and developing production-like CI scenarios. For such deployments > we would need control over DNS and DHCP services to provision baremetal machines via PXE, e.g. with OpenStack Ironic. > For production environments at least two NICs are recommended, our servers currently have only a single NIC connected. > For performance and redundancy reasons one would typically use at least four NICs per node. To be able to test > [Network Functions Virtualization (NFV)][nfv] and [SR-IOV][sriov], we (might) need suitable network hardware. To > deploy production-like Ceph clusters, we would have to seriously invest into storage, in size, performance and number > of devices. To be able to test e.g. [PCI passthrough][pci-passthrough] we would need to get extra devices such as > GPUs. NFV, SR-IOV and PCI-passthrough might be out of our scope, but control over TCP/IP link layer is fundamental. > > [pci-passthrough]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/configuring_the_compute_service_for_instance_creation/assembly_configuring-pci-passthrough_pci-passthrough > [nfv]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html-single/network_functions_virtualization_planning_and_configuration_guide/index > [sriov]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/network_functions_virtualization_planning_and_configuration_guide/part-sriov-nfv-configuration > > c) For our daily work, we need to provision systems. For example, to work on the Ansible OpenStack collection, we > need customized DevStack instances to run tests against. Instead of four or five people investing time into setting > up their own DevStacks it would help to have preprovisioned machines, created in advance by one team member. Or to > have playbooks to quickly spin up machines. We need a place to host these services, either on properly-sized and > non-risky OpenStack projects or baremetal machines with control over TCP/IP link layer and DHCP/DNS services. Sandeep @ 2022-02-17: > You are mentioning we basically need more infra (i.e Baremetal or virtual) for the following cases: > > a) Reproducer > b) OpenStack deployments with TripleO and > c) others. > >> a) Reproducer and c) others. > > Big +1 to get more quota/hardware for individual members for learning and reproducer. The 64G individual machine our > team members have is not of much use - we can only do a minimal installation there (director + 1 compute + 1 > controller) or a devstack/standalone environment. > > In my personal experience, for the reproducer most folks generally use below in practice: > > * Rerun the failing job and hold the failing job for reproducer environment(it have its own limitations here), but if > we also say we don't have enough resources to hold a node for reproducer within our quota - that's really a big > issue for our team. So far I only that is a case for downstream(PSI) and Ronelle is working with infra to increase > our quota in PSI for our tenant, we should soon have enough quota for downstream debugs. > > I believe the usage pattern of the reproducer script will change once we have enough quota in upshift and the > reproducer script is resilent. > >> b) OpenStack deployments with TripleO and > > You are making good points here, A while back when I joined forces in setting up downstream component pipeline, I > planed to create multiple customer like scenario test cases[1] including NFVs/ceph testing which you mentioned. > Two outcomes: > > 1. New requested Hardware newer came. > > 2. When I went to DF with these new possible job coverage - Everyone said these are great ideas but they said it's not > on DF to test all the possible customer like scenarios and it comes on particular DFGs to make sure their > component/functionality works well with Tripleo and most of these testing is already available in QE phase 3 jobs > [2]. > > Just taking NFV example which you mentioned, we already have those coverages in the NFV phase 3 jobs[2], You can find > coverage for other DFGs at [3]. Infact, now we have an ongoing "consolidation effort" in the downstream - QE and CRE > team is trying to consolidate our downstream CI so that we don't waste resources by running the same jobs/testing. > > I would make a point that running jobs in phase3 is too late and we should shift more testing upward in our component > pipeline /upstream(If possible) and for which we need more hardware, but with the formation of CRE team - Some of > these responsibility would go to them in downstream to take hold of component lines and test detailed customer like > scenarios earlier in the production chain. But I am not totally sure anymore what's our exact role in the downstream > after CRE team - To be discussed more with Ronelle. > > I whole heartdely agree with your ideas, But to really get new hardware we need to make our case: > > A) What additional coverage we will add and which is not already present in Phase1/2/3 jobs? > B) We need to confirm our(tripleo-ci team) role in downstream after the formation of the CRE team? > > [1] https://docs.google.com/spreadsheets/d/1UC9BeLdoxNv1SWolOapr4TyEhCC5TFcuff_2eUUVUG0/view > [2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.2/view/nfv/ > [3] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.2/ Sandeep @ 2022-02-17: > Personally from my experience so far, I believe getting more quota in upshift is more feasible than getting new > hardware, unless we can really prove our case. It was the opposite.. These downstream responsibilities were new to > this team. We proved upstream component lines can do wonders in downstream. That's why they are creating a team to > manage downstream component lines. There team is still not a thing currently - WIP (hiring and just starting) - > You remember guys from clbyl project. But yes, going forward they want to use best of both world: zuul and jenkins. > Currently they are trying to ask the basic (right) questions first: What and where we test and hence the clbyl > project. Most of jobs we manage are tripleo specific (from our name - we are the tripleo-ci team). This new cre team > hired a bunch of members to aligned with each dfg to add more focused jobs in their respective components and > monitor/manage them. Ronelle, Sandeep, Jakob @ 2022-02-22: New hardware is required for two separate use cases, one is development and another is CI. The future of RHOSP and TripleO is unclear, esp. because of Red Hat's investment in OpenShift, RHOSP leadership decision to set DirectorD on hold and their (renewed) focus on OpenStack Director Operator. Together with the formation of the new CRE team and the existing downstream DFGs, the future of the TripleO CI team, esp. our responsibilities, might be affected and might change. Writing specifications for hardware could prove to be difficult because our requirements might change if OpenShift or OpenStack Director Operator is going to play a bigger role for our team in future. For now, Ronelle will check what happened with Wes' request for new hardware from 2019/2020(?). Jakob will have a look at Ronelle's baremetal test environment and see whether this could be reworked so that it can be used by the team for development, e.g. TripleO cluster deployments. These baremetal servers haven't been used for a long time, some might be broken. If their recycling proves to be feasible, then we might request a recabling of their network to get control over TCP/IP link layer.