owned this note
owned this note
Published
Linked with GitHub
---
tags: Agenda
---
# Design Meetings
[TOC]
## Meeting Links
Can be found here https://wiki.openstack.org/wiki/Airship#Get_in_Touch
## Archive
Agenda/notes from prior to 2021-04-01 can be found [here](https://hackmd.io/OuI_aOfXQzCjE5NEsvDfIw?both).
## Troubleshooting Guide & FAQs HackMDs
**Purpose:** provide a more accessible, flexible & dynamic way of capturing troubleshooting information & frequently asked questions.Depending on the amount of content (or lack thereof) these may be combined in the future.
https://hackmd.io/Nbc4XF6mQBmutMX_FEs51Q
https://hackmd.io/jIr3An6MT5C2xAQbKR3qoA
:::success
Feel free to add content to the pages. Thanks!
:::
## Administrative
### [Recordings ](https://hackmd.io/CvuF8MzmR9KPqyAePnilPQ)
### Old Etherpad https://etherpad.openstack.org/p/Airship_OpenDesignDiscussions
### Design Needed - Issues List
| Priority | Issues List |
| - | - |
| Critical | https://github.com/airshipit/airshipctl/issues?q=is%3Aopen+is%3Aissue+milestone%3Av2.0++label%3Apriority%2Fcritical+label%3A%22design+needed%22+|
| Medium | https://github.com/airshipit/airshipctl/issues?q=is%3Aopen+is%3Aissue+milestone%3Av2.0++label%3Apriority%2Fmedium+label%3A%22design+needed%22+ |
| Low | https://github.com/airshipit/airshipctl/issues?q=is%3Aopen+is%3Aissue+milestone%3Av2.0++label%3Apriority%2Flow+label%3A%22design+needed%22+ |
## Tuesday, November 30th
### Continue discussion about new kubeconfig workflow (Ruslan A / Alexey O)
Related to issue: https://github.com/airshipit/airshipctl/issues/666.
Reviewing particular problems with current kubeconfig approach and ways how to solve them using new solution.
## Tuesday, November 16th
### AS 2.1 Issue: First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
Per this issue, we want to change BareMetalHost (BMH) node01's image url and checksum IP to reference the target-cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
Discuss implementations of potential solutions.
**Possible solution**:
- There is an option in BMO to [maintain an Ironic endpoint with keepalived](https://github.com/metal3-io/baremetal-operator/blob/master/docs/ironic-endpoint-keepalived-configuration.md). Using this option, I successfully deployed the test-site with [this PS](https://review.opendev.org/c/airship/airshipctl/+/817821).
- On the first try, the IP that is being managed by keepalived pointed to the target-cluster IP (10.23.24.102) since it inherits the `PROVISIONING_IP` environment variable used during target expansion. The image url and checksum were unreachable.
- A second try, using a [modified keepalived docker image](https://hub.docker.com/repository/docker/jh813b/ironic-keepalived-test), resulted in keepalived preserving the ephemeral IP and kept the image url and checksum available after target cluster expansion.
* As a POC, I hard-coded the ephemeral IP to the `assignedIP` variable in the `manage-keepalived.sh` file [here](https://github.com/metal3-io/baremetal-operator/blob/master/resources/keepalived-docker/manage-keepalived.sh).
* **Q:** Can we host our own the [keepalived-docker](https://github.com/metal3-io/baremetal-operator/tree/master/resources/keepalived-docker) image so that we can set the IP we want to preserve without relying on the `PROVISIONING_IP` env variable?
* **A:** We should not need to, as the provisioning IP is only used with the ironic container.
* **Q:** With the default keepalived image, the IP that will be preserved would ultimately come from [here](https://github.com/airshipit/airshipctl/blob/master/manifests/site/test-site/target/catalogues/shareable/networking.yaml#L17). If we can host our own image, would there be an objection to adding a new key, something like `keepalivedIP` here to mimic the environment variable workflow?
* **A:** Only would be necessary if provisioning IP is used somewhere besides the ironic container.
Other considerations that **may not** need discussion:
- Changing the IP to a FQDN seems viable since we would not need to edit the BMH node.
- Attempted to edit the hosts file in the test site as shown in [this PS](https://review.opendev.org/c/airship/airshipctl/+/815151) but during image provision of the BMH node, it fails, either because the hosts file change does not take effect, or it would occur after provision.
- **Q:** Any suggestions on implementing a solution utilizing a FQDN?
- We previously believed that the `baremetalhost.metal3.io/detached` annotation or triggering a rolling update could be possible solutions, but these do not seem desirable because we either leave the BMH node in an unmanaged state (due to a reprovision being triggered if the annotation is removed), or we lose data during the rollout.
### New kubeconfig workflow (Ruslan A/Alexey O)
Discuss current kubeconfig issues and introduce new kubeconfig workflow (github link with detailed proposal - https://github.com/airshipit/airshipctl/issues/666, proposed PS - https://review.opendev.org/c/airship/airshipctl/+/816617).
### Create Gatekeeper function TM#167 - (Shon , Snehal)
https://github.com/airshipit/treasuremap/issues/167
With change in the direction of treasuremap, should we create gatekeeper in treasuremap or move it to airshipctl?
PSP are being deprecated from kubernetes version v1.21 and will be removed in v1.25. We will need replacement for PSP.
*Per design discussion 6/17/21, the Gatekeeper function should be included in the multi-tenant type and applied during the initinfra phase.*
Is this still valid as we are deprecating multi-tenant sites?
## Tuesday, November 9th
### First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
https://github.com/airshipit/airshipctl/issues/610
Per this issue, we want to change BareMetalHost (BMH) node01's image url and checksum IP to reference the target-cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
Proposed PS: https://review.opendev.org/c/airship/airshipctl/+/815757
Discuss implications of `baremetalhost.metal3.io/detached` annotation or other possible solutions https://github.com/metal3-io/baremetal-operator/blob/master/docs/api.md#detaching-hosts.
Findings:
- Adding the `baremetalhost.metal3.io/detached` annotation prevents the initial target BMH node from reprovisioning when the BMH object is edited (last week's issues were due to my environment, not the annotation), which allowed me to update the image url and checksum IPs in testing.
- **However**, if this annotation is removed after updating the image url and checksum values, the BMH node will attempt to reprovision and shut off. So to utilize this as a solution, we would need to leave BMH node in an unmanaged state, which does not seem desirable.
Other considerations:
- A rolling update seems to be the cleanest way to avoid using the annotation, but would require a spare BMH Control Plane (CP). Currently the `test-site` only deploys one CP and one Worker in the target-cluster. Also, there is information on the deployed CP that would be lost with a rolling update.
- Changing the IP to a FQDN seems viable, we would only need to update the hosts file instead of editing the object itself. However, it seems as if the hosts file would need to be updated before the ephemeral cluster is deployed, and again when the target cluster is deployed. I am unsure how to change the hosts for the ephemeral since my understanding is, we build it from an image.
**Q:** Does leaving the target BMH node annotated with `baremetalhost.metal3.io/detached` seem like a viable solution?
**Q:** Is there a preferred solution, of the three possible ones mentioned here?
**Q:** Other suggestions?
## Tuesday, November 2nd
### Multi-Node site/testing (Pallav/Andrew K)
Recently opened issue https://github.com/airshipit/airshipctl/issues/652 for a multi-node CP AiaP deployment (3 control plane nodes/2 workers). Need to have a new multi-node test site "airship-core-multinode" type in Airshipctl to be able to do this. By having a new multi node test site, we are giving opportunity to user to test scenarios like rolling upgrade, HA etc. ***Discussion:*** Look at leveraging the existing TM manifests to see what can be reused in developing this in Airshipctl. Create a separate Airshipctl gate job to use the 32GB nodes. This is part of #652.
**Q:** Should 5 node replace the existing 3 node or run in ||?
**A:** Let's get the 5 node in place & then evaluate whether or not to replace or keep both.
**Consideration:** if 32GB VMs are as accessible as the 16GB VMs, then it may make sense to switch. If 32GB are harder to get, then perhaps keep the 3 node in place.
We have an old issue out there for multi-node testing in the gates. https://github.com/airshipit/airshipctl/issues/228
~~Can we leverage the Treasuremap resources?~~
This PS https://review.opendev.org/c/airship/airshipctl/+/815153 looks similar for multi node deployment but it will be better to have a new airship-core-multinode test site created instead of modifying existing test-site so user have a choice for the deployment.
### Rook-ceph upgrade, BF deployment and Day 2 operations - Code review and discussion (Vladimir/Alexey)
Related to issues:
CPVYGR-571
CPVYGR-572
As per decision made at the Design Call on October 5, the final implementation of POC employs KRM functions to provision Argo-workflows manifests as well, as to to perform the upgrade/BF deployment using DAG. There is a need in final code review and appoval to consider the approach mentioned above, as a default way to perform Rook-Ceph upgrade/BF related tasks.
For review and discussion:
- https://review.opendev.org/c/airship/airshipctl/+/815144
- https://review.opendev.org/c/airship/treasuremap/+/816210
### First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
https://github.com/airshipit/airshipctl/issues/610
Per this issue, we want to change BMH node01's image url and checksum IP to reference the target cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
I am looking for some suggestions to resolve this.
What has been tried:
- We tried adding "baremetalhost.metal3.io/detached" as an annotation as we thought this would prevent the node01 from reprovisioning when the BMH object is edited. When the BMH object is changed, node01 attempts to reprovision and the air-target-1 VM shuts off. Does anyone know why the air-target-1 VM would shut off?
- We also tried changing the IP address to a FQDN with the intent to update the ironic hosts, but it appears that the hosts would need to be changed before the ephemeral node is deployed. Is there a way for the baremetal operator to tell the ironic agent to change the hosts file at the initial ephemeral node deployment?
## Tuesday, October 19th (FUTURE PLACEHOLDER)
### ODIM+Airship demo (Ravi)
Introduce the work in the Anuket community that integrates ODIM into Airship 2 baremetal provisioning.
### Spike: Validate Node Label changes can be made through Metal3 BMH (Sidney S.)
Describe BaremetalHost and Node label synchronization supported by CAPM3.
Test and validation of this feature based on Ephemeral cluster and Target cluster deployed using airshipctl based on CAPI v1alpha4 and CAPM3 v0.5.0 uplift patchsets.
This work was documented in [hackmd.io](https://hackmd.io/PEoHL01hSIaVvBIR2NzCvA?view).
### Need discussion on Plan status cmd (Bijaya)
https://github.com/airshipit/airshipctl/issues/412
Discussion/design about real implementation of the command and still is it a valid issue
### Use KRM function to apply k8s resources (Ruslan A.)
Discuss the issue: https://github.com/airshipit/airshipctl/issues/646
Proposed PoC: https://review.opendev.org/c/airship/airshipctl/+/809291
## Tuesday, October 12 may be cancelled
## Tuesday, October(!) 5th
### Priority - Spike: Understand (and implement) Ceph upgrades - BF (Vladimir/Alexey)
Related to issues:
CPVYGR-571
CPVYGR-572
There is a need in a POC approval to start the implementation for Ceph upgrades.
POC was successfully tested in a local lab, video recordings are attached to the CPVYGR-571. The main idea is to deploy via airshipctl an additional workload - Argo Workflows (https://argoproj.github.io/argo-workflows/) - and accomplish the BF operations using DAG manifests.
The proof of concept mentioned above shows that the upgrade performed via Argo Workflows becomes smooth and seemles procedure.
https://github.com/rook/rook.github.io/blob/master/docs/rook/v1.7/ceph-upgrade.md#ceph-version-upgrades
### Single-node BMO+Ironic pod (Matt)
See Pete's comment here: https://review.opendev.org/c/airship/airshipctl/+/706533/10/manifests/function/baremetal-operator/operator.yaml
Is there any reason not to combine Ironic into the BMO pod? Alan has a strong preference for this as well, and it simplifies things.
### PTG coming up
Thursday 21st, 13UTC-17UTC
Agenda: https://etherpad.opendev.org/p/airship-ptg-yoga
Registration (free): https://www.openstack.org/ptg/
## Tuesday, September 28rd
### Spike: Dex OIDC Upgrade/Configuration Change in Existing Cluster (Sidney S.)
Discuss the analysis and conclusions drawn from analysis when upgrading dex-aio on a brownfield deployment.
Link for Story: https://itrack.web.att.com/browse/CPVYGR-573
Analysis, Observations and Recommendations: https://hackmd.io/4K0ds3S1S0O8uV0eTaydwA?both
### Finalize Design/Issues for Day 2 Image Delivery (Larry B./Andrew K.)
https://github.com/airshipit/airshipctl/issues/621
Issues currently created to address:
1) creation of a new QCOW image for Day 2
2) Address API server VIP issue in treasuremap.
Want to make sure that proper issues created and design for VIP for Ironic and any other necessary to support the rolling upgrade.
New issues:
1) During initial deployment, introduce a VIP for the Ironic deployment so that Ironic is accessible from more than just the first target node.
2) Demonstrate rolling upgrade of all control plane and worker nodes with new image
3) Once this is working on for the Target Control Plane nodes, make modifications so the Ironic VIP from the Ephemeral Node is passed to the Target Control Plane.
Are these correct? Could we combine these, or at least combine 1 & 2 as they seem to go together. Are any others needed?
### AIAP: Support caching with limited access to the node(s) (Ian/Matt)
https://github.com/airshipit/airshipctl/issues/645
Airship-in-a-Pod has a handy caching feature which allows a developer to take the outputs of a run and re-use them in a subsequent run. This bypasses the need to rebuild resources which have time-consuming build processes such as the `airshipctl` binary.
However, this only works if the developer has access to the filesystem of the node on which AIAP is running, as it requires moving files from an output directory to a caching directory. In environments such as AKS, the developer may not have this access, preventing them from using this time-saving feature.
## Thursday, September 23rd
### Supporting multiple k8s versions simultaneously (Alexey, Matt)
* Do we need to, & what are the use cases
* What is the role of the upstream reference images / image builder definitions - what do we expect operators to override?
* If so, should we
* support multiple image-builder config definitions in parallel?
* have a single image-builder reference definition and have tags for different (older) k8s versions?
* For reference:
* https://review.opendev.org/c/airship/image-builder/+/805101/
* https://review.opendev.org/c/airship/image-builder/+/807192
### Spike Metal3.io Support for BIOS/Firmware Updates and RAID Configuration Changes (Sanjib/Saurabh)
Link for Story:
https://itrack.web.att.com/browse/CPVYGR-485
https://hackmd.io/LuE4l1PrTSaUvnOJyfKSsQ?view
Just to check the findings of current support in Metal3.io for BIOS and RAID configuration changes.
Demo for BIOS/Firmware functionality (Mahnoor A.)
### Discuss a proper place to store status map (Ruslan A.)
Related to the issue: https://github.com/airshipit/airshipctl/issues/624
Proposed location: config section of KubernetesApply executor https://review.opendev.org/c/airship/airshipctl/+/804472/16/manifests/phases/executors.yaml
## Tuesday, September 21st
### Spike Metal3 infrastructure provider upgrade on brownfield site #610 - Shon Phand
https://github.com/airshipit/airshipctl/issues/610
https://hackmd.io/UT9IKTDLR2u3P06axwuSvA?view
https://itrack.web.att.com/browse/CPVYGR-489
### Adding new parameter to HWCC and adding new Error type for hosts - Ashu Kumar
* Previous HWCC profile parameters are CPU, Disk, NIC and RAM. Adding two new parameters as SystemVendor and Firmware.
Links of PRs for SystemVendor and Firmware
System Vendor: https://github.com/metal3-io/hardware-classification-controller/pull/65
Firmware: https://github.com/metal3-io/hardware-classification-controller/pull/66
* Previous HWCC support error types Registration Error, Inspection Error, Provisioning Error and Power Management Error. Adding three new error types Preparation Error, Provisioned Registration Error and Detach Error (proposed)
Link of Proposal sumitted to Metal3 community: https://github.com/metal3-io/metal3-docs/pull/192
## Tuesday, September 14th
### v1.21 upgrade (Andrew/Matt)
https://github.com/airshipit/airshipctl/issues/621
* Further break out into separate stories
https://github.com/airshipit/airshipctl/issues/589
* Status - Diwakar
* Getting certification
TODO: review ready v1.21 patchsets
* https://review.opendev.org/c/airship/airshipctl/+/802771
* https://review.opendev.org/c/airship/images/+/802465
TODO: Andrew creating an issue for the 1.21 recert
### Spike CAPI upgrade - Sirisha Gopigiri
https://github.com/airshipit/airshipctl/issues/609
https://hackmd.io/8OXbEcpQTY-P-aoRqFG_WA?view
## Thursday, September 9th
### CAPI v0.4.0 v1alpha -- Sirisha Gopigiri
Related to https://github.com/airshipit/airshipctl/issues/518
* CAPI v0.4.0 requires k8 v1.19.1 version minimum. Do we need to wait for the kubernetes uplift https://github.com/airshipit/airshipctl/issues/589
Related PS: https://review.opendev.org/c/airship/image-builder/+/805101
* CAPI v0.4.1 version is available. Do we have to build capm3 using that or using v0.4.0 version
Related PSs:
**With CAPI v0.4.0:**
https://review.opendev.org/c/airship/airshipctl/+/802025 - Manifests to add capi v0.4.0
https://review.opendev.org/c/airship/airshipctl/+/804834 - capm3 and capi v0.4.0
**With CAPI v0.4.1:**
https://review.opendev.org/c/airship/airshipctl/+/805164 - capi v0.4.1 manifests
https://review.opendev.org/c/airship/airshipctl/+/805167 - capm3 and capi v0.4.1
### Co-Existing Multiple versions of CAPI (v0.4.x) with Providers v0.5.x -- Sidney S.
CAPI v0.4.2 version became available recently and CAPZ has also announced v0.5.2 version almost at same time. It became clear that CAPI providers (capm3, capz, capo, etc) have different release speed and it is a problem today for airshipctl to support them all.
All the above manifests (v0.4.0, v0.4.1, v0.4.2) can co-exist and the use of **kustomize** in the reference test site allows to pick the specific CAPI and CAP(operator) versions.
The current limitation is within the **clusterctl** krm that has only the ability to "burn" a single version of **clusterctl** CLI in its container image.
```dockerfile=
ARG CCTL_VERSION=0.4.2 # <<<< single version of clusterctl CLI
RUN curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v${CCTL_VERSION}/clusterctl-linux-amd64 -o /clusterctl
RUN chmod +x /clusterctl
```
In order to support multiple versions of CAPI, one approach would be to "burn" all supported **clusterctl** CLI binaries (and store them under a known location, e.g., v0.4.x directory) to the container image, then add a mechanism to the **clusterctl-init** executor to determine the version of CLI to execute.
#### UPD (Alexey O.):
as per discussion during the call we decided to proceed with the following approach:
1. instead of creating a single clusterctl krm-function with all needed versions of clusterctl binary, we will need to create several krm-functions each of those will include the needed version of clusterctl.
2. Right now we're using `localhost/clusterctl:latest`, but we're going to create `localhost/clusterctlV0.4.0:latest`, `localhost/clusterctlv0.4.1:latest` and so on. This is due to the limitations of out approach to always use the latest version of krm-functions. The changes can be done in `tools/deployment/21_systemwide_executable.sh` if needed.
3. The recommended option to build clusterctl krm-functions with needed versions of clusterctl is - to 1. rename the current `krm-functions/clusterctl/` to `krm-functions/clusterctl-base`; 2. modify `krm-functions/clusterctl/Dockerfile` by removing section https://github.com/airshipit/airshipctl/blob/master/krm-functions/clusterctl/Dockerfile#L5-L15, and line https://github.com/airshipit/airshipctl/blob/master/krm-functions/clusterctl/Dockerfile#L39 3. create 2 folders `krm-functions/clusterctlV0.4.0/` and `krm-functions/clusterctl0.4.1/` and put there the corresponding Dockerfile (see below). 4. Update Makefile by specifying needed params for that images (see below)
Dockerfile snippet should looks something like this:
```
FROM ${PLUGINS_BUILD_IMAGE} as ctls
# Inject custom root certificate authorities if needed
# Docker does not have a good conditional copy statement and requires that a source file exists
# to complete the copy function without error. Therefore the README.md file will be copied to
# the image every time even if there are no .crt files.
RUN apk update && apk add curl
COPY ./certs/* /usr/local/share/ca-certificates/
RUN update-ca-certificates
ARG CCTL_VERSION=0.4.0 #0.4.1 for v0.4.1
RUN curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/v${CCTL_VERSION}/clusterctl-linux-amd64 -o /clusterctl
RUN chmod +x /clusterctl
FROM quay.io/airshipit/clusterctl-base as release
COPY --from=ctls /clusterctl /usr/local/bin/
```
Makefile modifications should look something like:
```
# put it instead of clusterctl_IS_INDEPENDED:=true
clusterctl-base_IS_INDEPENDED:=true
clusterctlV0.4.0_IS_INDEPENDED:=true
clusterctlV0.4.1_IS_INDEPENDED:=true
# put it after docker-image-toolbox-virsh_DEPENDENCY:=docker-image-toolbox, to make sure that clusterctl will be built first
docker-image-clusterctlV0.4.0_DEPENDENCY:=clusterctl-base
docker-image-clusterctlV0.4.0_DEPENDENCY:=clusterctl-base
```
## Thursday August 26, 2021
**TODO: put the recordings somewhere publically accessible**
### Kernel/Driver/Package upgrade (Sreejith) [#603, #604, #605]
There are various issues created for uplifting kernel/Driver/packages in both airshipctl and treasuremap. its mentioned to use image-builder for this purpose. when we use image-builder, we will have to do os reinstallation on all the nodes and if we want to perform this multiple time a year, it consumes lot of time. Also i have found that UUID of the disk is getting changed after reinstallation which may cause problem with ceph cluster. cant we use hostconfig-operator to perform the kernal/driver/package upgrade and then use an updated image when performing distro upgrade?
https://hackmd.io/@Pallav/BkU2FuWZY
TODO: K8s apiserver VIP ought to be working, but seems not to be; may be a document bug
TODO: We don't have a VIP configured for Ironic. Latest version of Ironic has support for a VIP, to select among multiple active Ironic servers - a keepalived pod. We should retest post-ironic-uplift/configuration.
We will (have) documented our findings in the POC issue, and will create a new issue to implement the ironic VIP with a dependency on the the ironic uplift.
Per Arvinder: we should additionally be using a VIP to front Ironic as it moves from the ephemeral to the target cluster.
TODO: after getting the ironic VIP working in the target cluster, in a new issue, extend use of the VIP into the ephemeral cluster and validate it works over a clusterctl move
### Spike Metal3.io Support for BIOS/Firmware Updates and RAID Configuration Changes (Sanjib)
Link for Story:
https://itrack.web.att.com/browse/CPVYGR-485
https://hackmd.io/LuE4l1PrTSaUvnOJyfKSsQ?view
Just to check, only need to find current support in Metal3.io for BIOS and RAID configuration changes.
TODO: JT to set up a meeting to walk through the changes w/ the dev team
### Ironic boot over wan
Does AS2 require pxe for booting? Ironic has Redfish as an option for the boot interface in its configuration. Are there any known issues or concerns to use Redfish for booting?
TODO: send email to Richard Pioso to confirm 1) Ironic supports it
TODO: flup with M3 community on whether 2) Metal3 exposes it
### Re-imaging of cluster management node - Target Node1 (Pallav) [#606]
In case of major upgrade for any airship2 site (e.g. OS upgrade), we will need to perform re-imaging operation on existing nodes.
This operation can be easily perform for other two controlplan through metal3 rolling upgrade strategy but it has been observed
various issue when we upgrade Target Node1 through rolling upgrade:
1. API Server VIP
Current upstream version of airshipctl doesn't provide VIP for API server so when we remove Target Node1, we need to manually
update API server IP in config maps, kubenetes conf files etc. It will be better if we can have API server vip so we don't need
to perform these updates.
2. Ironic bound to Target Node1
In current version of airshipctl, Ironic provisioning ip is hardcorded with First Target Node API Server IP so when we try to move
Ironic to other CP nodes, Ironic gets stuck in init-bootstrap. Also Provisioning IP is hard coded for qcow image url for control plane.
so need to update BMH, m3m template in existing site. Can we introduce VIP for Ironic (active/passive, 3 replicas 1 pod per node)?
Do we have any other thoughts so Target Node1 upgrade with minimal disruption?
https://hackmd.io/@Pallav/BkU2FuWZY
### SOPS GPG Key Management Working proposal [#586]
[WIP patch](https://review.opendev.org/c/airship/airshipctl/+/803503) & [SOPS plugin branch](https://github.com/aodinokov/kpt-functions-catalog/tree/allParams)
**Note**: The proposal is done on top of another already discussed topic - how to improve encryption/decryption. It was implemented [here](https://review.opendev.org/c/airship/airshipctl/+/794887/).
This introduces the plase to store info on who can decrypt data. That means - each individual or system must have his/its own private secret. Site manifests must have public part of that secrets. E.g. based on [that](https://review.opendev.org/c/airship/airshipctl/+/803503/14/manifests/site/test-site/ephemeral/catalogues/.public-keys/kustomization.yaml) only people with listed PGP fingerprints can decrypt data:
```
literals:
# user U1, U2 and U3
- pgp=FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4,D7229043384BCC60326C6FB9D8720D957C3D3074,9DC6FBBDB3801E4E1144017138959A55322BC64B
# - hc-vault-transit=http://127.0.0.1:8200/v1/sops/keys/firstkey,http://127.0.0.1:8200/v1/sops/keys/secondkey
```
If we have a vault server (e.g. try to make it working with [these steps](https://github.com/mozilla/sops#encrypting-using-hashicorp-vault)) and uncomment hc-vault-transit line it will be also possible to decrypt data to all who has access to that keys.
In order to be able to decrypt it's only necessary to put your key [here](https://review.opendev.org/c/airship/airshipctl/+/803503/14/manifests/.private-keys/my.key) (not recommended way)
or to export the key via env variable (as before), e.g.:
```
export SOPS_IMPORT_PGP=$(cat manifests/.private-keys/exampleU1.key)
```
or if the key is already in gpg:
```
export SOPS_IMPORT_PGP=$(gpg -a --export-secret-keys FBC7B9E2A4F9289AC0C1D4843D16CEE4A27381B4)
```
If we're using vault its only necessary to do
```
vault login
```
or to already have some toket, e.g.
```
export VAULT_TOKEN=toor
```
the info what key and on what sever is to go is already provided in the [sops metadata](https://review.opendev.org/c/airship/airshipctl/+/803503/14/manifests/site/test-site/ephemeral/catalogues/encrypted/secrets.yaml#31):
```
hc_vault:
- created_at: '2021-08-11T17:27:07Z'
enc: 'vault:v1:dTgln4Sz23VgKsMigpRssTtx7X8XB6wjCPDJGzvLRnM+LKpGnYdyppyYg4mha5mXLes5ke5RAj5CQHa5ccj+yaZFnCKihqZ1SkHDYhExXyBy9dNb2X8yDHx8Iix8Ir8icSEw+GZkG92xIbDHYxU4LgPgMAu9mQ5BUKGKv+IDpA/WKBRvvsczgVVDsuleBNnIQkxiU811RnqhYPojrPJefBcBXNsC2IgV0E9Lfo49Zm5HvOvPDaolucfteVAxIw3nTYToO/v2IV3I9X5NiWOvmYQ9JMvv83pmYgdkXlqekez4PPlADqUSZ/cW8B2UV21i46rW9Ilqui9eDv9SQMFg/xRbDu1pfXlKc4BGmUVrnH838mSCfizvNN+sX1ST6wrGtfOQA05wYtssbqRXrXbJ9dzjnkWnHWqEsTmS82uSu4tohsu29fRVwOgWfxHGKmhZuKYt2iggI/fn43CNyLgw2cRaXaXQFuTtefCAQ9toUOH4vOiZ9rDYsM8dInBukzYAcRAydZ1hVnhfm+UjfhS+e6MRDhA33BF/4VZzFW+mv9/1VzzbrZZE9x+juTDakmfcxj+Y88a8fgmkFfHpCAGnapdqpwvQ1/jomiCzLkQYPw8nRsirxDThggJBQ5IWqmINpr6wbx1A5eaepoAiGxEUTatFZdfVYL+tqO9Auz1xdvA='
engine_path: sops
key_name: firstkey
vault_address: 'http://127.0.0.1:8200'
```
SOPS will decrypt the data in case at least one of the private credentials provided.
In case if it's necessary to update the list of that credentials, - it's done in git repo and `airshipctl phase run secret-update` is executed in order to get secrets with new sops-metadata. Of course that has to be done by the person who owns the valid credential, because this requires decryption first.
For Vault cases it's enough just to reconfigure access to the secrets inside Vault. For user exclusion it would also be good to update the key in Vault, because that secret is shared.
TODO: Review Needed on this stack:
https://review.opendev.org/c/airship/airshipctl/+/794887
https://review.opendev.org/c/airship/airshipctl/+/803503 (only WIP until SOPS change merges)
## Thursday August 19, 2021
### Spike TM#196 - Implement Multus + SRIOV support in Airship - Digambar, Manoj and Jess
Related to https://github.com/airshipit/treasuremap/issues/196
* SRIOV design proposal discussion - https://hackmd.io/tZd-vemBQ6WpUq65Gy5vcQ.
### Spike CPVYGR-491 - Understand CoreDNS upgrades for an existing cluster (AS608)
Related to https://itrack.web.att.com/browse/CPVYGR-491
Just to check, if there is any other expectations from this US except to check handling of CoreDNS as a brownfield scenario
WIP Hackmd link https://hackmd.io/iogRxOGfSZWu7ypHq-w9mw
## Thursday August 12, 2021
Cancelled
## Thursday July 29, 2021
### Per cluster-type Image-builder configuration using kpt approach (Alexey, as per internal preliminary discussion)
We can do this per type:
* pull image-builder similarly to how we did with SIP/Vino: [example](https://review.opendev.org/c/airship/treasuremap/+/802615/1/manifests/type/multi-tenant/image-builder/Kptfile).
* update image-builder config if needed: [example](https://review.opendev.org/c/airship/treasuremap/+/802615/1/manifests/type/multi-tenant/image-builder/manifests/rootfs/multistrap-vars.yaml#51).
* Next time when we need to pull changes from upstream - just execute `kpt pkg update [@<commit-id>]` from image-builder dir and kpt will do 3way merge - it will be possible to see what new changes were introduced and if they are in the conflict with local changes. Once conflicts are resolved - just commit the changes: [example](https://review.opendev.org/c/airship/treasuremap/+/802627).
Needed changes:
* Makefile in treasuremap has to be updated in order to build images per type: [example](https://review.opendev.org/c/airship/treasuremap/+/802615/1/Makefile). Manifests have to be updated as well to use the right tag of the image (TBD)
* Image-builder has to be adjusted a bit to work with kpt: https://review.opendev.org/c/airship/images/+/802626
* POST Gating has to be updated to build images on merge (TBD).
### Upgrade kpt to v1.0.0-beta.x (Matt F)
v1.0.0-beta brings some breaking changes for our current design. This relates to [#598](https://github.com/airshipit/airshipctl/issues/598)
* concept of local package with upstream dependencies is gone; `dependencies` key in Kptfile has been deprecated
* Kptfiles must define an `upstream` repository, or they will be assumed to be dependent subpackages with a parent higher up in the directory tree that provides the root upstream repository
* local changes must be committed via git before performing an update
How this affects current workflow:
* No top-level Kptfile that can be easily modified to update pkg version. Each individual package's Kptfile must be edited (e.g. manifests/function/flux/base/upstream/policies/Kptfile)
* Cannot simply run `kpt pkg update .` from root of function (e.g. flux/helm-controller) anymore. Packages must be updated independently, with git commits before and after each update
Possible solution:
* Replace function's top-level Kptfile with a script (`update.sh`?) that runs `kpt pkg update upstream/<pkgname>` for everything in each function's `upstream` directory, and makes the necessary intermediate local commits between pkg updates
### Upgrade capm3/bmo/ironic deployment [#554]
Some new features from upstream, are we interested in leveraging any/all of them? Referring to [CAPM# releases](https://github.com/metal3-io/cluster-api-provider-metal3/releases)
* ironic TLS and basic auth
* ironic keepalived
* bmo live iso image support
Manifest pattern for upstream dependency:
* copy vs reference to upstream manifests
## Thursday July 15, 2021
#### *Carryovers from 7/8 meeting ->*
### K8s v1.21 upgrade & general uplift approach (Andrew)
When we had the v1.20 uplift scramble, we discussed the need to establish an ongoing k8s uplift cadence to keep Airship current, certified & in conformance.
This is intially for greenfield deployments, i.e. what version of K8s are we deploying out of the box. Brownfield upgrades will be handled under a different set of issues, but will need to bring the findings of that back to ensure we have a commmon approach.
* With v1.20, the approach was to upgrade the version for each provider. We want to have a more general approach, if possible for v1.21 & going forward.
* Determine the timeframe by when we need to uplift the k8s version in Airshipctl (i.e. how many months after a release should we uplift?). The k8s release process is fairly mature at this point, though there do tend to be several point releases per main version release.
* Should we bake conformance testing into the version upgrades? Separate issues for testing? Testing tasks:
* Perform conformance testing (Anuket and/or CNCF) for Bare Metal against the uplifted version
* Perform conformance testing against at least one other provider (maybe all, includes: Docker, Azure, GCP & Openstack)
* Report uplift/conformance testing to CNCF to maintain certification
Updated the below issue to be more v1.21 focused with the more general approach.
https://github.com/airshipit/airshipctl/issues/589
v1.22 info
https://github.com/kubernetes/sig-release/tree/master/releases/release-1.22
Some future topics for brownfield upgrades once we've worked the spikes:
* Determine impacts/process for brownfield bare metal upgrades via CAPI.
* Determine position on backwards compatibility with prior releases (Do we support? If so, how many?)
### #491 Redesign airship cluster status command (Vladimir Kozhukalov)
https://github.com/airshipit/airshipctl/issues/491
This feature was dependent on a different feature. Here is an old discussion https://hackmd.io/BbFyJRKGRQiuXYJduPhu4Q
Current architecture cannot support this command. Vladimir's comments from the issue:
> Airship phase engine is stateless, we don't use any operator to deal with phases/phase plans. We don't deploy phase CRs anywhere. We also don't store phase run results/errors anywhere. A user who runs a phase plan can see which of the phases were successful and thus cluster status. But we can not have a separate command to gather the cluster status since we don't have such a place from where the status could be collected.
>
Discuss path forward and if we should close this issue for now.
### #597 Airship specific implementation of KRM Function Specification
https://github.com/airshipit/airshipctl/issues/597
> At the moment Airship uses two types of generic containers: krm (kyaml/runfn) and airship (airship specific docker client). kyaml/runfn has some limitations that do not allow to use it everywhere. For example, sometimes we want a container to be run from a privileged user. Also kyaml/runfn does not provide some of the functionality we potentially would like to have (e.g. running timeout).
>
Let's
- Get rid of Airship specific docker client
- Copy the code from kyaml/runfn and modify it to cover Airship specific needs (just like what KPT did)
#### *New Items ->*
### SIP#19 Configurable HAProxy in SIP (Manoj Alva)
This is related to https://github.com/airshipit/sip/issues/19 and PS https://review.opendev.org/c/airship/sip/+/799161 is put in place. Discussion needed on the following item.
* The attributes of HAProxy configuration to be considered. The PS currently considers ==timeout== parameters only
[HAProxy Config Ref] https://cbonte.github.io/haproxy-dconv/2.0/configuration.html
### #545 Generic container timeout validation & enforcement (Manoj Alva)
For the requirement "provide a compliance mechanism to validate the timeout has been acknowledged & action has been taken", need help on the scope covering this issue.
Are the requirements targetted at ensuring e2e testing of the timeout support implemented via #544 possibly via Ginkgo framework ?
## Thursday July 8, 2021
### Component Uplift Review
Ensure we have all the major components accounted for, and determine if we have any gaps.
v2.2 milestone issues list:
https://github.com/airshipit/airshipctl/issues?q=is%3Aopen+is%3Aissue+milestone%3Av2.2
K8s uplift to v1.20 / v1.21 >
* https://github.com/airshipit/airshipctl/issues/555 < Bare Metal
* https://github.com/airshipit/airshipctl/issues/556 < Docker
* https://github.com/airshipit/airshipctl/issues/569 < Azure
* https://github.com/airshipit/airshipctl/issues/570 < GCP
* https://github.com/airshipit/airshipctl/issues/571 < OpenStack
* Vulnerability issue:
* https://github.com/airshipit/airshipctl/issues/559 < etcd - should be resolved by uplift
CAPM3, BMO & Ironic to v0.4.2 -- needs
Still is still against capi v1apha3.
What Ironic OS?
* https://github.com/airshipit/airshipctl/issues/554
* Vulernability issues:
* https://github.com/airshipit/airshipctl/issues/558 < dependent on upstream issue https://github.com/metal3-io/ironic-image/issues/266
* https://github.com/airshipit/airshipctl/issues/560 < kube-rbac-proxy - should be resolved by uplift
CAPM3, BMO & Ironic to v0.5.0 -- needs to wait ..
Still is still against capi v1apha3
BMO. and Ironic are now separated.
Maybe we can drive versions ?
What Ironic OS is targetted?
CAPI to v0.4 & CAPM3 to 0.5.0
* https://github.com/airshipit/airshipctl/issues/518 < need upstream to drop first. Arvinder updated today, now need CAPM3 v0.5.0 upgrade to fully implement.
CAPI (and docker Provider ) to v1alpha4 uplift [ NEW ISSUE ]
[NEW ISSUE] Explore KPT upgrade options 0.37 vs. v1.0
[NEW ISSUE] Kustomize Upgrade to the latest version (v4.2.0)
Clusterctl binary as KRM function:
* https://github.com/airshipit/airshipctl/issues/568 < removes CAPI dependencies & positions for the future moved to v2.1
Sonobuoy to v0.51
* https://github.com/airshipit/airshipctl/issues/557 < believe this is already being used, just need patchset reviewed & merged. v2.1 issue just needs review
iLO Redfish API
* https://github.com/airshipit/airshipctl/issues/540 < Updated yesterday to v2.1
## Thursday June 24, 2021
### Incorporating Image Builder manifests into Treasuremap (Need: Matt M., Craig, Pallav)
* Today, this part of our declarative intent is really just captured as `image-builder` defaults
* How should we integrate it w/ Treasuremap?
* Need a pattern that makes sense for operators
* Should we use catalogues to build type/site-specific catalogue info into config files on qcows:
* For Example: https://review.opendev.org/c/airship/airshipctl/+/775035/45/manifests/function/ephemeral/chrony-secret.yaml
:::info
Craig : "I would suggest we re-use the pattern of parent+child zuul jobs that we currently use. The parent Zuul job is defined upstream images repo, and allows for child job (like we have downstream) to override any needed parameters for image building. This would be a good pattern as well to follow for other container image customizations (e.g., the same pattern would permit operator customization of airshipctl)"
:::
### airshipctl secret generate encryptionkey,
airshipctl cluster rotate-sa-token,
airshipctl cluster check-certificate-expiration commands discussion (Ruslan A.)
* rotate-sa-token, check-certificate-expiration - both commands doesn't work at all since they were never used in test deployments and they use old-style API; since we have an upcoming upgrade of k8s dependecies in airshipctl - it would be diffucult to upgrade these commands to make them work;
* secret generate encryptionkey was introduced 2 years ago and it's not clear what's its purpose in current curcumstances, it just generate secure random string variable length;
* Do we really need above commands? If so, shouldn't they be as phases, not the core airshipctl functionality?
* https://github.com/airshipit/airshipctl/issues/588
## Tuesday June 22, 2021
### Generating secrets for subclusters
* it's not implemented in treasuremap yet
* there are some params that may be a good candidates for that: [ssh keys](https://github.com/airshipit/treasuremap/blob/master/manifests/type/airship-core/target/generator/secret-template.yaml#L58) and [dex clientSecret](https://github.com/airshipit/treasuremap/blob/master/manifests/type/airship-core/target/generator/secret-template.yaml#L63)
* Questions:
* should we do that? (seems like we have to if there are no other approaches)
* what master key should we use? from target? or create gpg per subcluster? (probably, per subcluster - we don't want to allow any subcluster users to decrypt target cluster creds, right?)
* should target cluster admin have access to that subcluster master keys? (probably yes, because target cluster admin will also do day1 for subcluster, right?)
Alternativly we may even avoid storing everything in git- instead follow ClusterAPI approach - where we keep everything in target cluster?
Probably it would be great to get a 'big picture' of how encryption/decryption will work for the subcluster scenario. Let's make it together? :)
**ISSUE** : Define an integration with system to manage GpG Key Management system that provides RBAC, distribution etc,. I.e. Such as Admision Controller Mutating Admission. Controller that injects keys as needed.
### VINO namespace issue.
Vino creates BMHs in a single namespace, currently in the same as vino-manager runs (vino-system).
Cluster-API CAPM3 right now requires that BMHs reside in the same namespace as m3m objects and hence KCP. So KCP needs to be in the same namespace as BMHs. With the VINO design, we can only specify 'count' per vino flavor(nodeset), so even if we add namespace field to VINO CR nodeset, it is still going to be one namespace per whole vino nodeset infrastructure. For example, if we have nodset called `control-plane` with count=1, and 40 nodes on the site, we will end up with 40 masters in single namespace.
### FAQs page
Similar to Troubleshooting Guide, the documentation tema has created a FAQs page to allow for community input to develop content that may go into https://docs.airshipit.org
https://hackmd.io/jIr3An6MT5C2xAQbKR3qoA
## Thursday June 17, 2021
### airshipctl exits with error when expanding controlplane nodes
* https://github.com/airshipit/airshipctl/issues/525
* It blocks the bare metal phase plan STL2 testing. Is it 2.1?
* Labeld as designed needed.
### Managing Gatekeeper Policy Constraint Templates & Constraints in Treasuremap (cont'd)
* Are there any treasuremap policies we might want to enforce
i.e. What policies do we start with?
* Restrictions on image sources?
* Restrictions on helm chart repositories?
* Any other standards for treasuremap deployments?
* Any PodSecurityPolicy implementations needed in treasuremap?
* Talk about how we might handle Audit violations in TM
* I would say some basic ones :
* https://github.com/open-policy-agent/gatekeeper-library/tree/master/library/general/allowedrepos
* https://github.com/open-policy-agent/gatekeeper-library/tree/master/library/general/containerlimits
* https://github.com/open-policy-agent/gatekeeper-library/tree/master/library/general/imagedigests
## Tuesday June 15, 2021
### Use clusterctl as a binary inside of KRM function instead of calling API (Ruslan A.)
Discussion of the issue - https://github.com/airshipit/airshipctl/issues/568
PoC patchset - https://review.opendev.org/c/airship/airshipctl/+/793701
Diagram - https://drive.google.com/file/d/1lqTW4ALAKOJcCTCYvMEh9C7RyWxMGwME/view
### Managing Gatekeeper Policy Constraint Templates & Constraints in Treasuremap (Larry)
* Design discussion for https://github.com/airshipit/treasuremap/issues/174
* Definition of the Policy == Constraint Template
* e.g. https://github.com/open-policy-agent/gatekeeper-library/blob/master/library/pod-security-policy/users/template.yaml
manifests/function/gatekeeper/policies/
manifests/function/gatekeeper/policies/<policy-name>
manifests/function/gatekeeper/policies/<policy-name>/
manifests/function/gatekeeper/policies/<policy-name>/kustomization.yaml
manifests/function/gatekeeper/policies/<policy-name>/template
e.g. https://github.com/open-policy-agent/gatekeeper-library/tree/master/library/pod-security-policy/users
* Instance of a Policy
* e.g https://github.com/open-policy-agent/gatekeeper-library/blob/master/library/pod-security-policy/users/samples/psp-pods-allowed-user-ranges/constraint.yaml
manifests/function/gatekeeper/policies/instances/
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/kustomization.yaml
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/constraint.yaml
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/replacements/... || TBD if we use catalogue info for defining the constraints
* How do we define a collection of policies as a group that menas something. e.g. PodSecurityPolicy ...
manifests/composite/gatekeeper/<name of policy group>
manifests/composite/gatekeeper/<name of policy group>/kustomization.yaml
.. Uses Instance of policy as resources.
manifests/composite/gatekeeper/<name of policy group>/replacements/kustomization.yaml
* When do we deliver the Policies
Will keep this as a TBD, expect we might need to deliver policies in multiple phases, yet to be determined.
* Installing gatekeepr is init infra phase, ... whatever helm "thingie"
* Explore using this for policy validation : https://github.com/GoogleContainerTools/kpt-functions-catalog/tree/master/functions/go/gatekeeper
### Bake Helm Charts in Helm-Chart-Collator (Sidney S.)
Some of service deployment relying on (Flux) Helm operator is still pulling charts from the public repository.
Should all Helm charts used by airshipctl be "baked" in the Helm Chart Collator Docker image, so charts are not exposed to the public?
(Sean) This is being worked here: https://github.com/airshipit/treasuremap/issues/162
## Thursday June 10, 2021
### Discuss different approaches to Day2 upgrade operations (Vladimir S., Alexey O.)
Several approaches based on Ceph/Rook upgrade examples
Presentation in this hackmd note https://hackmd.io/0Sw53doBSwiOzgfzYrfmRQ
## Tuesday June 8, 2021
### RAID implementation (JT & Matt)
### NextGen secret generation (Alexey)
(Following up our conversation started [here](https://hackmd.io/QiEksO4fRk-MnBjwBFaAkQ#Tuesday-Apr-27-2021))
UPD: **we decided to proceed with that changes in 2.2**
Implementation is here: https://review.opendev.org/c/airship/airshipctl/+/794887
Requires kustomize 4.x: that's why based on https://review.opendev.org/c/airship/airshipctl/+/794269/
Supports:
1. Generation with different timeperiods (e.g. once, monthly, yearly etc.). Includes `forced` regeneration by different period, e.g. regenerate all `yearly` secrets. Template contains info about how ofter to regenerate each group.
2. All secrets: generated and externally provided - **are in 1 single file**
3. `Pinning` of secret - this secret will be considered as manually (exnternally provided) and won't be regenerated
4. All modifications are done with 1 phase `secret-update` + [file](https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/site/test-site/target/encrypted/updater/import.yaml) that should contain a patch that is getting merged to the final secrets file.
The document structure can be strictly defined - we can switch from VariableCatalogue to something like SecretCatalogue with CRD and that will allow validation.
See example: https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/site/test-site/target/encrypted/results/secrets.yaml
Each group has date of last update.
Here is a [template example](https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/type/gating/target/regenerate-secrets/template.yaml).
The implementation introduced `functions` and `modules` for templater. Function defined in module can be called with `include` function (definition was taken from helm). Module is a document that contains function definitions. E.g. [this file](https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/function/templater-helpers/secret-generator/lib.yaml) is inlcuded [here](https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/site/test-site/target/encrypted/updater/kustomization.yaml#4) and contains implementation of function `group` that does the main magic on understanding what to regenerate, what to import and etc.
## Thursday June 3, 2021
### Treasuremap to include Gatekeeper/OPA? (Bryan)
Does it make sense to include [Gatekeeper](https://kubernetes.io/blog/2019/08/06/opa-gatekeeper-policy-and-governance-for-kubernetes/) (and [here](https://open-policy-agent.github.io/gatekeeper/website/docs/howto/)) as a "core platform" level policy agent in the reference implementation?
If so, what are the resiliency and deployed features that would be considered fundamental to the implementation? Is the [basic installation](https://open-policy-agent.github.io/gatekeeper/website/docs/install#deploying-via-helm) sufficient?
Does this provide a real path to the support of deprecation of PSP, or too early to say?
* [PodSecurityPolicy Deprecation: Past, Present, and Future](https://kubernetes.io/blog/2021/04/06/podsecuritypolicy-deprecation-past-present-and-future/)
* [PodSecurity admission (PodSecurityPolicy replacement) #2579](https://github.com/kubernetes/enhancements/issues/2579)
Create some issues :
* Issue 1 - Define gatekeepr function in treasuremap/ identify proper phase/type etc where it should be deployed.
* Issue 2 - Bare policies as part of deployment that demonstrate gatekeeper is properly configured and working.
* Issue 3 - Policies that enforce PSP like rules. Esssentially remove PSP's.
### Base Images (Andrii, MattF, MattM)
We have a [story](https://github.com/airshipit/airshipctl/issues/514) for switching KRM function base images away from dockerhub. The idea was to switch to an alternative non-dockerhub base if there was a good alternative to alpine, and maintain our own mirror in quay for alpine if not. It's slightly more nuanced than that; a few things to consider:
- Alpine has a strong security stance. Do we want to keep with Alpine for that reason alone?
- We already have downstream overrides of these base images to a downstream Alpine build. Changing upstream requires changing downstream.
- We could get more cross-distro compatibility if we use `#!/bin/bash`. However, Alpine doesn't have bash, so if we want Alpine that won't work.
- We could get more cross-distro compatibility if we refactored package management (`apk` vs `apt` vs `yum`) into a shared base image. In that case we are back to maintaining our own base images, however.
- Expecting compatibility between base images is tough:
- Alpine is weird because it doesn't have `/bin/bash`
- minideb is weird because `/bin/sh` is `dash`
Matt has a patchset switching to minideb [here](https://review.opendev.org/c/airship/airshipctl/+/790865), which changes both shell conventions (bash-like vs dash-like) and package manager (`apk`->`apt`).
### Uplifting BMO/CAPM3/Ironic, CAPI & the CAPI Management Operator (Andrew - Arvinder)
**Issues:** [#554](https://github.com/airshipit/airshipctl/issues/554) & [#518](https://github.com/airshipit/airshipctl/issues/518)
**Timing:** v1alpha4 is currently available for BMO/CAPM3/Ironic. v1alpha4 for CAPI is a month or so out. There aren't dependencies between the two uplifts though an additional uplift of CAPM3 will be required when shifting to CAPI v1alpha4.
**Approach:** Let's leave #554 for the BMO, CAPM3 & Ironic uplift. Let's create a new issue or revise #518 for CAPI uplift to v1alpha4 when its available (and required CAPM3 uplift).
**Discussion:** Do we want to shift to using CAPI Management Operator at some point?
https://hackmd.io/qdTfhNj8RSuQOM0QZL0JOA#CAPI-Management-Cluster-Operator
## Tuesday June 1, 2021
### Using AGE as an alternative to PGP in SOPS (Matt, Alexey)
Pronounced "ah-gay". A recently-merged option in SOPS, recommended by the SOPS community: "age is a simple, modern, and secure tool for encrypting files. It's recommended to use age over PGP, if possible."
It has a more compact representation - encrypted data:
```age1yt3tfqlfrwdwx0z0ynwplcr6qxcxfaqycuprpmy89nr83ltx74tqdpszlw```
Public key:
```AGE-SECRET-KEY-1NJT5YCS2LWU4V4QAJQ6R4JNU7LXPDX602DZ9NUFANVU5GDTGUWCQ5T59M6```
Is it something we should consider adopting?
* https://github.com/GoogleContainerTools/kpt-functions-catalog/pull/241
* https://github.com/mozilla/sops#encrypting-using-age
* https://github.com/FiloSottile/age
## Thursday May 27, 2021
### Dex HelmRelease - LDAP Patch locationn (Sidney S.)
A while ago it was decided to split the Dex/API server configuration from Dex/LDAP, the former relying on replacement rules while the latter kustomized through Strategic Merge patch.
The split was done with Dex/LDAP patch being implemented in *treasuremap/manifests/type/**airship-core**/target/workload/dex-aio* folder. As *multi-tenant* type will need similar patch, where would be the best place to implement this patch where it is shared between **airship-core**, **multi-tenant**, etc.
*Could it be under composite/utility (new) where all utility patch would be added, starting with dex/ldap patch?*
>REMINDER: this patch is composed of kustomization.yaml and the patch as shown below.
kustomization.yaml
```yaml=
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../../../function/dex-aio
patchesStrategicMerge:
- dex-aio-helm-patch.yaml
```
dex-aio-helm-patch.yaml
```yaml=
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: dex-aio
namespace: default
spec:
values:
params:
ldap:
bind_password: "your LDAP bind password"
name: "LDAP TEST SERVICES"
config:
host: "your LDAP FQDN"
port: 636
bind_dn: "your LDAP bind username"
bind_pw_env: LDAP_BIND_PW
username_prompt: SSO Username
user_search:
base_dn: dc=testservices,dc=test,dc=com
filter: "(objectClass=person)"
username: cn
idAttr: cn
emailAttr: name
nameAttr: name
group_search:
base_dn: ou=groups,dc=testservices,dc=test,dc=com
filter: "(objectClass=group)"
userMatchers:
userAttr: DN
groupAttr: member
nameAttr: cn
```
### Upgrade to kustomize 4.1.3 issues (Alexey O.)
Description of issues
```
kustomize 4.1.3 upgrade was difficult. It's possible to understand by the number of different options I've tried: https://review.opendev.org/c/airship/airshipctl/+/788582
The main issue appeared because of this commit in kustomize:
https://github.com/kubernetes-sigs/kustomize/commit/a2871181fe7e01b42741019c49d12d1ca24602a3 . They switched from github.com/go-openapi to k8s.io/kube-openapi and they use k8s.io/kube-openapi/pkg/validation/spec module.
(this module appreared recently - when k8s.io/kube-openapi already switched to gnostic 0.4.1+)
There is another module called github.com/googleapis/gnostic that is used by variaty of components including
k8s.io/kube-openapi and as well as such modules as k8s.io/apiextensions-apiserver, k8s.io/apimachinery, k8s.io/apiserver, k8s.io/client-go and k8s.io/kubectl.
The mainteiners of gnostic have made a breaking change in API here: https://github.com/google/gnostic/commit/896953e6749863beec38e27029c804e88c3144b8:
All versions <=0.4.0 had a module github.com/googleapis/gnostic/OpenAPIv2
and started >=0.4.1 github.com/googleapis/gnostic/openapiv2
The problem was that k8s.io/kube-openapi/pkg/validation/spec was added AFTER k8s.io/kube-openapi switched to github.com/googleapis/gnostic >0.4.1.
That is possible to see that it references github.com/googleapis/gnostic/openapiv2.
That means that if we switching to kustomize newer/or commit a2871181fe7e01b42741019c49d12d1ca24602a3, we
have only a choice to use github.com/googleapis/gnostic >=0.4.1.
k8s modules (k8s.io/apiextensions-apiserver, k8s.io/apimachinery, k8s.io/apiserver, k8s.io/client-go and k8s.io/kubectl) switched to
github.com/googleapis/gnostic >=0.4.1 starting version 0.18.0
Airshipctl is currently using Cluster-Api 0.3.13 (v1alpha3). The latest from v1alpha3 is 0.3.16 (see branch release 0.3)
Unfortunately the whole branch is compatible only with k8s components version 0.17.9 and if we switch to 0.18 it requires changes.
There are 2 more branches: master and 0.4, that are compatible with k8s compontents versions 0.18+.
But there are not released 0.4 branch. Since we want to stick with stable version, we should use 0.3 versions.
So, here is dependency chain:
if we switch to the latest kustomize components, we need to switch to kube-openapi that references gnostic >= 4.1 and that automatically
switches to k8s components >=0.18.0, but these components are not compatible with the latest stable releases of Cluster-API.
To make that switch possible I've created a special version of kube-openapi that works with gnostic 4.0:
https://github.com/aodinokov/airshipctl_3rdparty/commit/7e53cf79bd173bab65b996a4e5769987806e9d17#diff-8ecbe9b28adc79fb26a4c43a7eb214fa923a24551a741669d1ba010101de6022
if we replace:
k8s.io/kube-openapi => github.com/aodinokov/airshipctl_3rdparty/kube-openapi/95288971da7e v0.0.0-20210524214255-1b8a0f8bc487
it's possible to keep k8s components as is (0.17.4 and 0.17.9) and don't touch cluster-api as well.
But there is another issue that is described here:
https://github.com/kubernetes-sigs/kustomize/blob/1eb77a6cab06d5cdedf6b0e5ccbeb42a2e2010b4/cmd/depprobcheck/README.md
k8s.io/cli-runtime@v0.20.4 and older depend on sigs.k8s.io/kustomize@v2.0.3+incompatible
but if we're taking newer k8s.io/kube-openapi it can't be compiled.
Here are the changes that have to be done
https://github.com/aodinokov/airshipctl_3rdparty/commit/5ee7556e95f25866e33dfdb8e6efdacd4de3f152 +
https://github.com/aodinokov/airshipctl_3rdparty/commit/1b8a0f8bc487d96e5a0b36eddcbbda88883ac3e7
This modified version should be referenced in the replacement:
sigs.k8s.io/kustomize v2.0.3+incompatible => github.com/aodinokov/airshipctl_3rdparty/kustomize/a6f65144121d v0.0.0-20210524214255-1b8a0f8bc487
Going forward:
Everything will work if we switch to the latest versions:
e.g github.com/googleapis/gnostic 0.5.5
latest k8s.io/kube-openapi
k8s components 0.21.0+
The problem is that it causes update of cluster-api to unstable versions: e.g.:
https://review.opendev.org/c/airship/airshipctl/+/788582/29/go.mod
It causes the issue for our case:
[airshipctl] 2021/05/22 08:32:01 opendev.org/airship/airshipctl/pkg/clusterctl/client/client.go:104: Starting cluster-api initiation
[airshipctl] 2021/05/22 08:32:01 opendev.org/airship/airshipctl/pkg/events/processor.go:60: Received error on event channel {unsupported management cluster server version: v1.18.19 - minimum required version is v1.19.1}
Error events received on channel, errors are:
[unsupported management cluster server version: v1.18.19 - minimum required version is v1.19.1]
Here is that constant: https://github.com/kubernetes-sigs/cluster-api/blob/bfc6f80add5c21b8dc2b704951f42bc14708ebc4/cmd/clusterctl/client/cluster/client.go#L35
If we are creating a special version where we change this constant back, it still doesn't work with (See https://review.opendev.org/c/airship/airshipctl/+/788582/38):
[airshipctl] 2021/05/23 08:16:38 opendev.org/airship/airshipctl/pkg/clusterctl/implementations/repository.go:83: Building cluster-api provider component documents from kustomize path at /tmp/airship/airshipctl/manifests/function/capi/v0.3.7
[airshipctl] 2021/05/23 08:16:38 opendev.org/airship/airshipctl/pkg/events/processor.go:60: Received error on event channel {failed to read "metadata.yaml" from the repository for provider "cluster-api": document filtered by selector [Group="clusterctl.cluster.x-k8s.io", Version="v1alpha3", Kind="Metadata"] found no documents}
Error events received on channel, errors are:
[failed to read "metadata.yaml" from the repository for provider "cluster-api": document filtered by selector [Group="clusterctl.cluster.x-k8s.io", Version="v1alpha3", Kind="Metadata"] found no documents]
That version of cluster-api isn't released and may be unstable.
That's why the simplest and easiest way of using latest kustomize - is with additional 3rd-party modified modules:
https://github.com/aodinokov/airshipctl_3rdparty. We will need to create that repo in airshipctl.. or maybe use a directory in airshipctl, but
it will make issues for linter.
Once the new version of Cluster-Api is released that will support 0.21.0+ versions of k8s components,
we'll need to switch to it. In addition it will be necessary to upgrade k8s that we deploy at least to v1.19.1
```
Short-term solution:
upgrade kustomize to 4.1.2 (this will be much easier)
Long-term solution (preliminary):
Cluster-api 0.4.x release is planned to June-July (may slip further though). Once it's release our plan will be:
upgrade our gating to k8s at lease 1.19.1 (0.4 has that [dependency](https://github.com/kubernetes-sigs/cluster-api/blob/bfc6f80add5c21b8dc2b704951f42bc14708ebc4/cmd/clusterctl/client/cluster/client.go#L35)).And AFTER that try to upgrade to k8s modules in go.mod to 0.21.0. That should resoulve all dependency issues.
## Tuesday May 25, 2021
### Clusterctl move & the CA
The target cluster CA that is used for generating Dex certs is currently generated on the ephemeral cluster, then `clusterctl moved` over to target. It is created in the `default` NS, so that's where it winds up; can we create it in a more appropriate namespace, or is `default` special somehow?
- `clusterctl move` moves from ephemeral NS X to target NS X
- The KubeadmControlPlane that creates it is put in the default NS
- clusterctl move can only move resources from a single NS at a time
- `clusterctl move` has an option to change the target NS (don't need to use this)
- We want to be consistent in naming between target & subclusters
- E.g. `target-infra`, `lma-infra`, etc
- Then, just put the resources for the target cluster in the ephemeral cluseter's `target-infra` NS, and they'll end up where they should go
- Need issue: implement the above and make sure it works
### Enabling Physical Disks and Controllers parameters for RAID Configuration - Zainub/Mahnoor
This is related to the demo represented by Noor, last year, related to RAID configurations in Metal3.
https://review.opendev.org/c/airship/airshipctl/+/749043/6/manifests/function/hardwareprofile-example/hardwareprofile.yaml
Support of RAID configurations for Baremetal Servers has been added. The link for this is:
https://github.com/metal3-io/baremetal-operator/pull/292
Furthermore, we want to extend BMH for Disk names and RAID Controllers. While doing this, we came to know we need to extend BMH first, as Ironic API does not contain such information right now.
https://github.com/metal3-io/baremetal-operator/issues/206
airship-discuss@lists.airshipit.org
[Snia Swordfish](https://www.snia.org/forums/smi/swordfish)
- Cross platform redfish extension for managing disk things
- Does M3 or Ironic make use of this / plan to? - not sure, will follow up
- Look at this for inspiration for how we want Airship manifests to look
How will M3 handle bad disks when doing raid config?
- We may want to be tolerant of some level of disk failure
- E.g. a rule like "I need Xgb"
- We'll need to understand how Ironic and Metal3 will handle this scenario, and then build on top how we want Airship to behave. Zainub will follow up with the M3 community.
## Thursday May 20, 2021
### Generic Container timeout implementation details
https://github.com/airshipit/airshipctl/issues/544
### Function-specific catalogues (Matt, Sidney)
Let's talk about our approach to function-specific catalogues.
1. What kinds of config do we want to target to normal patches, vs. config we want to put in catalogues?
- Site-specific stuff should go in catalogues
- Highly duplicated stuff should go in catalogues
2. Do we want to scope catalogues to individual functions, or broader categories?
- Eg: we added a `networking-ha` catalogue for VIP configuration, rather than an `ingress` (function name) catalogue. I think we've added related data to that catalogue since.
Dex example: **https://review.opendev.org/c/airship/treasuremap/+/791835/4/manifests/site/test-site/target/catalogues/dex-aio.yaml**
Catalogue Conventions
https://hackmd.io/HM-CNNIuRIm2MseaL523eA
### Using AGE as an alternative to PGP in SOPS (Matt, Alexey)
Pronounced "ah-gay". A recently-merged option in SOPS, recommended by the SOPS community: "age is a simple, modern, and secure tool for encrypting files. It's recommended to use age over PGP, if possible."
It has a more compact representation - encrypted data:
```age1yt3tfqlfrwdwx0z0ynwplcr6qxcxfaqycuprpmy89nr83ltx74tqdpszlw```
Public key:
```AGE-SECRET-KEY-1NJT5YCS2LWU4V4QAJQ6R4JNU7LXPDX602DZ9NUFANVU5GDTGUWCQ5T59M6```
Is it something we should consider adopting?
* https://github.com/GoogleContainerTools/kpt-functions-catalog/pull/241
* https://github.com/mozilla/sops#encrypting-using-age
* https://github.com/FiloSottile/age
## Tuesday May 18, 2021
### Target-state PhasePlans
We currently have phase plans named `phasePlan` and `iso`. We should true that up for the v2.1 release.
* We'd discussed greenfield vs brownfield before. Do we want to call those something like `deploy` and `upgrade`?
* An alternative which would avoid duplicate phase lists between the two phase groups would be `ephemeral` vs `deploy` or `target`, where ephemeral takes you up through clusterctl initing the target cluster (upgrade and greenfield look the same after that)
* Vlad's work to change scripting into generic containers means that we have VM-specific phases too: `virsh-eject-cdrom-images`, `virsh-destroy-vms`.
* We have subcluster-specific phases, but I believe we still need to create subcluster-specific PhasePlans for them
Conclusion: create `deploy`, `deploy-virt`, and `manage-secrets` phasePlans at the type level for now. In the future we can add (as needed) things like `upgrade`, `update`, `release` (for generating qcows), `rotate-secrets`, etc etc.
### Can we archive the old content in this agenda?
HackMD is getting really slow for me :)
## Thursday May 13, 2021
### Discuss Design for [Make executors respect timeouts (airshipctl #533, v2.1)]( https://github.com/airshipit/airshipctl/issues/533)
- Ony KubernetsApply respect timeout.
- solution might need to be executor specific :
- **Generic Container**
- issue 1 - Generic container implementation, interface agreement discussed in the call.
- issue 2 - Wait interface aggreement verification and enforcement will be a future Issue.
- **ClusterCtl**
- Issue on the clusterctl community.
- [Clusterctl should wait for providers to be availabe.](https://github.com/kubernetes-sigs/cluster-api/issues/4474)
### Troubleshooting Guide HackMD
Purpose: provide a more accessible, flexible & dynamic way of capturing troubleshooting information.
https://hackmd.io/Nbc4XF6mQBmutMX_FEs51Q
We can gauge usage & see if there's value in transposing this into our formal documentation suite or if this is sufficient.
Secondary topic, do we have a comprehensive list of all errors produced by airshipctl?
## Tuesday May 11, 2021
### Discuss Design for [Make executors respect timeouts (airshipctl #533, v2.1)]( https://github.com/airshipit/airshipctl/issues/533)
### Discuss next steps for KRM function gating/version management
Below are notes from a smaller group meeting for a proposed approach for https://github.com/airshipit/airshipctl/issues/524).
- Decided against moving KRM functions out of airshipctl and into their own repo(s). This reduces some complexity in version management. KRM image versions would be managed to match the Airship release version.
Recommended approach:
Task 1:
- Update airshipctl to build KRM function images and tag locally with some local tag (not latest)
- Update all references in KRM functions in manifests to reference local tagged images so that all gating will use current KRM function code
- Update treasuremap the same way to use the local tags so that the versions of the KRM functions will match the airshipctl version specified by ${AIRSHIPCTL_REF}.
- During release process, KRM functions would be pinned with same version as airshipctl and release script would update manifests with pinned version.
- Treasuremap manifests would also be updated with pinned version of KRM functions during release process.
- Downstream sites could follow similar process or pin to specific versions.
Task 2:
- Externalize the KRM function image versions such that versions are only specified in one place. See https://github.com/airshipit/airshipctl/issues/524#issuecomment-833004449
- https://review.opendev.org/c/airship/airshipctl/+/790507
### Generate user guide command links (Sirisha Gopigiri)
Related to https://github.com/airshipit/airshipctl/issues/281
PS: https://review.opendev.org/c/airship/airshipctl/+/789775
* PS address the issue?
* Do we need to add custom help template to all commands? This would update the golden help output.
## Tuesday May 4, 2021
### CLI documentation (Sirisha Gopigiri)
Documentation structure in issue https://github.com/airshipit/airshipctl/issues/280
To render the documents properly in https://docs.airshipit.org/airshipctl/cli/airshipctl.html proposing two approaches which approach to take?
* https://review.opendev.org/c/airship/airshipctl/+/784358/20 - Implemented custom code to generate the docs appropriate to docs.airshipit.org website
* https://review.opendev.org/c/airship/airshipctl/+/789250 - Custom code + cobra.doc.GenReSTTreeCustom function used
### Generate CA certificate/Secret from a known authority (Sidney S.)
The API server/OIDC authenticator plugin is configured with a CA certificate. When using a CA generated without being signed by a known authority I get the error **Unable to connect to the server: x509: certificate signed by unknown authority**.
- What is the mechanism to auto generate a CA signed by a known authority and keep it secure?
- How can the generated signed CA be secured, e.g., SOPS encryption?
### Splitting up the KRM toolbox image (Vlad/Matt)
https://review.opendev.org/c/airship/images/+/786664
## Thursday Apr 29, 2021
### Lifecycle management of Airship KRM functions (Sean, Matt)
The new `:v2` container tag is a moving tag, but it only moves when a git tag is pushed to the repo.
* How frequently / under what conditions do we want to push a tag to update the `:v2` tag?
* Probably more frequently than our "big" releases
* How do we want to test/validate the updated container?
* Test via a DNM patchset using `:latest` functions?
* At the moment `:v2` is 22 days old
* Base image changes for RT have broken `:latest`
* How do we improve KRM function gating: https://github.com/airshipit/airshipctl/issues/524
* Can we externalize KRM function versions: https://github.com/airshipit/airshipctl/issues/419#issuecomment-829267689
.....
### CLI documentation (Kostiantyn Kalynovskyi)
Documentation structure in issue https://github.com/airshipit/airshipctl/issues/280
We have a patch set with a script that changes MarkDown format to ReST:
* https://review.opendev.org/c/airship/airshipctl/+/784358/20
* I've put -2; please help understand if I am wrong here.
* The scirpt inside is complex and has no tests: hard to support and understand
* Why not use doc.GenReSTTree() function that generates rst documentation in a similar fassion as it was done before.
### Generate CA certificate/Secret from a known authority (Sidney S.)
The API server/OIDC authenticator plugin is configured with a CA certificate. When using a CA generated without being signed by a known authority I get the error **Unable to connect to the server: x509: certificate signed by unknown authority**.
What is the mechanism to auto generate a CA signed by a known authority and keep it secure?
### Continued pleas for assistance with Troubleshooting Guide - Andrew K
Lots of opportunities to contribute!
## Tuesday Apr 27, 2021
### Validate ecnryption/decrypt design of externally provided secrets (Alexey O.)
* working patchset: https://review.opendev.org/c/airship/airshipctl/+/786286
* Discussion image https://go.gliffy.com/go/publish/13490574
* Dex/Ldap - will be the first user of that (Sidney)
* It would be great to discuss and simplify that if possible
* There may be additional requirements for encryption we didn't set before: we may want to be able to override generated secrets with manually set. In that case even this approach won't work and we instead/in addition should use encrypted SMP. The only issue with 'in addition' is - it will be one more call of SOPs for decryption. On another hand - encrypted SMP approach has an issue - it won't be possible to render without decryption.
* Consider simplification on overrides (maybe get rid of them by default?)
* Nice to have - move manifests/site/test-site/target/generator/results/decrypt-secrets/ to functions (functions/sops/gpg_decrypt) or type, so we don't copy it everywhere.
## Thursday Apr 15, 2021
### Airship 2.0 Troubleshooting Guide Continued - Andrew K
* WIP patchset - https://review.opendev.org/c/airship/docs/+/786062
* What do we want to target for v2.1? This will be evolutionary.
* Collaborators needed
* Related question: do we log what phase is being executed when it starts/ends?
### FQDN resolution - Andrii O.
* Some software require fqdn resolution from hostname. /etc/hosts and systemd-resolved config need to be populated accordingly during the deployment.
* Arvinder:
* I see three possible approaches to doing this:
1. host-config operator: this is best when updates need to happen dynamically on existing K Nodes.
2. pre/post kubeadm hooks in KubeadmConfigSpec: you can write custom scripts that for example patch the /etc/hosts accordingly. For example, you can use KubeadmConfigSpec.Files to create one or more files in say the /etc/host.d directory and then during PreKubeadmCommand run something along the lines of `cat /etc/hosts.d/*.conf > /etc/hosts`
3. Metal3DataTemplate: the above KubeadmConfig approach applies the same configuration across all nodes in an KCP or MD. Metal3DataTemplate provides more flexibility by allowing some of the configuration to be Node specific. https://github.com/metal3-io/metal3-docs/blob/master/design/metadata-handling.md
:::warning
We are already using #3 for networking etc, its unclear to me that this helps with the original question. FQDN/Systemd changes etc.
:::
## Tuesday Apr 13, 2021
### Hostconfig-operator integration (Sreejith P)
While integrating HCO with treasuremap, found that we need to annotate the secret on to nodes and we also need to have a specific label. what would be the best way to annotate nodes. Also would it be best to add a mechanism to override the default labels in HCO via manifests.
- Metal3 Feature for Label Cascading: https://github.com/airshipit/airshipctl/issues/377
### Replacing into multiple targets (Reddy / Matt)
There are use cases where we need to ReplacementTransform the same source data into multiple target paths -- e.g. replacing an IP address into many network policy rules. It would be helpful for the RT to support this natively. Some options,
1) add a `targets` list as an alternative to `target`
* Lets open an ISSUE to support this. Improve Transformer to support Target as Slice : https://github.com/airshipit/airshipctl/blob/7998615a7b5847c367c29874641c8422157ebb52/pkg/document/plugin/replacement/transformer.go#L77
2) allow for global substring replacements across a document, in addition to the path-based substring replacement we have
ISSUE : Open thi sfor a future priority. specify a patteern that does not hae a specific target.
### PTG later this month
* Thurs, Apr 22 1300 UTC (8:00 AM CDT) - 1700 UTC (12:00 PM CDT)
* Fri, Apr 23 1300 UTC (8:00 AM CDT) - 1700 UTC (12:00 PM CDT)
* Open agenda: https://etherpad.opendev.org/p/xena-ptg-airship
* Registration (free): https://april2021-ptg.eventbrite.com
### Need to discuss document pull command behavior (Kozhukalov)
Related issues
* https://github.com/airshipit/airshipctl/issues/416
* https://github.com/airshipit/airshipctl/issues/417
Patch is ready for review (probably needs rebasing)
* https://review.opendev.org/c/airship/airshipctl/+/767571
Patch implements two things
* Default target path is ${HOME}/.airship/<manifest_name> (i.e. ${HOME}/.airship/default)
* If target path is not defined in the config for a manifest, then the current directory is used to pull the manifests repo
## Thursday Apr 8, 2021
### Discuss the document-validation solution for airshipctl (Ruslan A.)
https://hackmd.io/t2mxDiB3TdGXI8B6gDtA-Q
### Discuss "dex-aio" Implementation Short-/Long-term (Sidney S.)
Approach to discuss described in https://hackmd.io/bdPFHBBSQy-IrpPe1U9itg
### Align on approach to troubleshooting guide (Andrew K)
Use the life cycle states as a high level framework:
- Initialize State
- Prepare State
- Physical Validation
- Bootstrap State
- Target Cluster Lifecycle
- Target Workload Lifecycle
- Sub-clusters (separate state or combine with Target?)
- Day 2 Operations
Proposed approach would be to ~~to list the phases/steps within each higher level lifecycle state & then~~ reference the relevant troubleshooting areas listed below within the life cycle states. Generally speaking here's what you need to look at within a phase (base on executor, do x, y, z).
Troubleshooting areas:
- Manifests:
- Documents,
- Kustomize,
- Kustomize-KRM-Plugins
- Our troubleshooting for Replacememt, Sops,
- Running phases: How to debug a failed phase, where to start, which logs to read > Focus from the phase perspective.
- Identify Phase
- UUnderstand Phase.yaml
- Identify Executor
- Understand Executor.yaml
- Given. theexecutor type ;
- Different guidance
- is it generic
- which one,
- sops, gpg,kubeval, image builder, etc
- is it k8s
- is it clusterctl
- Cluster-API & Kubernetes Deployment: grouped together as the k8s deployment is done by Cluster-API
- Proxy settings
- Networking: is this too broad/complex/specific to individual use cases?
- Helm Charts: Helm Operator & Helm Chart Collator debugging
- Image Builder: base image generation debugging, ISO/QCOW generation & application debugging
-
- Host Config Operator (may be a future topic)
- Sub-Clusters?
- Services/Applications
- i.e. CEPH
- DEX
- LMA stuff
- ... Point to their documentation
Assuming our other documentation will provide details on what each phase does. Would it make sense to incorporate troubleshooting into the deployment guide so you have a one stop shop or keep it separate so it's not cluttering up the deployment guide?
We created an issue for this quite awhile back [#328](https://github.com/airshipit/airshipctl/issues/328). It references this TM v1 debug report script as a potential starting point. Is this still valid? https://github.com/airshipit/treasuremap/blob/master/tools/gate/debug-report.sh
Next Steps:
- Create WIP Patchset with documentation framework
- Identify SMEs for each troubleshooting area who can help contribute
- Try to get a basic framework & some initial content by the EOM for v2.1 & continue to build on it
## Tuesday Apr 6, 2021
### Discuss the document-validation solution for airshipctl (Ruslan A.)
Review the followings commits
https://review.opendev.org/q/topic:%22add-validation-phases%22+(status:open)
### Discuss "dex-aio" certifcate generated by Cert-Manager (Sidney S.)
Approach to discuss described in https://hackmd.io/bdPFHBBSQy-IrpPe1U9itg
## Thursday Apr 1, 2021
### Discuss the rook-ceph cluster implementation (Vladimir S.)
Review the initial commit https://review.opendev.org/c/airship/treasuremap/+/784184 ,
Discuss rook-ceph components which should be deployed by default,
Discuss the further downstream/WHARF work
- Set failure domain to host by default
-
## Place for scripts such as waiting, that are currently in tools/deployment. (KKalynovskyi)
We have new pattern of waiting and adding new scripts to gate/deployments: https://review.opendev.org/c/airship/airshipctl/+/782520
as example, we placed script here, but now its test-site specific, we need a place to be shared between every site:
https://github.com/airshipit/airshipctl/tree/master/manifests/site/test-site/phases/helpers