or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Design Meetings
Meeting Links
Can be found here https://wiki.openstack.org/wiki/Airship#Get_in_Touch
Archive
Agenda/notes from prior to 2021-04-01 can be found here.
Troubleshooting Guide & FAQs HackMDs
Purpose: provide a more accessible, flexible & dynamic way of capturing troubleshooting information & frequently asked questions.Depending on the amount of content (or lack thereof) these may be combined in the future.
https://hackmd.io/Nbc4XF6mQBmutMX_FEs51Q
https://hackmd.io/jIr3An6MT5C2xAQbKR3qoA
Feel free to add content to the pages. Thanks!
Administrative
Recordings
Old Etherpad https://etherpad.openstack.org/p/Airship_OpenDesignDiscussions
Design Needed - Issues List
Tuesday, November 30th
Continue discussion about new kubeconfig workflow (Ruslan A / Alexey O)
Related to issue: https://github.com/airshipit/airshipctl/issues/666.
Reviewing particular problems with current kubeconfig approach and ways how to solve them using new solution.
Tuesday, November 16th
AS 2.1 Issue: First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
Per this issue, we want to change BareMetalHost (BMH) node01's image url and checksum IP to reference the target-cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
Discuss implementations of potential solutions.
Possible solution:
PROVISIONING_IP
environment variable used during target expansion. The image url and checksum were unreachable.assignedIP
variable in themanage-keepalived.sh
file here.PROVISIONING_IP
env variable?keepalivedIP
here to mimic the environment variable workflow?Other considerations that may not need discussion:
baremetalhost.metal3.io/detached
annotation or triggering a rolling update could be possible solutions, but these do not seem desirable because we either leave the BMH node in an unmanaged state (due to a reprovision being triggered if the annotation is removed), or we lose data during the rollout.New kubeconfig workflow (Ruslan A/Alexey O)
Discuss current kubeconfig issues and introduce new kubeconfig workflow (github link with detailed proposal - https://github.com/airshipit/airshipctl/issues/666, proposed PS - https://review.opendev.org/c/airship/airshipctl/+/816617).
Create Gatekeeper function TM#167 - (Shon , Snehal)
https://github.com/airshipit/treasuremap/issues/167
With change in the direction of treasuremap, should we create gatekeeper in treasuremap or move it to airshipctl?
PSP are being deprecated from kubernetes version v1.21 and will be removed in v1.25. We will need replacement for PSP.
Per design discussion 6/17/21, the Gatekeeper function should be included in the multi-tenant type and applied during the initinfra phase.
Is this still valid as we are deprecating multi-tenant sites?
Tuesday, November 9th
First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
https://github.com/airshipit/airshipctl/issues/610
Per this issue, we want to change BareMetalHost (BMH) node01's image url and checksum IP to reference the target-cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
Proposed PS: https://review.opendev.org/c/airship/airshipctl/+/815757
Discuss implications of
baremetalhost.metal3.io/detached
annotation or other possible solutions https://github.com/metal3-io/baremetal-operator/blob/master/docs/api.md#detaching-hosts.Findings:
baremetalhost.metal3.io/detached
annotation prevents the initial target BMH node from reprovisioning when the BMH object is edited (last week's issues were due to my environment, not the annotation), which allowed me to update the image url and checksum IPs in testing.Other considerations:
test-site
only deploys one CP and one Worker in the target-cluster. Also, there is information on the deployed CP that would be lost with a rolling update.Q: Does leaving the target BMH node annotated with
baremetalhost.metal3.io/detached
seem like a viable solution?Q: Is there a preferred solution, of the three possible ones mentioned here?
Q: Other suggestions?
Tuesday, November 2nd
Multi-Node site/testing (Pallav/Andrew K)
Recently opened issue https://github.com/airshipit/airshipctl/issues/652 for a multi-node CP AiaP deployment (3 control plane nodes/2 workers). Need to have a new multi-node test site "airship-core-multinode" type in Airshipctl to be able to do this. By having a new multi node test site, we are giving opportunity to user to test scenarios like rolling upgrade, HA etc. Discussion: Look at leveraging the existing TM manifests to see what can be reused in developing this in Airshipctl. Create a separate Airshipctl gate job to use the 32GB nodes. This is part of #652.
Q: Should 5 node replace the existing 3 node or run in ||?
A: Let's get the 5 node in place & then evaluate whether or not to replace or keep both.
Consideration: if 32GB VMs are as accessible as the 16GB VMs, then it may make sense to switch. If 32GB are harder to get, then perhaps keep the 3 node in place.
We have an old issue out there for multi-node testing in the gates. https://github.com/airshipit/airshipctl/issues/228
Can we leverage the Treasuremap resources?This PS https://review.opendev.org/c/airship/airshipctl/+/815153 looks similar for multi node deployment but it will be better to have a new airship-core-multinode test site created instead of modifying existing test-site so user have a choice for the deployment.
Rook-ceph upgrade, BF deployment and Day 2 operations - Code review and discussion (Vladimir/Alexey)
Related to issues:
CPVYGR-571
CPVYGR-572
As per decision made at the Design Call on October 5, the final implementation of POC employs KRM functions to provision Argo-workflows manifests as well, as to to perform the upgrade/BF deployment using DAG. There is a need in final code review and appoval to consider the approach mentioned above, as a default way to perform Rook-Ceph upgrade/BF related tasks.
For review and discussion:
First Target Node BMH Image HREF Should Not Reference Ephemeral Host - Discussion (Josh / Drew)
Related to issues:
https://github.com/airshipit/airshipctl/issues/641
https://github.com/airshipit/airshipctl/issues/610
Per this issue, we want to change BMH node01's image url and checksum IP to reference the target cluster. During the move phase, the ephemeral node's resources are moved to the target node, but this IP is not changed.
I am looking for some suggestions to resolve this.
What has been tried:
Tuesday, October 19th (FUTURE PLACEHOLDER)
ODIM+Airship demo (Ravi)
Introduce the work in the Anuket community that integrates ODIM into Airship 2 baremetal provisioning.
Spike: Validate Node Label changes can be made through Metal3 BMH (Sidney S.)
Describe BaremetalHost and Node label synchronization supported by CAPM3.
Test and validation of this feature based on Ephemeral cluster and Target cluster deployed using airshipctl based on CAPI v1alpha4 and CAPM3 v0.5.0 uplift patchsets.
This work was documented in hackmd.io.
Need discussion on Plan status cmd (Bijaya)
https://github.com/airshipit/airshipctl/issues/412
Discussion/design about real implementation of the command and still is it a valid issue
Use KRM function to apply k8s resources (Ruslan A.)
Discuss the issue: https://github.com/airshipit/airshipctl/issues/646
Proposed PoC: https://review.opendev.org/c/airship/airshipctl/+/809291
Tuesday, October 12 may be cancelled
Tuesday, October(!) 5th
Priority - Spike: Understand (and implement) Ceph upgrades - BF (Vladimir/Alexey)
Related to issues:
CPVYGR-571
CPVYGR-572
There is a need in a POC approval to start the implementation for Ceph upgrades.
POC was successfully tested in a local lab, video recordings are attached to the CPVYGR-571. The main idea is to deploy via airshipctl an additional workload - Argo Workflows (https://argoproj.github.io/argo-workflows/) - and accomplish the BF operations using DAG manifests.
The proof of concept mentioned above shows that the upgrade performed via Argo Workflows becomes smooth and seemles procedure.
https://github.com/rook/rook.github.io/blob/master/docs/rook/v1.7/ceph-upgrade.md#ceph-version-upgrades
Single-node BMO+Ironic pod (Matt)
See Pete's comment here: https://review.opendev.org/c/airship/airshipctl/+/706533/10/manifests/function/baremetal-operator/operator.yaml
Is there any reason not to combine Ironic into the BMO pod? Alan has a strong preference for this as well, and it simplifies things.
PTG coming up
Thursday 21st, 13UTC-17UTC
Agenda: https://etherpad.opendev.org/p/airship-ptg-yoga
Registration (free): https://www.openstack.org/ptg/
Tuesday, September 28rd
Spike: Dex OIDC Upgrade/Configuration Change in Existing Cluster (Sidney S.)
Discuss the analysis and conclusions drawn from analysis when upgrading dex-aio on a brownfield deployment.
Link for Story: https://itrack.web.att.com/browse/CPVYGR-573
Analysis, Observations and Recommendations: https://hackmd.io/4K0ds3S1S0O8uV0eTaydwA?both
Finalize Design/Issues for Day 2 Image Delivery (Larry B./Andrew K.)
https://github.com/airshipit/airshipctl/issues/621
Issues currently created to address:
Want to make sure that proper issues created and design for VIP for Ironic and any other necessary to support the rolling upgrade.
New issues:
Are these correct? Could we combine these, or at least combine 1 & 2 as they seem to go together. Are any others needed?
AIAP: Support caching with limited access to the node(s) (Ian/Matt)
https://github.com/airshipit/airshipctl/issues/645
Airship-in-a-Pod has a handy caching feature which allows a developer to take the outputs of a run and re-use them in a subsequent run. This bypasses the need to rebuild resources which have time-consuming build processes such as the
airshipctl
binary.However, this only works if the developer has access to the filesystem of the node on which AIAP is running, as it requires moving files from an output directory to a caching directory. In environments such as AKS, the developer may not have this access, preventing them from using this time-saving feature.
Thursday, September 23rd
Supporting multiple k8s versions simultaneously (Alexey, Matt)
Spike Metal3.io Support for BIOS/Firmware Updates and RAID Configuration Changes (Sanjib/Saurabh)
Link for Story:
https://itrack.web.att.com/browse/CPVYGR-485
https://hackmd.io/LuE4l1PrTSaUvnOJyfKSsQ?view
Just to check the findings of current support in Metal3.io for BIOS and RAID configuration changes.
Demo for BIOS/Firmware functionality (Mahnoor A.)
Discuss a proper place to store status map (Ruslan A.)
Related to the issue: https://github.com/airshipit/airshipctl/issues/624
Proposed location: config section of KubernetesApply executor https://review.opendev.org/c/airship/airshipctl/+/804472/16/manifests/phases/executors.yaml
Tuesday, September 21st
Spike Metal3 infrastructure provider upgrade on brownfield site #610 - Shon Phand
https://github.com/airshipit/airshipctl/issues/610
https://hackmd.io/UT9IKTDLR2u3P06axwuSvA?view
https://itrack.web.att.com/browse/CPVYGR-489
Adding new parameter to HWCC and adding new Error type for hosts - Ashu Kumar
Links of PRs for SystemVendor and Firmware
System Vendor: https://github.com/metal3-io/hardware-classification-controller/pull/65
Firmware: https://github.com/metal3-io/hardware-classification-controller/pull/66
Link of Proposal sumitted to Metal3 community: https://github.com/metal3-io/metal3-docs/pull/192
Tuesday, September 14th
v1.21 upgrade (Andrew/Matt)
https://github.com/airshipit/airshipctl/issues/621
https://github.com/airshipit/airshipctl/issues/589
TODO: review ready v1.21 patchsets
TODO: Andrew creating an issue for the 1.21 recert
Spike CAPI upgrade - Sirisha Gopigiri
https://github.com/airshipit/airshipctl/issues/609
https://hackmd.io/8OXbEcpQTY-P-aoRqFG_WA?view
Thursday, September 9th
CAPI v0.4.0 v1alpha – Sirisha Gopigiri
Related to https://github.com/airshipit/airshipctl/issues/518
CAPI v0.4.0 requires k8 v1.19.1 version minimum. Do we need to wait for the kubernetes uplift https://github.com/airshipit/airshipctl/issues/589
Related PS: https://review.opendev.org/c/airship/image-builder/+/805101
CAPI v0.4.1 version is available. Do we have to build capm3 using that or using v0.4.0 version
Related PSs:
With CAPI v0.4.0:
https://review.opendev.org/c/airship/airshipctl/+/802025 - Manifests to add capi v0.4.0
https://review.opendev.org/c/airship/airshipctl/+/804834 - capm3 and capi v0.4.0
With CAPI v0.4.1:
https://review.opendev.org/c/airship/airshipctl/+/805164 - capi v0.4.1 manifests
https://review.opendev.org/c/airship/airshipctl/+/805167 - capm3 and capi v0.4.1
Co-Existing Multiple versions of CAPI (v0.4.x) with Providers v0.5.x – Sidney S.
CAPI v0.4.2 version became available recently and CAPZ has also announced v0.5.2 version almost at same time. It became clear that CAPI providers (capm3, capz, capo, etc) have different release speed and it is a problem today for airshipctl to support them all.
All the above manifests (v0.4.0, v0.4.1, v0.4.2) can co-exist and the use of kustomize in the reference test site allows to pick the specific CAPI and CAP(operator) versions.
The current limitation is within the clusterctl krm that has only the ability to "burn" a single version of clusterctl CLI in its container image.
In order to support multiple versions of CAPI, one approach would be to "burn" all supported clusterctl CLI binaries (and store them under a known location, e.g., v0.4.x directory) to the container image, then add a mechanism to the clusterctl-init executor to determine the version of CLI to execute.
UPD (Alexey O.):
as per discussion during the call we decided to proceed with the following approach:
localhost/clusterctl:latest
, but we're going to createlocalhost/clusterctlV0.4.0:latest
,localhost/clusterctlv0.4.1:latest
and so on. This is due to the limitations of out approach to always use the latest version of krm-functions. The changes can be done intools/deployment/21_systemwide_executable.sh
if needed.krm-functions/clusterctl/
tokrm-functions/clusterctl-base
; 2. modifykrm-functions/clusterctl/Dockerfile
by removing section https://github.com/airshipit/airshipctl/blob/master/krm-functions/clusterctl/Dockerfile#L5-L15, and line https://github.com/airshipit/airshipctl/blob/master/krm-functions/clusterctl/Dockerfile#L39 3. create 2 folderskrm-functions/clusterctlV0.4.0/
andkrm-functions/clusterctl0.4.1/
and put there the corresponding Dockerfile (see below). 4. Update Makefile by specifying needed params for that images (see below)Dockerfile snippet should looks something like this:
Makefile modifications should look something like:
Thursday August 26, 2021
TODO: put the recordings somewhere publically accessible
Kernel/Driver/Package upgrade (Sreejith) [#603, #604, #605]
There are various issues created for uplifting kernel/Driver/packages in both airshipctl and treasuremap. its mentioned to use image-builder for this purpose. when we use image-builder, we will have to do os reinstallation on all the nodes and if we want to perform this multiple time a year, it consumes lot of time. Also i have found that UUID of the disk is getting changed after reinstallation which may cause problem with ceph cluster. cant we use hostconfig-operator to perform the kernal/driver/package upgrade and then use an updated image when performing distro upgrade?
https://hackmd.io/@Pallav/BkU2FuWZY
TODO: K8s apiserver VIP ought to be working, but seems not to be; may be a document bug
TODO: We don't have a VIP configured for Ironic. Latest version of Ironic has support for a VIP, to select among multiple active Ironic servers - a keepalived pod. We should retest post-ironic-uplift/configuration.
We will (have) documented our findings in the POC issue, and will create a new issue to implement the ironic VIP with a dependency on the the ironic uplift.
Per Arvinder: we should additionally be using a VIP to front Ironic as it moves from the ephemeral to the target cluster.
TODO: after getting the ironic VIP working in the target cluster, in a new issue, extend use of the VIP into the ephemeral cluster and validate it works over a clusterctl move
Spike Metal3.io Support for BIOS/Firmware Updates and RAID Configuration Changes (Sanjib)
Link for Story:
https://itrack.web.att.com/browse/CPVYGR-485
https://hackmd.io/LuE4l1PrTSaUvnOJyfKSsQ?view
Just to check, only need to find current support in Metal3.io for BIOS and RAID configuration changes.
TODO: JT to set up a meeting to walk through the changes w/ the dev team
Ironic boot over wan
Does AS2 require pxe for booting? Ironic has Redfish as an option for the boot interface in its configuration. Are there any known issues or concerns to use Redfish for booting?
TODO: send email to Richard Pioso to confirm 1) Ironic supports it
TODO: flup with M3 community on whether 2) Metal3 exposes it
Re-imaging of cluster management node - Target Node1 (Pallav) [#606]
In case of major upgrade for any airship2 site (e.g. OS upgrade), we will need to perform re-imaging operation on existing nodes.
This operation can be easily perform for other two controlplan through metal3 rolling upgrade strategy but it has been observed
various issue when we upgrade Target Node1 through rolling upgrade:
API Server VIP
Current upstream version of airshipctl doesn't provide VIP for API server so when we remove Target Node1, we need to manually
update API server IP in config maps, kubenetes conf files etc. It will be better if we can have API server vip so we don't need
to perform these updates.
Ironic bound to Target Node1
In current version of airshipctl, Ironic provisioning ip is hardcorded with First Target Node API Server IP so when we try to move
Ironic to other CP nodes, Ironic gets stuck in init-bootstrap. Also Provisioning IP is hard coded for qcow image url for control plane.
so need to update BMH, m3m template in existing site. Can we introduce VIP for Ironic (active/passive, 3 replicas 1 pod per node)?
Do we have any other thoughts so Target Node1 upgrade with minimal disruption?
https://hackmd.io/@Pallav/BkU2FuWZY
SOPS GPG Key Management Working proposal [#586]
WIP patch & SOPS plugin branch
Note: The proposal is done on top of another already discussed topic - how to improve encryption/decryption. It was implemented here.
This introduces the plase to store info on who can decrypt data. That means - each individual or system must have his/its own private secret. Site manifests must have public part of that secrets. E.g. based on that only people with listed PGP fingerprints can decrypt data:
If we have a vault server (e.g. try to make it working with these steps) and uncomment hc-vault-transit line it will be also possible to decrypt data to all who has access to that keys.
In order to be able to decrypt it's only necessary to put your key here (not recommended way)
or to export the key via env variable (as before), e.g.:
or if the key is already in gpg:
If we're using vault its only necessary to do
or to already have some toket, e.g.
the info what key and on what sever is to go is already provided in the sops metadata:
SOPS will decrypt the data in case at least one of the private credentials provided.
In case if it's necessary to update the list of that credentials, - it's done in git repo and
airshipctl phase run secret-update
is executed in order to get secrets with new sops-metadata. Of course that has to be done by the person who owns the valid credential, because this requires decryption first.For Vault cases it's enough just to reconfigure access to the secrets inside Vault. For user exclusion it would also be good to update the key in Vault, because that secret is shared.
TODO: Review Needed on this stack:
https://review.opendev.org/c/airship/airshipctl/+/794887
https://review.opendev.org/c/airship/airshipctl/+/803503 (only WIP until SOPS change merges)
Thursday August 19, 2021
Spike TM#196 - Implement Multus + SRIOV support in Airship - Digambar, Manoj and Jess
Related to https://github.com/airshipit/treasuremap/issues/196
Spike CPVYGR-491 - Understand CoreDNS upgrades for an existing cluster (AS608)
Related to https://itrack.web.att.com/browse/CPVYGR-491
Just to check, if there is any other expectations from this US except to check handling of CoreDNS as a brownfield scenario
WIP Hackmd link https://hackmd.io/iogRxOGfSZWu7ypHq-w9mw
Thursday August 12, 2021
Cancelled
Thursday July 29, 2021
Per cluster-type Image-builder configuration using kpt approach (Alexey, as per internal preliminary discussion)
We can do this per type:
kpt pkg update [@<commit-id>]
from image-builder dir and kpt will do 3way merge - it will be possible to see what new changes were introduced and if they are in the conflict with local changes. Once conflicts are resolved - just commit the changes: example.Needed changes:
Upgrade kpt to v1.0.0-beta.x (Matt F)
v1.0.0-beta brings some breaking changes for our current design. This relates to #598
dependencies
key in Kptfile has been deprecatedupstream
repository, or they will be assumed to be dependent subpackages with a parent higher up in the directory tree that provides the root upstream repositoryHow this affects current workflow:
kpt pkg update .
from root of function (e.g. flux/helm-controller) anymore. Packages must be updated independently, with git commits before and after each updatePossible solution:
update.sh
?) that runskpt pkg update upstream/<pkgname>
for everything in each function'supstream
directory, and makes the necessary intermediate local commits between pkg updatesUpgrade capm3/bmo/ironic deployment [#554]
Some new features from upstream, are we interested in leveraging any/all of them? Referring to CAPM# releases
Manifest pattern for upstream dependency:
Thursday July 15, 2021
Carryovers from 7/8 meeting ->
K8s v1.21 upgrade & general uplift approach (Andrew)
When we had the v1.20 uplift scramble, we discussed the need to establish an ongoing k8s uplift cadence to keep Airship current, certified & in conformance.
This is intially for greenfield deployments, i.e. what version of K8s are we deploying out of the box. Brownfield upgrades will be handled under a different set of issues, but will need to bring the findings of that back to ensure we have a commmon approach.
Updated the below issue to be more v1.21 focused with the more general approach.
https://github.com/airshipit/airshipctl/issues/589
v1.22 info
https://github.com/kubernetes/sig-release/tree/master/releases/release-1.22
Some future topics for brownfield upgrades once we've worked the spikes:
#491 Redesign airship cluster status command (Vladimir Kozhukalov)
https://github.com/airshipit/airshipctl/issues/491
This feature was dependent on a different feature. Here is an old discussion https://hackmd.io/BbFyJRKGRQiuXYJduPhu4Q
Current architecture cannot support this command. Vladimir's comments from the issue:
Discuss path forward and if we should close this issue for now.
#597 Airship specific implementation of KRM Function Specification
https://github.com/airshipit/airshipctl/issues/597
Let's
New Items ->
SIP#19 Configurable HAProxy in SIP (Manoj Alva)
This is related to https://github.com/airshipit/sip/issues/19 and PS https://review.opendev.org/c/airship/sip/+/799161 is put in place. Discussion needed on the following item.
[HAProxy Config Ref] https://cbonte.github.io/haproxy-dconv/2.0/configuration.html
#545 Generic container timeout validation & enforcement (Manoj Alva)
For the requirement "provide a compliance mechanism to validate the timeout has been acknowledged & action has been taken", need help on the scope covering this issue.
Are the requirements targetted at ensuring e2e testing of the timeout support implemented via #544 possibly via Ginkgo framework ?
Thursday July 8, 2021
Component Uplift Review
Ensure we have all the major components accounted for, and determine if we have any gaps.
v2.2 milestone issues list:
https://github.com/airshipit/airshipctl/issues?q=is%3Aopen+is%3Aissue+milestone%3Av2.2
K8s uplift to v1.20 / v1.21 >
CAPM3, BMO & Ironic to v0.4.2 – needs
Still is still against capi v1apha3.
What Ironic OS?
CAPM3, BMO & Ironic to v0.5.0 – needs to wait ..
Still is still against capi v1apha3
BMO. and Ironic are now separated.
Maybe we can drive versions ?
What Ironic OS is targetted?
CAPI to v0.4 & CAPM3 to 0.5.0
CAPI (and docker Provider ) to v1alpha4 uplift [ NEW ISSUE ]
[NEW ISSUE] Explore KPT upgrade options 0.37 vs. v1.0
[NEW ISSUE] Kustomize Upgrade to the latest version (v4.2.0)
Clusterctl binary as KRM function:
Sonobuoy to v0.51
iLO Redfish API
Thursday June 24, 2021
Incorporating Image Builder manifests into Treasuremap (Need: Matt M., Craig, Pallav)
image-builder
defaultsCraig : "I would suggest we re-use the pattern of parent+child zuul jobs that we currently use. The parent Zuul job is defined upstream images repo, and allows for child job (like we have downstream) to override any needed parameters for image building. This would be a good pattern as well to follow for other container image customizations (e.g., the same pattern would permit operator customization of airshipctl)"
airshipctl secret generate encryptionkey,
airshipctl cluster rotate-sa-token,
airshipctl cluster check-certificate-expiration commands discussion (Ruslan A.)
Tuesday June 22, 2021
Generating secrets for subclusters
Alternativly we may even avoid storing everything in git- instead follow ClusterAPI approach - where we keep everything in target cluster?
Probably it would be great to get a 'big picture' of how encryption/decryption will work for the subcluster scenario. Let's make it together? :)
ISSUE : Define an integration with system to manage GpG Key Management system that provides RBAC, distribution etc,. I.e. Such as Admision Controller Mutating Admission. Controller that injects keys as needed.
VINO namespace issue.
Vino creates BMHs in a single namespace, currently in the same as vino-manager runs (vino-system).
Cluster-API CAPM3 right now requires that BMHs reside in the same namespace as m3m objects and hence KCP. So KCP needs to be in the same namespace as BMHs. With the VINO design, we can only specify 'count' per vino flavor(nodeset), so even if we add namespace field to VINO CR nodeset, it is still going to be one namespace per whole vino nodeset infrastructure. For example, if we have nodset called
control-plane
with count=1, and 40 nodes on the site, we will end up with 40 masters in single namespace.FAQs page
Similar to Troubleshooting Guide, the documentation tema has created a FAQs page to allow for community input to develop content that may go into https://docs.airshipit.org
https://hackmd.io/jIr3An6MT5C2xAQbKR3qoA
Thursday June 17, 2021
airshipctl exits with error when expanding controlplane nodes
Managing Gatekeeper Policy Constraint Templates & Constraints in Treasuremap (cont'd)
i.e. What policies do we start with?
Tuesday June 15, 2021
Use clusterctl as a binary inside of KRM function instead of calling API (Ruslan A.)
Discussion of the issue - https://github.com/airshipit/airshipctl/issues/568
PoC patchset - https://review.opendev.org/c/airship/airshipctl/+/793701
Diagram - https://drive.google.com/file/d/1lqTW4ALAKOJcCTCYvMEh9C7RyWxMGwME/view
Managing Gatekeeper Policy Constraint Templates & Constraints in Treasuremap (Larry)
Design discussion for https://github.com/airshipit/treasuremap/issues/174
Definition of the Policy == Constraint Template
manifests/function/gatekeeper/policies/
manifests/function/gatekeeper/policies/<policy-name>
manifests/function/gatekeeper/policies/<policy-name>/
manifests/function/gatekeeper/policies/<policy-name>/kustomization.yaml
manifests/function/gatekeeper/policies/<policy-name>/template
e.g. https://github.com/open-policy-agent/gatekeeper-library/tree/master/library/pod-security-policy/users
manifests/function/gatekeeper/policies/instances/
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/kustomization.yaml
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/constraint.yaml
manifests/function/gatekeeper/policies/instances/<instance-of-policy-x-name>/replacements/… || TBD if we use catalogue info for defining the constraints
How do we define a collection of policies as a group that menas something. e.g. PodSecurityPolicy …
manifests/composite/gatekeeper/<name of policy group>
manifests/composite/gatekeeper/<name of policy group>/kustomization.yaml
.. Uses Instance of policy as resources.
manifests/composite/gatekeeper/<name of policy group>/replacements/kustomization.yaml
When do we deliver the Policies
Will keep this as a TBD, expect we might need to deliver policies in multiple phases, yet to be determined.
Installing gatekeepr is init infra phase, … whatever helm "thingie"
Explore using this for policy validation : https://github.com/GoogleContainerTools/kpt-functions-catalog/tree/master/functions/go/gatekeeper
Bake Helm Charts in Helm-Chart-Collator (Sidney S.)
Some of service deployment relying on (Flux) Helm operator is still pulling charts from the public repository.
Should all Helm charts used by airshipctl be "baked" in the Helm Chart Collator Docker image, so charts are not exposed to the public?
(Sean) This is being worked here: https://github.com/airshipit/treasuremap/issues/162
Thursday June 10, 2021
Discuss different approaches to Day2 upgrade operations (Vladimir S., Alexey O.)
Several approaches based on Ceph/Rook upgrade examples
Presentation in this hackmd note https://hackmd.io/0Sw53doBSwiOzgfzYrfmRQ
Tuesday June 8, 2021
RAID implementation (JT & Matt)
NextGen secret generation (Alexey)
(Following up our conversation started here)
UPD: we decided to proceed with that changes in 2.2
Implementation is here: https://review.opendev.org/c/airship/airshipctl/+/794887
Requires kustomize 4.x: that's why based on https://review.opendev.org/c/airship/airshipctl/+/794269/
Supports:
forced
regeneration by different period, e.g. regenerate allyearly
secrets. Template contains info about how ofter to regenerate each group.Pinning
of secret - this secret will be considered as manually (exnternally provided) and won't be regeneratedsecret-update
+ file that should contain a patch that is getting merged to the final secrets file.The document structure can be strictly defined - we can switch from VariableCatalogue to something like SecretCatalogue with CRD and that will allow validation.
See example: https://review.opendev.org/c/airship/airshipctl/+/794887/13/manifests/site/test-site/target/encrypted/results/secrets.yaml
Each group has date of last update.
Here is a template example.
The implementation introduced
functions
andmodules
for templater. Function defined in module can be called withinclude
function (definition was taken from helm). Module is a document that contains function definitions. E.g. this file is inlcuded here and contains implementation of functiongroup
that does the main magic on understanding what to regenerate, what to import and etc.Thursday June 3, 2021
Treasuremap to include Gatekeeper/OPA? (Bryan)
Does it make sense to include Gatekeeper (and here) as a "core platform" level policy agent in the reference implementation?
If so, what are the resiliency and deployed features that would be considered fundamental to the implementation? Is the basic installation sufficient?
Does this provide a real path to the support of deprecation of PSP, or too early to say?
Create some issues :
Base Images (Andrii, MattF, MattM)
We have a story for switching KRM function base images away from dockerhub. The idea was to switch to an alternative non-dockerhub base if there was a good alternative to alpine, and maintain our own mirror in quay for alpine if not. It's slightly more nuanced than that; a few things to consider:
#!/bin/bash
. However, Alpine doesn't have bash, so if we want Alpine that won't work.apk
vsapt
vsyum
) into a shared base image. In that case we are back to maintaining our own base images, however./bin/bash
/bin/sh
isdash
Matt has a patchset switching to minideb here, which changes both shell conventions (bash-like vs dash-like) and package manager (
apk
->apt
).Uplifting BMO/CAPM3/Ironic, CAPI & the CAPI Management Operator (Andrew - Arvinder)
Issues: #554 & #518
Timing: v1alpha4 is currently available for BMO/CAPM3/Ironic. v1alpha4 for CAPI is a month or so out. There aren't dependencies between the two uplifts though an additional uplift of CAPM3 will be required when shifting to CAPI v1alpha4.
Approach: Let's leave #554 for the BMO, CAPM3 & Ironic uplift. Let's create a new issue or revise #518 for CAPI uplift to v1alpha4 when its available (and required CAPM3 uplift).
Discussion: Do we want to shift to using CAPI Management Operator at some point?
https://hackmd.io/qdTfhNj8RSuQOM0QZL0JOA#CAPI-Management-Cluster-Operator
Tuesday June 1, 2021
Using AGE as an alternative to PGP in SOPS (Matt, Alexey)
Pronounced "ah-gay". A recently-merged option in SOPS, recommended by the SOPS community: "age is a simple, modern, and secure tool for encrypting files. It's recommended to use age over PGP, if possible."
It has a more compact representation - encrypted data:
age1yt3tfqlfrwdwx0z0ynwplcr6qxcxfaqycuprpmy89nr83ltx74tqdpszlw
Public key:
AGE-SECRET-KEY-1NJT5YCS2LWU4V4QAJQ6R4JNU7LXPDX602DZ9NUFANVU5GDTGUWCQ5T59M6
Is it something we should consider adopting?
Thursday May 27, 2021
Dex HelmRelease - LDAP Patch locationn (Sidney S.)
A while ago it was decided to split the Dex/API server configuration from Dex/LDAP, the former relying on replacement rules while the latter kustomized through Strategic Merge patch.
The split was done with Dex/LDAP patch being implemented in treasuremap/manifests/type/airship-core/target/workload/dex-aio folder. As multi-tenant type will need similar patch, where would be the best place to implement this patch where it is shared between airship-core, multi-tenant, etc.
Could it be under composite/utility (new) where all utility patch would be added, starting with dex/ldap patch?
kustomization.yaml
dex-aio-helm-patch.yaml
Upgrade to kustomize 4.1.3 issues (Alexey O.)
Description of issues
Short-term solution:
upgrade kustomize to 4.1.2 (this will be much easier)
Long-term solution (preliminary):
Cluster-api 0.4.x release is planned to June-July (may slip further though). Once it's release our plan will be:
upgrade our gating to k8s at lease 1.19.1 (0.4 has that dependency).And AFTER that try to upgrade to k8s modules in go.mod to 0.21.0. That should resoulve all dependency issues.
Tuesday May 25, 2021
Clusterctl move & the CA
The target cluster CA that is used for generating Dex certs is currently generated on the ephemeral cluster, then
clusterctl moved
over to target. It is created in thedefault
NS, so that's where it winds up; can we create it in a more appropriate namespace, or isdefault
special somehow?clusterctl move
moves from ephemeral NS X to target NS Xclusterctl move
has an option to change the target NS (don't need to use this)target-infra
,lma-infra
, etctarget-infra
NS, and they'll end up where they should goEnabling Physical Disks and Controllers parameters for RAID Configuration - Zainub/Mahnoor
This is related to the demo represented by Noor, last year, related to RAID configurations in Metal3.
https://review.opendev.org/c/airship/airshipctl/+/749043/6/manifests/function/hardwareprofile-example/hardwareprofile.yaml
Support of RAID configurations for Baremetal Servers has been added. The link for this is:
https://github.com/metal3-io/baremetal-operator/pull/292
Furthermore, we want to extend BMH for Disk names and RAID Controllers. While doing this, we came to know we need to extend BMH first, as Ironic API does not contain such information right now.
https://github.com/metal3-io/baremetal-operator/issues/206
airship-discuss@lists.airshipit.org
Snia Swordfish
How will M3 handle bad disks when doing raid config?
Thursday May 20, 2021
Generic Container timeout implementation details
https://github.com/airshipit/airshipctl/issues/544
Function-specific catalogues (Matt, Sidney)
Let's talk about our approach to function-specific catalogues.
networking-ha
catalogue for VIP configuration, rather than aningress
(function name) catalogue. I think we've added related data to that catalogue since.Dex example: https://review.opendev.org/c/airship/treasuremap/+/791835/4/manifests/site/test-site/target/catalogues/dex-aio.yaml
Catalogue Conventions
https://hackmd.io/HM-CNNIuRIm2MseaL523eA
Using AGE as an alternative to PGP in SOPS (Matt, Alexey)
Pronounced "ah-gay". A recently-merged option in SOPS, recommended by the SOPS community: "age is a simple, modern, and secure tool for encrypting files. It's recommended to use age over PGP, if possible."
It has a more compact representation - encrypted data:
age1yt3tfqlfrwdwx0z0ynwplcr6qxcxfaqycuprpmy89nr83ltx74tqdpszlw
Public key:
AGE-SECRET-KEY-1NJT5YCS2LWU4V4QAJQ6R4JNU7LXPDX602DZ9NUFANVU5GDTGUWCQ5T59M6
Is it something we should consider adopting?
Tuesday May 18, 2021
Target-state PhasePlans
We currently have phase plans named
phasePlan
andiso
. We should true that up for the v2.1 release.deploy
andupgrade
?ephemeral
vsdeploy
ortarget
, where ephemeral takes you up through clusterctl initing the target cluster (upgrade and greenfield look the same after that)virsh-eject-cdrom-images
,virsh-destroy-vms
.Conclusion: create
deploy
,deploy-virt
, andmanage-secrets
phasePlans at the type level for now. In the future we can add (as needed) things likeupgrade
,update
,release
(for generating qcows),rotate-secrets
, etc etc.Can we archive the old content in this agenda?
HackMD is getting really slow for me :)
Thursday May 13, 2021
Discuss Design for Make executors respect timeouts (airshipctl #533, v2.1)
Troubleshooting Guide HackMD
Purpose: provide a more accessible, flexible & dynamic way of capturing troubleshooting information.
https://hackmd.io/Nbc4XF6mQBmutMX_FEs51Q
We can gauge usage & see if there's value in transposing this into our formal documentation suite or if this is sufficient.
Secondary topic, do we have a comprehensive list of all errors produced by airshipctl?
Tuesday May 11, 2021
Discuss Design for Make executors respect timeouts (airshipctl #533, v2.1)
Discuss next steps for KRM function gating/version management
Below are notes from a smaller group meeting for a proposed approach for https://github.com/airshipit/airshipctl/issues/524).
Recommended approach:
Task 1:
Task 2:
Generate user guide command links (Sirisha Gopigiri)
Related to https://github.com/airshipit/airshipctl/issues/281
PS: https://review.opendev.org/c/airship/airshipctl/+/789775
Tuesday May 4, 2021
CLI documentation (Sirisha Gopigiri)
Documentation structure in issue https://github.com/airshipit/airshipctl/issues/280
To render the documents properly in https://docs.airshipit.org/airshipctl/cli/airshipctl.html proposing two approaches which approach to take?
Generate CA certificate/Secret from a known authority (Sidney S.)
The API server/OIDC authenticator plugin is configured with a CA certificate. When using a CA generated without being signed by a known authority I get the error Unable to connect to the server: x509: certificate signed by unknown authority.
Splitting up the KRM toolbox image (Vlad/Matt)
https://review.opendev.org/c/airship/images/+/786664
Thursday Apr 29, 2021
Lifecycle management of Airship KRM functions (Sean, Matt)
The new
:v2
container tag is a moving tag, but it only moves when a git tag is pushed to the repo.:v2
tag?:latest
functions?:v2
is 22 days old:latest
…
CLI documentation (Kostiantyn Kalynovskyi)
Documentation structure in issue https://github.com/airshipit/airshipctl/issues/280
We have a patch set with a script that changes MarkDown format to ReST:
Generate CA certificate/Secret from a known authority (Sidney S.)
The API server/OIDC authenticator plugin is configured with a CA certificate. When using a CA generated without being signed by a known authority I get the error Unable to connect to the server: x509: certificate signed by unknown authority.
What is the mechanism to auto generate a CA signed by a known authority and keep it secure?
Continued pleas for assistance with Troubleshooting Guide - Andrew K
Lots of opportunities to contribute!
Tuesday Apr 27, 2021
Validate ecnryption/decrypt design of externally provided secrets (Alexey O.)
Thursday Apr 15, 2021
Airship 2.0 Troubleshooting Guide Continued - Andrew K
FQDN resolution - Andrii O.
cat /etc/hosts.d/*.conf > /etc/hosts
We are already using #3 for networking etc, its unclear to me that this helps with the original question. FQDN/Systemd changes etc.
Tuesday Apr 13, 2021
Hostconfig-operator integration (Sreejith P)
While integrating HCO with treasuremap, found that we need to annotate the secret on to nodes and we also need to have a specific label. what would be the best way to annotate nodes. Also would it be best to add a mechanism to override the default labels in HCO via manifests.
Replacing into multiple targets (Reddy / Matt)
There are use cases where we need to ReplacementTransform the same source data into multiple target paths – e.g. replacing an IP address into many network policy rules. It would be helpful for the RT to support this natively. Some options,
targets
list as an alternative totarget
ISSUE : Open thi sfor a future priority. specify a patteern that does not hae a specific target.
PTG later this month
Need to discuss document pull command behavior (Kozhukalov)
Related issues
Patch is ready for review (probably needs rebasing)
Patch implements two things
Thursday Apr 8, 2021
Discuss the document-validation solution for airshipctl (Ruslan A.)
https://hackmd.io/t2mxDiB3TdGXI8B6gDtA-Q
Discuss "dex-aio" Implementation Short-/Long-term (Sidney S.)
Approach to discuss described in https://hackmd.io/bdPFHBBSQy-IrpPe1U9itg
Align on approach to troubleshooting guide (Andrew K)
Use the life cycle states as a high level framework:
Proposed approach would be to
to list the phases/steps within each higher level lifecycle state & thenreference the relevant troubleshooting areas listed below within the life cycle states. Generally speaking here's what you need to look at within a phase (base on executor, do x, y, z).Troubleshooting areas:
Manifests:
Running phases: How to debug a failed phase, where to start, which logs to read > Focus from the phase perspective.
Cluster-API & Kubernetes Deployment: grouped together as the k8s deployment is done by Cluster-API
Proxy settings
Networking: is this too broad/complex/specific to individual use cases?
Helm Charts: Helm Operator & Helm Chart Collator debugging
Image Builder: base image generation debugging, ISO/QCOW generation & application debugging
Host Config Operator (may be a future topic)
Sub-Clusters?
Services/Applications
Assuming our other documentation will provide details on what each phase does. Would it make sense to incorporate troubleshooting into the deployment guide so you have a one stop shop or keep it separate so it's not cluttering up the deployment guide?
We created an issue for this quite awhile back #328. It references this TM v1 debug report script as a potential starting point. Is this still valid? https://github.com/airshipit/treasuremap/blob/master/tools/gate/debug-report.sh
Next Steps:
Tuesday Apr 6, 2021
Discuss the document-validation solution for airshipctl (Ruslan A.)
Review the followings commits
https://review.opendev.org/q/topic:"add-validation-phases"+(status:open)
Discuss "dex-aio" certifcate generated by Cert-Manager (Sidney S.)
Approach to discuss described in https://hackmd.io/bdPFHBBSQy-IrpPe1U9itg
Thursday Apr 1, 2021
Discuss the rook-ceph cluster implementation (Vladimir S.)
Review the initial commit https://review.opendev.org/c/airship/treasuremap/+/784184 ,
Discuss rook-ceph components which should be deployed by default,
Discuss the further downstream/WHARF work
Set failure domain to host by default
Place for scripts such as waiting, that are currently in tools/deployment. (KKalynovskyi)
We have new pattern of waiting and adding new scripts to gate/deployments: https://review.opendev.org/c/airship/airshipctl/+/782520
as example, we placed script here, but now its test-site specific, we need a place to be shared between every site:
https://github.com/airshipit/airshipctl/tree/master/manifests/site/test-site/phases/helpers