Kepler Release Community Meeting Minutes
===
###### tags: `kepler` `release`
- **Meeting recordings:**
https://youtube.com/playlist?list=PLz3pRD3kGUsd25nwuE-cDWgISktOpJmCx
- **Date:**
- May 9, 2023
- **Agenda**
- 0.5 release update
- 0.6 planning https://github.com/orgs/sustainable-computing-io/projects/2/views/1
-
- **Date:**
- Apr 4, 2023
- **Agenda**
- Development updates:
- dependent library update
- cgroup v1/v2 are both supported
- CPU map update
- https://github.com/sustainable-computing-io/kepler/pull/601
- Any need to update models?
- validate the models
- DRAM intensive workload to train dram model
- VM case: how to get memory heuristics to use dram model
- Operator available at OperatorHub, new release will be cut around 04/15
- 0.5 Release date
- Around Apr 15th
- Any critical PRs?
- https://github.com/sustainable-computing-io/kepler/pull/609
- https://github.com/sustainable-computing-io/kepler/pull/611
- The motivation makes sense: reduce pod listing overhead and get pod info in real time.
- priviledge escalation by accessing docker socket is a concern, will discuss it with OpenShift architect.
- Critical issues
- https://github.com/sustainable-computing-io/kepler/issues/610
- https://github.com/sustainable-computing-io/kepler/issues/608
- add cgroup id
- https://github.com/sustainable-computing-io/kepler/issues/594
- https://github.com/sustainable-computing-io/kepler/pull/599 (?)
- CNCF TAG demos
- Sustainability (Apr 5th 11AM ET)
- Runtime (Apr 6th 11AM ET)
- **Date:**
- March 21, 2023
- **Agenda**
- Development updates: `30min`
- Operator
- Standalone Kepler
- Multi arch support
- 0.6 release feature planning
- Kepler project board update
- Training on different CPU architecture
- External power source support: BMC, Sentry
- Scalability: Prometheus client (https://github.com/sustainable-computing-io/kepler/discussions/439)
- Models documentation and usability improvement
- Planning meeting date
- End of Apr (after KubeCon EU) and end of May (end of OSS NA)
-
- **Date:**
- March 07, 2023
- **Agenda**
- New kepler model server pipeline `20-25min`
- Follow up on license on models and training data
- Development updates: `15min`
- Operator
- Standalone Kepler
- TBD
- Discussions `10-15min`
- ACPI/IPMI/HMC power reading
- CGroup/Kubelet library import vs reading from cgroup sysfs and kubelet metrics endpoint
- **Date:**
- Feb 21, 2023
- **Agenda**
- Benchmark test intro `15min`
- How to add new kepler model server pipeline `20-25min`
- Discussions `5-10min`
- **Date:**
- Feb 07, 2023
- **Agenda**
- Current progress
- Scalability issues
- Testing
- Deployment
:::info
- **Date:**
- Jan. 24, 2023
- **Agenda**
- 0.5 release planning
- Release capitan: Parul
- Project planning
- CNCF TAG presentation video
- **Date:**
- Jan. 10, 2023
- **Agenda**
- Betrand introduce Sentry
- HW agent/monitoring (BMC, etc)
- support OpenTelemetry, collecting HW metrics from a range of vendors (cisco/HPE/etc) through SNMP/Redfish/etc
- Power estimate based on HW specs
- HW metrics: energy in joules (hw_host_power_watts, hw_host_energy_joules_counter)
- Looking into VM using ratio based approach
- Sentry SW is free but not open sourced (legacy issue)
- Demo: https://hws-demo.sentrysoftware.com/d/SHFBpSH7z/hardware-sentry-site?orgId=1&var-site=Sentry-Ottawa&from=now-2d&to=now
- Architecture diagram: https://www.sentrysoftware.com/products/hardware-sentry.html
- https://www.sentrysoftware.com/docs/hws-doc/3.0.00/index.html
1. v0.4 release review `15min`
- Feature
- Deployment
- Test
- Doc
2. v0.5 planning `30min`
- https://github.com/orgs/sustainable-computing-io/projects/2
- Feature: please add your thinking to the project board.
- Sam: MVC like architecture, OTEL support
- Deployment
- helm chart: Kepler is included, prometheus/grafana is under consideration.GHA to generate more data during test.
- operator: works for k8s, testing for ocp. v0.5 will support model server. bundles ready for operatorhub registration. cluster-admin for operator?
- Test: consolidated integration test using Github action, setting kind, install bcc, can be used for operator. Test coverage improvement (especially those silently bypassing panic), benchmark testing.
- Doc
- Model: offline to online upload. Node power modeling and verify online training. Formalize Accuracy testing of Models.
3. Discussion `15min`
- Conference talks
- CNCF TAG presentation: prepare offline recording (5min each) and reserve a 15-20min TAG meeting time for presentation.
4. Issues
- Root privilege in deploying kepler (check eBPF)
- GPU support: HW spec support matrix (check nvidia library supportability matrix), shared or dedicated GPU usage.
:::info
- **Date:**
- Dec. 13, 2022
- **Agenda**
1. Update and progress `40min`
- Kepler: no urgent task,
- GPU issue under debugging (thank David Gray)
- CPU frequency: reading sysfs is expensive. CPU frequency is only available on BM. Reading from kernel tracepoint doesn't always work (because it is activated only when governor changes). Reading HW counters (cycles, hperf) can provide real time frequency. https://github.com/sustainable-computing-io/kepler/pull/427/files. CPU time calculation is also identified and fixed. The PR removes many expensive calls. The performance is also improved.
- Kepler Test: CI, GH Action
- Kepler Doc:
- Operator: Progress and release estimate
- Helm: Progress and next step (chart will be merged)
2. Release decision `10min`
- Kepler: Dec 16
- Kepler Operator: week of Dec 19
- Helm chart: TBD
3. Welcome new committers `5min`
- https://github.com/sustainable-computing-io/kepler/pull/459
4. 2023 community meeting schedule `5min`
- Starting mid of Jan, bi-weekly, Huamin to update zoom invite on README.md
- **Date:**
- Nov. 29, 2022
- **Agenda**
1. Update and progress `30min`
- VM support and model estimator integration
- Offline models are available, power estimate on VM is working.
- e2e test cases are available to verify kepler container and node metrics (Sam: can you add test case to detect new Pods and their metrics? Huamin to add issues. Prometheus client is used in e2e, metrics reading can be added)
- Estimator sidecar will be verified and added to e2e test (not in this release)
- What is the status of online model training and updating? (manual test first, Pang will share her sidecar and model server manifests)
- test coverage
- improved from lower 30s to 39%, need more unit test (Huamin to investigate which pkg needs more tests and create issues. Sam: create tests for each pkg. Internal pointer/connector makes test case hard to make, pointer value validation etc needs refactor. Some pkg requires bcc library, making mac user hard to add test cases, maybe conditional build tag? Huamin will create issues for mac/refactor)
- Deployment
- Operator
- [v1alphav1](https://github.com/sustainable-computing-io/kepler-operator/tree/v1alpha1) (Parul will send a demo based on kind. TODO integrating model server. OperatorHub integration will happen later. Sally will help on deploying on OpenShift/MicroShift)
- tested kepler-exporter on kind, cluster-prerequisite for openshift WIP
- working on integration with estimator and offline models
- TO-DO Parul: Document features present in current operator and what will be expected in the next release.
- Helm (https://github.com/sustainable-computing-io/kepler-helm-chart)
- PR ready, Sam commented/reviewed. Tested on kind (validated exporter and output). Still working on Prometheus and Grafana (may add in the future)
- to investigate how to release the chart (maybe use github actions)
- Docs
- [Simplify end user doc](https://github.com/sustainable-computing-io/kepler/issues/418) (Nikki to make two contributions)
2. Issues `20min`
- [0 process](https://github.com/sustainable-computing-io/kepler/issues/422) (Add logging)
- [podEnergyStatLabels needs update](https://github.com/sustainable-computing-io/kepler/issues/408) (already in local repo, doesn't affect model training atm)
- [block_device_used logging](https://github.com/sustainable-computing-io/kepler/issues/355) (let's lower the verbosity first)
3. Questions and Discussions `10min`
- Release criteria and date to be finalized
- https://github.com/sustainable-computing-io/kepler/issues/333 (Estimator/kepler: configmaps. Pang will share the examples (in the discussion and docs PR))
- Should cgroup v1 be supported in cgroup metrics based models? (Let's document this and investigate more next release)
- Shared e2e on all repos:
- Kepler(including estimator and model server)
- Operator, helm
:::info
- **Date:**
- Nov. 14, 2022
- **Agenda**
1. Issues and progress `40min`
- Issues
- VM support, Estimator, Model Server usage
- VM: CPU host passthrough tested (with perf counters metrics), cgroup metric model comes next
- development in model branch: cgroup pkg issues found. cgroupo metrics not reached. The work function is not finalized, under debugging. Pang is working on it and will report an issue.
- Estimator sidecar: tested before the metric refactoring. Config names in env var (also in the dev branch): set estimator to true. Huamin to add debugging to the sidecar.
- Kaiyi: update namespace in model server to kepler (in deployment and in Service endpoint)
- Process to ensure usability and performance
- Marcelo: metrics doc (including samples) updated, new grafana dashboard PR (not yet using all the metrics), all power sources are in their own metrics
- Separate metrics vs aggregating at the Prometheus: performance hit on prometheus should be avoided; Aggregating on the Kepler side can help the scalability.
- Having dedidcated power source metrics can be used by label based aggregation. End user can query/check individual or aggregate metrics based on the basic metrics.
- Docs
- mkdocs vs Sphinx vs Hugo: kepler-docs needs dev preview on local env. Hugo is used by k8s but not as easy as mkdocs. Sphinx provides apidocs, but so do mkdocs. Sphinx is complicated, not consistent preview on github pages and local vs code plugin.
- mkdocs also reports broken links, maybe a test needed to ensure all links valid (refer to https://github.com/redhat-et/microshift-documentation/blob/main/.github/workflows/broken-link-check.yml)
- kepler-docs approver: Marcelo, Parul, Pang
- Update
- Operator
- v1aphal1 branch: specs defined, main reconciler, abstraction in placeholder. Kepler-exporter: Parul, others: Kaiyi. Preview on kepler-exporter on bare metal in the next two weeks.
- Helm
- helm chart PR drop today. Will need review. Prometheus/grafana integration.
- reviewer: Sally
- Next
- Test coverage, e2e testing
- pkgs that need test coverage
- unit test:
- power pkg, simple to implement
- cgroup
- complex ones: comments + TODO
- simple test cases now, refactor can come later.
- system level: library dependency (how to mock them?), maybe borrow from k8s mock tests.
- how to run focus test: vs code ginko plugin (huamin and sam to share the command, If/focus)
-
- basic e2e test cases
- we deploy workload on kind cluster
- validate the kepler metrics with that workload
- gh uses ubuntu server, manifests with ebpf, that may cause issues with kepler (lib/modules bind mount)
- build on ubuntu or run containerized mode.
- dind vs VM on GH action: the flow of creating Fedora derived OS on GH VMs. Kind is a dind. The limitation of kind? eBPF/bcc library dependency. Sam please create issues
2. Questions `20min`
:::info
- **Date:**
- Nov. 1, 2022
- **Agenda**
1. Walk through project board `40min`
- Release criteria: urgent + high priority tasks done
- Size
- Size is used to determine the development time and deadline of the task
- XL: 1+ months
- L: 2 weeks - 1 month
- M: 1-2 weeks
- S: < 1 week
- Early PR is recommended.
- If the anticipated deadline goes beyond the release date, the priority of task is lowered and may be moved to next release.
2. Logistics `10min`
- release tracking
- biweekly meeting (for dev)
- release date
- Mid Dec (tentative, Dec 16th)
- release captain
- tracking the tasks and PRs, create tags for the issues that have release, priority, size (Parul)
- tags that can be reused (Marcelo)
- document everything release captain does so the process can be reused (Sally)
- manage PR merge
3. Development process `10min`
- task -> issue -> design -> PR -> test -> doc
- only task PR before release, refactor PR will be merged after release
- when merge conflicts exist, high priority PR and small PR are merged first
- feature PRs must have test cases (i.e. do not drop test coverage)
- bug fixes always have high priority
- **Participants:**
- Huamin Chen
- Chen Wang
- Parul
- Sally O'Malley
- Sam Yuan
- Kaiyi Liu
- Chen Ji
- Peng Hui Jiang
- Marcelo Amaral
- Sunyanan Choockotkaew
- Ken Lu
- Ruomeng Hao
::: note
- version scheme: incremental integer, decimal, periodical release
- milestone definition:
- support all clouds
- accurate of power measurement
- support all HW (x86, arm, s390x)
- backlog project to track new ideas that not covered in current release
## Walk through project board
:dart: Goal
---
- e2e integration
- owner: Huamin (also include e2e test)
- with also include operator for deployment
- also test API (maybe with mock data and long run test with read data)
- (GPU test will be at risk)
- platform: cpu architecture (e.g. icelake), linux distro (issues in 6.2 kernel) (Ken)
- test coverage
- owner: Sally
- platform: cpu architecture (e.g. icelake), linux distro (issues 6.2 kernel) (Ken)
- documentation
- owner: Marcelo (metrics), Parul (overall), Pang (model server/estimator)
- basic feature
- owner: Chen Wang
- deployment
- owner: Peng Hui Jiang (helm) and Parul and Pang (Operator)
:books: Backlog
---
- Features
- Testing
- Documentation
- Deployment
:closed_book: Tasks
--
==Importance== (Urgent - Low) / Name / **Size** (Small - X Large)
### Feature
- PRs to review
- Low / Detect kepler is running inside VM [PR #302](https://github.com/sustainable-computing-io/kepler/pull/302) / Small
- Not needed at the moment
- Medium / Use local model if no model server endpoint given [PR #384](https://github.com/sustainable-computing-io/kepler/pull/384) / Medium
- Urgent / power: switch to model based estimator when the RAPL interface is not available [PR #388](https://github.com/sustainable-computing-io/kepler/pull/388) / Small
- Merged
- Issues to prioritize
- Urgent / Difference between kepler-model-server and kepler estimator [issue #375](https://github.com/sustainable-computing-io/kepler/issues/375) / X Large
- High / Kepler general energy metrics vs energy tracing metrics [issue #365](https://github.com/sustainable-computing-io/kepler/issues/365) / X Large
- New metrics need to propose a new design if it involves new power source
- Need general enhancement template for new proposals (will follow up on slack)
- Low / Kepler on VM with Hardware Counters and RAPL [issue #367](https://github.com/sustainable-computing-io/kepler/issues/367) / X Large
- Medium / Fix BPF dependency so that Kepler can always find containers [issue #364](https://github.com/sustainable-computing-io/kepler/issues/364)
- find a use case.
- Urgent / end-to-end integration with model server, estimator and kepler [issue #349](https://github.com/sustainable-computing-io/kepler/issues/349)
- ansible expert needed
-
### Test
- [ ]
### Document
- [ ]
### Deployment
- [ ]
## Notes
<!-- Other important details discussed during the meeting can be entered here. -->