OpenEBS Jiva Discussions Notes === :::info - **Location:** https://meet.google.com/nvu-dhwx-jhb?hs=122 - **Date:** Every week from 7:00pm to 07:30pm on Wednesday (IST) - **Agenda** 1. Walk through the status update 2. Discussion topic *Add your request, question or suggestion to our [issue list](https://github.com/openebs/openebs/issues)* *If you have anything you’d like to put on the agenda, please do so below for the next meeting:* ::: # Mar 10th Topics for discussion: - Jiva Volume status being updated very frequently even in RW mode. switch instance.Status.Phase { case jv.JivaVolumePhaseReady, jv.JivaVolumePhaseSyncing: return reconcile.Result{}, r.getAndUpdateVolumeStatus(instance) } - Merge jiva csi and operator yaml NFS - Repos: - https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner - https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner - https://github.com/openebs/dynamic-nfs-provisioner - Give a walktrough of NFS - Comment on the issues, followup with users - Setup ci/cd pipeline - Verify storage classes in dynamic nfs provisioner - Documentation and examples - Native support for RWM volumes - Support for RO volumes # Pending items for winding up jiva-csi - upgrade github actions test(Merged) - Migration doc form non CSI to CSI volumes(PR to be raised) - Jiva Operator Documentation(PR raised under review) - Jiva ReadME(PR raised under review) - e2e PR merge/pipeline handover(waiting on pipeline fix, PR raised and reviewed) # Mar 3 - [Shubham] - Add version label on all the resources(Merged) - operator-sdk pr under review by akhil and prateek(Merged) - Upgrade pending(Waiting on maya update) - Upgrade test cases (") - Non CSI to CSI migration doc (Probably next release) - Non CSI to CSI migration test (") - Remove '(' from version in jivavolume CR - Error logs - [Prateek] - Add jiva operator to openebs charts repo - Google analytics (Conflicts to be resolved) - [Payes] - Verify support for RO volumes in cstor/jiva - Setup e2e pipeline for jiva-operator(A new branch has been created) - ReadME(PR under review) - Docs - Add events to Jiva Volume CR - Verify if Local PV are resized when jiva Volume PVC is resized - [Kiran] - Remove travis - Close the following Jira issues: - https://mayadata.atlassian.net/browse/OC-11?filter=- # Feb 24 - [Shubham] - Add version label on all the resources(Merged) - operator-sdk pr under review by akhil and prateek(Merged) - Upgrade pending(Waiting on maya update) - Upgrade test cases (") - Non CSI to CSI migration doc (Probably next release) - Non CSI to CSI migration test (") - Jiva Volume status being updated very frequently even in RW mode. (Fixed up increasing resync interval to 5s ) - [Prateek] - Add jiva operator to openebs charts repo - Google analytics (Under review) - [Payes] - Verify support for RO volumes in cstor/jiva - Setup e2e pipeline for jiva-operator(A new branch has been created) - ReadME(PR under review) - Docs - Add events to Jiva Volume CR - Verify if Local PV are resized when jiva Volume PVC is resized - [Kiran] - Remove travis - Close the following Jira issues: - https://mayadata.atlassian.net/browse/OC-11?filter=- # Feb 17 - [Shubham] - Jiva Volume status being updated very frequently even in RW mode. - Upgrade operator-sdk in jiva-operator(To be decided) - Upgrade (Draft PR raised) will be taken in next release - Upgrade test (Next release) - Non CSI to CSI migration doc (Probably next release) - Non CSI to CSI migration test (") - [Prateek] - Analytics for jiva-csi volumes - [Payes] - Verify support for RO volumes in cstor/jiva - Setup e2e pipeline for jiva-operator - Docs(PR sent to update README) - Add events to Jiva Volume CR - Verify if Local PV are resized when jiva Volume PVC is resized - [Kiran] - Remove travis - Close the following Jira issues: - https://mayadata.atlassian.net/browse/OC-11?filter=- # Feb 10 - [Shubham] - Reviewing the helm chart PRs - jiva-operator doesn't reconcile if jiva-volume policy is applied later(Merged) - Upgrade (Draft PR raised) will be taken in next release - Upgrade test (Next release) - Non CSI to CSI migration doc (Probably next release) - Non CSI to CSI migration test (") - [Prateek] - Multi arch (Merged) - Helm chart (Merged) - Topology support(Merged) - [Payes] - Jiva = hash based quorum. Check for the case where 3 replicas are registerd, a non-master replica tries to reconnect with a new ID.(Verified) - jiva: restart while initial registration (Merged) - Update go version in jiva to 1.14.7(Merged) - raw block support(merged) - Resize failing for raw block volumes(k8s does not support in 1.19, need to document this after cross verifying) - gotgt: metrics - issue/enhancement - gotgt: refactor io processing to single thread with queue vs thread per io. - csi: ginkgo tests - e2e: csi based tests - e2e: upgrade / migration tests - e2e: missing resiliency tests? # Feb 3 - [Shubham] - Merge operator and csi (Merged) - Version details on jiva-volume CR (Merged) - Target Affinity (Merged) - Upgrade (Draft PR raised) will be taken in next release - Upgrade test (Next release) - Non CSI to CSI migration doc (Probably next release) - Non CSI to CSI migration test (") - [Prateek] - Multi arch (Merged) - Helm chart (Inprogress - RC2) - [Payes] - jiva: restart while initial registration (Up for review) - csi: Root mount (Merged) - csi: Attach required as false (Merged) - raw block support(PR raised) - csi matrix for raw block(PR raised) - Update go version in jiva to 1.14.7(PR raised) - Use logrus for jiva operator (PR needs to be raised) - Topology support (To be added) - Readme Pending(RC2) - jiva-operator doesn't reconcile if jiva-volume policy is applied later - Resize failing for raw block volumes - gotgt: metrics - issue/enhancement - gotgt: refactor io processing to single thread with queue vs thread per io. - csi: ginkgo tests - e2e: csi based tests - e2e: upgrade / migration tests - e2e: missing resiliency tests? # Jan 27 - [Shubham] - Merge operator and csi (Merged) - Version details on jiva-volume CR (In progress) - Upgrade (Draft PR raised) - Upgrade test - Non CSI to CSI migration doc - Non CSI to CSI migration test - [Prateek] - Multi arch (Inprogress) - Helm chart (Inprogress) - [Payes] - jiva: restart while initial registration (Merged) - csi: Root mount (Merged) - csi: Attach required as false (Merged) - Use logrus for jiva operator - Topology support (To be added) - Check raw block support(PR to be sent) - Check csi matrix(Code complete, PR will be raised) - gotgt: metrics - issue/enhancement - gotgt: refactor io processing to single thread with queue vs thread per io. - csi: ginkgo tests - e2e: csi based tests - e2e: upgrade / migration tests - e2e: missing resiliency tests? ## Jan 20 - [Shubham] - Merge operator and csi (Review pending) - Version details on jiva-volume CR - Upgrade - Node Affinity pending in Jiva Volume Policy(Not Required, dependent on local PV) - [Prateek] - Multi arch - Helm chart - [Payes] - jiva: restart while initial registration (PR Raised) - csi: Root mount (PR Raised) - csi: Attach required as false (PR Raised) - Use logrus for jiva operator - Review and merge https://github.com/openebs/jiva-operator/pull/28/files - gotgt: metrics - issue/enhancement - gotgt: refactor io processing to single thread with queue vs thread per io. - csi: ginkgo tests - e2e: csi based tests - e2e: upgrade / migration tests - e2e: missing resiliency tests? ## Jan 13 - [Shubham] - Created separate issues for tasks. - Working on adding version details to jivavolume CR. - CSI volume upgrades == will be added to openebs/upgrade - multi-arch builds with github actions - helm chart for jiva - Do we need an admission webhook to validate jiva volume day 2 operatoins? - Resize will be done via PVC - Replica Scale-up? will not support this. - [Payes] - jiva: restart while initial registration - gotgt: metrics - issue/enhancement - gotgt: refactor io processing to single thread with queue vs thread per io. - csi: Root mount - csi: Attach required as false - csi: ginkgo tests - e2e: csi based tests - e2e: upgrade / migration tests - e2e: missing resiliency tests? - Open Discussions - Merge jiva-csi and jiva operator code into jiva-operators - jiva-operators will have the CSI as well as helm charts - jiva-operators follow Operator SDK pattern and JivaVolumeCR is embedded into this repo. - Pros: - Avoids maintaining dependencies and E2e tests in two repos - Cons: - log framework - cStor CSI uses logrus - K8s uses klog everywhere - uber/zap is used in jiva operator - troubleshooting - depends on the above log framework - easy to integrate with tools used in the customer clusters - prometheus/grafana/alertmanager integrations ## Jan 6 - [Shubham] - Work on completion of parity u- [Payes] - Start working on uuid change - [Kiran] - Got permission for kubernetes-sigs repo ## Dec 23 - [Shubham/prateek] - Continue with feature parity - [Payes] - Clean emplty files in raw block mode(Kubernetes bug not ours) - Add attacher container in multi attach PR(Done) ## Dec 16 - [Shubham/Prateek] - https://github.com/openebs/jiva-operator/issues/25 - https://docs.google.com/spreadsheets/d/1QfMFGjqDxp_buzEq8XtuTGHJd4zMSKVKYugWQaPIgTQ/edit?ts=5df0bae6#gid=0 - [Payes] - Trying out strok - Multi-attach error PR - Look into nfs issues - Familiarize with valero - [Kiran] - NFS ganesha automation setup - Dynamic NFS provisioner - PRs to be merged: - https://github.com/openebs/jiva/pull/337 - https://github.com/openebs/sparse-tools/pull/7 - https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/pull/17 ## Dec 2nd - [Payes] - Node failure test cases in progress - Learning Ansible (done) - Learning the current jiva pipelines and seeing the execution status - Raising a PR for simple change in the current jiva pipeline - Raise a PR to crash application node (Under review) - Raise a PR to crash RW replica node (Under review) - Raise a PR to crash WO node (Under review) - Work on raising a PR to litmus for node shutdown tests - m_upgrade or jiva-csi - NFS Provisioner - Raised a PR to fix 152.152 (done) - Fix the CLA(done) - Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done) - Follow up with community to merge - To be documented: - Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs - Below obervation needs to be shared with mcKinsey(Sent): - If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left - If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID - To delete PV when export ID is updated, PV's annotation should also be updated. - [Kiran] - Dynamic NFS provisioner - https://github.com/kmova/dynamic-nfs-provisioner - 02/12 : Pushed the new repo into openebs with create/delete functionality. ## Nov 4th/18th, 2020 ### Status Updates - [Payes] - Node failure test cases in progress - Learning Ansible (done) - Learning the current jiva pipelines and seeing the execution status - Raising a PR for simple change in the current jiva pipeline - Raise a PR to crash application node (Under review) - Raise a PR to crash RW replica node (Under review) - Raise a PR to crash WO node (Under review) - Work on raising a PR to litmus for node shutdown tests - m_upgrade or jiva-csi - NFS Provisioner - Raised a PR to fix 152.152 (done) - Fix the CLA(done) - Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done) - Follow up with community to merge - To be documented: - Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs - Below obervation needs to be shared with mcKinsey: - If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left - If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID - To delete PV when export ID is updated, PV's annotation should also be updated. - [Kiran] - Dynamic NFS provisioner - https://github.com/kmova/dynamic-nfs-provisioner ## Oct 21, 2020 ### Status Updates - [Payes] - Node failure test cases in progress ## Oct 14, 2020 ### Status Updates - [Utkarsh] - PRs related to vendor updates in Jiva CSI and Jiva operator - CSI Sanity - [Payes] - Merged Remount, resize and volume provisioning test cases merged - Need to connect with gostor team (Pending) - Raise gotgt PR upstream(Not yet raised) - Node Failure Tests(Pending) - [Kiran] - NFS operator build code to be pushed - OpenEBS NFS Operator (Pending) ## Oct 7, 2020 ### Status Updates - [Payes] - Remount issue(Merged) - Jiva CSI integration tests(In Progress) - PR pushed, travis failing - Need to connect with gostor team (Pending) - Raise gotgt PR upstream(Not yet raised) - Node Failure Tests(Pending) - NFS Operator (Pending) - [Utkarsh] - PR merged in CSI sanity tests - Will create PR to remote custom code for CSI sanity - [Kiran] - ## Sept 30, 2020 ### Status Updates - [Shashank] - Test list (in pipeline and planned) - https://docs.google.com/spreadsheets/d/18kxwynIe59gIDeIKGhf1oZzWZdEAsnqMgjC9-H0Jx9s/edit#gid=0 - Node failure in case of NFS provisioner on Jiva(Planned) - Kubelet failure where nfs provisioner is scheduled - [Payes] - License update in Jiva-csi and Jiva-operator (Done) - Build failures due to version mismatches (Done)- Received help from Prateek - Remount isuue PR raised (Under Review) - Sai is Reviewing - Jiva CSI integration tests under progress - Need to connect with gostor team (Pending) - Raise gotgt PR upstream(Not yet raised) - Node Failure Tests(Pending) - NFS Operator (Pending) ## Sept 23, 2020 Attendees: Kiran/Payes/Shashank ### Status Updates - [Payes] - License update(Done) - Jiva CSI Stability(InProgess) - Node Failure Tests(planned) - Raise gotgt PR upstream(InProgress) - NFS Operator(Not Planned) - [Shashank] - Review Node Failure tests(Planned) - List NFS tests on Jiva in e2e pipeline(Done) - Verify if Jiva Metrics are tested (Planned) - Jiva CSI driver related e2e tests (Not Planned)- Not present in pipeline - [Kiran] - NFS external provisioner CI automation(Planned) ### Backlogs - Multi arch build for Jiva and Jiva CSI driver (Issue Created) - Recovering from complete node failure(code to be added in operator) - Jiva CSI and operator integration tests (Pending) - Update Contributor and User Documentation - Jiva CSI Resize feature - Supported - Jiva CSI Metrics - Supported - Enhancements to Jivactl - Enhancements to OpenEBS ctl to include Jiva Volumes - Verify Jiva Volume CR status updates and events ## Sept 16, 2020 Attendees: Kiran/Payes ### Status Updates - [Payes] - Update license in all files - Jiva CSI Stability - Node Failure Tests ## Sept 2, 2020 Attendees: Payes ### Status Updates - [Payes] - Jiva Read performance improvement done, code yet to be pushed to upstream gostor repo ## Aug 26th, 2020 Attendees: Shashank, Payes, Kiran ### Status Updates - [Shashank/Payes] - Jiva Reads give n*x performance numbers when no writes are being done(Not really happening) - Verify if metrics is tested in pipelines - Jiva CSI is under progress - Verify if Jiva over NFS tests are being run in pipeline - What happens if filesystem goes in RO state in pipeline - Test e2e replica rebuild slowness using replicated - Replica data loss case for dummy path in e2e ## Aug 19th, 2020 Attendees: Shashank, Payes, Kiran ### Status Updates - [Shashank/Payes] - 2.0 released with fixes for auto delete of internal snapshots - The fix is back-ported/cherry picked to 1.12.x - 2.0 also includes refactoring on the rebuild with check-points - Enhanced Travis Integration tests around check-point and data integrity tests with chaos = random and continous were executed manually - E2e pipelines were updated with chaos tests on controller failure. ### Discussion Topics - 2.x Goals - Brainstorming - Stability of Jiva - e2e and integration tests - NFS on Jiva => stability and best practicies - Complete cluster shutdown and restart - Error reporting for failures and troubleshooting guides - Support bundle - take application name as input and generate a zip of config and logs - Jiva user guides - CSI Driver with remount feature - CSI Driver with Replica with Local PV hostpath - Workload cluster = jiva - Support from Kubera for Jiva - Performance comparision b/w single and parallel IO - [Refactor E2e snapshot delete test] - Current test - trigger the delete and verify data integrity while deletion is in progress - controller to replica connection happens - restart master replica. secondary replicas go to crash-loop-back-off. takes time for replica to reconcile. - master comes back online after t1 - replica1 comes back online after t1+2s == starts the sync process - master restarted again at t1+1s while sync is in progress - Test Details: - kill master and wait for new master to take over and bring replicas to RW state - R1 (RW), R2 (RW), R3 (WO) - 3 replicas are connected - Deployed application - R1 (RW), R2 (RW), R3 (WO) => Restart R1 - 1 mins (wait for 1 min) - 5 mins (identify new master) - x mins (wait for RW, RW, WO) - R1 (NA), R2 (RW), R3 (NA) => - R1 (NA), R2 (RW), R3 (WO) => - R1 (WO), R2 (RW), R3 (RW) => Restart R15 - R1 (NA), R2 (WO), R3 (RW) => - R1 (WO), R2 (RW), R3 (RW) => - 40 snapshots or 20 snapshots - R1 (WO), R2 (NA), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R1 CLBTimer) - R1 (RW), R2 (WO), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R2 CLBTimer) - R1 (RW), R2 (RW), R3 (RW) => - Verify that all replicas are RW ( 15 min ==> n snapshots * 2 * ~copy time ) - Wait till snaps become 12 ( 15 min ==> n or restarts * 1 mins? ) ## May 13th, 2020 Attendees: Utkarsh, Payes, Vitta, Giri ### Status Updates - [Payes] - Review pending PR's - Help somesh on preload tests - Work on preload test scenario - [Utkarsh] - Re-enable auto-snap deletion (design is done) - Re-architect jiva rebuilding and registration (design in progress) - [Somesh] - e2e for pre-load optimizations (done) - jiva replica scaleup refactoring (done) ### Discussion Topics - [Utkarsh] Failure during a resync could cause data loss - Rebuilding and registration process needs architecture change ## May 6th, 2020 Attendees: Utkarsh, Payes, Vitta, Kiran ### Status Updates - [Payes] - Travis CI failing (blocker for RC1) - [Utkarsh] - Retry reload via configurable timeout (https://github.com/openebs/sparse-tools/pull/6) - Re-enable auto-snap deletion (design in progress) - [Somesh] - e2e for pre-load optimizations (in-progress) ### Discussion Topics - [Utkarsh] Failure during a resync could cause data loss - During resync, holes in images files are getting synced - If there a failure after a paritial sync of sparse files, the slave can become master. - This issue is triggered mainly due to either automatic or manual snap deletion ## April 29, 2020 Attendees: Somesh, Utkarsh, Payes, Vitta, Kiran ### Status Updates - [Somesh] - Automated cleanup job testing (review is pending) - Manually verified flushing logs to file feature and automated it (review is pending) - [Utkarsh] - Proposed the improvement/fix for two child issue - Sync progress feature has been merged - Sync progress on clone volume (review is in progress) - Configuring sync server with custom http timeouts or retry (WIP) - [Payes] - Preload fix has been merged (degraded replica is getting restarted since preload was taking time when write IO's continues) - Work with somesh to validate the fix ### Discussion Topics - [Vitta] - Review Backlog at https://github.com/orgs/openebs/projects/1 - Prioritize auto-snap deletion. Start with the design - jiva replica under IO pressure will be taken up by payes. work with utkarsh. - jivactl two child is not a concern, so move to near term - [Utkarsh] - Should a release tag be created on openebs/sparse-tools? - create a tag 1.0 and use that in jiva. - update dependencies in a different commit helps to know what changes went in. - [Giri/Somesh] - Functional Testing on Jiva - Provisioning of Jiva. Verify setting up and cleaning of various resources (done) - Provisioning Jiva with XFS (instead of default Ext4) - Volume Metrics from Jiva Target and Replicas - Application Testing on Jiva - Running Percona/MySQL on Raw Block Device with Jiva - Running MongoDB on Jiva. - Backup and Restore of PostgreSQL on Jiva - Application pods with Cross AZ functionality - Chaos Testing on Jiva - Provisioners, Jiva Target and Replica (podkill) during provisioning and deletion of PVCs. - Application pod kill - Node Failure/Restart - Node Delete (or changing the hostname of the node) and handle the graceful application pod rescheduling/creating new pod. - Scale Testing on Jiva - Test for capacity threshold crossing - Test for underlying device running out of capacity - Test for 50, 100, 200 volumes - Soak Testing on Jiva - Setup long running clusters - soak test bed - Postgresql on Jiva - MongoDB on Jiva - Use data generators to keep pumping data - Introduce occasional choas as part of build/release pipelines into the soaktest bed cluster on these applications.