owned this note
owned this note
Published
Linked with GitHub
OpenEBS Jiva Discussions Notes
===
:::info
- **Location:** https://meet.google.com/nvu-dhwx-jhb?hs=122
- **Date:** Every week from 7:00pm to 07:30pm on Wednesday (IST)
- **Agenda**
1. Walk through the status update
2. Discussion topic
*Add your request, question or suggestion to our [issue list](https://github.com/openebs/openebs/issues)*
*If you have anything you’d like to put on the agenda, please do so below for the next meeting:*
:::
# Mar 10th
Topics for discussion:
- Jiva Volume status being updated very frequently even in RW mode.
switch instance.Status.Phase {
case jv.JivaVolumePhaseReady, jv.JivaVolumePhaseSyncing:
return reconcile.Result{}, r.getAndUpdateVolumeStatus(instance)
}
- Merge jiva csi and operator yaml
NFS
- Repos:
- https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner
- https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner
- https://github.com/openebs/dynamic-nfs-provisioner
- Give a walktrough of NFS
- Comment on the issues, followup with users
- Setup ci/cd pipeline
- Verify storage classes in dynamic nfs provisioner
- Documentation and examples
- Native support for RWM volumes
- Support for RO volumes
# Pending items for winding up jiva-csi
- upgrade github actions test(Merged)
- Migration doc form non CSI to CSI volumes(PR to be raised)
- Jiva Operator Documentation(PR raised under review)
- Jiva ReadME(PR raised under review)
- e2e PR merge/pipeline handover(waiting on pipeline fix, PR raised and reviewed)
# Mar 3
- [Shubham]
- Add version label on all the resources(Merged)
- operator-sdk pr under review by akhil and prateek(Merged)
- Upgrade pending(Waiting on maya update)
- Upgrade test cases (")
- Non CSI to CSI migration doc (Probably next release)
- Non CSI to CSI migration test (")
- Remove '(' from version in jivavolume CR
- Error logs
- [Prateek]
- Add jiva operator to openebs charts repo
- Google analytics (Conflicts to be resolved)
- [Payes]
- Verify support for RO volumes in cstor/jiva
- Setup e2e pipeline for jiva-operator(A new branch has been created)
- ReadME(PR under review)
- Docs
- Add events to Jiva Volume CR
- Verify if Local PV are resized when jiva Volume PVC is resized
- [Kiran]
- Remove travis
- Close the following Jira issues:
- https://mayadata.atlassian.net/browse/OC-11?filter=-
# Feb 24
- [Shubham]
- Add version label on all the resources(Merged)
- operator-sdk pr under review by akhil and prateek(Merged)
- Upgrade pending(Waiting on maya update)
- Upgrade test cases (")
- Non CSI to CSI migration doc (Probably next release)
- Non CSI to CSI migration test (")
- Jiva Volume status being updated very frequently even in RW mode. (Fixed up increasing resync interval to 5s )
- [Prateek]
- Add jiva operator to openebs charts repo
- Google analytics (Under review)
- [Payes]
- Verify support for RO volumes in cstor/jiva
- Setup e2e pipeline for jiva-operator(A new branch has been created)
- ReadME(PR under review)
- Docs
- Add events to Jiva Volume CR
- Verify if Local PV are resized when jiva Volume PVC is resized
- [Kiran]
- Remove travis
- Close the following Jira issues:
- https://mayadata.atlassian.net/browse/OC-11?filter=-
# Feb 17
- [Shubham]
- Jiva Volume status being updated very frequently even in RW mode.
- Upgrade operator-sdk in jiva-operator(To be decided)
- Upgrade (Draft PR raised) will be taken in next release
- Upgrade test (Next release)
- Non CSI to CSI migration doc (Probably next release)
- Non CSI to CSI migration test (")
- [Prateek]
- Analytics for jiva-csi volumes
- [Payes]
- Verify support for RO volumes in cstor/jiva
- Setup e2e pipeline for jiva-operator
- Docs(PR sent to update README)
- Add events to Jiva Volume CR
- Verify if Local PV are resized when jiva Volume PVC is resized
- [Kiran]
- Remove travis
- Close the following Jira issues:
- https://mayadata.atlassian.net/browse/OC-11?filter=-
# Feb 10
- [Shubham]
- Reviewing the helm chart PRs
- jiva-operator doesn't reconcile if jiva-volume policy is applied later(Merged)
- Upgrade (Draft PR raised) will be taken in next release
- Upgrade test (Next release)
- Non CSI to CSI migration doc (Probably next release)
- Non CSI to CSI migration test (")
- [Prateek]
- Multi arch (Merged)
- Helm chart (Merged)
- Topology support(Merged)
- [Payes]
- Jiva = hash based quorum. Check for the case where 3 replicas are registerd, a non-master replica tries to reconnect with a new ID.(Verified)
- jiva: restart while initial registration (Merged)
- Update go version in jiva to 1.14.7(Merged)
- raw block support(merged)
- Resize failing for raw block volumes(k8s does not support in 1.19, need to document this after cross verifying)
- gotgt: metrics - issue/enhancement
- gotgt: refactor io processing to single thread with queue vs thread per io.
- csi: ginkgo tests
- e2e: csi based tests
- e2e: upgrade / migration tests
- e2e: missing resiliency tests?
# Feb 3
- [Shubham]
- Merge operator and csi (Merged)
- Version details on jiva-volume CR (Merged)
- Target Affinity (Merged)
- Upgrade (Draft PR raised) will be taken in next release
- Upgrade test (Next release)
- Non CSI to CSI migration doc (Probably next release)
- Non CSI to CSI migration test (")
- [Prateek]
- Multi arch (Merged)
- Helm chart (Inprogress - RC2)
- [Payes]
- jiva: restart while initial registration (Up for review)
- csi: Root mount (Merged)
- csi: Attach required as false (Merged)
- raw block support(PR raised)
- csi matrix for raw block(PR raised)
- Update go version in jiva to 1.14.7(PR raised)
- Use logrus for jiva operator (PR needs to be raised)
- Topology support (To be added)
- Readme Pending(RC2)
- jiva-operator doesn't reconcile if jiva-volume policy is applied later
- Resize failing for raw block volumes
- gotgt: metrics - issue/enhancement
- gotgt: refactor io processing to single thread with queue vs thread per io.
- csi: ginkgo tests
- e2e: csi based tests
- e2e: upgrade / migration tests
- e2e: missing resiliency tests?
# Jan 27
- [Shubham]
- Merge operator and csi (Merged)
- Version details on jiva-volume CR (In progress)
- Upgrade (Draft PR raised)
- Upgrade test
- Non CSI to CSI migration doc
- Non CSI to CSI migration test
- [Prateek]
- Multi arch (Inprogress)
- Helm chart (Inprogress)
- [Payes]
- jiva: restart while initial registration (Merged)
- csi: Root mount (Merged)
- csi: Attach required as false (Merged)
- Use logrus for jiva operator
- Topology support (To be added)
- Check raw block support(PR to be sent)
- Check csi matrix(Code complete, PR will be raised)
- gotgt: metrics - issue/enhancement
- gotgt: refactor io processing to single thread with queue vs thread per io.
- csi: ginkgo tests
- e2e: csi based tests
- e2e: upgrade / migration tests
- e2e: missing resiliency tests?
## Jan 20
- [Shubham]
- Merge operator and csi (Review pending)
- Version details on jiva-volume CR
- Upgrade
- Node Affinity pending in Jiva Volume Policy(Not Required, dependent on local PV)
- [Prateek]
- Multi arch
- Helm chart
- [Payes]
- jiva: restart while initial registration (PR Raised)
- csi: Root mount (PR Raised)
- csi: Attach required as false (PR Raised)
- Use logrus for jiva operator
- Review and merge https://github.com/openebs/jiva-operator/pull/28/files
- gotgt: metrics - issue/enhancement
- gotgt: refactor io processing to single thread with queue vs thread per io.
- csi: ginkgo tests
- e2e: csi based tests
- e2e: upgrade / migration tests
- e2e: missing resiliency tests?
## Jan 13
- [Shubham]
- Created separate issues for tasks.
- Working on adding version details to jivavolume CR.
- CSI volume upgrades == will be added to openebs/upgrade
- multi-arch builds with github actions
- helm chart for jiva
- Do we need an admission webhook to validate jiva volume day 2 operatoins?
- Resize will be done via PVC
- Replica Scale-up? will not support this.
- [Payes]
- jiva: restart while initial registration
- gotgt: metrics - issue/enhancement
- gotgt: refactor io processing to single thread with queue vs thread per io.
- csi: Root mount
- csi: Attach required as false
- csi: ginkgo tests
- e2e: csi based tests
- e2e: upgrade / migration tests
- e2e: missing resiliency tests?
- Open Discussions
- Merge jiva-csi and jiva operator code into jiva-operators
- jiva-operators will have the CSI as well as helm charts
- jiva-operators follow Operator SDK pattern and JivaVolumeCR is embedded into this repo.
- Pros:
- Avoids maintaining dependencies and E2e tests in two repos
- Cons:
- log framework
- cStor CSI uses logrus
- K8s uses klog everywhere
- uber/zap is used in jiva operator
- troubleshooting
- depends on the above log framework
- easy to integrate with tools used in the customer clusters
- prometheus/grafana/alertmanager integrations
## Jan 6
- [Shubham]
- Work on completion of parity
u- [Payes]
- Start working on uuid change
- [Kiran]
- Got permission for kubernetes-sigs repo
## Dec 23
- [Shubham/prateek]
- Continue with feature parity
- [Payes]
- Clean emplty files in raw block mode(Kubernetes bug not ours)
- Add attacher container in multi attach PR(Done)
## Dec 16
- [Shubham/Prateek]
- https://github.com/openebs/jiva-operator/issues/25
- https://docs.google.com/spreadsheets/d/1QfMFGjqDxp_buzEq8XtuTGHJd4zMSKVKYugWQaPIgTQ/edit?ts=5df0bae6#gid=0
- [Payes]
- Trying out strok
- Multi-attach error PR
- Look into nfs issues
- Familiarize with valero
- [Kiran]
- NFS ganesha automation setup
- Dynamic NFS provisioner
- PRs to be merged:
- https://github.com/openebs/jiva/pull/337
- https://github.com/openebs/sparse-tools/pull/7
- https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner/pull/17
## Dec 2nd
- [Payes]
- Node failure test cases in progress
- Learning Ansible (done)
- Learning the current jiva pipelines and seeing the execution status
- Raising a PR for simple change in the current jiva pipeline
- Raise a PR to crash application node (Under review)
- Raise a PR to crash RW replica node (Under review)
- Raise a PR to crash WO node (Under review)
- Work on raising a PR to litmus for node shutdown tests
- m_upgrade or jiva-csi
- NFS Provisioner
- Raised a PR to fix 152.152 (done)
- Fix the CLA(done)
- Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done)
- Follow up with community to merge
- To be documented:
- Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs
- Below obervation needs to be shared with mcKinsey(Sent):
- If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left
- If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID
- To delete PV when export ID is updated, PV's annotation should also be updated.
- [Kiran]
- Dynamic NFS provisioner
- https://github.com/kmova/dynamic-nfs-provisioner
- 02/12 : Pushed the new repo into openebs with create/delete functionality.
## Nov 4th/18th, 2020
### Status Updates
- [Payes]
- Node failure test cases in progress
- Learning Ansible (done)
- Learning the current jiva pipelines and seeing the execution status
- Raising a PR for simple change in the current jiva pipeline
- Raise a PR to crash application node (Under review)
- Raise a PR to crash RW replica node (Under review)
- Raise a PR to crash WO node (Under review)
- Work on raising a PR to litmus for node shutdown tests
- m_upgrade or jiva-csi
- NFS Provisioner
- Raised a PR to fix 152.152 (done)
- Fix the CLA(done)
- Test for upgrades with nfs volumes that crossed 152 with previous code and users might have manually edited the exports file(done)
- Follow up with community to merge
- To be documented:
- Quota doesn't work if the underlying filesystem of nfs provisioner is not xfs
- Below obervation needs to be shared with mcKinsey:
- If only Filesystem_id is updated volume gets deleted but a stale entry in vfs.conf is left
- If export ID is updated, then the PV will not be deleted, as PV has annotation with old Export ID
- To delete PV when export ID is updated, PV's annotation should also be updated.
- [Kiran]
- Dynamic NFS provisioner
- https://github.com/kmova/dynamic-nfs-provisioner
## Oct 21, 2020
### Status Updates
- [Payes]
- Node failure test cases in progress
## Oct 14, 2020
### Status Updates
- [Utkarsh]
- PRs related to vendor updates in Jiva CSI and Jiva operator
- CSI Sanity
- [Payes]
- Merged Remount, resize and volume provisioning test cases merged
- Need to connect with gostor team (Pending)
- Raise gotgt PR upstream(Not yet raised)
- Node Failure Tests(Pending)
- [Kiran]
- NFS operator build code to be pushed
- OpenEBS NFS Operator (Pending)
## Oct 7, 2020
### Status Updates
- [Payes]
- Remount issue(Merged)
- Jiva CSI integration tests(In Progress) - PR pushed, travis failing
- Need to connect with gostor team (Pending)
- Raise gotgt PR upstream(Not yet raised)
- Node Failure Tests(Pending)
- NFS Operator (Pending)
- [Utkarsh]
- PR merged in CSI sanity tests
- Will create PR to remote custom code for CSI sanity
- [Kiran]
-
## Sept 30, 2020
### Status Updates
- [Shashank]
- Test list (in pipeline and planned)
- https://docs.google.com/spreadsheets/d/18kxwynIe59gIDeIKGhf1oZzWZdEAsnqMgjC9-H0Jx9s/edit#gid=0
- Node failure in case of NFS provisioner on Jiva(Planned)
- Kubelet failure where nfs provisioner is scheduled
- [Payes]
- License update in Jiva-csi and Jiva-operator (Done)
- Build failures due to version mismatches (Done)- Received help from Prateek
- Remount isuue PR raised (Under Review) - Sai is Reviewing
- Jiva CSI integration tests under progress
- Need to connect with gostor team (Pending)
- Raise gotgt PR upstream(Not yet raised)
- Node Failure Tests(Pending)
- NFS Operator (Pending)
## Sept 23, 2020
Attendees: Kiran/Payes/Shashank
### Status Updates
- [Payes]
- License update(Done)
- Jiva CSI Stability(InProgess)
- Node Failure Tests(planned)
- Raise gotgt PR upstream(InProgress)
- NFS Operator(Not Planned)
- [Shashank]
- Review Node Failure tests(Planned)
- List NFS tests on Jiva in e2e pipeline(Done)
- Verify if Jiva Metrics are tested (Planned)
- Jiva CSI driver related e2e tests (Not Planned)- Not present in pipeline
- [Kiran]
- NFS external provisioner CI automation(Planned)
### Backlogs
- Multi arch build for Jiva and Jiva CSI driver (Issue Created)
- Recovering from complete node failure(code to be added in operator)
- Jiva CSI and operator integration tests (Pending)
- Update Contributor and User Documentation
- Jiva CSI Resize feature - Supported
- Jiva CSI Metrics - Supported
- Enhancements to Jivactl
- Enhancements to OpenEBS ctl to include Jiva Volumes
- Verify Jiva Volume CR status updates and events
## Sept 16, 2020
Attendees: Kiran/Payes
### Status Updates
- [Payes]
- Update license in all files
- Jiva CSI Stability
- Node Failure Tests
## Sept 2, 2020
Attendees: Payes
### Status Updates
- [Payes]
- Jiva Read performance improvement done, code yet to be pushed to upstream gostor repo
## Aug 26th, 2020
Attendees: Shashank, Payes, Kiran
### Status Updates
- [Shashank/Payes]
- Jiva Reads give n*x performance numbers when no writes are being done(Not really happening)
- Verify if metrics is tested in pipelines
- Jiva CSI is under progress
- Verify if Jiva over NFS tests are being run in pipeline
- What happens if filesystem goes in RO state in pipeline
- Test e2e replica rebuild slowness using replicated
- Replica data loss case for dummy path in e2e
## Aug 19th, 2020
Attendees: Shashank, Payes, Kiran
### Status Updates
- [Shashank/Payes]
- 2.0 released with fixes for auto delete of internal snapshots
- The fix is back-ported/cherry picked to 1.12.x
- 2.0 also includes refactoring on the rebuild with check-points
- Enhanced Travis Integration tests around check-point and data integrity tests with chaos = random and continous were executed manually
- E2e pipelines were updated with chaos tests on controller failure.
### Discussion Topics
- 2.x Goals - Brainstorming
- Stability of Jiva - e2e and integration tests
- NFS on Jiva => stability and best practicies
- Complete cluster shutdown and restart
- Error reporting for failures and troubleshooting guides
- Support bundle - take application name as input and generate a zip of config and logs
- Jiva user guides
- CSI Driver with remount feature
- CSI Driver with Replica with Local PV hostpath
- Workload cluster = jiva
- Support from Kubera for Jiva
- Performance comparision b/w single and parallel IO
- [Refactor E2e snapshot delete test]
- Current test - trigger the delete and verify data integrity while deletion is in progress
- controller to replica connection happens - restart master replica. secondary replicas go to crash-loop-back-off. takes time for replica to reconcile.
- master comes back online after t1
- replica1 comes back online after t1+2s == starts the sync process
- master restarted again at t1+1s while sync is in progress
- Test Details:
- kill master and wait for new master to take over and bring replicas to RW state
- R1 (RW), R2 (RW), R3 (WO)
- 3 replicas are connected
- Deployed application
- R1 (RW), R2 (RW), R3 (WO) => Restart R1
- 1 mins (wait for 1 min)
- 5 mins (identify new master)
- x mins (wait for RW, RW, WO)
- R1 (NA), R2 (RW), R3 (NA) =>
- R1 (NA), R2 (RW), R3 (WO) =>
- R1 (WO), R2 (RW), R3 (RW) => Restart R15
- R1 (NA), R2 (WO), R3 (RW) =>
- R1 (WO), R2 (RW), R3 (RW) =>
- 40 snapshots or 20 snapshots
- R1 (WO), R2 (NA), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R1 CLBTimer)
- R1 (RW), R2 (WO), R3 (RW) => ( 15 min ==> n snapshots * ~copy time + delta R2 CLBTimer)
- R1 (RW), R2 (RW), R3 (RW) =>
- Verify that all replicas are RW ( 15 min ==> n snapshots * 2 * ~copy time )
- Wait till snaps become 12 ( 15 min ==> n or restarts * 1 mins? )
## May 13th, 2020
Attendees: Utkarsh, Payes, Vitta, Giri
### Status Updates
- [Payes]
- Review pending PR's
- Help somesh on preload tests
- Work on preload test scenario
- [Utkarsh]
- Re-enable auto-snap deletion (design is done)
- Re-architect jiva rebuilding and registration (design in progress)
- [Somesh]
- e2e for pre-load optimizations (done)
- jiva replica scaleup refactoring (done)
### Discussion Topics
- [Utkarsh] Failure during a resync could cause data loss
- Rebuilding and registration process needs architecture change
## May 6th, 2020
Attendees: Utkarsh, Payes, Vitta, Kiran
### Status Updates
- [Payes]
- Travis CI failing (blocker for RC1)
- [Utkarsh]
- Retry reload via configurable timeout (https://github.com/openebs/sparse-tools/pull/6)
- Re-enable auto-snap deletion (design in progress)
- [Somesh]
- e2e for pre-load optimizations (in-progress)
### Discussion Topics
- [Utkarsh] Failure during a resync could cause data loss
- During resync, holes in images files are getting synced
- If there a failure after a paritial sync of sparse files, the slave can become master.
- This issue is triggered mainly due to either automatic or manual snap deletion
## April 29, 2020
Attendees: Somesh, Utkarsh, Payes, Vitta, Kiran
### Status Updates
- [Somesh]
- Automated cleanup job testing (review is pending)
- Manually verified flushing logs to file feature and automated it (review is pending)
- [Utkarsh]
- Proposed the improvement/fix for two child issue
- Sync progress feature has been merged
- Sync progress on clone volume (review is in progress)
- Configuring sync server with custom http timeouts or retry (WIP)
- [Payes]
- Preload fix has been merged (degraded replica is getting restarted since preload was taking time when write IO's continues)
- Work with somesh to validate the fix
### Discussion Topics
- [Vitta]
- Review Backlog at https://github.com/orgs/openebs/projects/1
- Prioritize auto-snap deletion. Start with the design
- jiva replica under IO pressure will be taken up by payes. work with utkarsh.
- jivactl two child is not a concern, so move to near term
- [Utkarsh]
- Should a release tag be created on openebs/sparse-tools?
- create a tag 1.0 and use that in jiva.
- update dependencies in a different commit helps to know what changes went in.
- [Giri/Somesh]
- Functional Testing on Jiva
- Provisioning of Jiva. Verify setting up and cleaning of various resources (done)
- Provisioning Jiva with XFS (instead of default Ext4)
- Volume Metrics from Jiva Target and Replicas
- Application Testing on Jiva
- Running Percona/MySQL on Raw Block Device with Jiva
- Running MongoDB on Jiva.
- Backup and Restore of PostgreSQL on Jiva
- Application pods with Cross AZ functionality
- Chaos Testing on Jiva
- Provisioners, Jiva Target and Replica (podkill) during provisioning and deletion of PVCs.
- Application pod kill
- Node Failure/Restart
- Node Delete (or changing the hostname of the node) and handle the graceful application pod rescheduling/creating new pod.
- Scale Testing on Jiva
- Test for capacity threshold crossing
- Test for underlying device running out of capacity
- Test for 50, 100, 200 volumes
- Soak Testing on Jiva
- Setup long running clusters - soak test bed
- Postgresql on Jiva
- MongoDB on Jiva
- Use data generators to keep pumping data
- Introduce occasional choas as part of build/release pipelines into the soaktest bed cluster on these applications.