owned this note
owned this note
Published
Linked with GitHub
# LVM / DM discussion
## Performance results
### ext4, thin-provisioning, LVM
```text
BenchmarkLVM/run-88 1 4464814810 ns/op 37.58 MB/s
BenchmarkLVM/prepare 1 2569627045 ns/op
BenchmarkLVM/write 1 1887356260 ns/op
BenchmarkLVM/commit 1 7589561 ns/op
```
### xfs, thin-provisioning, LVM
```text
BenchmarkLVM/run-88 1 3896416208 ns/op 43.06 MB/s
BenchmarkLVM/prepare 1 2470850110 ns/op
BenchmarkLVM/write 1 1417508987 ns/op
BenchmarkLVM/commit 1 7853430 ns/op
```
### ext4, thin-provisioning, devmapper
```text
BenchmarkDeviceMapper/run-88 1 3713029688 ns/op 45.18 MB/s
BenchmarkDeviceMapper/prepare 1 1745955485 ns/op
BenchmarkDeviceMapper/write 1 1960767979 ns/op
BenchmarkDeviceMapper/commit 1 5976977 ns/op
```
### Discussion
-ext4 versus xfs:
-Any obvious sources of performance difference? -- none identified yet
#### Why LVM over devicemapper
-tooling: LVM its possible using cmldine tools to get good idea of what has happened and can manipulate things on the fly. DM is very complex.
-not clear with respect to performance yet
## Questions from review
1. Q: default size of the snapshot we create? What is a good one? (LVM - 10G, not defined by dm)
A: Hasn't been addressed in the implementation at this point. Haven't put a lot of thought into an appropriate value yet.
2. Q: xfs vs ext4 (Write performance is better with xfs as shown above but what are the trade-offs between that and ext4 and why did they choose ext4?)
A: Both Kata and Firecracker team went with the same approach. start with 1 known filesystem and then make them configurable for the user later. Kata went with xfs cause of reflink support and firecracker just started with ext4.
3. Q: Active devices (Why did dm choose to leave all layers/snapshots always active? LVM deactivates on commit, activates on prepare/view).
A: Not an explicit decision to not deactivite layers which are not in use - should look at what the theoretical limit is. There is a small overhead when deactivation is included
4. Q: On commit, size of the snapshot is not measured by dm. Any reason for that? LVM does a temp mount, uses containerd/continuity/fs to walk and find inodes and size and reports back).
A: Something that is being considered in the future
5. Q: metadata store location (LVM creates a new lv and stores it inside. Any reason dm does not prefer to do that?)
A: w/in dm managed storage benefit: DM storage coming from a remote disk, entirety is used for DM; and the metadata can be just carried with it. For Kata, we created a volume within what was given to us, that way it is conslidated. This wasn't considered thoroughly; they treat the system as a single unified entity today. boltDB, etc, was easy to use. We still use boltDB in LVM side; we just keep it in volume that is a part of teh managed volumes from snapshotter itself. (difference is just where we create the file)
6. Q: Both LVM and dm use command calls. Have you pursued using a library (liblvm) and use dynamic library against shell-out calls.
A: Not asked
7. Q: any selinux implications? have you tried this snapshotter with selinux?
A: set enforce 0 - untested
8. Discussion regarding RO images, [moby/27364](https://github.com/moby/moby/pull/27364) This has been discussed on the AWS containers team, but isn't on the roadmap now.
why not make the VMM image itself be a container: create a block device that is the VM rootfs, and then have a snapshot for each VM, and if RO then great.
If there is work and discussion around this, we can send PR that were denied in the past which show how we've been pushing this. Harder for runC, but very applicable for both solutions. More than one community requesting would be helpful.
RO Path: LVM or DM: even though they are snapshots, the file caching on the host doesn't really kick in (they are two different fileystems). if they are truely RO, then filecache on host will be available (useful for runc use case; not sure re: VM: the caching will happen in the VMs)
CI: goes and runs tests run for the snapshotter. Nothing that runs at a bigger scale/soak testing. What kind of e2e, and using the containerd' snapshotter test suite.
What Firecracker would like for fs proxying:
- Non posix by design, but kind of like posix
- really just want ppl to move away from posix usage
## Next Steps
- Quantify benefits of LVM and analysis/closure on performance gap
- Look at their code base with plan to integrate LVM support: we should look at being able to integrate into their code base to support LVM and make it configurable.