owned this note
owned this note
Published
Linked with GitHub
# FileCache investigation
It has been reported that enabling `...runners.gtfn.FileCache` in the "translation" step of GTFN does not seem to speed up things. This file is to collect questions and results about that.
## Question: What's up with `FileCache.__init__`
### Answer so far
Quote from the docstring of `FileCache`:
> This class extends `diskcache.Cache` to ensure the cache is properly
- opened when accessed by multiple processes using a file lock. This guards the creating of the
cache object, which has been reported to cause `sqlite3.OperationalError: database is locked`
errors and slow startup times when multiple processes access the cache concurrently. While this
issue occurred frequently and was observed to be fixed on distributed file systems, the lock
does not guarantee correct behavior in particular for accesses to the cache (beyond opening)
since the underlying SQLite database is unreliable when stored on an NFS based file system.
It does however ensure correctness of concurrent cache accesses on a local file system. See
#1745 for more details.
- closed upon deletion, i.e. it ensures that any resources associated with the cache are
properly released when the instance is garbage collected.
Quote from https://github.com/GridTools/gt4py/pull/1745:
> The disk cache used to cache compilation in the gtfn backend has a race condition manifesting itself in sqlite3.OperationalError: database is locked errors if multiple python processes try to initialize the diskcache.Cache object concurrently. This PR fixes this by guarding the object creation by a file-based lock in the same directory as the database.
> While this issue occurred frequently and was observed to be fixed on distributed file systems, the lock does not guarantee correct behavior in particular for accesses to the cache (beyond opening) since the underlying SQLite database is unreliable when stored on an NFS based file system. It does however ensure correctness of concurrent cache accesses on a local file system. See more information here:
> https://grantjenks.com/docs/diskcache/tutorial.html#settings
https://www.sqlite.org/faq.html#q5
tox-dev/filelock#73
NFS safe locking:
https://gitlab.com/warsaw/flufl.lock
Barry Warsaw / FLUFL Lock · GitLab
### Additional thoughts
According to [the diskcache docs](https://grantjenks.com/docs/diskcache/tutorial.html#cache), a `diskcache.Cache` can be **opened** multiple times concurrently. However, the docs do not say anything about **creating** multiple `diskcache.Cache` instances in the same directory, concurrently or otherwise. The current usage of `FileCache` creates an instance per cached toolchain variant.
Running the testsuite with pytest-xdist also will create an instance per variant per worker, concurrently.
### Conclusion for the investigation
While testing the `FileCache` class, check for
- [X] multiple toolchain variants trying to create a database in the same place for their file cache. Currently they would wait their turn but then each would ensure that the database is set up properly (by running sql commands, source: `diskcache` source code on github), which is probably slow.
- this happens 15 times when running `test_next-3.11(internal, cpu, nomesh)` on 12 workers.
- [X] toolchain variants created on the fly overwriting the existing database with an empty one, effectively forcing all the work to be redone once again.
- no overwriting should happen according to the `diskcache` source code.
### Conclusion for future refactors
There are currently no tests for `FileCache`, neither unit tests nor integration or performance tests. Or if they exist, they are not recognizeable as such.
Any refactor should start with a test that shows:
- using a `CachedStep` with a `FileCache` fills the cache
- creating a new, identical `CachedStep` instance does not destroy the cache
- creating a new, identical `CachedStep` reuses the cache
## Question: how many instances of FileCache are created during a run of the test suite?
### Answer
15 instances when running `nox -s "test_next-3.11(internal, cpu, nomesh)` on 12 workers.
Answer obtained using the following debug code in `FileCache.__init__`:
```python=
logfile = pathlib.Path("/Users/ricoh/Documents/gt4py/file-cache-log.txt")
with logfile.open("a") as debugfile:
debugfile.write(f"Initing FileCache {id(self)} at {directory or 'some tempdir'}\n")
```
### Conclusions
`FileCache` might be enabled by default. There seem to be three Additional on the fly toolchain variants using it (or one picked up by three workers or something in between).
## Question: what does the `FileCache` look like after running the tests?
### Answer
after running `nox -s "test_next-3.11(internal, cpu, nomesh)`, there are two items in there (type: `ProgramSource`).
### Conclusions
This is weird, 15 instances were created, but only two of them were used to cache a `ProgramSource`. There should be more than two programs in this test suite. So is it disabled by default in the tested GTFN backends and only run on two programs?
## Question: what happens if one were to switch the tests to using cached GTFN by default?
### Method
1. Change the default `runners.gtfn.run_gtfn` to set `otf_workflow__cached_translation=True`. run `nox -s "test_next-3.11(internal, cpu, nomesh)`. Cache empty except for the two stencils that were already explicitly tested with the cached backend before.
2. Change back and compare (Cache full but file caching off by default).
3. Keeping the file cache warm, switch back again and compare
### Answer
18 tests fail only with file caching, all with the message
```
TypeError: CachedStep.__init__() got an unexpected keyword argument 'enable_itir_transforms'
```
They are parametrizations of
- `test_builtins.py::test_arithmetic_and_logical_functors_gtfn`
- `test_temporaries_with_sizes.py::test_verification`
Duration:
- 127s (avg of 3) without file caching translation
- 122.95s with file caching translation (cold start)
- 111s (avg of 3) with file caching translation (warm start)
The number of keys in the file cache grows every run even with warm starts.
### Conclusions
#### Test failures
Again, the factories prove to make things difficult when used to nest steps automatically, due to the need for configurable steps. In this case the translation step is configurable but and somewhere in the code that is being used (probably via `replace()`). However, when the translation step is wrapped in a `CachedStep`, this breaks.
This can be fixed with extra logic in `CachedStep.replace` to pass unknown kwargs to the inner step's `replace` (PoC available).
#### performance
Cold start performance is within fluctuations of no cache. Warm start is well outside.
The ratio of compiling / translating new programs vs. rerunning existing ones is quite high in the test suite. So file caching should not be expected to help much from a cold start.
The gain from warm starting is about 11-12 %, which is pretty good for this scenario, where there are lots of small programs.