# PulpImport/Export Memory Investigation ## Problem statement * Large datasets are serialized in memory before written to disk. * `json.dump` cannot be tricked to loop over itertators (in any sane way) * django_import/export `export()` returns a list in memory instead of a lazy iterator. * **This is a real problem "in the wild"** with actual Red Hat data and users - we need a short-term solution, ASAP ## Proposed solutions ### Export in chunks 1. Write `"["` to a tempfile 2. Call `export(...).json.encode(...)` on chunks of our queryset (which will be a string with json-list starting/ending with `"[]"`) 3. remove first and last character from the exported string 4. write the string to the tempfile 5. if chunks are left write `","` 6. repeat from 2. 7. Write `"]"` to the tempfile 8. add the tempfile to the tar #### Advantages: * contained in one place (`pulpcore.app.importexport._write_export()`) * we own "all" the code involved * doesn't try to "take advantage of" current implementation details of lower level libs * never more than one resource-worth of tempfile at a time on disk * never more than one "chunk" in memory at a time ("how much" depends on what batch-size we use to work w/ the queryset as we export) * this change doesn't change the export-file-format - so it doesn't require 'import' to know that we've done anything * We continue to call the libraries as intended #### Disadvantages * post_export() called per-batch, instead of once-per-entire-export-set * this may not be a problem * if it is - we already have it in low-memory situations * we have tests! * this is UGLY - much commentary needed in the code to explain why we're doing this * We do string operations on the output ### Create our own export_to_file in QueryModelResource * This is not necessary for the fix, but a possible way to refactor the code towards the ultimate goal below. * Would use the above approach, but encapsulate inside of `pulpcore.plugin.importexport.QueryModelResource`, which all PIE model-resources subclass from. * Sets us up for making it possible to do stream-to-file in the future, by changing this one method (instead of inline somewhere else) * `export_to_file` would take a `query_set`, a `file_stream`, the `format` and maybe a `batch_size` parameter. ## Ultimate Goal * show django-import-export what we have to do to work through this problem * work w/ them to add new d-i-e API to make it possible * accept new release of d-i-e, and remove all this ugly code all at once ###### tags: `import/export`