Try   HackMD

PulpImport/Export Memory Investigation

Problem statement

  • Large datasets are serialized in memory before written to disk.
  • json.dump cannot be tricked to loop over itertators (in any sane way)
  • django_import/export export() returns a list in memory instead of a lazy iterator.
  • This is a real problem "in the wild" with actual Red Hat data and users - we need a short-term solution, ASAP

Proposed solutions

Export in chunks

  1. Write "[" to a tempfile
  2. Call export(...).json.encode(...) on chunks of our queryset (which will be a string with json-list starting/ending with "[]")
  3. remove first and last character from the exported string
  4. write the string to the tempfile
  5. if chunks are left write ","
  6. repeat from 2.
  7. Write "]" to the tempfile
  8. add the tempfile to the tar

Advantages:

  • contained in one place (pulpcore.app.importexport._write_export())
  • we own "all" the code involved
  • doesn't try to "take advantage of" current implementation details of lower level libs
  • never more than one resource-worth of tempfile at a time on disk
  • never more than one "chunk" in memory at a time ("how much" depends on what batch-size we use to work w/ the queryset as we export)
  • this change doesn't change the export-file-format - so it doesn't require 'import' to know that we've done anything
  • We continue to call the libraries as intended

Disadvantages

  • post_export() called per-batch, instead of once-per-entire-export-set
    • this may not be a problem
    • if it is - we already have it in low-memory situations
    • we have tests!
  • this is UGLY - much commentary needed in the code to explain why we're doing this
  • We do string operations on the output

Create our own export_to_file in QueryModelResource

  • This is not necessary for the fix, but a possible way to refactor the code towards the ultimate goal below.
  • Would use the above approach, but encapsulate inside of pulpcore.plugin.importexport.QueryModelResource, which all PIE model-resources subclass from.
  • Sets us up for making it possible to do stream-to-file in the future, by changing this one method (instead of inline somewhere else)
  • export_to_file would take a query_set, a file_stream, the format and maybe a batch_size parameter.

Ultimate Goal

  • show django-import-export what we have to do to work through this problem
  • work w/ them to add new d-i-e API to make it possible
  • accept new release of d-i-e, and remove all this ugly code all at once
tags: import/export