> To: https://discuss.python.org/c/ideas/6
# Disk space minimization for Python distributors
We found out that Python as distributed in Fedora is larger than it could be. This is not a problem in general use, but in some container environments, people want their system as minimal as possible.
Currently, for every ``.py`` source file, we ship the corresponding ``.pyc`` bytecode caches. We ship these for all 3 optimization levels (none, `-O` and `-OO`).
This means that:
- For regular users, import is as fast as possible, no matter the optimization level they choose.
- For the superuser (root), Python does not create files in system locations. (These files would be complicated to track, verify and clean up on uninstall.)
- Python does not *attempt* to write files to system locations. Such attempts are indistinguishable from malicious software written in Python attempting to inject ecexutable files. These attempts can be flagged as such by security software (SELinux).
- All files of the standard library are installed and tracked by the package manager, so their integrity can be verified with standard tools.
However, for minimal environments, shipping 3 `.pyc` files and the source for each Python module is undesirable.
We're looking for a way to cut the size down for space-minded people, while preserving functionality and security for everyone else.
Is anyone else having similar issues?
If possible, we would like to standardize how downstream distributors with similar goals can ship Python libraries (starting with stdlib), so we don't have to all reinvent the wheel, rely on hacks and/or break users' expectations.
One more restriction we have is that we'd like any minimal environment to be a *subset*, file-wise, of a more complete one. Installing additional files (or removing unneeded ones) is much easier to deal with than having several different sets of files.
Options
=======
We see two possible approaches to fixing the issues.
Option 1: Shipping only ``.py`` files, and disabling creation of ``.pyc`` files
-------------------------------------------------------------------------------
Python programs will run fine from only ``.py`` files, so shipping ``.pyc`` bytecode files is not necessary.
Programs will take a bit longer to start, but in our testing the slowdown is acceptable.
For example, importing ``importlib.py`` takes on average 0.025s longer on our machines compared to the ``.pyc`` bytecode file.
The problem is that when Python imports a ``.py`` source file without finding a corresponding ``.pyc`` in ``__pycache__``, it will try to create it. This is undesirable for the reasons listed in ``Motivation``. Also, under the superuser account, these files are created and the disk footprint starts to grow, defeating the purpose of minimization.
To remedy this, distributors could mark the ``__pycache__`` directories as write-protected.
When Python loads a ``.py`` file, it would look for a marker file called ``__pycache__/__dont_write_bytecode__`` and, if found, it would skip writing the ``.pyc`` file.
Using a file as a marker would make it easy to configure, manage and verify this mechanism with package managers (and other tools that aren't Python-specific).
The directory structure could look like this:
project/
├── some_file.py
└── __pycache__
└── __dont_write_bytecode__
The ``__dont_write_bytecode__`` marker would only prevent *creating and updating* the `.pyc` files.
If they already exist, they would be checked and used as usual.
This means that for normal installations, we would ship ``.pyc`` files along with the ``__dont_write_bytecode__`` marker.
We could also let the user choose which optimization levels to install.
Option 2: Shipping the non-optimized ``.pyc`` files and compressed ``.py`` source files
---------------------------------------------------------------------------------------
While option 1 has its advantages, it suffers from somewhat slower start times and needing a new kind of a marker.
Alternatively, we can ship non-optimized ``.pyc`` bytecode files *instead* of ``.py`` source files.
The non-optimized ``.pyc`` files would be placed where the ``.py`` source files would have been (i.e. outside of ``__pycache__``: files inside ``__pycache__`` are checked only if a corresponding ``.py`` file exists in the directory above).
To save space further, we would not ship the *optimized* ``.pyc`` files (``.opt-1.pyc`` and ``.opt-2.pyc``).
This would mean that users running Python with optimizations (``-O``, ``-OO``, ``$PYTHONOPTIMIZE``) would get non-optimized library modules.
We believe that this would not have an adverse impact: the optimizations are rather superficial.
If desired, we could devise a mechanism for CPython to handle the relevant bytecode parts (docstrings, `__debug__`) properly when running in either mode (optimized or non-optimized).
We can start shipping only ``.pyc`` files right now without any changes to Python.
However, this would be problematic because Python tools generally assume the ``.py`` source files are available.
One prominent example are Python tracebacks, which need the source to display source line contents, useful for debugging.
To fix this, we can add a new optional ``__pysources__`` directory, which would hold the source files.
Python would load the ``.pyc`` files for execution, but when it needs the source, it would look in ``__pysources__``.
The directory structure could look like this:
project/
├── some_file.pyc
└── __pysources__
└── some_file.py
Minimal systems would not have ``__pysources__`` installed at all, while others get almost all benefits of having the sources available.
On normal systems, ``__pysources__`` would be installed, so the system would behave as it does now.
Importing could even become a bit faster, as importlib would only need to stat the ``.pyc`` file to run it (compared to 2 files today: the ``*.py`` *and* the corresponding ``__pycache__/*.pyc``).
A caveat is that Python wouldn't pick up modifications to the sources. Users would need to copy the source outside ``__pysources__`` if they wished to edit installed libraries.
We think that the ``__pysources__`` directory is a sufficiently strong signal to make them research the situation.
For more space savings (and a stronger “don't edit” signal), the sources in ``__pysources__`` could be compressed.
Since this mechanism is intended for distributors, there would be no problems with compression libraries being optional: only distributions with `zlib` would compress the files.
Size impact
===========
We calculated the size impact on Fedora's `python3-libs` RPM package, which contains most of the standard library, but omits `tkinter` (and ``IDLE``, ``turtle``, ``turtledemo``) for dependency reasons, and ``test`` (and test suites of other modules) for size reasons. The omitted parts can be installed from other packages. Such a split is fairly typical in Linux distributions.
The exact numbers will vary between distributions and Python versions, but the following table should be represenative.
Option | Size | Difference (MiB) | Difference (%)
------ | ---- | ----------------- | --------------
Status quo | 31.76 MiB
Shipping ``.py`` and non-optimized ``.pyc`` | 22.8 MiB | -8.9 MiB | -28%
**Option 1** Shipping only ``.py``, disabling creation of ``.pyc`` | 15.2 MiB | -16.5 MiB | -52%
**Option 2** Shipping non-optimized ``.pyc`` and zip-compressed ``.py`` | 17.1 MiB | -14.6 MiB | -46%
**Minimal 2** Shipping non-optimized ``.pyc`` only | 13.5 MiB | -18.3 MiB | -57%
Other ideas
===========
We have a [much more thorough brainstorming document](https://github.com/hroncok/python-minimization/blob/2020-02/document.md).
One idea we already implemented is that large auto-generated files (`pydoc_data` and several `encoding` modules) are shipped as `.pyc` only, without source, since the source is not very informative and differences in optimization levels are negligible.