Introduction
Syncing content into Pulp3 is a time-consuming process. Fortunately, one can specify that content-artifacts be retrieved in parallel, in order to minimize time by maximizing bandwidth utilization. This paralellism is accomplished by specifying the download_concurrency
value on a given Remote.
Pulp3's default value for download_concurrency
is 20. This can put an undue burden on the server the remote is retrieving content from - sometimes to the point of causing such a server to begin throttling (or outright failing)Pulp3's requests.
Rather than blindly changing that default, we would like to have some idea of the performance impact of various settings of download_concurrency
, as noted in Redmine issue 7212
This document describes a set of tests, their environment, and some results, in search of that impact.
Methodology
- Pick a repository of significant size and complexity
- Sync its content (using sync-immediate) into a 'clean' Pulp3 instance, at
download_concurrency
values of 20, 15, 10, 7, 5, and 3
- For each level, repeat the sync at least three times to average out general-internet-nondeterminism
- Find the average sync-time for a given concurrency level.
Environment
- Hardware
- Intel Core i7-6700 (4Ghzx8)
- 32Gb memory
- 1TB Western Digital 7200RPM drive
- Google Fibre network connection, ~900Mbps up/down
- local network has frequent high internet use (Twitch streaming, video, high-bandwidth gaming, etc.)
- system under test was not dedicated to the test process, and was running other loads
- Software
- Fedora 31
- tests run on the pulplift vagrant box, pulp3-source-fedora31
- pulpcore and pulp_rpm master as of 2020-07-25
Caveats
- The higher the concurrency, the heavier the load on the Pulp3 instance
- Connection-bandwidth of the Pulp3 instance is a limiting factor no matter how high concurrency is set
- Connection-bandwidth of the source server is a limiting factor no matter how high concurrency is set
- Internet connectivity is subject to arbitrary changes
Results
Decreasing concurrency by a factor of four (from twenty to five) did not quite double the sync-time (10:39 to 18:15).
Reducing it to three results in nearly tripling the sync-time (to 29:12)
Runs were executed on 2020-07-25 and 07-26:
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
(Horizontal axis ticks are 20/15/10/7/5/3 concurrency)
As can be seen from the chart, while reducing concurrency does impact performance, there is a distinct knee below download_concurrency=5
Conclusion
In the context of this specific test, on this specific weekend, with this specific hardware - it would seem that a concurrency of seven or five, while having an impact on sync-time, still results in acceptable results. Going below five starts to have a disproprtionate effect.
It would therefore appear to be useful to reduce Pulp3's default concurrency to, but not lower than, five, based on these results.
Further Experiments
- Repeat the experiment using
on_demand
instead of immediate
- Monitor the load-average of the Pulp3 instance under various concurrency levels
- Monitor the Pulp3 instance's memory usage under different concurrency levels
- Monitor the performance of postgresql under different concurrency levels (although the load of moving the actual bits across the network is likely to swamp that metric)
- Test the same sync against different source sites (what impact does latency have? source-bandwidth limitiations?)
- Test when syncing onto an SSD instead of HDD
Raw Data
concurrency,avg duration,duration,start,finish
3,,0:29:07,2020-07-25 16:48:11,2020-07-25 17:17:18
3,,0:27:32,2020-07-25 17:21:37,2020-07-25 17:49:09
3,0:29:12,0:30:56,2020-07-25 19:00:45,2020-07-25 19:31:41
5,,0:18:29,2020-07-24 20:45:03,2020-07-24 21:03:32
5,,0:17:47,2020-07-24 21:41:11,2020-07-24 21:58:59
5,0:18:15,0:18:29,2020-07-25 13:07:23,2020-07-25 13:25:52
7,,0:15:00,2020-07-25 23:15:42,2020-07-25 23:30:42
7,,0:15:36,2020-07-26 0:45:38,2020-07-26 1:01:14
7,0:15:30,0:15:53,2020-07-26 3:01:19,2020-07-26 3:17:12
10,,0:11:56,2020-07-25 14:38:27,2020-07-25 14:50:23
10,,0:15:00,2020-07-25 15:01:04,2020-07-25 15:16:04
10,0:13:33,0:13:43,2020-07-25 15:53:02,2020-07-25 16:06:45
15,,0:12:46,2020-07-25 21:43:20,2020-07-25 21:56:06
15,,0:12:44,2020-07-25 22:20:47,2020-07-25 22:33:31
15,0:12:31,0:12:04,2020-07-25 22:37:57,2020-07-25 22:50:01
20,,0:10:36,2020-07-24 18:51:45,2020-07-24 19:02:21
20,,0:10:58,2020-07-24 20:14:53,2020-07-24 20:25:50
20,0:10:39,0:10:23,2020-07-24 20:27:51,2020-07-24 20:38:14