Try   HackMD

Pulp3 Concurrency impact on sync-performance

Introduction

Syncing content into Pulp3 is a time-consuming process. Fortunately, one can specify that content-artifacts be retrieved in parallel, in order to minimize time by maximizing bandwidth utilization. This paralellism is accomplished by specifying the download_concurrency value on a given Remote.

Pulp3's default value for download_concurrency is 20. This can put an undue burden on the server the remote is retrieving content from - sometimes to the point of causing such a server to begin throttling (or outright failing)Pulp3's requests.

Rather than blindly changing that default, we would like to have some idea of the performance impact of various settings of download_concurrency, as noted in Redmine issue 7212

This document describes a set of tests, their environment, and some results, in search of that impact.

Methodology

  • Pick a repository of significant size and complexity
  • Sync its content (using sync-immediate) into a 'clean' Pulp3 instance, at download_concurrency values of 20, 15, 10, 7, 5, and 3
  • For each level, repeat the sync at least three times to average out general-internet-nondeterminism
  • Find the average sync-time for a given concurrency level.

Environment

  • Hardware
    • Intel Core i7-6700 (4Ghzx8)
    • 32Gb memory
    • 1TB Western Digital 7200RPM drive
    • Google Fibre network connection, ~900Mbps up/down
    • local network has frequent high internet use (Twitch streaming, video, high-bandwidth gaming, etc.)
    • system under test was not dedicated to the test process, and was running other loads
  • Software
    • Fedora 31
    • tests run on the pulplift vagrant box, pulp3-source-fedora31
    • pulpcore and pulp_rpm master as of 2020-07-25

Caveats

  • The higher the concurrency, the heavier the load on the Pulp3 instance
  • Connection-bandwidth of the Pulp3 instance is a limiting factor no matter how high concurrency is set
  • Connection-bandwidth of the source server is a limiting factor no matter how high concurrency is set
  • Internet connectivity is subject to arbitrary changes

Results

Decreasing concurrency by a factor of four (from twenty to five) did not quite double the sync-time (10:39 to 18:15).

Reducing it to three results in nearly tripling the sync-time (to 29:12)

Runs were executed on 2020-07-25 and 07-26:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

(Horizontal axis ticks are 20/15/10/7/5/3 concurrency)

As can be seen from the chart, while reducing concurrency does impact performance, there is a distinct knee below download_concurrency=5

Conclusion

In the context of this specific test, on this specific weekend, with this specific hardware - it would seem that a concurrency of seven or five, while having an impact on sync-time, still results in acceptable results. Going below five starts to have a disproprtionate effect.

It would therefore appear to be useful to reduce Pulp3's default concurrency to, but not lower than, five, based on these results.

Further Experiments

  • Repeat the experiment using on_demand instead of immediate
  • Monitor the load-average of the Pulp3 instance under various concurrency levels
  • Monitor the Pulp3 instance's memory usage under different concurrency levels
  • Monitor the performance of postgresql under different concurrency levels (although the load of moving the actual bits across the network is likely to swamp that metric)
  • Test the same sync against different source sites (what impact does latency have? source-bandwidth limitiations?)
  • Test when syncing onto an SSD instead of HDD

Raw Data

concurrency,avg duration,duration,start,finish
3,,0:29:07,2020-07-25 16:48:11,2020-07-25 17:17:18
3,,0:27:32,2020-07-25 17:21:37,2020-07-25 17:49:09
3,0:29:12,0:30:56,2020-07-25 19:00:45,2020-07-25 19:31:41
5,,0:18:29,2020-07-24 20:45:03,2020-07-24 21:03:32
5,,0:17:47,2020-07-24 21:41:11,2020-07-24 21:58:59
5,0:18:15,0:18:29,2020-07-25 13:07:23,2020-07-25 13:25:52
7,,0:15:00,2020-07-25 23:15:42,2020-07-25 23:30:42
7,,0:15:36,2020-07-26 0:45:38,2020-07-26 1:01:14
7,0:15:30,0:15:53,2020-07-26 3:01:19,2020-07-26 3:17:12
10,,0:11:56,2020-07-25 14:38:27,2020-07-25 14:50:23
10,,0:15:00,2020-07-25 15:01:04,2020-07-25 15:16:04
10,0:13:33,0:13:43,2020-07-25 15:53:02,2020-07-25 16:06:45
15,,0:12:46,2020-07-25 21:43:20,2020-07-25 21:56:06
15,,0:12:44,2020-07-25 22:20:47,2020-07-25 22:33:31
15,0:12:31,0:12:04,2020-07-25 22:37:57,2020-07-25 22:50:01
20,,0:10:36,2020-07-24 18:51:45,2020-07-24 19:02:21
20,,0:10:58,2020-07-24 20:14:53,2020-07-24 20:25:50
20,0:10:39,0:10:23,2020-07-24 20:27:51,2020-07-24 20:38:14