Multiprocessing/Multithreading in Python

--- title: Multiprocessing/Multithreading in Python tags: Parallel Computing --- # Multiprocessing/Multithreading in Python :::info ### TL;DR - **Multithreading** - If the code you want to speed up is IO bound, e.g. read/write in local machine, request/response from the server. - Using Python's `threading` module - For Python 3.2+, use `concurrent.futures.ThreadPoolExecutor` module - **Multiprocessing** - If the code you wan to speed up is CPU bound, e.g. heavily compute an algorithm. - Using Python's `processing` module - For Python 3.2+, use `concurrent.futures.ProcessPoolExecutor` module - If you use multithreading for CPU-bound code, it does not speed up at all, due to GIL of Python - only one thread executed at a time, therefore, your code just runs concurrently but not parallelly. ::: Python is a great programming language for crunching data and automating repetitive tasks. Got a few gigs of web server logs to process or a million images that need resizing? No problem! You can almost always find a helpful Python library that makes the job easy. But while Python makes coding fun, it’s not always the quickest to run. By default, Python programs execute as a single process using a single CPU. If you have a computer made in the last decade, there’s a good chance it has 4 (or more) CPU cores. That means that 75% or more of your computer’s power is sitting there nearly idle while you are waiting for your program to finish running! Let’s learn how to take advantage of the full processing power of your computer by running Python functions in parallel. ## Concurrency and Parallelism in Python: Threading Example Threading is one of the most well-known approaches to attaining Python concurrency and parallelism. Threading is a feature usually provided by the operating system. Threads are lighter than processes, and share the same memory space. ![](https://i.imgur.com/a21NNbX.png) #### Multithreading Example: Download images The following script fetch the list of images and download them, which contains 3 functions: - `get_links` - `download_link` - `setup_download_dir` Imgur’s API requires HTTP requests to bear the Authorization header with the client ID. You can find this client ID from the dashboard of the application that you have registered on Imgur, and the response will be JSON encoded. We can use Python’s standard JSON library to decode it. ```python import json import logging import os from pathlib import Path from urllib.request import urlopen, Request logger = logging.getLogger(__name__) types = {'image/jpeg', 'image/png'} def get_links(client_id): headers = {'Authorization': 'Client-ID {}'.format(client_id)} req = Request('https://api.imgur.com/3/gallery/random/random/', headers=headers, method='GET') with urlopen(req) as resp: data = json.loads(resp.read().decode('utf-8')) return [item['link'] for item in data['data'] if 'type' in item and item['type'] in types] def download_link(directory, link): download_path = directory / os.path.basename(link) with urlopen(link) as image, download_path.open('wb') as f: f.write(image.read()) logger.info('Downloaded %s', link) def setup_download_dir(): download_dir = Path('images') if not download_dir.exists(): download_dir.mkdir() return download_dir ``` Next, we will need to write a module that will use these functions to download the images, one by one, name `single.py`. ```python import logging import os from time import time from download import setup_download_dir, get_links, download_link logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) def main(): ts = time() client_id = os.getenv('IMGUR_CLIENT_ID') if not client_id: raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!") download_dir = setup_download_dir() links = get_links(client_id) for link in links: download_link(download_dir, link) logging.info('Took %s seconds', time() - ts) if __name__ == '__main__': main() ``` On a certain machine, this script took 19.4 seconds to download 91 images. Please do note that these numbers may vary based on the network you are on. 19.4 seconds isn’t terribly long, but what if we wanted to download more pictures? Perhaps 900 images, instead of 90. With an average of 0.2 seconds per picture, 900 images would take approximately 3 minutes. For 9000 pictures it would take 30 minutes. The good news is that by introducing concurrency or parallelism, we can speed this up dramatically. #### Applying Multithreading for Downloading Images In this Python threading example, we will write a new module to replace `single.py`. This module will create a pool of eight threads, making a total of nine threads including the main thread. I chose eight worker threads because my computer has eight CPU cores and one worker thread per core seemed a good number for how many threads to run at once. In practice, this number is chosen much more carefully based on other factors, such as other applications and services running on the same machine. ```python import logging import os import threading from time import time from download import setup_download_dir, get_links, download_link logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) def main(): ts = time() client_id = os.getenv('IMGUR_CLIENT_ID') if not client_id: raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!") download_dir = setup_download_dir() links = get_links(client_id) download_threads = [] for link in links: download_thread = threading.Thread(target=download_link, arg=(download_dir, link)) for th in download_threads: th.start() for th in download_threads: th.join() # join() method closes the threads after finishing logging.info('Took %s seconds', time() - ts) if __name__ == '__main__': main() ``` Running this Python threading example script on the same machine used earlier results in a download time of 4.1 seconds! That’s 4.7 times faster than the previous example. While this is much faster, it is worth mentioning that **only one thread was executing at a time throughout this process due to the GIL**. Therefore, this code is concurrent but not parallel. The reason it is still faster is because this is an IO bound task. The processor is hardly doing any heavy computations while downloading these images, and the majority of the time is spent waiting for the network. This is why Python multithreading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work. Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance. **If your code is performing a CPU bound task**, such as decompressing gzip files, using the threading module will result in a slower execution time. For CPU bound tasks and truly parallel execution, we can use the **multiprocessing** module. ### Concurrency and Parallelism in Python: Spawning Multiple Processes ![](https://i.imgur.com/GbyJgO1.png) To use multiple processes, we create a multiprocessing Pool. With the `map` method it provides, we will pass the list of URLs to the pool, which in turn will spawn up to `number-of-CPU-cores` new processes and use each one to download the images in parallel. This is true parallelism, but it comes with a cost. The entire memory of the script is copied into each subprocess that is spawned. In this simple example, it isn’t a big deal, but it can easily become serious overhead for non-trivial programs. ```python import logging import os from functools import partial from multiprocessing.pool import Pool from time import time from download import setup_download_dir, get_links, download_link logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logging.getLogger('requests').setLevel(logging.CRITICAL) logger = logging.getLogger(__name__) def main(): num_workers = multiprocessing.cpu_count() ts = time() client_id = os.getenv('IMGUR_CLIENT_ID') if not client_id: raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!") download_dir = setup_download_dir() links = get_links(client_id) download = partial(download_link, download_dir) with Pool(processes=num_workers) as p: p.map(download, links) logging.info('Took %s seconds', time() - ts) if __name__ == '__main__': main() ``` **Notice**: Since `map()` method only takes in 2 arguments - the function to execute and its corresponding argument to loop over. To solve this, we perform an additional step of using `partial()`, which helps us derive a function with $X$ parameters to a function with fewer parameters and fixed values set for the more limited function. Example: ```python from functools import partial def multiply(x,y): return x * y # create a new function that multiplies by 2 dbl = partial(multiply,2) print(dbl(4)) # Returns 8 ``` **Alternative**: A different approach to handle multiple arguments to the function is using `starmap()` method instead of `map()`. By doing this, we can pass a tuple of multiple arguments to the target function. :::info **How can we merge back the results run in different multiple processes ?** For some specific use cases, such as breaking a very long sentence by different marks (`. ? ! ;`) into multiple sub-sentences and then put them through the target function for parallel computation, we may want to merge them back in the same order to re-construct the full sentence. In that case, we can obtain an array of results by: ```python with multiprocessing.Pool(processes=num_workers) as pool: results = p.map(download, links) full_results = list(itertools.chain.from_iterable(results)) ``` ::: ### Python Multithreading vs. Multiprocessing If your code is IO bound, both multiprocessing and multithreading in Python will work for you. Multiprocessing is a easier to just drop in than threading but has a higher memory overhead. If your code is CPU bound, multiprocessing is most likely going to be the better choice—especially if the target machine has multiple cores or CPUs. ### Newly Update - Python `concurrent.futures` Something new since Python 3.2 is the `concurrent.futures package`. This package provides yet another way to use concurrency and parallelism with Python. #### `concurrent.futures.ThreadPoolExecutor` Using a `concurrent.futures.ThreadPoolExecutor` makes the Python threading example much easier. ```python import logging import os from concurrent.futures import ThreadPoolExecutor from functools import partial from time import time from download import setup_download_dir, get_links, download_link logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logger = logging.getLogger(__name__) def main(): client_id = os.getenv('IMGUR_CLIENT_ID') if not client_id: raise Exception("Couldn't find IMGUR_CLIENT_ID environment variable!") download_dir = setup_download_dir() links = get_links(client_id) # By placing the executor inside a with block, the executors shutdown method will be called cleaning up threads. # By default, the executor sets number of workers to 5 times the number of CPUs. with ThreadPoolExecutor() as executor: # Create a new partially applied function that stores the directory argument. # This allows the download_link function that normally takes two arguments to work with the map function that expects a function of a single argument. fn = partial(download_link, download_dir) # Executes fn concurrently using threads on the links iterable. The # timeout is for the entire process, not a single call, so downloading # all images must complete within 30 seconds. executor.map(fn, links, timeout=30) if __name__ == '__main__': main() ``` #### `Concurrent.futures.ProcessPoolExecutor()` [This link](https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a) gives a very good example of using `concurrent.futures.ProcessPoolExecutor()` and additional explainations. ### References * https://www.pyimagesearch.com/2019/09/09/multiprocessing-with-opencv-and-python/ * https://www.journaldev.com/15631/python-multiprocessing-example * https://docs.python.org/2/library/multiprocessing.html Good blog to start * https://medium.com/towards-artificial-intelligence/the-why-when-and-how-of-using-python-multi-threading-and-multi-processing-afd1b8a8ecca * https://medium.com/@urban_institute/using-multiprocessing-to-make-python-code-faster-23ea5ef996ba FastAI parallel utility function * https://medium.com/@robertbracco1/how-to-do-fastai-even-faster-bebf9693c99a Python Multithreading and Multiprocessing Tutorial * https://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python Quick Tip: Speed up your Python data processing scripts with Process Pools * https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a When to use Multiprocessing over Multithreading ? * https://stackoverflow.com/questions/24744739/multi-threading-in-python-is-it-really-performance-effiicient-most-of-the-time * https://stackoverflow.com/questions/46045956/whats-the-difference-between-threadpool-vs-pool-in-the-multiprocessing-module