PyGrab Library Documentation

This is a library that builds on top of the requests library, adding quite a bit of functionality. PyGrab is fully interoperable with the requests library and has virtually identical syntax for the shared functions/methods/classes.

pygrab Module

This module implements the primary developer interface for pygrab.

`pygrab.get()`

Parameters

url (str): The URL or IP address to get.
retries (int, optional): The number of times to retry the request if it fails. Defaults to 5.
enable_js (bool, optional): Enable Javascript on the request. Defaults to False.
*args: Variable length argument list passed to requests.get.
**kwargs: Arbitrary keyword arguments passed to requests.get.

Returns

requests.Response: The response from the server.

Raises

TypeError: If any of the arguments are not of the desired data type.
ValueError: If the URL or IP address is invalid. Use get_local() for local requests.

Exceptions

RuntimeError: If an error occurs during runtime, such as a connection error.

Notes

For handling Javascript-enabled sites, the enable_js parameter can be set to True.

`pygrab.get_async()`

Description
Gets multiple URLs asynchronously.
This function sends HTTP requests to a list of URLs in separate threads, allowing for concurrent HTTP requests. The function returns a list of responses from the grabbed URLs. For each request that had a connection error, a warning will be printed to the console.

Parameters

urls (list): A list of URLs to grab.
retries (int, optional): The number of times to retry the HTTP request in case of failure. Defaults to 5.
thread_limit (int, optional): The maximum number of threads that will be spawned at one time. Defaults to 800.
time_rest (int, optional): The time in seconds to wait between starting each thread. Defaults to 0.
*args: Variable length argument list to pass to the get function.
**kwargs: Arbitrary keyword arguments to pass to the get function.

Returns

dict: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.

Raises

TypeError: If any of the arguments are not of the desired data type.

Notes

This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.

`pygrab.get_local()`

Description
Reads the contents of a file and returns it to the user.

Parameters

filename (str): The file to read from.
local_read_type (str, optional): The read type, 'r' or 'rb' for example. Defaults to 'r'.
encoding (str, optional): Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.

Returns

data: The contents of the file as a string.

Raises

TypeError: If any of the arguments are not of the desired data type.

Notes

This function reads the contents of a file and returns it to the user.

`pygrab.download()`

Description
Downloads a file from a given URL and saves it locally.
This function retrieves a file from a specified URL and saves it to a local directory. The file will be saved with the filename from the URL if no local filename is specified.

Parameters

url (str): The URL of the file to be downloaded. Must include a file extension.
local_filename (str, recommended): The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.
retries (int, optional): The number of retry attempts for the download in case of failure. Defaults to 5.

Returns

None

Raises

TypeError: If any of the arguments are not of the desired data type.
ValueError: If 'local_filename' is specified but does not contain a file extension.

Notes

If a local file name is not specified, it will attempt to download the file under its name on the web.

`pygrab.download_async()`

Description
Executes multiple file downloads asynchronously from a list of given URLs and saves them locally.
This function uses threading to download multiple files simultaneously. Each file is saved with a filename from the list of local filenames, if provided. If no local filename is provided, the function uses the filename from the corresponding URL.

Parameters

urls (list of str): The URLs of the files to be downloaded. Each URL must include a file extension.
local_filenames (list of str, recommended): A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of same length as 'urls' if provided. Defaults to None.
retries (int, optional): The number of retry attempts for the downloads in case of failure. Defaults to 5.
thread_limit (int, optional): The maximum number of threads that will be spawned. Defaults to 500.
time_rest (int, optional): The amount of time to rest between the start of each download thread. Defaults to 0 seconds.

Returns

None

Raises

TypeError: If any of the arguments are not of the desired data type.
ValueError: If a 'local_filenames' is specified but does not contain a file extension.

`pygrab.head()`

Description
Essentially a carbon copy of requests.head() with the exception of the ability to route the request through the Tor network.

`pygrab.post()`

Description
Essentially a carbon copy of requests.post() with the exception of the ability to route the request through the Tor network.

`pygrab.post_local()`

Description
Writes data to a local file.
This function is used to write or append data to a local file. It can be used in various scenarios such as saving request data, logging, or other local storage needs.

Parameters

filepath (str): The path to the file where the data will be written. If the file does not exist, it will be created.
data (str): The data that will be written to the file.
local_save_type (str, optional): The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.
encoding (str, optional): The encoding to be used when opening the file. Defaults to 'utf-8'.

Returns

None

`pygrab.put()`

Description
Essentially a carbon copy of requests.put() with the exception of the ability to route the request through the Tor network.

`pygrab.patch()`

Description
Essentially a carbon copy of requests.patch() with the exception of the ability to route the request through the Tor network.

`pygrab.delete()`

Description
Essentially a carbon copy of requests.delete() with the exception of the ability to route the request through the Tor network.

`pygrab.options()`

Description
Essentially a carbon copy of requests.options() with the exception of the ability to route the request through the Tor network.

`pygrab.tor_status()`

Description
Returns True if the tor service is running and False if otherwise.

Parameters

Returns

A boolean of True of the tor service is running and False if otherwise.

`pygrab.display_tor_status()`

Description
Prints out data regarding your tor connection. This includes a boolean value that's True if the tor service is enabled in addition to your public ip address, country, region, and city.

Parameters

Returns

None

`pygrab.rotate_tor()`

Description
Starts the tor service if it isn't already running. Configures the library to reconnect to the network every num_req_per_rotation. This rotates your ip and grants even greater anonymity.

Parameters

num_req_per_rotation (int): The maximum number of requests that can be send before a rotation.

Returns

None

`pygrab.end_rotate_tor()`

Description
Ends the configuration to rotate tor connections. Note that this does not end the tor service itself (run pygrab.Tor.end_tor() to do that).

Parameters

Returns

None

`pygrab.warn_settings()`

Description
Changins the warning configuration of the entire library. All warnings are shown be defult and calling pygrab.warn_settings(False) will turn off all warnings.

Parameters

warn (bool): Boolean value to enable or disable warnings.

Returns

None

Tor Object

This module implements the primary developer interface for connecting pygrab to the Tor network.

`pygrab.Tor.start_tor()`

Description
Starts the Tor network and configures all requests to be routed through the tor network.

Parameters

verbose (int, optional): 0 by default. If set to 1, the logs from starting the tor network will be displayed.
force_start (bool, optional): False by default. This will cause the program to crash if a service is already running on 127.0.0.1:9050. If this service is an instance of Tor, then you can set force_start equal to true and have the program run as normal.

Returns

None

Notes

If force_start is set equal to True, then the program will crash if the service running on 127.0.0.1:9050 is not Tor.
Additionally, even if the service running on 127.0.0.1:9050 is Tor, some functionality may be lost (such as rotating tor connections). It is recommended to only have one instance of tor running at any one time.
If you get an error message stating that you are missing the tor.exe dependency, its probably because you are :). If you're on Windows, download it from "https://www.torproject.org/download/tor/". Then copy the path to the .tar.gz file and enter it when prompted by pygrab.Tor.start_tor(). If you're on linux, simply run sudo apt-get install tor.

`pygrab.Tor.end_tor()`

Description
Ends the tor service.

Parameters

Returns

None

`pygrab.Tor.tor_status()`

Description
Returns True if the tor status is active and False if otherwise.

Parameters

Returns

bool: True if the tor status is active and False if otherwise.

`pygrab.Tor.override_status()`

Description
Returns True if a second instance of tor has been overridden and False if otherwise. Note that IP/Tor rotations will not be available if an instance of tor is overridden.

Parameters

Returns

bool: True if a second instance of tor has been overridden and False if otherwise.

`pygrab.Tor.load_tor_dependencies()`

Description
Loads the tor.exe dependency from the .tar.gz file that is downloaded from "https://www.torproject.org/download/tor/".

Parameters

filepath (str): The path to the .tar.gz file downloaded from Tor's website.

Returns

None

Notes

This function is for windows only. If you're on Linux, run sudo apt-get install tor
Running pygrab.Tor.start_tor() without the tor.exe dependency installed will prompt you for the path to the .tar.gz file. This method is merely an alternative to that.

Session Object

This module implements the primary developer interface for Session handling. These sessions also include all of the functionality provided by the main pygrab module with the exception of IP/Tor rotations (coming soon!).

`s_obj = pygrab.Session()`

Description
Constructor for the pygrab.Session object.

Parameters

use_tor (bool, optional): Tor service will be enabled for the session if True. Defaults to True if the tor service is running and false if it isn't.

Returns

None

Notes

Setting use_tor to true will start the tor service if it isn't already.
Setting use_tor to false disable it from using the tor network, even if the tor service is already running.

`s_obj.start_tor()`

Description
Enables use of the tor service for the session object. If the tor service is not running, this method will start the service.

Parameters

Returns

None

`s_obj.end_tor()`

Description
Disables use of the tor service for the session object. However, this method does not end the tor service itself (run pygrab.Tor.end_tor() to do that).

Parameters

Returns

None

`s_obj.get()`

Parameters

url (str): The URL to get.
enable_js (bool, optional): Enable Javascript on the request. Defaults to False.
**kwargs: Arbitrary keyword arguments passed to requests.get.

Returns

requests.Response: The response from the server.

Raises

TypeError: If any of the arguments are not of the desired data type.
ValueError: If the URL doesn't start with http. Use get_local() for local requests.
Exception: If the URL is invalid.

Notes

For handling Javascript-enabled sites, the enable_js parameter can be set to True.

`s_obj.get_async()`

Parameters

urls (list): A list of URLs to grab.
thread_limit (int, optional): The maximum number of threads that will be spawned at one time. Defaults to 800.
time_rest (int, optional): The time in seconds to wait between starting each thread. Defaults to 0.
*args: Variable length argument list to pass to the get function.
**kwargs: Arbitrary keyword arguments to pass to the get function.

Returns

dict: A dictionary of responses with the grabbed URLs as keys and their respective responses as values.

Raises

TypeError: If any of the arguments are not of the desired data type.

Notes

This function will remove all repeats from the urls list passed in order to prevent accidental DoS attacks.

`s_obj.get_local()`

Description
Reads the contents of a file and returns it to the user.

Parameters

filename (str): The file to read from.
local_read_type (str, optional): The read type, 'r' or 'rb' for example. Defaults to 'r'.
encoding (str, optional): Encoding, 'utf-8' or 'ascii' for example. Defaults to 'utf-8'.

Returns

data: The contents of the file as a string.

Raises

TypeError: If any of the arguments are not of the desired data type.

Notes

This function reads the contents of a file and returns it to the user.

`s_obj.head()`

Description
Essentially a carbon copy of requests.Session().head() with the exception of the ability to route the request through the Tor network.

`s_obj.post()`

Description
Essentially a carbon copy of requests.Session().post() with the exception of the ability to route the request through the Tor network.

`s_obj.post_local()`

Parameters

filepath (str): The path to the file where the data will be written. If the file does not exist, it will be created.
data (str): The data that will be written to the file.
local_save_type (str, optional): The mode in which the file is opened. Defaults to 'w' (write mode), and can also be set to 'a' (append mode) or any other valid file mode.
encoding (str, optional): The encoding to be used when opening the file. Defaults to 'utf-8'.

Returns

None

`s_obj.put()`

Description
Essentially a carbon copy of requests.Session().put() with the exception of the ability to route the request through the Tor network.

`s_obj.patch()`

Description
Essentially a carbon copy of requests.Session().patch() with the exception of the ability to route the request through the Tor network.

`s_obj.delete()`

Description
Essentially a carbon copy of requests.Session().delete() with the exception of the ability to route the request through the Tor network.

`s_obj.options()`

Description
Essentially a carbon copy of requests.Session().options() with the exception of the ability to route the request through the Tor network.

`s_obj.download()`

Parameters

url (str): The URL of the file to be downloaded. Must include a file extension.
local_filename (str, recommended): The name to be used when saving the file locally. If none is provided, the function uses the filename from the URL. Must include a file extension if provided.

Returns

None

Raises

TypeError: If any of the arguments are not of the desired data type.
ValueError: If 'local_filename' is specified but does not contain a file extension.

Notes

If a local file name is not specified, it will attempt to download the file under its name on the web.

`s_obj.download_async()`

Parameters

urls (list of str): The URLs of the files to be downloaded. Each URL must include a file extension.
local_filenames (list of str, recommended): A list of names to be used when saving the files locally. If none is provided, the function uses the filename from each corresponding URL. Each filename must include a file extension if provided. Must be of the same length as 'urls' if provided. Defaults to None.
thread_limit (int, optional): The maximum number of threads that will be spawned. Defaults to 500.
time_rest (int, optional): The amount of time to rest between the start of each download thread. Defaults to 0 seconds.

Returns

None

Raises

TypeError: If any of the arguments are not of the desired data type.

`s_obj.tor_status()`

Description
Returns True if tor is enabled for the session object and returns False if otherwise.

Parameters

Returns

bool: True if tor is enabled for the session object and returns False if otherwise.

`s_obj.display_tor_status()`

Parameters

Returns

None

Request Object

This module is a carbon copy of the requests.Request object.

Response Object

This module is a carbon copy of the requests.Response object.

Conclusion

The PyGrab library provides a comprehensive set of tools for web scraping, including session management, Tor integration, and IP rotation. By utilizing these functionalities, developers can efficiently extract data from websites while maintaining anonymity and flexibility. If this library was helpful to you, make my day and leave a star at PyGrab GitHub!