JS Fetcher - HackMD

# JS Fetcher `JSFetcher` is a Python-based tool designed to fetch, save, and analyze JavaScript files from a list of provided URLs. It supports features such as proxy usage, custom headers, URL extraction, JavaScript beautification, and optional retrieval of source maps. ## Features :::success - **JavaScript Beautification**: Beautify fetched JavaScript files for easier analysis. - **URL Extraction**: Extract additional `.js` or `.chunk.js` URLs from fetched content and process them recursively. - **Source Map Retrieval**: Optionally fetch mapping files (`.js.map` and `.map`) referenced in JavaScript files and by adding .map to the end of JavaScript filename. - **Output Directory**: Save fetched content to a specified directory, preserving the directory structure. - **Multithreading**: Fetch multiple URLs concurrently using a specified number of threads. - **Custom Headers**: Add custom headers to requests for authentication or other purposes. - **Proxy Support**: Use a proxy for requests with validation checks. - **Retry Mechanism**: Retry failed requests with an exponential backoff. ::: ## Requirements - Python 3.7+ ## Download The latest version of `JS Fetcher` can be downloaded at the following URL: ```bash wget https://static.k0.lc/share/js_fetcher.py ``` ## Installation & run The following libraries are required for this project: ```bash pip3 install docopt jsbeautifier coloredlogs verboselogs tldextract==3.2.0 python3 js_fetcher.py -h ``` ## Usage ```html Usage: js_fetcher.py [--scope <scope>] [--mapping-search] [--outdir <directory>] [--proxy <proxy>] [(-H <header>)...] [--retry <num_retries>] [--threads <num_threads>] [--timeout <seconds>] [--follow-redirect] [--only-request-js-files] [--only-save-js-files | --only-save-orig-js-files] [--no-save-embedded-js] [--response-min-size <size>] [--disable-js-beautify] [--disable-js-embedded] [--disable-url-search] [--disable-proxy-check] [--disable-recursion] [--disable-cpath-filter] [--headers-from-file <file>] [--keep-minified-content] [-v | -d | -dd] -u <URL> Program options: -u,--url <URL> URL(s) to fetch, could be a filename, a string a comma separated string list or a list -s,--scope <scope> Set domain(s)/subdomain(s) as a valid scope instead of root_url. Multi-format (like --url) -m,--mapping-search Attempt to recover '.js.map' files for all '.js' files found -o,--outdir <directory> Directory to save the responses -x,--proxy <proxy> Proxy to use for requests -H,--headers <header>... Custom headers for the requests -r,--retry <num_retries> Number of retries on failure [default: 1] -t,--threads <num_threads> Number of threads to use [default: 5] -T,--timeout <seconds> Timeout for each request in seconds [default: 8] Filter options: --only-request-js-files Input Filter: Clean '--url' argument to keep and only request '.js' files --only-save-js-files Output filter: Save only beautified '.js' and '.js.map' files in output directory --only-save-orig-js-files Output filter: Save only original '.js' and '.js.map' files in output directory --no-save-embedded-js Output filter: No save in separated file embedded JavaScript code found in script tag --response-min-size <size> Output filter: Skip all URLs whose response size is shorter than this value [default: 50] Misc options --follow-redirect Follow redirection when request responds an HTTP/3xx redirection --disable-cpath-filter Disables consecutive paths detection and filter mechanism (ex: '../assets/assets/..') --disable-js-beautify Disables the JavaScript beautification mechanism applied to '.js' files --disable-js-embedded Disables the embedded JavaScript code recovery mechanism applied to non- '.js' files --disable-url-search Disables the '.js' and '.chunk.js' recovering mechanism applied to fetched URLs --disable-proxy-check Disables proxy check mechanism when proxy is present --disable-recursion Disables recursion, the program stops when the first-level URLs are retrieved --headers-from-file <file> Get raw request headers from file --keep-minified-content Keep a copy of original minified content (.minified.js) for further unmapping with '.map' General options: -v, --verbose Enable verbose mode -d, --debug Show more details on what the program does under the hood -dd, --debug Print Debug level 2 (with all classes debug_class output) -V, --version Show version info ``` ## More about supported arguments ### Arguments parsing `JS Fetcher` allows to define some arguments in many ways: - `-u,--url`, `-s,--scope` arguments can be a filename, a string, a comma-separated string list or a list (when `JS Fetcher` is used as a library); - `-H,--headers` could be defined multiple times (like `curl`); - `stdin` (with `-`) is supported for all these arguments. For example, if you want to define several target urls (`-u,--url`), all the following commands produce the same result: ```c js_fetcher -u /path/urls js_fetcher -u http://www.example.com/app/ js_fetcher -u "http://target.tld.com/app/index, https://target.tld.com/file.js" cat /path/urls | js_fetcher -u - echo 'https://target.tld.com/file.js' | js_fetcher -u - ``` ### Scope By default, if the tool finds a complete URL that doesn't match the root URL, it marks it as out-of-scope (`OOS`). For example, if all the `.js` files on the home page of **www.target.com** are linked to **assets.target.com**, these files will be rejected by default unless the argument `-s *.target.com` is passed to the program. ### Filters It is also possible to filter the program's inputs/outputs: + **Input Filter** `--only-request-js-files`: Clean `--url` argument to keep and only request `.js` files; + **Output filter** `--only-save-js-files`: Save only `.js` files (beautified by default) and `.js.map` files in output directory. + **Output filter** `-only-save-min-js-files`: Save only original `.js` files (not beautified) and `.js.map` files in output directory. ### Internal JS beautify mechanism By default, all fetched `.js` files are *beautified*, to make it easier and more efficient for the program to search for patterns in minified files. The tool uses the python version of the [https://github.com/beautifier/js-beautify](https://github.com/beautifier/js-beautify) library, configured with the following options: ```json { "indent_size": 4, "indent_char": " ", "max_preserve_newlines": 2, "preserve_newlines": True, "keep_array_indentation": True, "break_chained_methods": True, "indent_scripts": "normal", "brace_style": "collapse", "space_before_conditional": False, "unescape_strings": True, "jslint_happy": True, "end_with_newline": True, "wrap_line_length": 200, "indent_inner_html": True, "comma_first": False, "e4x": False, "indent_empty_lines": False } ``` If the main goal is to simply beautify JavaScript code, these parameters are adjustable in the source code, in the method `JSFetcher.get_beautifier_config()`, as well as in the 2 global variables of the same class `BEAUTIFIER_INDENT_SIZE` and `BEAUTIFIER_WRAP_LINE_LENGTH`. :::warning If you modify this configuration or disable beautification mechanism with the `--disable-js-beautify` option, the program may no longer find additional URLs in `.js` files. ::: These parameters have been initially obtained (and can be tested) in the online version of this library tool [https://beautifier.io/](https://beautifier.io/) ## Practical examples ### Retrieved all JavaScript code of target(s) Classic use of this tool. Takes a list of URLs and tries to recover as much JavaScript code as possible, saving each recovered file. A beautification operation is automatically applied to each valid javascript file. ```c js_fetcher -u /tmp/urls.txt -s "*.target1.fr, *.target2.com" -r 2 -t 8 -T 10 -o /tmp/target-jscode/ --only-save-js-files --follow-redirect -d ``` ### Just beautify JavaScript code of target(s) If the goal is to simply beautify and save all the javascript files in a list of URLs, you can use the filters to keep only the javascript code as input/output and disable (or not) the search function of additional .js files. ```c js_fetcher -u /tmp/urls.txt --only-request-js-files -t 8 -o /tmp/beautified-jscode/ --only-save-js-files --disable-url-search -v ``` ### Get mapping file of any fetched JavaScript URLs With the `-m,--mapping-search` option, the tool also includes a search feature of the mapping file (`.js.map`) associated with any fetched `.js` URLs. ```c js_fetcher -u /tmp/urls.txt --only-request-js-files -t 8 -o /tmp/jscode-with-mapping/ -m -v ``` ### Just replay a list of URLs through a proxy server (like Burp) To replay a list of URLs through a proxy server, there's no need to waste time beautifying javascript code or searching for patterns to discover other URLs: ```c js_fetcher -u /tmp/urls.txt -r 2 -t 8 -T 10 --follow-redirect -disable-js-beautify --disable-url-search -X http://127.0.0.1:8080 ``` ## Changelog ### Version 2.5 :::spoiler Version improvements: - Add `try_webpack_without_key` mode (global variable, True by default) to support a special case seen in ``Nuxt.js``. Example: `"js/" + { 0: "6aaec45", ... } [e] + ".js"` => `js/6aaec45.js` ::: ### Version 2.4 :::spoiler Version improvements: - Improve `REGEX_JS_URL_3` and `REGEX_END_BY_JS_OR_MAP` regexes to support mapping files ending by `.map` (instead of `.js.map`). ::: ### Version 2.3 :::spoiler Version improvements: - Fixed a bug allowing urls running on a port other than 80 or 443 (ex: `http://app.domain.com:3000/`) to be processed; - Improve `REGEX_WEBPACK_2` regex to support a new loader format without a prefix folder and using a `.` (instead of `-`) as separator (encountered on an *AngularJS* frontend). ::: ### Version 2.2 :::spoiler Version improvements: - Add new `REGEX_WEBPACK_3` regex to support a special webpack loader (seen on *boutique.orange.fr*); - Improve `REGEX_JS_URL_2` regex to support `.js` URLs containing a cache busting suffix (ex: `/assets/file.js?v=2`); - Improve `REGEX_SCRIPT` regex and normalize *fake* `.embedded.js` URL(s). ::: ### Version 2.1 :::spoiler Version improvements: - Important bug fix in `UTF-8` string encoding, the default `strict` mode "lost" the content of part of the files instead of raising an exception. All encode/decode now set to `errors="replace"`; - Added *fake urls* `.embedded.js` for files containing embedded JavaScript code to the results displayed by the tool on `STDOUT` and in log file. ::: ### Version 2.0 :::spoiler Version improvements: - Add a `references` key to fetched `.map` files containing the URL of the JavaScript file(s) linked to this mapping file: ![image](https://hackmd.io/_uploads/S1N2w58Wye.png) This reference will be used later by `js_unmap` when unmapping the `index.js` file to find out where it came from: ![image](https://hackmd.io/_uploads/By49tcLbJe.png) - Added a new `--headers-from-file <file>` option to retrieve all headers present in a raw request pasted into a file; - Improve some details in logging: Program shows more info in debug mode `-d` and `OOS` URLs now issue a warning and are displayed in all display modes; - Add class comparison and hashing functions and docstrings harmonization. ::: ### Version 1.9 :::spoiler Version improvements: - Add support for files embedding JavaScript code (in `<script>` tags). Very useful when using the home page or `.html` files as a starting point. This behavior can be disabled with a new `--disable-js-embedded` option; - Javascript source code embedded in `<script>` tags (in files not ending with `.js`) is saved by default in a file with the extension `embedded.js`. If you don't want to save this code, you can use the new `--no-save-embedded-js` option: ![image](https://hackmd.io/_uploads/Skb3wt5l1e.png) - Add mapping coverage statistics, with two types of coverage percentages for `.map` files in the results URL list: ![image](https://hackmd.io/_uploads/rkvGKRtgJg.png) - Exact match coverage: percentage of files that have their exact corresponding mapping file; - Global coverage: percentage based on the total number of `.map` files versus other files. - Internal doc, minor code refactoring and typo fix. ::: ### Version 1.8 :::spoiler Version improvements: - Major bug fix in `REGEX_JS_URL_3` regex; - Get additional `.js` from `file` key value in source map files if present; - Get additional `.js` file by removing `.map` in `.js.map` urls; - Remove useless filter in `URLExtractor.extract_*`; - Improve some details in debug_class logging (`-dd`). ::: ### Version 1.7 :::spoiler Version improvements: - Add new `--only-save-min-js-files` misc option to save the original minified content instead of his beautified version for further unmapping with related `js.map` file; - Small refactoring of `fetch_url()` method to deal with the new options; - Parse discovered `.js.map` files to keep only valid `JSON` source map files; - Format / pretty print content of discovered `.js.map` files before saving; - Various minor bugs, internal doc and typo fix. ::: ### Version 1.6 :::spoiler Version improvements: - Add new `--keep-minified-content` misc option to keep, in addition to beautified `.js` file, a copy of original minified content (`.minified.js`) for further unmapping with related `js.map` file; - Improve `REGEX_JS_URL_2` to avoid matching `.json` in addition of of `.js`; - Improve `REGEX_JS_URL_3` regex to detect (sourceMappingURL|sourceURL), according to official documentation: https://tc39.es/source-map/#linking-generated-code; - Add an addtionnal security filter to skip invalid `.js` or `.js.map` matches from general regex; - Add python `set` support for library mode; - Improve exception handling in proxy check method (now in Tools class); - Various minor bugs, internal doc and typo fix. ::: ### Version 1.5 :::spoiler Version improvements: - Add `nextJS` support; - Add embedded mapping files support; (`sourceMappingURL=data...base64,xxx`); - Add consecutive paths detection (ex: `x/v1/assets/v1/assets/x.js` => `x/v1/assets/x.js`); - Add 2 new misc options: + `--disable-recursion`: Disables recursion, the program stops when the first-level URLs are retrieved; + `--disable-cpath-filter`: Disables consecutive paths detection and filter mechanism. Useful for debug program; - Avoid to beautify fake `HTTP/200` html error pages content; - Various bugs & typo fix. ::: ### Version 1.1 :::spoiler Version improvements: - Improve relative paths (`../xxx`) management; - Get URLs for `.js.map` files that differ from the `.js` filename; - Various bugs & typo fix. ::: ### Version 1.0 Initial release.