GURL - HackMD

# GURL [GURL](https://source.chromium.org/chromium/chromium/src/+/main:url/gurl.h;l=47;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771) is a library parsing URL. The word "GURL" is a short word of Google's URL. We need to be careful on handling url as it may consist of malicious scripts. Also, there are multiple types of urls such as about url, file scheme, chrome url... GURL library will validate urls and parse it in a proper way. Today, let's check the flow of url handling and the temrinology used in url. ## Overview You can pass the url text to GURL to call [constructor](https://source.chromium.org/chromium/chromium/src/+/main:url/gurl.cc;l=46;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771). The passed utl is std::string_view/std::u16string_view, the string type without ownership. GURL stores the actual url text as `spec_`, and parsed url as `parsed_`. Also, `is_valid_` carries the information whether the url has something invalid. It also has `inner_url_` used for nested url schemes, but this is now only used from filesystem. Let's ignore thi for today's note. ## GURL initialization GRUL object is initialized by [this constructor](https://source.chromium.org/chromium/chromium/src/+/main:url/gurl.cc;l=46;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771) with UTF8 url text. First, it creates [url::StdStringCanonOutput](https://source.chromium.org/chromium/chromium/src/+/main:url/url_canon_stdstring.h;l=37;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771) object as a string handler for canonicalization. This does not own the string itself so the user of the object must guarantee that the string stays alive throughout the lifetime of this StdStringCanonOutput object. Then, it calculates `is_valid_` from passed url text `input_spec`. It uses url util library [Canonicalize](https://source.chromium.org/chromium/chromium/src/+/main:url/url_util.cc;l=768;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771). The algorithm is basiaclly check all SchemeType properly. [Resolve](https://source.chromium.org/chromium/chromium/src/+/main:url/gurl.h;l=155;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771) will parse the url via [ResolveRelative](https://source.chromium.org/chromium/chromium/src/+/main:url/url_util.cc;l=801;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771). ## url::Parsed `spec_` is [url::Parsed](https://source.chromium.org/chromium/chromium/src/+/main:url/third_party/mozilla/url_parse.h;l=79;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771) type. This is included in mozilla third party library The url is divided into several [Component](https://source.chromium.org/chromium/chromium/src/+/main:url/third_party/mozilla/url_parse.h;l=17;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771). Component type consists of two int values, `begin` and `len`. As you can assume from the name, it only contains the index in the url text. If it's unspecified, the length is set to `-1`. There are 8 components: scheme, username, password, host, port, path, query and ref. `scheme` is like "http". The thing before `:`. This may be empty if the url looks like "foo.com". `username` is spceified with `@` sign before the host, and `password` is its password. "http://me:secret@host". `host` and `port` are just host and port. `path` contains anything comeafter the host or port with `/` dividier such as `index` from `someone.com/index`. `query` is the thing after `?`. By adding `?n=1000` after the path, you can set `n` value in the page as 1000. This is query. `ref` is anything after `#`. It's often used to select a section. Corresponding component can be obtained from GURL library via [ComponentString(const url::Component& comp)](https://source.chromium.org/chromium/chromium/src/+/main:url/gurl.h;l=466;drc=f5bdc89c7395ed24f1b8d196a3bdd6232d5bf771). ## Memo: URL types Current URL type list is following: ```cpp const char kAboutBlankURL[] = "about:blank"; const char kAboutSrcdocURL[] = "about:srcdoc"; const char kAboutBlankPath[] = "blank"; const char kAboutSrcdocPath[] = "srcdoc"; const char kAboutScheme[] = "about"; const char kBlobScheme[] = "blob"; const char kContentScheme[] = "content"; const char kContentIDScheme[] = "cid"; const char kDataScheme[] = "data"; const char kFileScheme[] = "file"; const char kFileSystemScheme[] = "filesystem"; const char kFtpScheme[] = "ftp"; const char kHttpScheme[] = "http"; const char kHttpsScheme[] = "https"; const char kJavaScriptScheme[] = "javascript"; const char kMailToScheme[] = "mailto"; const char kTelScheme[] = "tel"; const char kUrnScheme[] = "urn"; const char kUuidInPackageScheme[] = "uuid-in-package";; const char kWebcalScheme[] = "webcal"; const char kWsScheme[] = "ws"; const char kWssScheme[] = "wss"; ```