Search PR - HackMD

# Search PR This PR adds support for search in odoc. It leaves the actual search to an external search engine, but exposes the information needed for such a search engine to work, and defines the communication between the search engine and odoc’s page. It is not polished (html part and tests need more love), but close, I open the PR to start the review and discuss some of the design choices. More precisely, here is what the PR is about: 1. The addition of a `Odoc_model.Fold` module, folding on the values of type in `Lang` (can be useful to generate a glossary of values/modules/... too). 2. The addition of a type, `Odoc_search.Entry.t`. A value of this type correspond to an entry in a search index. This type is supposed to stay stable as it can be consumed by search indexes written in OCaml (to still allow for breaking modification of the `Lang` module). Entries can be printed as a JSON object (sometimes simplified). 3. This PR specifies the communication between search engines and odoc, through the json file consumed by search engines (in `json_search.ml`) and the json outputed by the search engine (in `json_display.ml`) 4. The modification of the doc-comments and mld pages, to have them include an id on basic blocks of text (paragraphs, "leaf" list items, verbatim blocks, ...), so that they are searchable/linkable. 5. An option not to render links or html id/anchors, to avoid duplicated IDs or links inside links, when rendering the search results. 6. On the generated html pages: The execution of the search engine "front-end code" as a webworker, and the display of the results (the code for this needs to be reworked a bit). And, in terms of CLI: - The addition of the `compile-index` command, which outputs a JSON file, meant to be read for indexation by external search engines - The addition of a `--search-file <path>` option to `support-files`. This option copies the given file in the relevant directory. - The addition of a `--search-file <name>` option to `html-generate`. This option activates the search elements on the generated pages (text input and result area), and includes the loading of the `name` file (which has to be included with the corresponding option of `support-file`). Finally, to be able to test: - An update of the driver to provide client-side search, using the minisearch library. - Some tests (to be improved). You can view the result of running the driver directly here. (The minisearch library behaves sometimes strangely, but its uses is just an example, for reference). And the result of the test here. ### TODOs/Improvements: - [ ] Paragraphs and basic blocks of text have their ID disambiguated at link stage (the `href` needs to be fully knowable after the link stage, for the index to have the right URL). However, headings still have their ID disambiguated at the document level. Need to change that to have the correct URLs for headings in the search index. - [ ] @EmileTrotignon has reported seing some internal values being indexed in his tests. Need to find an example and fix. - [ ] @EmileTrotignon has reported many `Warning, resolved hidden path: Base__.Set_intf.Named.t` being output when indexing. - [ ] The rendering of search bar and results could improved. The code for it could be improved too: The situation being different, I think we should write a new generator that takes `Entry.t` and generates html, and which does not depend on `Odoc_html`. - [ ] Starting the webworker only when the user click on the search bar might save a lot of computing resources. - [ ] Makes it so that the information in a JSON entry and an `Odoc_search.Entry.t` are the same. - [ ] Benchmark and add more tests ### Separating the search engine from `odoc` As hinted above, the actual search engine is supposed to be external from `odoc`. It can generate its database from a JSON file (whose format is defined int the `src/search/json_search.ml` file), or by using `odoc` as a library. It should a javascript file that will be run as a webworker. This webworker should listen to"webworker messages", each message being a query (given as plain string) to search for. The result of a queried search must be an array of json object of type defined in `/src/search/json_display.ml`. An odoc-generated value for each entry is included in the input of the search engine, which is welcome to modify it/generate its own, for instance to highlight the reason this result was chosen. As the search script is in a webworker, the result of the search has to be sent back to the main thread via `postMessage`, and `odoc` will render the results. This allows to have different scenarios: - Client-side search using a js library (possibly compiled with jsoo). Good for search restricted to a single package. However, the whole search index and search logic is sent to the client on each load page. `dune build @doc` could use that. - Server-side search. The `index.js` file simply does a request to an API. Can search on much bigger dbs, using any search engine (elasticsearch, a custom OCaml-specific search engine, ...). `odoc` could also depend on a blessed search engine (`sherlodoc`), to allow for a more "out-of-the-box" search support. #### Sherlodoc Sherlodoc has already been modified by @EmileTrotignon to use the functionality added by this PR to generate the sherlodoc search index. It can be compiled to javascript using jsoo, to have client-side search on single libraries, or be used server-side for bigger indexes. Compared to the current web version (doc.sherlocode.com), the modified version also search inside doc comments, and is not restricted to values (it can search module, module types, constructors, ...). The corresponding branch is [jsoo-compat](https://github.com/art-w/sherlodoc/tree/jsoo-compat). Pinging also @art-w ! ### Data to build the search index The data to build the search index can be read in the JSON file generated by the `compile-index` command. However, the code is made so that it is also easy to use `odoc` as a library, and use values of type `Odoc_search.Entry.t` (that you get by combining `Odoc_model.Fold` with `Odoc_search.Entry.of_{value/type/...}`). Each have its advantages and use-case: - The JSON format is readable by most programming language and search engines. It is easily extensible with new fields without breaking the consumers of this format. - The OCaml `Entry.t` format encodes slightly more information, and is more practical to use by OCaml programs. The `Entry.t` type is defined in `Odoc_search.Entry.t`. The JSON format is defined in `Odoc_search.Json_search`. An example of a JSON entry would be: ```json { "id": [ { "kind": "Root", "name": "Main" }, { "kind": "Value", "name": "lorem" } ], "doc": "lorem 1 and a link", "extra": { "kind": "Value", "type": "int" }, "display": { "html": "<div class=\"search-entry val\"><code class=\"entry-title\"><span class=\"entry-kind\">val</span><span class=\"prefix-name\">Main.</span><span class=\"entry-name\">lorem</span><code class=\"entry-rhs\"> : int</code></code><div class=\"entry-comment\"><div><p>lorem 1 and a <span>link</span></p></div></div></div>", "url": "Main/index.html#val-lorem" } } ``` It contains: - A JSONification of the ID, - A stringified version of the docstring - An `extra` field containing: - The kind of entry - All extra information on this kind of entry. For instance, for values, the type of the value is included (as a string). The set of extra information per entry kind is available in `src/search/json_output.ml` - A `display` field, corresponding to the JSON value to return in case of inclusion of this entry in the search results: - The `url` to the entry (relative to the base of the compilation unit, which is known in the page executing the search) - a `html` field, containing how the html to use to display the entry in the search result. I am not sure the intermediate representation of a search entry, `Odoc_search.Entry.t`, is a good design. I would be curious about your opinion! ### Running as a webworker Running as a webworker is nice to isolate the search engine, and prevent it to freeze the UI at any point. The browsers interpretation of the CORS origin policy prevents to run webworkers from javascript files fetched from the `file://` protocol. I used a hack to go around this restriction.