Papyri - HackMD

# Papyri **CLI**: `papyri/__init__.py` ## Meeting Notes: 24 Nov, 202 - The parsing is either handled by tree sitter or by numpy doc because function docstrings are not correctly parsed by tree sitter because numpy docs has it's own syntax. - I tried to have an AST that is generic enough that it can represent all the documentation. When possible doesn't have too many specifities to how it was written, it's as simple as possible to be able to represent the document. - What we tried to do is something that's generic enough, that we don't know after it it completely processed if it was rst or markdown. - Another thing we wanted to have delayed link resolution. - We dont' use tree sitter AST has too much information about the source, we don't want. - Our ast still have too much information, but we can't change the tree sitter ast, but we can change our ast. - Another problem with tree-sitter ast is, all the nodes are generic type and to get the actual type you would do something like `<node>.kind` to get the type of node. ```python >>> title_node <Node kind=title, start_point=(1, 0), end_point=(1, 5)> >>> adornment_node <Node kind="adornment", start_point=(2, 0), end_point=(2, 5)> ``` With this you would have to do runtime checks about the types of objects, you can't have static typing, so we can't use mypy. - It's an experimental project so sometimes the reason for something to be in a particular way is because the author started that way. - tree sitter also have a lot of optional fields that are sometimes there and sometimes not and I was trying to make something more coherant, in which you always have the fields when possible and that's some of the reasons why it goes to our own ast. - It might also be the case with myst - Another thing with myst spec is you can serialize to json, but our internal representation as python objects can be well typed, so we can actually make sure everything is rich, everything types well and not use runtime value checking. - Next steps (We can do it in many pull requests, don't need to do all at once) could be either we can modify `ts.py` or we can create a `ts2.py`, that would not emit the ast in `take2.py` instead would emit a new ast in lets say `myst.py`, so that then we need to replace things progressively. - When we don the rendering in html, we can use some of the myst machinery. - Almost everything will somehow change, because everything relies on the current ast. - In the current ast I was planning to have some changes, but didn't had time to do them, like technically when you go to the rendering task, the rendering task shouldn't see any directives. Directives are really meant for addition not for rendering, so the directive node should go away at some point. - Discussion with myst folks: They want to do something similar and we may have feedback on their ast, by saying they parse only markdown, we parse rst, we might say can you change this in your ast. We can also influence the ast myst has. - It's really exploratory. The answer could be no, we really can't use myst ast for reason x, y and z. ## `gen` command Generate documentation for a given package. First item should be the root package to import, if subpackages need to be analyzed but are not accessible from the root pass them as extra arguments. ```python if api: if examples: g.collect_examples_out() if api: g.collect_api_docs(target_module_name) if narrative: g.collect_narrative_docs() ``` ### Where it all start `gen` command does the parsing using tree sitter and returns the object from `take2.py` ### Code Flow: - command: `papyri gen examples/numpy.toml`: `gen` command takes a toml configuration file to generate documentation for a given package. - Saves the generated documentation in the `~/.papyri/data` directory. - `gen_main`: Main entry point to generate docbundle files. - This function collects package metadata, api docs, examples, narrative docs. - `collect_api_docs`: - collector (comes from `_get_collector` constructs a depth first search collector that will try to find all the objects it can.) - For e.g numpy, `collected` was *2667* items. - `('numpy', <module 'numpy' ..>)` - `('numpy.distutils', <module 'numpy.distutils'..>)` - We call `helper_1` on the fully qualified name (`qa`) and `target_item` as in the module for all the collected items. It returns the following three items: - `item_docstring`: docstring of the module - `arbitrary`: List of `papyri.take2.Section`, each section will have the title and items inside that section, this is basically a section in the documentation. - `api_object`: `papyri.gen.APIObjectInfo` a structured object which contains all the information about the parsed documentation, infact `api_object.parsed` is equal to `arbitrary`. - If there is an error, then we continue to the next item in collected. - `prepare_doc_for_one_object`: gets documentation information for one python object. It resturns the following: - DocBlob: An object containing information about the documentation of an arbitrary object. - After all this processing the docs are written in the file system. - For each collected item, the data is written in `~/.papyri/data/numpy_1.23.4/module/<collected_item.json>` ### Tree sitter parsed object ```python >>> tree <tree_sitter.Tree object at 0x1090cf3f0> >>> tree.root_node <Node kind=document, start_point=(0, 0), end_point=(104, 0)> >>> tree.root_node.children[0] <Node kind=section, start_point=(1, 0), end_point=(2, 5)> >>> tree.root_node.children[0].text b'NumPy\n=====' >>> tree.root_node.children[1].text b'Provides\n 1. An array object of arbitrary homogeneous items\n 2. Fast mathematical operations over arrays\n 3. Linear Algebra, Fourier Transforms, Random Number Generation' >>> tree.root_node.children[2].text b'How to use the documentation\n----------------------------' >>> tree.root_node.children[3].text b'Documentation is available in two forms: docstrings provided\nwith the code, and a loose standing reference guide, available from\n`the NumPy homepage <https://numpy.org>`_.' >>> tree.root_node.children[0].children [<Node kind=title, start_point=(1, 0), end_point=(1, 5)>, <Node kind="adornment", start_point=(2, 0), end_point=(2, 5)>] >>> tree.root_node.children[0].children[0] <Node kind=title, start_point=(1, 0), end_point=(1, 5)> >>> tree.root_node.children[0].children[0].children [<Node kind="text", start_point=(1, 0), end_point=(1, 5)>] >>> tree.root_node.children[0].children[0].children[0] <Node kind="text", start_point=(1, 0), end_point=(1, 5)> >>> tree.root_node.children[0].children[0].children[0].children [] >>> tree.root_node.children[0].children[0].children[0].text b'NumPy' ``` ### TreeSitter Parsing (`ts.py` and `take2.py`): - We pass the tree sitter root node to Papyri's `Node` object. - That `Node` object is then passed to `TSVisitor`. - Then we call the `visit_document` method of `TSVisitor`, which eventually calls the `visit` method, which visits all the children. - All the children (tree sitter object) have a type (`c.type`), we call it kind. For each tree sitter type we have defined a method in the `TSVisitor` class named `visit_{kind}`. - For each tree sitter type we have a node defined in the `take2.py` - For each children we call the corresponding `visit_{kind}`, which parses the children and returns the respective object from `take2.py`. - In the last step we `nest_sections`, put things under `Section` Node. ## `ingest` command Example: `papyri ingest ~/.papyri/data/numpy_1.23.4` Given paths to a docbundle folder, ingest it into the known libraries. - This uses the library [cbor2](https://pypi.org/project/cbor2/) to create Concise Binary Object Representation (CBOR) of the doc_blob. ```python encoder.encode(doc_blob) ``` - This is then saved in files inside the `~/.papyri/ingest` directory - At this point we also save refs/links to database - The data base is saved in a file at ``~/.papyri/ingest/` as `papyri.db` ## `papyri.db` The `papyri.db` database contains the following tables ``` main.destinations main.documents main.links ``` This is managed by `graphstore` module (Class abstraction over the filesystem to store documents in a graph-like structure) ### `destinations` |id |package |version |category |identifier | |---|--------------|---------------|----------|---------------------| |1 |numpy |1.23.4 |module |numpy.ndarray | |2 |current-module|current-version|to-resolve|ogrid | |3 |builtins |* |module |builtins.tuple | |4 |numpy |1.23.4 |module |numpy.indices | |5 |current-module|current-version|to-resolve|mgrid | |6 |numpy |* |module |numpy.ndarray.reshape| ### `documents` |id |package |version |category |identifier | |---|--------------|---------------|----------|---------------------| |1 |numpy |1.23.4 |assets |fig-numpy.kaiser-1-ce19905e.png| |2 |numpy |1.23.4 |assets |fig-numpy.histogram2d-0-3819e7bf.png| |30 |numpy |1.23.4 |module |numpy.polynomial.hermite.hermfit| |31 |numpy |1.23.4 |module |numpy.lib.function_base._i0_dispatcher| |32 |numpy |1.23.4 |module |numpy.lib.index_tricks.MGridClass| ### `links` |id |source |dest |metadata | |---|--------------|---------------|----------| |1 |29 |1 |debug | |2 |29 |2 |debug | |3 |29 |3 |debug | |4 |29 |4 |debug | |5 |29 |5 |debug | ## `render` command Example: `papyri render` This does static rendering of all the given files. - This decodes the ingested blobs (cbor2 bytes) to get the doc_blob back. - That doc_blob is passed to a jinja template (`html.tpl.j2`) to render the html. - Each html for api is written into: `~/.papyri/html/p/numpy/1.23.4/api/<qa>.html` - The html jinja template also has the logic for ordering of various sections in the DocBlob. DocBlob Attributes: (Understanding one of the Nodes) ```python >>> doc_blob.content.keys() dict_keys(['Attributes', 'Extended Summary', 'Methods', 'Notes', 'Other Parameters', 'Parameters', 'Raises', 'Receives', 'Returns', 'Summary', 'Warnings', 'Warns', 'Yields']) >>> returns = doc_blob.content['Returns'] >>> type(returns) <class 'papyri.take2.Section'> >>> type(returns.children[0]) <class 'papyri.take2.Parameters'> >>> parameters = returns.children[0] >>> type(parameters.children[0]) <class 'papyri.take2.Param'> >>> param = parameters.children[0] >>> type(param.children[0]) <class 'papyri.take2.Paragraph'> >>> paragraph = param.children[0] >>> type(paragraph.children[0]) <class 'papyri.take2.Words'> >>> words = paragraph.children[0] >>> words.value 'Chebyshev coefficients ordered from low to high. If ' ``` ``` [Returns - Section] | V [Parameters] | V [Param] | V [Paragraph] | V [Words] ``` ## `serve` command Example `papyri serve` This serves the rendered html files. ## MyST *myst-spec is in development; any structures or features present in the JSON schema may change at any time without notice.* ### Directives & Roles - Roles and directives are two of the most powerful parts of MyST. - They both serve a similar purpose, but roles are written in one line whereas directives span many lines. ## Questions Q1: What's `rst.so`? ```python pth = str(Path(__file__).parent / "rst.so") RST = Language(pth, "rst") parser = Parser() parser.set_language(RST) ``` Q2: Would need to find equivalent of each (almost - with some manual additions) item in the current ast in the myst spec to replace the current ast with myst? ## Actions Items - [X] Understand how various commands in `papyri` works. - [X] Understand Tree sitter ast on higher level. - [X] Understand Current ast on a higher level. - [ ] Improve/Fix json schema to python dataclasses code - [ ] Create another `myst.py` (or `take3.py`) to return myst ast after tree-sitter parsing. ## Using MySt AST - Trying to replace Word/Words with Text from MyST ast - The `Words` Node in current AST is different from the `Text` in MyST AST. Words is a single word and Text is a continous block of words. - Needs to figure out a way for that single element to pass all the asertions during the construction of the tree, like for example: ```python # Ref: papyri/tree.py:366 # c is Myst Text and Node is the one defined in take2.py # Whereas c is an instance of the Node defined in myst_ast.py assert isinstance(c, Node), c ``` Trying it on `numpy.distutils` ```python papyri gen examples/numpy.toml --only numpy.distutils ``` ### Current AST: ```python [<Section: |children: [<Paragraph: | |children: [An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more.] | |>, <Paragraph: | |children: [Public submodules are: ] | |>, <BlockVerbatim '47'>, <Paragraph: | |children: [For details, please see the , *Packaging*, and , *NumPy Distutils User Guide*, sections of the NumPy Reference Guide.] | |>, <Paragraph: | |children: [For configuring the preference for and location of libraries like BLAS and LAPACK, and for setting include paths and similar build options, please see , <Verbatim ``site.cfg.example``>, in the root of the NumPy repository or sdist.] | |>] |title: None |level: 0 |target: None |>] ``` ```python= # nss - above mentioned structure >>> nss[0].children[0].children[0] An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more. >>> type(nss[0].children[0].children[0]) <class 'papyri.take2.Words'> ``` ### New Attempted AST ```python >>> nss [<Section: |children: [<Paragraph: | |children: [<MText: | | |value: 'An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more.' | | |>] | |>, <Paragraph: | |children: [<MText: | | |value: 'Public submodules are ' | | |>] | |>, <BlockVerbatim '47'>, <Paragraph: | |children: [<MText: | | |value: 'For details, please see the ' | | |>, *Packaging*, <MText: | | |value: ' and ' | | |>, *NumPy Distutils User Guide*, <MText: | | |value: ' sections of the NumPy Reference Guide.' | | |>] | |>, <Paragraph: | |children: [<MText: | | |value: 'For configuring the preference for and location of libraries like BLAS and LAPACK, and for setting include paths and similar build options, please see ' | | |>, <Verbatim ``site.cfg.example``>, <MText: | | |value: ' in the root of the NumPy repository or sdist.' | | |>] | |>] |title: None |level: 0 |target: None |> ] ``` ### Current Parsing Flow - Node ```python >>> type(root) <class 'papyri.ts.Node'> tsv.visit_document(root) ``` - `Word`(s) are compressed into `Words` object in the `visit_paragraph` function. --- .. plot:: :format: png import matplotlib.pyplot as plt ... Once parsed by tree sitter: Directive: children: #< list of N elements. Options: format: png Code: Text: value "import matplotlib.pyplot as plt." .. plot:: import matplotlib.pyplot as plt ... Once parsed by tree sitter: Directive: children: #< list of N elements. Code: Text: value "import matplotlib.pyplot as plt." You don't know if your first children is option or not. Directive: Options: Option or None Code: Words. --- :bulb: Probably: Currently the structure we return via out ast is same as the structure returned by tree sitter, as in the tree of nodes and that structure is different for myst. ## Links: - https://github.com/stsewd/tree-sitter-rst - https://myst.tools/docs/spec/myst-schema ## Actions Items - [x] Replace all node items with MySt ast in the `papyri gen examples/numpy.toml --only numpy.distutils` call. - [ ] Change serialisation to have json in myst ast properly and try rendering using myst tools. ## 22 March 2023 - Remove `BlockVerbatim` with `Code`? - Codecov Token fix To Replace/Remove remaining nodes: - [ ] `Verbatim` - [ ] `Directive` - [ ] `Link` - [ ] `Math` - [ ] `BlockMath` - [ ] `SubstitutionDef` - [ ] `SubstitutionRef` - [ ] `Target` - [ ] `Unimplemented` - [ ] `Comment` - [ ] `Fig` - [ ] `RefInfo` - [ ] `ListItem` - [ ] `Signature` - [ ] `NumpydocExample` - [ ] `NumpydocSeeAlso` - [ ] `NumpydocSignature` - [ ] `Section` - [ ] `Parameters` - [ ] `Param` - [ ] `Token` - [ ] `Code3` - [ ] `CodeLine` - [ ] `Code2` - [ ] `GenToken` - [ ] `Code` - [ ] `BlockQuote` - [ ] `Transition` - [ ] `Paragraph` - [ ] `Admonition` - [ ] `TocTree` - [ ] `BlockDirective` - [ ] `BlockVerbatim` - [ ] `Options` - [ ] `FieldList` - [ ] `FieldListItem` - [ ] `DefList` - [ ] `DefListItem` - [ ] `SeeAlsoItem`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.