Papyri

CLI: papyri/__init__.py

Meeting Notes: 24 Nov, 202

  • The parsing is either handled by tree sitter or by numpy doc because function docstrings are not correctly parsed by tree sitter because numpy docs has it's own syntax.
  • I tried to have an AST that is generic enough that it can represent all the documentation. When possible doesn't have too many specifities to how it was written, it's as simple as possible to be able to represent the document.
  • What we tried to do is something that's generic enough, that we don't know after it it completely processed if it was rst or markdown.
  • Another thing we wanted to have delayed link resolution.
  • We dont' use tree sitter AST has too much information about the source, we don't want.
  • Our ast still have too much information, but we can't change the tree sitter ast, but we can change our ast.
  • Another problem with tree-sitter ast is, all the nodes are generic type and to get the actual type you would do something like <node>.kind to get the type of node.
>>> title_node
<Node kind=title, start_point=(1, 0), end_point=(1, 5)>

>>> adornment_node
<Node kind="adornment", start_point=(2, 0), end_point=(2, 5)>

With this you would have to do runtime checks about the types of objects, you can't have static typing, so we can't use mypy.

  • It's an experimental project so sometimes the reason for something to be in a particular way is because the author started that way.
  • tree sitter also have a lot of optional fields that are sometimes there and sometimes not and I was trying to make something more coherant, in which you always have the fields when possible and that's some of the reasons why it goes to our own ast.
  • It might also be the case with myst
  • Another thing with myst spec is you can serialize to json, but our internal representation as python objects can be well typed, so we can actually make sure everything is rich, everything types well and not use runtime value checking.
  • Next steps (We can do it in many pull requests, don't need to do all at once) could be either we can modify ts.py or we can create a ts2.py, that would not emit the ast in take2.py instead would emit a new ast in lets say myst.py, so that then we need to replace things progressively.
  • When we don the rendering in html, we can use some of the myst machinery.
  • Almost everything will somehow change, because everything relies on the current ast.
  • In the current ast I was planning to have some changes, but didn't had time to do them, like technically when you go to the rendering task, the rendering task shouldn't see any directives. Directives are really meant for addition not for rendering, so the directive node should go away at some point.
  • Discussion with myst folks: They want to do something similar and we may have feedback on their ast, by saying they parse only markdown, we parse rst, we might say can you change this in your ast. We can also influence the ast myst has.
  • It's really exploratory. The answer could be no, we really can't use myst ast for reason x, y and z.

gen command

Generate documentation for a given package. First item should be the root package to import, if subpackages need to be analyzed but are not accessible from the root pass them as extra arguments.

if api:
    if examples:
        g.collect_examples_out()
    if api:
        g.collect_api_docs(target_module_name)
    if narrative:
        g.collect_narrative_docs()

Where it all start

gen command does the parsing using tree sitter and returns the object from take2.py

Code Flow:

  • command: papyri gen examples/numpy.toml: gen command takes a toml configuration file to generate documentation for a given package.
  • Saves the generated documentation in the ~/.papyri/data directory.
  • gen_main: Main entry point to generate docbundle files.
  • This function collects package metadata, api docs, examples, narrative docs.
  • collect_api_docs:
    • collector (comes from _get_collector constructs a depth first search collector that will try to find all the objects it can.)
    • For e.g numpy, collected was 2667 items.
      • ('numpy', <module 'numpy' ..>)
      • ('numpy.distutils', <module 'numpy.distutils'..>)
  • We call helper_1 on the fully qualified name (qa) and target_item as in the module for all the collected items. It returns the following three items:
    • item_docstring: docstring of the module
    • arbitrary: List of papyri.take2.Section, each section will have the title and items inside that section, this is basically a section in the documentation.
    • api_object: papyri.gen.APIObjectInfo a structured object which contains all the information about the parsed documentation, infact api_object.parsed is equal to arbitrary.
  • If there is an error, then we continue to the next item in collected.
  • prepare_doc_for_one_object: gets documentation information for one python object. It resturns the following:
    • DocBlob: An object containing information about the documentation of an arbitrary object.
  • After all this processing the docs are written in the file system.
    • For each collected item, the data is written in ~/.papyri/data/numpy_1.23.4/module/<collected_item.json>

Tree sitter parsed object

>>> tree
<tree_sitter.Tree object at 0x1090cf3f0>

>>> tree.root_node
<Node kind=document, start_point=(0, 0), end_point=(104, 0)>

>>> tree.root_node.children[0]
<Node kind=section, start_point=(1, 0), end_point=(2, 5)>

>>> tree.root_node.children[0].text
b'NumPy\n====='
>>> tree.root_node.children[1].text
b'Provides\n  1. An array object of arbitrary homogeneous items\n  2. Fast mathematical operations over arrays\n  3. Linear Algebra, Fourier Transforms, Random Number Generation'
>>> tree.root_node.children[2].text
b'How to use the documentation\n----------------------------'
>>> tree.root_node.children[3].text
b'Documentation is available in two forms: docstrings provided\nwith the code, and a loose standing reference guide, available from\n`the NumPy homepage <https://numpy.org>`_.'

>>> tree.root_node.children[0].children
[<Node kind=title, start_point=(1, 0), end_point=(1, 5)>, <Node kind="adornment", start_point=(2, 0), end_point=(2, 5)>]

>>> tree.root_node.children[0].children[0]
<Node kind=title, start_point=(1, 0), end_point=(1, 5)>

>>> tree.root_node.children[0].children[0].children
[<Node kind="text", start_point=(1, 0), end_point=(1, 5)>]

>>> tree.root_node.children[0].children[0].children[0]
<Node kind="text", start_point=(1, 0), end_point=(1, 5)>

>>> tree.root_node.children[0].children[0].children[0].children
[]

>>> tree.root_node.children[0].children[0].children[0].text
b'NumPy'

TreeSitter Parsing (ts.py and take2.py):

  • We pass the tree sitter root node to Papyri's Node object.
  • That Node object is then passed to TSVisitor.
  • Then we call the visit_document method of TSVisitor, which eventually calls the visit method, which visits all the children.
  • All the children (tree sitter object) have a type (c.type), we call it kind. For each tree sitter type we have defined a method in the TSVisitor class named visit_{kind}.
  • For each tree sitter type we have a node defined in the take2.py
  • For each children we call the corresponding visit_{kind}, which parses the children and returns the respective object from take2.py.
  • In the last step we nest_sections, put things under Section Node.

ingest command

Example: papyri ingest ~/.papyri/data/numpy_1.23.4

Given paths to a docbundle folder, ingest it into the known libraries.

  • This uses the library cbor2 to create Concise Binary Object Representation (CBOR) of the doc_blob.
encoder.encode(doc_blob)
  • This is then saved in files inside the ~/.papyri/ingest directory
  • At this point we also save refs/links to database
  • The data base is saved in a file at ``~/.papyri/ingest/aspapyri.db`

papyri.db

The papyri.db database contains the following tables

main.destinations
main.documents
main.links

This is managed by graphstore module (Class abstraction over the filesystem to store documents in a graph-like structure)

destinations

id package version category identifier
1 numpy 1.23.4 module numpy.ndarray
2 current-module current-version to-resolve ogrid
3 builtins * module builtins.tuple
4 numpy 1.23.4 module numpy.indices
5 current-module current-version to-resolve mgrid
6 numpy * module numpy.ndarray.reshape

documents

id package version category identifier
1 numpy 1.23.4 assets fig-numpy.kaiser-1-ce19905e.png
2 numpy 1.23.4 assets fig-numpy.histogram2d-0-3819e7bf.png
30 numpy 1.23.4 module numpy.polynomial.hermite.hermfit
31 numpy 1.23.4 module numpy.lib.function_base._i0_dispatcher
32 numpy 1.23.4 module numpy.lib.index_tricks.MGridClass
id source dest metadata
1 29 1 debug
2 29 2 debug
3 29 3 debug
4 29 4 debug
5 29 5 debug

render command

Example: papyri render

This does static rendering of all the given files.

  • This decodes the ingested blobs (cbor2 bytes) to get the doc_blob back.
  • That doc_blob is passed to a jinja template (html.tpl.j2) to render the html.
  • Each html for api is written into: ~/.papyri/html/p/numpy/1.23.4/api/<qa>.html
  • The html jinja template also has the logic for ordering of various sections in the DocBlob.

DocBlob Attributes: (Understanding one of the Nodes)

>>> doc_blob.content.keys()
dict_keys(['Attributes', 'Extended Summary', 'Methods', 'Notes', 'Other Parameters', 'Parameters', 'Raises', 'Receives', 'Returns', 'Summary', 'Warnings', 'Warns', 'Yields'])

>>> returns = doc_blob.content['Returns']
>>> type(returns)
<class 'papyri.take2.Section'>

>>> type(returns.children[0])
<class 'papyri.take2.Parameters'>

>>> parameters = returns.children[0]
>>> type(parameters.children[0])
<class 'papyri.take2.Param'>

>>> param = parameters.children[0]
>>> type(param.children[0])
<class 'papyri.take2.Paragraph'>

>>> paragraph = param.children[0]
>>> type(paragraph.children[0])
<class 'papyri.take2.Words'>

>>> words = paragraph.children[0]
>>> words.value
'Chebyshev coefficients ordered from low to high. If '
[Returns - Section]
 |
 V
[Parameters]
 |
 V
[Param]
 |
 V
[Paragraph]
 |
 V
[Words]

serve command

Example papyri serve

This serves the rendered html files.

MyST

myst-spec is in development; any structures or features present in the JSON schema may change at any time without notice.

Directives & Roles

  • Roles and directives are two of the most powerful parts of MyST.
  • They both serve a similar purpose, but roles are written in one line whereas directives span many lines.

Questions

Q1: What's rst.so?

pth = str(Path(__file__).parent / "rst.so")
RST = Language(pth, "rst")
parser = Parser()
parser.set_language(RST)

Q2: Would need to find equivalent of each (almost - with some manual additions) item in the current ast in the myst spec to replace the current ast with myst?

Actions Items

  • Understand how various commands in papyri works.
  • Understand Tree sitter ast on higher level.
  • Understand Current ast on a higher level.
  • Improve/Fix json schema to python dataclasses code
  • Create another myst.py (or take3.py) to return myst ast after tree-sitter parsing.

Using MySt AST

  • Trying to replace Word/Words with Text from MyST ast

  • The Words Node in current AST is different from the Text in MyST AST. Words is a single word and Text is a continous block of words.

  • Needs to figure out a way for that single element to pass all the asertions during the construction of the tree, like for example:

# Ref: papyri/tree.py:366
# c is Myst Text and Node is the one defined in take2.py
# Whereas c is an instance of the Node defined in myst_ast.py
assert isinstance(c, Node), c

Trying it on numpy.distutils

papyri gen examples/numpy.toml --only numpy.distutils

Current AST:

[<Section:
   |children: [<Paragraph:
   |   |children: [An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more.]
   |   |>, <Paragraph:
   |   |children: [Public submodules are:      ]
   |   |>, <BlockVerbatim '47'>, <Paragraph:
   |   |children: [For details, please see the , *Packaging*,  and , *NumPy Distutils User Guide*,  sections of the NumPy Reference Guide.]
   |   |>, <Paragraph:
   |   |children: [For configuring the preference for and location of libraries like BLAS and LAPACK, and for setting include paths and similar build options, please see , <Verbatim ``site.cfg.example``>,  in the root of the NumPy repository or sdist.]
   |   |>]
   |title: None
   |level: 0
   |target: None
   |>]
# nss - above mentioned structure >>> nss[0].children[0].children[0] An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more. >>> type(nss[0].children[0].children[0]) <class 'papyri.take2.Words'>

New Attempted AST

>>> nss

[<Section:
   |children: [<Paragraph:
   |   |children: [<MText:
   |   |   |value: 'An enhanced distutils, providing support for Fortran compilers, for BLAS, LAPACK and other common libraries for numerical computing, and more.'
   |   |   |>]
   |   |>, <Paragraph:
   |   |children: [<MText:
   |   |   |value: 'Public submodules are      '
   |   |   |>]
   |   |>, <BlockVerbatim '47'>, <Paragraph:
   |   |children: [<MText:
   |   |   |value: 'For details, please see the '
   |   |   |>, *Packaging*, <MText:
   |   |   |value: ' and '
   |   |   |>, *NumPy Distutils User Guide*, <MText:
   |   |   |value: ' sections of the NumPy Reference Guide.'
   |   |   |>]
   |   |>, <Paragraph:
   |   |children: [<MText:
   |   |   |value: 'For configuring the preference for and location of libraries like BLAS and LAPACK, and for setting include paths and similar build options, please see '
   |   |   |>, <Verbatim ``site.cfg.example``>, <MText:
   |   |   |value: ' in the root of the NumPy repository or sdist.'
   |   |   |>]
   |   |>]
   |title: None
   |level: 0
   |target: None
   |>
]

Current Parsing Flow

  • Node
>>> type(root)
<class 'papyri.ts.Node'>

tsv.visit_document(root)
  • Word(s) are compressed into Words object in the visit_paragraph function.

.. plot::
:format: png

import matplotlib.pyplot as plt

Once parsed by tree sitter:

Directive:
children: #< list of N elements.
Options:
format: png
Code:
Text:
value "import matplotlib.pyplot as plt."

.. plot::

import matplotlib.pyplot as plt

Once parsed by tree sitter:

Directive:
children: #< list of N elements.
Code:
Text:
value "import matplotlib.pyplot as plt."

You don't know if your first children is option or not.

Directive:
Options: Option or None
Code: Words.


Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Probably: Currently the structure we return via out ast is same as the structure returned by tree sitter, as in the tree of nodes and that structure is different for myst.

Actions Items

  • Replace all node items with MySt ast in the papyri gen examples/numpy.toml --only numpy.distutils call.
  • Change serialisation to have json in myst ast properly and try rendering using myst tools.

22 March 2023

  • Remove BlockVerbatim with Code?
  • Codecov Token fix

To Replace/Remove remaining nodes:

  • Verbatim
  • Directive
  • Link
  • Math
  • BlockMath
  • SubstitutionDef
  • SubstitutionRef
  • Target
  • Unimplemented
  • Comment
  • Fig
  • RefInfo
  • ListItem
  • Signature
  • NumpydocExample
  • NumpydocSeeAlso
  • NumpydocSignature
  • Section
  • Parameters
  • Param
  • Token
  • Code3
  • CodeLine
  • Code2
  • GenToken
  • Code
  • BlockQuote
  • Transition
  • Paragraph
  • Admonition
  • TocTree
  • BlockDirective
  • BlockVerbatim
  • Options
  • FieldList
  • FieldListItem
  • DefList
  • DefListItem
  • SeeAlsoItem
Select a repo