Try โ€‚โ€‰HackMD

Explorations on TG Instant Views

Collections on public resources (for test datasets)

Parsing

Basic syntax traits:

  1. Line delimited.
  2. # means comment for the rest of the line.
  3. \ at line end can continue the line with the next line.

Program: ( (Rule 1|2|3)? โ€‚ (# any)? )*

The grammar can be loosely described as follows: A rule generally has the following structure (if it does not end with {):

Rule 1: ( [~?!@$]? RuleName | < TagName > )
โ€ƒ (( Args ))? โ€ƒ โ€“ only meaningful for funcs
โ€ƒ : (anything elseโ€ฆ)

Args: arg ( ( + | ,) arg )* ,?

If ending with {, the parsing rule is different:

Rule 2: [~?!@$]? RuleNameAndArgs {

where RuleNameAndArgs is anything in between, optionally having an argument list:

RuleName (( Args ))?

It may open a block function, which the only recursive structure the format currently supports. The block function is closed by a single } as a line.

Rule 3: }

Parsing quirks

  • The continuation backslash can "glue" two lines literally โ€“ with trailing and leading spaces trimmed
    • Tracing position information gets tricky since a rule can span over multiple lines
    • not possible to skip over single line comments because they can also be glued
    • comments can also appear after a line, but a # wrapped in " does not count
  • Not that driven by tokens
    • Most of the time you will not see "unterminated string literal" or "xxx expected but yyy found"
    • Rule name / value is guided by :, not some identifier token
      • e.g. <a"> would be a tag token, but a" is illegal as a tag name (spoiler: it is valid in HTML5)
    • When you see "XXX as an alias of YYY" in doc, it is likely an actual text replacement.
      • which can be seen by observing the syntax error message of: <<xxx>>; it says Invalid function name: replace_tag(<<xxx>)>.
      • Again, < is not a valid tag name, but it is included.
      • However, the XPath engine is mostly a black box. It is hard to know it genuinely works this way.
    • Is this parser PEG-based ๐Ÿคฏ? To model the behavior percisely, a handwritten parser is prefered rather than a parser generator.
  • A rule is always parsed differently if ending in {, not just those starting in @:
    • The normal rule is to seek the first : to be the delimiter, skipping any strings. But if the rule ends with {, parse everything before as { name + args }.
      • @debug::: -> XPath error: Invalid expression in query ::
      • @debug:::{ -> Invalid function name: debug:::
      • @debug:::{{ -> Invalid function name: debug:::{
      • @debug::{: -> Invalid expression in query :{:
      • Applies to all types of rule. E.g., ?true:{ and ?true{.
    • ( is only meaningful when paired with ):
      • if(: Invalid property name: if(
      • if(): unexpected (
    • It seems that each block func is specialized by name.
      • @if{ -> fine
      • @debug{ -> unexpected {; weird!
    • It is technically impossible to reverse-engineer the behavior when a block is left open โ€“ a template is always be appended a ?true block, but the error telling that a condition cannot appear in a block func would prevail.
      • You can use a condition that evaluates to false to trigger the error "} expected". The logic seems to be "if false, find the nearest paired } and jump there".

From the fuzzing to guess the official IV engine's inner working, we simplify the parsing by preprocess the line stream before tokenizations take place. The steps and postconditions are as follows:

  1. String concatenation: Make every trailing /\s*\\\s*$/ glue to the next line, also eliminate any leading spaces for the next rule.
  2. Stripping trailing comments: Eliminate all comments from statements; any # char MUST be inside some string literal.
  3. Dropping empty lines: Remove lines of length zero; a rule must be a non-empty string and the first character unambiguously determines its type.
  4. Split the rule type, the rule name, arguments if any, and the rest.

Also from fuzzing, the engine interprets the template line-by-line, i.e., it does not produce IR. This makes it easier to implement shortcut behaviors but make it harder to discover syntax errors early.

Tokens

All tokens are effectively string under most cases, parsed depending on the patterns and so it is impossible to "escape" special strings.

Some properties of string literals:

  • wrapped in " (or ', but not where strict JSON strings are required)
  • supports \n, \", \u1234
  • the only token that a # char can appear in. But does not count if the string literal is left open
  • illegal escape char is forbidden (e.g. no "\.")
  • the standard approach is to match "(?:\\.|[^"\\])*" and pass it to JSON.parse

Specialized types:

  • var
    • In the form $var.
    • allowed pattern? Likely [a-zA-Z]\w* from code highlighting.
    • Special variables: $$ and $@.
  • attr
    • In the form @attr with attr non-empty. Used rarely, by like @append_to.
    • allowed pattern?
  • regexp
    • which regex does the engine use?
      • the doc refers to PCRE when talking about ims modifiers.
    • some regexp implies i flag
  • xpath query
    • context syntax: $context/query
      • despite the $ prefix normally for variables, context can refer to properties as a fallback
    • zeroing syntax: (query)[n]
    • In arguments they are treated like strings. Some argument accepts "." prefix and it is expanded to self::*.
    • additional functions; fortunately they are all implementable by text replacements
      • has-class("class")
        -> contains(concat(" ", normalize-space(@class), " "), " class ")
      • ends-with("haystack", "needle")
        -> (substring("haystack", string-length("haystack") - string-length("needle") + 1) = "needle")
    • additional axes:
      • prev-sibling -> preceding-sibling::*[1]/self
      • next-sibling -> following-sibling::*[1]/self
    • (speculation) In property and variable assignments, the word null might just happen to be a valid XPath query that always return an empty list (the only valid words under the default context node (root) are head and body for any valid HTML document). However, specialize that value to enforce its semantics might be a good idea.
  • tag
    • Find the next most >, skipping over any string literals
    • By bruteforcing all printable ASCIIs, allowed tag names are [a-zA-Z_][-\w.]* PLUS some Unicode categories (not tested exhaustively).

Props & Vars

  • When a variable is not defined, it emits a warning "Unknown variable โ€ฆ" Notice that it is different from setting a variable to null.
    PoC:
    โ€‹โ€‹$foo: "foo"
    โ€‹โ€‹$foo: null
    โ€‹โ€‹# `$foo` is now null
    โ€‹โ€‹@debug: $foo  # does not emit this error
    
  • A string literal, when materialized, is treated as a new text node whose text content equals to the literal's value.

Function blocks

@function lpar PropList rpar lbrac
  Rule*
rbrac

options

version

err: version should be a (quoted) string
invalid version
quirk? "1" or "1." is interpreted as "1.0"
TODO: "2." -> ? ("2.10000" causes an internal error)

err: Version should be defined once

version not placed as the first rule:

Version 1.0 is outdated. Please update your template to the last version 2.1
[medium.com:12] ~version: "2.1"
Version should be set at the beginning of template

string quote rules?

Block functions

Conditions are not allowed inside block functions: ?true inside @if

Arguments

Parse from ( to the nearest ) (i.e., no nesting):

  1. Eat spaces. Break if the end is reached.
  2. If a " is met, find the extent ("\b) read the whole as a string literal. Otherwise, read until the first space or , and interpret the content as a string.
  3. Eat the , if there is one.
  4. Back to 1. to read the next argument.

Note that the trailing comma matters. @x(1,2,) should be parsed as if the last argument is "", but @x(1,2 ) should not. It is easier tested through @append(<tag>, ...) since it requires a odd number of argument in this form.

TODO: how to test @x vs. @x() vs. @x(,)?