Lutra - a proposal for a data tool

# Lutra - a proposal for a data tool Lutra is a tool for executing relational data queries. It operates with a directory of PRQL source files and .parquet input data. It extracts queries, executes them using DuckDB and writes results to local disk. It is highly interactive, asking for additional information about what the user is trying to accomplish. It is designed as a CLI-library duo, so it can be used from command line as well integrated into other data tools. It uses the directory as the source of truth for project state, implementes appropriate locking to allow many concurrent invocations on the same directory. This is a typical project structure: ``` - lutra.toml # project config file - _project.prql # main PRQL source file - utils.prql # a PRQL source with utils - sales/ # a directory module - orders.prql # a sub module - order_dump.parquet # data input file ``` ## Building PRQL source files contain queries that either load data or transform that data. Any relational query that is not used in any other query is exported into a .parquet file. ``` # orders.prql # loading data let all_orders = ( from read_parquet('./sales/order_dump.parquet') filter important_column == null ) let closed = ( from all_orders filter status == 'closed' ) all_orders ``` Now we run: ``` $lutra run Building project... done. Results: - sales/orders.main.parquet - sales/orders.closed.parquet ``` And we get the following files: ``` - ... - sales/ # (already existed) - orders.prql # (already existed) - order_dump.parquet # (already existed) - orders.main.parquet # result of `sales.orders.main` - orders.closed.parquet # result of `sales.orders.closed` ``` ## Auto-importing When we add: ``` - ... - sales/ # (already existed) - orders.prql # (already existed) - ... - articles.parquet # new data input ``` ... and run: ``` $ lutra run > Found unimported data in `sales/articles.parquet`. Do you want to import it? [y] Yes, use default name `sales.articles.main` [e] Yes, but edit the name. [n] No, ignore it. Creating file `sales/articles.prql` with contents: from read_parquet('sales/articles.parquet') Building project... done. Results: - sales/order.main.parquet - sales/order.closed.parquet ``` Lutra is reading and writing .parquet files all over your project. To know which files are input, which are output and which were recently added, it needs to keep tracks of all `read_parquet` calls and all output files. This means that we need to build a DAG of all relations If we are traversing the trees, we might as well build the DAG and infer inputs and outputs from that. ## Importing other file formats When a CSV, JSON, XLSX, ODS, HTML or SQL files are detected to have been added into the project, Lutra will try to convert those files into parquet. That is preferred over other formats because parquet is very close to relational data. Each of the other format may contain wildly varied data, which needs to be converted and regularized into tabular data: - CSV contains only string data. Any numbers, booleans or temporal data has to be deserialized from strings. - JSON can contain tabular data in different data structures: an array of row objects, an array of row arrays, an array of column arrays are just a few of them. - XLSX and ODS can contain multiple sheets in one file, along with non-tabular structures. The workflow would look like this: ``` $ lutra run > Found unimported data in `sales/articles.csv`. Do you want to import it? [y] Yes, use default name `sales.articles.main` [e] Yes, but edit the name. [n] No, ignore it. > CSV files need to be converted to Parquet. Does this layout looks correct? article_id | article_name | category | out_of_stock | price int32 | text | text | bool | float64 ------------+--------------+----------+--------------+--------- 42 | Pilot M2 1mm | pen | false | 7.50 55 | Fountain G33 | pen | true | 10.99 ... 411 more rows (413 total) ... [y] Yes, import as-is [e] Edit column types [n] No, abort. Successfully imported. > How do you want to cleanup the original CSV file? [d] Delete it. [a] Move it into `archive/` directory. [i] Keep it where it is and ignore it in future runs. ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.